linux-unionfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/34] fs: idmapped mounts
@ 2020-10-29  0:32 Christian Brauner
  2020-10-29  0:32 ` [PATCH 01/34] namespace: take lock_mount_hash() directly when changing flags Christian Brauner
                   ` (37 more replies)
  0 siblings, 38 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

Hey everyone,

I vanished for a little while to focus on this work here so sorry for
not being available by mail for a while.

Since quite a long time we have issues with sharing mounts between
multiple unprivileged containers with different id mappings, sharing a
rootfs between multiple containers with different id mappings, and also
sharing regular directories and filesystems between users with different
uids and gids. The latter use-cases have become even more important with
the availability and adoption of systemd-homed (cf. [1]) to implement
portable home directories.

The solutions we have tried and proposed so far include the introduction
of fsid mappings, a tiny overlay based filesystem, and an approach to
call override creds in the vfs. None of these solutions have covered all
of the above use-cases.

The solution proposed here has it's origins in multiple discussions
during Linux Plumbers 2017 during and after the end of the containers
microconference.
To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
James, and myself. A variant of the solution proposed here has also been
discussed, again to the best of my knowledge, after a Linux conference
in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
after Linux Plumbers.
I've taken the time to finally implement a working version of this
solution over the last weeks to the best of my abilities. Tycho has
signed up for this sligthly crazy endeavour as well and he has helped
with the conversion of the xattr codepaths.

The core idea is to make idmappings a property of struct vfsmount
instead of tying it to a process being inside of a user namespace which
has been the case for all other proposed approaches.
It means that idmappings become a property of bind-mounts, i.e. each
bind-mount can have a separate idmapping. This has the obvious advantage
that idmapped mounts can be created inside of the initial user
namespace, i.e. on the host itself instead of requiring the caller to be
located inside of a user namespace. This enables such use-cases as e.g.
making a usb stick available in multiple locations with different
idmappings (see the vfat port that is part of this patch series).

The vfsmount struct gains a new struct user_namespace member. The
idmapping of the user namespace becomes the idmapping of the mount. A
caller that is either privileged with respect to the user namespace of
the superblock of the underlying filesystem or a caller that is
privileged with respect to the user namespace a mount has been idmapped
with can create a new bind-mount and mark it with a user namespace. The
user namespace the mount will be marked with can be specified by passing
a file descriptor refering to the user namespace as an argument to the
new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag.
By default vfsmounts are marked with the initial user namespace and no
behavioral or performance changes should be observed. All mapping
operations are nops for the initial user namespace.

When a file/inode is accessed through an idmapped mount the i_uid and
i_gid of the inode will be remapped according to the user namespace the
mount has been marked with. When a new object is created based on the
fsuid and fsgid of the caller they will similarly be remapped according
to the user namespace of the mount they care created from.

This means the user namespace of the mount needs to be passed down into
a few relevant inode_operations. This mostly includes inode operations
that create filesystem objects or change file attributes. Some of them
such as ->getattr() don't even need to change since they pass down a
struct path and thus the struct vfsmount is already available. Other
inode operations need to be adapted to pass down the user namespace the
vfsmount has been marked with. Al was nice enough to point out that he
will not tolerate struct vfsmount being passed to filesystems and that I
should pass down the user namespace directly; which is what I did.
The inode struct itself is never altered whenever the i_uid and i_gid
need to be mapped, i.e. i_uid and i_gid are only remapped at the time of
the check. An inode once initialized (during lookup or object creation)
is never altered when accessed through an idmapped mount.

To limit the amount of noise in this first iteration we have not changed
the existing inode operations but rather introduced a few new struct
inode operation methods such as ->mkdir_mapped which pass down the user
namespace of the mount they have been called from. Should this solution
be worth pursuing we have no problem adapting the existing inode
operations instead.

In order to support idmapped mounts, filesystems need to be changed and
mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. In this first
iteration I tried to illustrate this by changing three different
filesystem with different levels of complexity. Of course with some bias
towards urgent use-cases and filesystems I was at least a little more
familiar with. However, Tycho and I (and others) have no problem
converting each filesystem one-by-one. This first iteration includes fat
(msdos and vfat), ext4, and overlayfs (both with idmapped lower and
upper directories and idmapped merged directories). I'm sure I haven't
gotten everything right for all three of them in the first version of
this patch.

I have written a simple tool that allows to create idmapped mounts so
people can play with this patch series. Here are a few illustrations:

1. Create a simple idmapped mount of another user's home directory

u1001@f2-vm:/$ sudo ./mount-idmapped --map-mount b:1000:1001:1 /home/ubuntu/ /mnt
u1001@f2-vm:/$ ls -al /home/ubuntu/
total 28
drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
drwxr-xr-x 4 root   root   4096 Oct 28 04:00 ..
-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 ubuntu ubuntu  220 Feb 25  2020 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25  2020 .bashrc
-rw-r--r-- 1 ubuntu ubuntu  807 Feb 25  2020 .profile
-rw-r--r-- 1 ubuntu ubuntu    0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ ls -al /mnt/
total 28
drwxr-xr-x  2 u1001 u1001 4096 Oct 28 22:07 .
drwxr-xr-x 29 root  root  4096 Oct 28 22:01 ..
-rw-------  1 u1001 u1001 3154 Oct 28 22:12 .bash_history
-rw-r--r--  1 u1001 u1001  220 Feb 25  2020 .bash_logout
-rw-r--r--  1 u1001 u1001 3771 Feb 25  2020 .bashrc
-rw-r--r--  1 u1001 u1001  807 Feb 25  2020 .profile
-rw-r--r--  1 u1001 u1001    0 Oct 16 16:11 .sudo_as_admin_successful
-rw-------  1 u1001 u1001 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ touch /mnt/my-file
u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file
u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file
u1001@f2-vm:/$ ls -al /mnt/my-file
-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file
u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file
u1001@f2-vm:/$ getfacl /mnt/my-file
getfacl: Removing leading '/' from absolute path names
# file: mnt/my-file
# owner: u1001
# group: u1001
user::rw-
user:u1001:rwx
group::rw-
mask::rwx
other::r--
u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
getfacl: Removing leading '/' from absolute path names
# file: home/ubuntu/my-file
# owner: ubuntu
# group: ubuntu
user::rw-
user:ubuntu:rwx
group::rw-
mask::rwx
other::r--

2. Create mapping of the whole ext4 rootfs without a mapping for uid and gid 0

ubuntu@f2-vm:~$ sudo /mount-idmapped --map-mount b:1:1:65536 / /mnt/
ubuntu@f2-vm:~$ findmnt | grep mnt
└─/mnt                                /dev/sda2  ext4       rw,relatime
  └─/mnt/mnt                          /dev/sda2  ext4       rw,relatime
ubuntu@f2-vm:~$ sudo mkdir /AS-ROOT-CAN-CREATE
ubuntu@f2-vm:~$ sudo mkdir /mnt/AS-ROOT-CANT-CREATE
mkdir: cannot create directory ‘/mnt/AS-ROOT-CANT-CREATE’: Value too large for defined data type
ubuntu@f2-vm:~$ mkdir /mnt/home/ubuntu/AS-USER-1000-CAN-CREATE

3. Create a vfat usb mount and expose to user 1001 and 5000

ubuntu@f2-vm:/$ sudo mount /dev/sdb /mnt
ubuntu@f2-vm:/$ findmnt  | grep mnt
└─/mnt                                /dev/sdb vfat       rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro
ubuntu@f2-vm:/$ ls -al /mnt
total 12
drwxr-xr-x  2 root root 4096 Jan  1  1970 .
drwxr-xr-x 34 root root 4096 Oct 28 22:24 ..
-rwxr-xr-x  1 root root    4 Oct 28 03:44 aaa
-rwxr-xr-x  1 root root    0 Oct 28 01:09 bbb
ubuntu@f2-vm:/$ sudo /mount-idmapped --map-mount b:0:1001:1 /mnt /mnt-1001/
ubuntu@f2-vm:/$ ls -al /mnt-1001/
total 12
drwxr-xr-x  2 u1001 u1001 4096 Jan  1  1970 .
drwxr-xr-x 34 root  root  4096 Oct 28 22:24 ..
-rwxr-xr-x  1 u1001 u1001    4 Oct 28 03:44 aaa
-rwxr-xr-x  1 u1001 u1001    0 Oct 28 01:09 bbb
ubuntu@f2-vm:/$ sudo /mount-idmapped --map-mount b:0:5000:1 /mnt /mnt-5000/
ubuntu@f2-vm:/$ ls -al /mnt-5000/
total 12
drwxr-xr-x  2 5000 5000 4096 Jan  1  1970 .
drwxr-xr-x 34 root root 4096 Oct 28 22:24 ..
-rwxr-xr-x  1 5000 5000    4 Oct 28 03:44 aaa
-rwxr-xr-x  1 5000 5000    0 Oct 28 01:09 bbb

4. Create an idmapped rootfs mount for a container

root@f2-vm:~# ls -al /var/lib/lxc/f2/rootfs/
total 68
drwxr-xr-x 17 20000 20000 4096 Sep 24 07:48 .
drwxrwx---  3 20000 20000 4096 Oct 16 19:26 ..
lrwxrwxrwx  1 20000 20000    7 Sep 24 07:43 bin -> usr/bin
drwxr-xr-x  2 20000 20000 4096 Apr 15  2020 boot
drwxr-xr-x  3 20000 20000 4096 Oct 16 19:26 dev
drwxr-xr-x 61 20000 20000 4096 Oct 16 19:26 etc
drwxr-xr-x  3 20000 20000 4096 Sep 24 07:45 home
lrwxrwxrwx  1 20000 20000    7 Sep 24 07:43 lib -> usr/lib
lrwxrwxrwx  1 20000 20000    9 Sep 24 07:43 lib32 -> usr/lib32
lrwxrwxrwx  1 20000 20000    9 Sep 24 07:43 lib64 -> usr/lib64
lrwxrwxrwx  1 20000 20000   10 Sep 24 07:43 libx32 -> usr/libx32
drwxr-xr-x  2 20000 20000 4096 Sep 24 07:43 media
drwxr-xr-x  2 20000 20000 4096 Sep 24 07:43 mnt
drwxr-xr-x  2 20000 20000 4096 Sep 24 07:43 opt
drwxr-xr-x  2 20000 20000 4096 Apr 15  2020 proc
drwx------  2 20000 20000 4096 Sep 24 07:43 root
drwxr-xr-x  2 20000 20000 4096 Sep 24 07:45 run
lrwxrwxrwx  1 20000 20000    8 Sep 24 07:43 sbin -> usr/sbin
drwxr-xr-x  2 20000 20000 4096 Sep 24 07:43 srv
drwxr-xr-x  2 20000 20000 4096 Apr 15  2020 sys
drwxrwxrwt  2 20000 20000 4096 Sep 24 07:44 tmp
drwxr-xr-x 13 20000 20000 4096 Sep 24 07:43 usr
drwxr-xr-x 12 20000 20000 4096 Sep 24 07:44 var
root@f2-vm:~# /mount-idmapped --map-mount b:20000:10000:100000 /var/lib/lxc/f2/rootfs/ /mnt
root@f2-vm:~# ls -al /mnt
total 68
drwxr-xr-x 17 10000 10000 4096 Sep 24 07:48 .
drwxr-xr-x 34 root  root  4096 Oct 28 22:24 ..
lrwxrwxrwx  1 10000 10000    7 Sep 24 07:43 bin -> usr/bin
drwxr-xr-x  2 10000 10000 4096 Apr 15  2020 boot
drwxr-xr-x  3 10000 10000 4096 Oct 16 19:26 dev
drwxr-xr-x 61 10000 10000 4096 Oct 16 19:26 etc
drwxr-xr-x  3 10000 10000 4096 Sep 24 07:45 home
lrwxrwxrwx  1 10000 10000    7 Sep 24 07:43 lib -> usr/lib
lrwxrwxrwx  1 10000 10000    9 Sep 24 07:43 lib32 -> usr/lib32
lrwxrwxrwx  1 10000 10000    9 Sep 24 07:43 lib64 -> usr/lib64
lrwxrwxrwx  1 10000 10000   10 Sep 24 07:43 libx32 -> usr/libx32
drwxr-xr-x  2 10000 10000 4096 Sep 24 07:43 media
drwxr-xr-x  2 10000 10000 4096 Sep 24 07:43 mnt
drwxr-xr-x  2 10000 10000 4096 Sep 24 07:43 opt
drwxr-xr-x  2 10000 10000 4096 Apr 15  2020 proc
drwx------  2 10000 10000 4096 Sep 24 07:43 root
drwxr-xr-x  2 10000 10000 4096 Sep 24 07:45 run
lrwxrwxrwx  1 10000 10000    8 Sep 24 07:43 sbin -> usr/sbin
drwxr-xr-x  2 10000 10000 4096 Sep 24 07:43 srv
drwxr-xr-x  2 10000 10000 4096 Apr 15  2020 sys
drwxrwxrwt  2 10000 10000 4096 Sep 24 07:44 tmp
drwxr-xr-x 13 10000 10000 4096 Sep 24 07:43 usr
drwxr-xr-x 12 10000 10000 4096 Sep 24 07:44 var
root@f2-vm:~# lxc-start f2 # uses /mnt as rootfs
root@f2-vm:~# lxc-attach f2 -- cat /proc/1/uid_map
         0      10000      10000
root@f2-vm:~# lxc-attach f2 -- cat /proc/1/gid_map
         0      10000      10000
root@f2-vm:~# lxc-attach f2 -- ls -al /
total 52
drwxr-xr-x  17 root   root    4096 Sep 24 07:48 .
drwxr-xr-x  17 root   root    4096 Sep 24 07:48 ..
lrwxrwxrwx   1 root   root       7 Sep 24 07:43 bin -> usr/bin
drwxr-xr-x   2 root   root    4096 Apr 15  2020 boot
drwxr-xr-x   5 root   root     500 Oct 28 23:39 dev
drwxr-xr-x  61 root   root    4096 Oct 28 23:39 etc
drwxr-xr-x   3 root   root    4096 Sep 24 07:45 home
lrwxrwxrwx   1 root   root       7 Sep 24 07:43 lib -> usr/lib
lrwxrwxrwx   1 root   root       9 Sep 24 07:43 lib32 -> usr/lib32
lrwxrwxrwx   1 root   root       9 Sep 24 07:43 lib64 -> usr/lib64
lrwxrwxrwx   1 root   root      10 Sep 24 07:43 libx32 -> usr/libx32
drwxr-xr-x   2 root   root    4096 Sep 24 07:43 media
drwxr-xr-x   2 root   root    4096 Sep 24 07:43 mnt
drwxr-xr-x   2 root   root    4096 Sep 24 07:43 opt
dr-xr-xr-x 232 nobody nogroup    0 Oct 28 23:39 proc
drwx------   2 root   root    4096 Oct 28 23:41 root
drwxr-xr-x  12 root   root     360 Oct 28 23:39 run
lrwxrwxrwx   1 root   root       8 Sep 24 07:43 sbin -> usr/sbin
drwxr-xr-x   2 root   root    4096 Sep 24 07:43 srv
dr-xr-xr-x  13 nobody nogroup    0 Oct 28 23:39 sys
drwxrwxrwt  11 root   root    4096 Oct 28 23:40 tmp
drwxr-xr-x  13 root   root    4096 Sep 24 07:43 usr
drwxr-xr-x  12 root   root    4096 Sep 24 07:44 var
root@f2-vm:~# lxc-attach f2 -- ls -al /my-file
-rw-r--r-- 1 root root 0 Oct 28 23:43 /my-file
root@f2-vm:~# ls -al /var/lib/lxc/f2/rootfs/my-file
-rw-r--r-- 1 20000 20000 0 Oct 28 23:43 /var/lib/lxc/f2/rootfs/my-file

[1]: https://systemd.io/HOME_DIRECTORY/
     "If the UID assigned to a user does not match the owner of the home
      directory in the file system, the home directory is automatically
      and recursively chown()ed to the correct UID."
      This has huge performance impact and is also problematic since it
      chowns all files independent of ownership.
[2]: https://github.com/brauner/mount-idmapped

In no particular order I'd like to say thanks to:
Al for pointing me into the direction to avoid inode alias issues during
lookup. David for various discussions around this. Tycho for helping
with this series and on future patches if this is in any shape or form
acceptable. Alban Crequy for pointing out more application container
use-cases. Stéphane for various valuable input on various use-cases and
letting me work on this. Amir for explaining and discussing aspects of
overlayfs with me.
I'd like to especially thank Seth Forshee because he provided a lot of
good analysis, suggestions, and participated in short-notice discussions
in both chat and video.

This series can be found and pulled in three locations:
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=idmapped_mounts
https://github.com/brauner/linux/tree/idmapped_mounts
https://gitlab.com/brauner/linux/-/commits/idmapped_mounts

Thanks!
Christian

Christian Brauner (32):
  namespace: take lock_mount_hash() directly when changing flags
  namespace: only take read lock in do_reconfigure_mnt()
  fs: add mount_setattr()
  tests: add mount_setattr() selftests
  fs: introduce MOUNT_ATTR_IDMAP
  fs: add id translation helpers
  capability: handle idmapped mounts
  namei: add idmapped mount aware permission helpers
  inode: add idmapped mount aware init and permission helpers
  attr: handle idmapped mounts
  acl: handle idmapped mounts
  commoncap: handle idmapped mounts
  stat: add mapped_generic_fillattr()
  namei: handle idmapped mounts in may_*() helpers
  namei: introduce struct renamedata
  namei: prepare for idmapped mounts
  namei: add lookup helpers with idmapped mounts aware permission
    checking
  open: handle idmapped mounts in do_truncate()
  open: handle idmapped mounts
  af_unix: handle idmapped mounts
  utimes: handle idmapped mounts
  would_dump: handle idmapped mounts
  exec: handle idmapped mounts
  fs: add helpers for idmap mounts
  apparmor: handle idmapped mounts
  audit: handle idmapped mounts
  ima: handle idmapped mounts
  ext4: support idmapped mounts
  expfs: handle idmapped mounts
  overlayfs: handle idmapped lower directories
  overlayfs: handle idmapped merged mounts
  fat: handle idmapped mounts

Tycho Andersen (2):
  xattr: handle idmapped mounts
  selftests: add idmapped mounts xattr selftest

 arch/alpha/kernel/syscalls/syscall.tbl        |   1 +
 arch/arm/tools/syscall.tbl                    |   1 +
 arch/arm64/include/asm/unistd32.h             |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl         |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl         |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl       |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |   1 +
 arch/s390/kernel/syscalls/syscall.tbl         |   1 +
 arch/sh/kernel/syscalls/syscall.tbl           |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |   1 +
 fs/Kconfig                                    |   6 +
 fs/attr.c                                     | 142 ++-
 fs/coredump.c                                 |  12 +-
 fs/exec.c                                     |  12 +-
 fs/exportfs/expfs.c                           |   4 +-
 fs/ext4/acl.c                                 |  11 +-
 fs/ext4/acl.h                                 |   3 +
 fs/ext4/ext4.h                                |  14 +-
 fs/ext4/file.c                                |   4 +
 fs/ext4/ialloc.c                              |   7 +-
 fs/ext4/inode.c                               |  27 +-
 fs/ext4/ioctl.c                               |  18 +-
 fs/ext4/namei.c                               | 145 ++-
 fs/ext4/super.c                               |   4 +
 fs/ext4/symlink.c                             |   9 +
 fs/ext4/xattr_hurd.c                          |  22 +-
 fs/ext4/xattr_security.c                      |  18 +-
 fs/ext4/xattr_trusted.c                       |  18 +-
 fs/fat/fat.h                                  |   2 +
 fs/fat/file.c                                 |  27 +-
 fs/fat/namei_msdos.c                          |   7 +
 fs/fat/namei_vfat.c                           |   7 +
 fs/inode.c                                    |  66 +-
 fs/internal.h                                 |   9 +
 fs/namei.c                                    | 597 ++++++++----
 fs/namespace.c                                | 446 ++++++++-
 fs/open.c                                     |  52 +-
 fs/overlayfs/copy_up.c                        | 104 +-
 fs/overlayfs/dir.c                            | 219 +++--
 fs/overlayfs/export.c                         |   3 +-
 fs/overlayfs/file.c                           |  23 +-
 fs/overlayfs/inode.c                          | 121 ++-
 fs/overlayfs/namei.c                          |  64 +-
 fs/overlayfs/overlayfs.h                      | 158 +++-
 fs/overlayfs/ovl_entry.h                      |   1 +
 fs/overlayfs/readdir.c                        |  34 +-
 fs/overlayfs/super.c                          | 109 ++-
 fs/overlayfs/util.c                           |  38 +-
 fs/posix_acl.c                                | 130 ++-
 fs/stat.c                                     |  18 +-
 fs/utimes.c                                   |   4 +-
 fs/xattr.c                                    | 264 ++++--
 include/linux/audit.h                         |  10 +-
 include/linux/capability.h                    |  12 +-
 include/linux/fs.h                            | 254 ++++-
 include/linux/ima.h                           |  15 +-
 include/linux/lsm_hook_defs.h                 |  10 +-
 include/linux/lsm_hooks.h                     |   1 +
 include/linux/mount.h                         |  20 +-
 include/linux/namei.h                         |   6 +
 include/linux/posix_acl.h                     |  14 +-
 include/linux/posix_acl_xattr.h               |  12 +-
 include/linux/security.h                      |  36 +-
 include/linux/syscalls.h                      |   3 +
 include/linux/xattr.h                         |  29 +
 include/uapi/asm-generic/unistd.h             |   4 +-
 include/uapi/linux/mount.h                    |  26 +
 ipc/mqueue.c                                  |   8 +-
 kernel/auditsc.c                              |  29 +-
 kernel/capability.c                           |  22 +-
 net/unix/af_unix.c                            |   2 +-
 security/apparmor/domain.c                    |   9 +-
 security/apparmor/file.c                      |   5 +-
 security/apparmor/lsm.c                       |  12 +-
 security/commoncap.c                          |  50 +-
 security/integrity/ima/ima.h                  |  19 +-
 security/integrity/ima/ima_api.c              |  10 +-
 security/integrity/ima/ima_appraise.c         |  14 +-
 security/integrity/ima/ima_asymmetric_keys.c  |   2 +-
 security/integrity/ima/ima_main.c             |  28 +-
 security/integrity/ima/ima_policy.c           |  17 +-
 security/integrity/ima/ima_queue_keys.c       |   2 +-
 security/security.c                           |  18 +-
 security/selinux/hooks.c                      |  13 +-
 security/smack/smack_lsm.c                    |  11 +-
 tools/include/uapi/asm-generic/unistd.h       |   4 +-
 tools/testing/selftests/Makefile              |   1 +
 .../testing/selftests/idmap_mounts/.gitignore |   1 +
 tools/testing/selftests/idmap_mounts/Makefile |   8 +
 tools/testing/selftests/idmap_mounts/config   |   1 +
 tools/testing/selftests/idmap_mounts/xattr.c  | 389 ++++++++
 .../selftests/mount_setattr/.gitignore        |   1 +
 .../testing/selftests/mount_setattr/Makefile  |   7 +
 tools/testing/selftests/mount_setattr/config  |   1 +
 .../mount_setattr/mount_setattr_test.c        | 888 ++++++++++++++++++
 102 files changed, 4109 insertions(+), 912 deletions(-)
 create mode 100644 tools/testing/selftests/idmap_mounts/.gitignore
 create mode 100644 tools/testing/selftests/idmap_mounts/Makefile
 create mode 100644 tools/testing/selftests/idmap_mounts/config
 create mode 100644 tools/testing/selftests/idmap_mounts/xattr.c
 create mode 100644 tools/testing/selftests/mount_setattr/.gitignore
 create mode 100644 tools/testing/selftests/mount_setattr/Makefile
 create mode 100644 tools/testing/selftests/mount_setattr/config
 create mode 100644 tools/testing/selftests/mount_setattr/mount_setattr_test.c


base-commit: 3650b228f83adda7e5ee532e2b90429c03f7b9ec
-- 
2.29.0


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 01/34] namespace: take lock_mount_hash() directly when changing flags
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-11-01 14:41   ` Christoph Hellwig
  2020-10-29  0:32 ` [PATCH 02/34] namespace: only take read lock in do_reconfigure_mnt() Christian Brauner
                   ` (36 subsequent siblings)
  37 siblings, 1 reply; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

Changing mount options always ends up taking lock_mount_hash() but when
MNT_READONLY is requested and neither the mount nor the superblock are
not already MNT_READONLY we end up taking the lock, dropping it, and
retaking it to change the other mount attributes. Instead of this,
acquire the lock once when changing mount properties. This simplifies
the locking in these codepath, makes them easier to reason about and
avoids having to reacquire the lock right after dropping it.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/namespace.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index cebaa3e81794..20ee291a7af4 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -463,7 +463,6 @@ static int mnt_make_readonly(struct mount *mnt)
 {
 	int ret = 0;
 
-	lock_mount_hash();
 	mnt->mnt.mnt_flags |= MNT_WRITE_HOLD;
 	/*
 	 * After storing MNT_WRITE_HOLD, we'll read the counters. This store
@@ -497,15 +496,12 @@ static int mnt_make_readonly(struct mount *mnt)
 	 */
 	smp_wmb();
 	mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
-	unlock_mount_hash();
 	return ret;
 }
 
 static int __mnt_unmake_readonly(struct mount *mnt)
 {
-	lock_mount_hash();
 	mnt->mnt.mnt_flags &= ~MNT_READONLY;
-	unlock_mount_hash();
 	return 0;
 }
 
@@ -2517,11 +2513,9 @@ static int change_mount_ro_state(struct mount *mnt, unsigned int mnt_flags)
  */
 static void set_mount_attributes(struct mount *mnt, unsigned int mnt_flags)
 {
-	lock_mount_hash();
 	mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
 	mnt->mnt.mnt_flags = mnt_flags;
 	touch_mnt_namespace(mnt->mnt_ns);
-	unlock_mount_hash();
 }
 
 static void mnt_warn_timestamp_expiry(struct path *mountpoint, struct vfsmount *mnt)
@@ -2567,9 +2561,11 @@ static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags)
 		return -EPERM;
 
 	down_write(&sb->s_umount);
+	lock_mount_hash();
 	ret = change_mount_ro_state(mnt, mnt_flags);
 	if (ret == 0)
 		set_mount_attributes(mnt, mnt_flags);
+	unlock_mount_hash();
 	up_write(&sb->s_umount);
 
 	mnt_warn_timestamp_expiry(path, &mnt->mnt);
@@ -2610,8 +2606,11 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
 		err = -EPERM;
 		if (ns_capable(sb->s_user_ns, CAP_SYS_ADMIN)) {
 			err = reconfigure_super(fc);
-			if (!err)
+			if (!err) {
+				lock_mount_hash();
 				set_mount_attributes(mnt, mnt_flags);
+				unlock_mount_hash();
+			}
 		}
 		up_write(&sb->s_umount);
 	}
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 02/34] namespace: only take read lock in do_reconfigure_mnt()
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
  2020-10-29  0:32 ` [PATCH 01/34] namespace: take lock_mount_hash() directly when changing flags Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 03/34] fs: add mount_setattr() Christian Brauner
                   ` (35 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

do_reconfigure_mnt() used to take the down_write(&sb->s_umount) lock
which seems unnecessary since we're not changing the superblock. We're
only checking whether it is already read-only. Setting other mount
attributes is protected by lock_mount_hash() afaict and not by s_umount.

So I think the history of down_write(&sb->s_umount) lock being taken
when setting mount attributes dates back to the introduction of
MNT_READONLY in [2]. Afaict, this introduced the concept of having
read-only mounts in contrast to just having a read-only superblock. When
it got introduced it was simply plumbed into do_remount() which already
took down_write(&sb->s_umount) because it was only used to actually
change the superblock before [2]. Afaict, it would've already been
possible back then to only use down_read(&sb->s_umount) for
MS_BIND | MS_REMOUNT since actual mount options were protected by
the vfsmount lock already. But that would've meant special casing the
locking for MS_BIND | MS_REMOUNT in do_remount() which people might not
have considered worth it.
Then in [1] MS_BIND | MS_REMOUNT mount option changes were split out of
do_remount() into do_reconfigure_mnt() but the down_write(&sb->s_umount)
lock was simply copied over.
Now that we have this be a separate helper only take
the down_read(&sb->s_umount) lock since we're only interested in
checking whether the super block is currently read-only and blocking any
writers from changing it. Essentially, checking that the super block is
read-only has the advantage that we can avoid having to go into the
slowpath and through MNT_WRITE_HOLD and can simply set the read-only
flag on the mount in set_mount_attributes().

[1]: commit 43f5e655eff7 ("vfs: Separate changing mount flags full remount")
[2]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts at remount")
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/namespace.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 20ee291a7af4..6e0d5bb63197 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2507,10 +2507,6 @@ static int change_mount_ro_state(struct mount *mnt, unsigned int mnt_flags)
 	return __mnt_unmake_readonly(mnt);
 }
 
-/*
- * Update the user-settable attributes on a mount.  The caller must hold
- * sb->s_umount for writing.
- */
 static void set_mount_attributes(struct mount *mnt, unsigned int mnt_flags)
 {
 	mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
@@ -2560,13 +2556,17 @@ static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags)
 	if (!can_change_locked_flags(mnt, mnt_flags))
 		return -EPERM;
 
-	down_write(&sb->s_umount);
+	/*
+	 * We're only checking whether the superblock is read-only not changing
+	 * it, so only take down_read(&sb->s_umount).
+	 */
+	down_read(&sb->s_umount);
 	lock_mount_hash();
 	ret = change_mount_ro_state(mnt, mnt_flags);
 	if (ret == 0)
 		set_mount_attributes(mnt, mnt_flags);
 	unlock_mount_hash();
-	up_write(&sb->s_umount);
+	up_read(&sb->s_umount);
 
 	mnt_warn_timestamp_expiry(path, &mnt->mnt);
 
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 03/34] fs: add mount_setattr()
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
  2020-10-29  0:32 ` [PATCH 01/34] namespace: take lock_mount_hash() directly when changing flags Christian Brauner
  2020-10-29  0:32 ` [PATCH 02/34] namespace: only take read lock in do_reconfigure_mnt() Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-11-01 14:42   ` Christoph Hellwig
  2020-10-29  0:32 ` [PATCH 04/34] tests: add mount_setattr() selftests Christian Brauner
                   ` (34 subsequent siblings)
  37 siblings, 1 reply; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

This implements the mount_setattr() syscall. While the new mount api
allows to change the properties of a superblock there is currently no
way to change the mount properties of a mount or mount tree using mount
file descriptors which the new mount api is based on. In addition the
old mount api has the restriction that mount options cannot be
applied recursively. This hasn't changed since changing mount options on
a per-mount basis was implemented in [1] and has been a frequent
request.
The legacy mount is currently unable to accommodate this behavior
without introducing a whole new set of flags because MS_REC | MS_REMOUNT
| MS_BIND | MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to
the topmost mount. Changing MS_REC to apply to the whole mount tree
would mean introducing a significant uapi change and would likely cause
significant regressions.

The new mount_setattr() syscall allows to recursively clear and set
mount options in one shot. Multiple calls to change mount options
requesting the same changes are idempotent:

int mount_setattr(int dfd, const char *path, unsigned flags,
                  struct mount_attr *uattr, size_t usize);

Flags to modify path resolution behavior are specified in the @flags
argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW,
and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to
restrict path resolution as introduced with openat2() might be supported
in the future.

mount_setattr() can be expected to grow over time and is designed with
extensibility in mind. It follows the extensible syscall pattern we have
used with other syscalls such as openat2(), clone3(),
sched_{set,get}attr(), and others.
The set of mount options is passed in the uapi struct mount_attr which
currently has the following layout:

struct mount_attr {
	__u64 attr_set;
	__u64 attr_clr;
	__u32 propagation;
};

The @attr_set and @attr_clr members are used to clear and set mount
options. This way a user can e.g. request that a set of flags is to be
raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in
@attr_set while at the same time requesting that another set of flags is
to be lowered such as removing noexec from a mount tree by specifying
MOUNT_ATTR_NOEXEC in @attr_clr.
Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0,
not a bitmap, users wanting to transition to a different atime setting
cannot simply specify the atime setting in @attr_set, but must also
specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that
MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set
can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in
@attr_clr.

The @propagation field lets callers specify the propagation type of a
mount tree. Propagation is a single property that has four different
settings and as such is not really a flag argument but an enum.
Specifically, it would be unclear what setting and clearing propagation
settings in combination would amount to. The legacy mount() syscall thus
forbids the combination of multiple propagation settings too. The goal
is to keep the semantics of mount propagation somewhat simple as they
are overly complex as it is.

Changing mount option has quite a few moving parts and the locking is
quite intricate so it is not unlikely that I got subtleties wrong.

[1]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts at remount")
Cc: David Howells <dhowells@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 arch/alpha/kernel/syscalls/syscall.tbl      |   1 +
 arch/arm/tools/syscall.tbl                  |   1 +
 arch/arm64/include/asm/unistd32.h           |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |   1 +
 arch/s390/kernel/syscalls/syscall.tbl       |   1 +
 arch/sh/kernel/syscalls/syscall.tbl         |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |   1 +
 fs/internal.h                               |   8 +
 fs/namespace.c                              | 286 ++++++++++++++++++--
 include/linux/syscalls.h                    |   3 +
 include/uapi/asm-generic/unistd.h           |   4 +-
 include/uapi/linux/mount.h                  |  22 ++
 tools/include/uapi/asm-generic/unistd.h     |   4 +-
 23 files changed, 319 insertions(+), 26 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index ee7b01bb7346..24d8709624b8 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -480,3 +480,4 @@
 548	common	pidfd_getfd			sys_pidfd_getfd
 549	common	faccessat2			sys_faccessat2
 550	common	process_madvise			sys_process_madvise
+551	common	mount_setattr			sys_mount_setattr
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index d056a548358e..e3785513d445 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -454,3 +454,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
 440	common	process_madvise			sys_process_madvise
+441	common	mount_setattr			sys_mount_setattr
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 107f08e03b9f..78af754e070a 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -889,6 +889,8 @@ __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
 __SYSCALL(__NR_faccessat2, sys_faccessat2)
 #define __NR_process_madvise 440
 __SYSCALL(__NR_process_madvise, sys_process_madvise)
+#define __NR_mount_setattr 441
+__SYSCALL(__NR_mount_setattr, sys_mount_setattr)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index b96ed8b8a508..f7d4b1f55be0 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -361,3 +361,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
 440	common	process_madvise			sys_process_madvise
+441	common	mount_setattr			sys_mount_setattr
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 625fb6d32842..e96e9c6a6ffa 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -440,3 +440,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
 440	common	process_madvise			sys_process_madvise
+441	common	mount_setattr			sys_mount_setattr
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index aae729c95cf9..6538f075a18e 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -446,3 +446,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
 440	common	process_madvise			sys_process_madvise
+441	common	mount_setattr			sys_mount_setattr
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 32817c954435..64d129db1aa7 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -379,3 +379,4 @@
 438	n32	pidfd_getfd			sys_pidfd_getfd
 439	n32	faccessat2			sys_faccessat2
 440	n32	process_madvise			sys_process_madvise
+441	n32	mount_setattr			sys_mount_setattr
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 9e4ea3c31b1c..94b24e6b2608 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -355,3 +355,4 @@
 438	n64	pidfd_getfd			sys_pidfd_getfd
 439	n64	faccessat2			sys_faccessat2
 440	n64	process_madvise			sys_process_madvise
+441	n64	mount_setattr			sys_mount_setattr
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 29f5f28cf5ce..eae522306767 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -428,3 +428,4 @@
 438	o32	pidfd_getfd			sys_pidfd_getfd
 439	o32	faccessat2			sys_faccessat2
 440	o32	process_madvise			sys_process_madvise
+441	o32	mount_setattr			sys_mount_setattr
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index f375ea528e59..c7e25f1d219f 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
 440	common	process_madvise			sys_process_madvise
+441	common	mount_setattr			sys_mount_setattr
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 1275daec7fec..0b309ef64e91 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -530,3 +530,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
 440	common	process_madvise			sys_process_madvise
+441	common	mount_setattr			sys_mount_setattr
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 28c168000483..0b30398fee42 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -443,3 +443,4 @@
 438  common	pidfd_getfd		sys_pidfd_getfd			sys_pidfd_getfd
 439  common	faccessat2		sys_faccessat2			sys_faccessat2
 440  common	process_madvise		sys_process_madvise		sys_process_madvise
+441  common	mount_setattr		sys_mount_setattr		sys_mount_setattr
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 783738448ff5..8e4949c5b740 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -443,3 +443,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
 440	common	process_madvise			sys_process_madvise
+441	common	mount_setattr			sys_mount_setattr
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 78160260991b..409f21a650b8 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -486,3 +486,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
 440	common	process_madvise			sys_process_madvise
+441	common	mount_setattr			sys_mount_setattr
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 0d0667a9fbd7..2a694420f6cd 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -445,3 +445,4 @@
 438	i386	pidfd_getfd		sys_pidfd_getfd
 439	i386	faccessat2		sys_faccessat2
 440	i386	process_madvise		sys_process_madvise
+441	i386	mount_setattr		sys_mount_setattr
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 1f47e24fb65c..773058fca7f8 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -362,6 +362,7 @@
 438	common	pidfd_getfd		sys_pidfd_getfd
 439	common	faccessat2		sys_faccessat2
 440	common	process_madvise		sys_process_madvise
+441	common	mount_setattr		sys_mount_setattr
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index b070f272995d..a650dc05593d 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -411,3 +411,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
 440	common	process_madvise			sys_process_madvise
+441	common	mount_setattr			sys_mount_setattr
diff --git a/fs/internal.h b/fs/internal.h
index a7cd0f64faa4..a5a6c470dc07 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -82,6 +82,14 @@ int may_linkat(struct path *link);
 /*
  * namespace.c
  */
+struct mount_kattr {
+	unsigned int attr_set;
+	unsigned int attr_clr;
+	unsigned int propagation;
+	unsigned int lookup_flags;
+	bool recurse;
+};
+
 extern struct vfsmount *lookup_mnt(const struct path *);
 extern int finish_automount(struct vfsmount *, struct path *);
 
diff --git a/fs/namespace.c b/fs/namespace.c
index 6e0d5bb63197..e9c515b012a4 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -459,10 +459,8 @@ void mnt_drop_write_file(struct file *file)
 }
 EXPORT_SYMBOL(mnt_drop_write_file);
 
-static int mnt_make_readonly(struct mount *mnt)
+static inline int mnt_hold_writers(struct mount *mnt)
 {
-	int ret = 0;
-
 	mnt->mnt.mnt_flags |= MNT_WRITE_HOLD;
 	/*
 	 * After storing MNT_WRITE_HOLD, we'll read the counters. This store
@@ -487,15 +485,29 @@ static int mnt_make_readonly(struct mount *mnt)
 	 * we're counting up here.
 	 */
 	if (mnt_get_writers(mnt) > 0)
-		ret = -EBUSY;
-	else
-		mnt->mnt.mnt_flags |= MNT_READONLY;
+		return -EBUSY;
+
+	return 0;
+}
+
+static inline void mnt_unhold_writers(struct mount *mnt)
+{
 	/*
 	 * MNT_READONLY must become visible before ~MNT_WRITE_HOLD, so writers
 	 * that become unheld will see MNT_READONLY.
 	 */
 	smp_wmb();
 	mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
+}
+
+static int mnt_make_readonly(struct mount *mnt)
+{
+	int ret;
+
+	ret = mnt_hold_writers(mnt);
+	if (!ret)
+		mnt->mnt.mnt_flags |= MNT_READONLY;
+	mnt_unhold_writers(mnt);
 	return ret;
 }
 
@@ -3433,6 +3445,33 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
 	return ret;
 }
 
+static int build_attr_flags(unsigned int attr_flags, unsigned int *flags)
+{
+	unsigned int aflags = 0;
+
+	if (attr_flags & ~(MOUNT_ATTR_RDONLY |
+			   MOUNT_ATTR_NOSUID |
+			   MOUNT_ATTR_NODEV |
+			   MOUNT_ATTR_NOEXEC |
+			   MOUNT_ATTR__ATIME |
+			   MOUNT_ATTR_NODIRATIME))
+		return -EINVAL;
+
+	if (attr_flags & MOUNT_ATTR_RDONLY)
+		aflags |= MNT_READONLY;
+	if (attr_flags & MOUNT_ATTR_NOSUID)
+		aflags |= MNT_NOSUID;
+	if (attr_flags & MOUNT_ATTR_NODEV)
+		aflags |= MNT_NODEV;
+	if (attr_flags & MOUNT_ATTR_NOEXEC)
+		aflags |= MNT_NOEXEC;
+	if (attr_flags & MOUNT_ATTR_NODIRATIME)
+		aflags |= MNT_NODIRATIME;
+
+	*flags = aflags;
+	return 0;
+}
+
 /*
  * Create a kernel mount representation for a new, prepared superblock
  * (specified by fs_fd) and attach to an open_tree-like file descriptor.
@@ -3455,24 +3494,9 @@ SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags,
 	if ((flags & ~(FSMOUNT_CLOEXEC)) != 0)
 		return -EINVAL;
 
-	if (attr_flags & ~(MOUNT_ATTR_RDONLY |
-			   MOUNT_ATTR_NOSUID |
-			   MOUNT_ATTR_NODEV |
-			   MOUNT_ATTR_NOEXEC |
-			   MOUNT_ATTR__ATIME |
-			   MOUNT_ATTR_NODIRATIME))
-		return -EINVAL;
-
-	if (attr_flags & MOUNT_ATTR_RDONLY)
-		mnt_flags |= MNT_READONLY;
-	if (attr_flags & MOUNT_ATTR_NOSUID)
-		mnt_flags |= MNT_NOSUID;
-	if (attr_flags & MOUNT_ATTR_NODEV)
-		mnt_flags |= MNT_NODEV;
-	if (attr_flags & MOUNT_ATTR_NOEXEC)
-		mnt_flags |= MNT_NOEXEC;
-	if (attr_flags & MOUNT_ATTR_NODIRATIME)
-		mnt_flags |= MNT_NODIRATIME;
+	ret = build_attr_flags(attr_flags, &mnt_flags);
+	if (ret)
+		return ret;
 
 	switch (attr_flags & MOUNT_ATTR__ATIME) {
 	case MOUNT_ATTR_STRICTATIME:
@@ -3780,6 +3804,220 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
 	return error;
 }
 
+static int do_mount_setattr(struct path *path, struct mount_kattr *kattr)
+{
+	struct mount *mnt = real_mount(path->mnt), *m = mnt, *last = NULL;
+	unsigned int all_raised = kattr->attr_set | kattr->attr_clr;
+	bool rdonly_set = kattr->attr_set & MNT_READONLY;
+	int err = 0;
+
+	if (!check_mnt(m))
+		return -EINVAL;
+
+	if (path->dentry != m->mnt.mnt_root)
+		return -EINVAL;
+
+	if (kattr->propagation) {
+		/* Only take namespace_lock() if we're actually changing propagation. */
+		namespace_lock();
+		if (kattr->propagation == MS_SHARED) {
+			err = invent_group_ids(m, kattr->recurse);
+			if (err) {
+				namespace_unlock();
+				return err;
+			}
+		}
+	}
+
+	lock_mount_hash();
+	/*
+	 * Get the mount tree in a shape where we can change mount properties
+	 * without failure.
+	 */
+	m = mnt;
+	do {
+		last = m;
+
+		if (!can_change_locked_flags(m, all_raised)) {
+			err = -EPERM;
+			break;
+		}
+
+		if (rdonly_set && !(m->mnt.mnt_flags & MNT_READONLY)) {
+			err = mnt_hold_writers(m);
+			if (err)
+				break;
+		}
+	} while (kattr->recurse && (m = next_mnt(m, mnt)));
+
+	m = mnt;
+	do {
+		if (!err) {
+			unsigned int new_flags;
+
+			new_flags = m->mnt.mnt_flags;
+			/* Lower flags user wants us to clear. */
+			new_flags &= ~kattr->attr_clr;
+			/* Raise flags user wants us to set. */
+			new_flags |= kattr->attr_set;
+			WRITE_ONCE(m->mnt.mnt_flags, new_flags);
+		}
+
+		/*
+		 * We either set MNT_READONLY above so make it visible
+		 * before ~MNT_WRITE_HOLD or we failed to recursively
+		 * apply mount options.
+		 */
+		if (rdonly_set && (m->mnt.mnt_flags & MNT_WRITE_HOLD))
+			mnt_unhold_writers(m);
+
+		if (!err && kattr->propagation)
+			change_mnt_propagation(m, kattr->propagation);
+
+		/*
+		 * On failure, only cleanup until we found the first mount we
+		 * failed to handle.
+		 */
+		if (err && m == last)
+			break;
+	} while (kattr->recurse && (m = next_mnt(m, mnt)));
+
+	if (!err)
+		touch_mnt_namespace(mnt->mnt_ns);
+
+	unlock_mount_hash();
+
+	if (kattr->propagation) {
+		namespace_unlock();
+		if (err)
+			cleanup_group_ids(mnt, NULL);
+	}
+
+	return err;
+}
+
+static int build_mount_kattr(const struct mount_attr *attr,
+			     struct mount_kattr *kattr, unsigned int flags)
+{
+	unsigned int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW;
+
+	if (flags & AT_NO_AUTOMOUNT)
+		lookup_flags &= ~LOOKUP_AUTOMOUNT;
+	if (flags & AT_SYMLINK_NOFOLLOW)
+		lookup_flags &= ~LOOKUP_FOLLOW;
+	if (flags & AT_EMPTY_PATH)
+		lookup_flags |= LOOKUP_EMPTY;
+
+	*kattr = (struct mount_kattr){
+		.lookup_flags	= lookup_flags,
+		.recurse	= !!(flags & AT_RECURSIVE),
+	};
+
+	switch (attr->propagation) {
+	case MAKE_PROPAGATION_UNCHANGED:
+		kattr->propagation = 0;
+		break;
+	case MAKE_PROPAGATION_UNBINDABLE:
+		kattr->propagation = MS_UNBINDABLE;
+		break;
+	case MAKE_PROPAGATION_PRIVATE:
+		kattr->propagation = MS_PRIVATE;
+		break;
+	case MAKE_PROPAGATION_DEPENDENT:
+		kattr->propagation = MS_SLAVE;
+		break;
+	case MAKE_PROPAGATION_SHARED:
+		kattr->propagation = MS_SHARED;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	if (upper_32_bits(attr->attr_set))
+		return -EINVAL;
+	if (build_attr_flags(lower_32_bits(attr->attr_set), &kattr->attr_set))
+		return -EINVAL;
+
+	if (upper_32_bits(attr->attr_clr))
+		return -EINVAL;
+	if (build_attr_flags(lower_32_bits(attr->attr_clr), &kattr->attr_clr))
+		return -EINVAL;
+
+	/*
+	 * Since the MOUNT_ATTR_<atime> values are an enum, not a bitmap, users
+	 * wanting to transition to a different atime setting cannot simply
+	 * specify the atime setting in @attr_set, but must also specify
+	 * MOUNT_ATTR__ATIME in the @attr_clr field.
+	 * So ensure that MOUNT_ATTR__ATIME can't be partially set in
+	 * @attr_clr and that @attr_set can't have any atime bits set if
+	 * MOUNT_ATTR__ATIME isn't set in @attr_clr.
+	 */
+	if (!(attr->attr_clr & MOUNT_ATTR__ATIME) && (attr->attr_set & MOUNT_ATTR__ATIME))
+		return -EINVAL;
+	else if ((attr->attr_clr & MOUNT_ATTR__ATIME) &&
+		 ((attr->attr_clr & MOUNT_ATTR__ATIME) != MOUNT_ATTR__ATIME))
+		return -EINVAL;
+
+	if (attr->attr_clr & MOUNT_ATTR__ATIME) {
+		/* Clear all previous time settings as they are mutually exclusive. */
+		kattr->attr_clr |= MNT_RELATIME | MNT_NOATIME;
+		switch (attr->attr_set & MOUNT_ATTR__ATIME) {
+		case MOUNT_ATTR_RELATIME:
+			kattr->attr_set |= MNT_RELATIME;
+			break;
+		case MOUNT_ATTR_NOATIME:
+			kattr->attr_set |= MNT_NOATIME;
+			break;
+		case MOUNT_ATTR_STRICTATIME:
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+SYSCALL_DEFINE5(mount_setattr, int, dfd, const char __user *, path, unsigned int, flags,
+		struct mount_attr __user *, uattr, size_t, usize)
+{
+	int err;
+	struct path target;
+	struct mount_attr attr;
+	struct mount_kattr kattr;
+
+	BUILD_BUG_ON(sizeof(struct mount_attr) < MOUNT_ATTR_SIZE_VER0);
+	BUILD_BUG_ON(sizeof(struct mount_attr) != MOUNT_ATTR_SIZE_LATEST);
+
+	if (flags & ~(AT_EMPTY_PATH | AT_RECURSIVE | AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT))
+		return -EINVAL;
+
+	if (unlikely(usize < MOUNT_ATTR_SIZE_VER0))
+		return -EINVAL;
+
+	if (!may_mount())
+		return -EPERM;
+
+	err = copy_struct_from_user(&attr, sizeof(attr), uattr, usize);
+	if (err)
+		return err;
+
+	if (attr.attr_set == 0 && attr.attr_clr == 0 && attr.propagation == 0)
+		return 0;
+
+	err = build_mount_kattr(&attr, &kattr, flags);
+	if (err)
+		return err;
+
+	err = user_path_at(dfd, path, kattr.lookup_flags, &target);
+	if (err)
+		return err;
+
+	err = do_mount_setattr(&target, &kattr);
+	path_put(&target);
+	return err;
+}
+
 static void __init init_mount_tree(void)
 {
 	struct vfsmount *mnt;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 37bea07c12f2..a62d5904fb6a 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -68,6 +68,7 @@ union bpf_attr;
 struct io_uring_params;
 struct clone_args;
 struct open_how;
+struct mount_attr;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -999,6 +1000,8 @@ asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
 asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
 			       int to_dfd, const char __user *to_path,
 			       unsigned int ms_flags);
+asmlinkage long sys_mount_setattr(int dfd, const char __user *path, unsigned int flags,
+				  struct mount_attr __user *uattr, size_t usize);
 asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
 asmlinkage long sys_fsconfig(int fs_fd, unsigned int cmd, const char __user *key,
 			     const void __user *value, int aux);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 2056318988f7..0517f36fe783 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -859,9 +859,11 @@ __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
 __SYSCALL(__NR_faccessat2, sys_faccessat2)
 #define __NR_process_madvise 440
 __SYSCALL(__NR_process_madvise, sys_process_madvise)
+#define __NR_mount_setattr 441
+__SYSCALL(__NR_mount_setattr, sys_mount_setattr)
 
 #undef __NR_syscalls
-#define __NR_syscalls 441
+#define __NR_syscalls 442
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index dd8306ea336c..fb3ad26fdebf 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -118,4 +118,26 @@ enum fsconfig_command {
 #define MOUNT_ATTR_STRICTATIME	0x00000020 /* - Always perform atime updates */
 #define MOUNT_ATTR_NODIRATIME	0x00000080 /* Do not update directory access times */
 
+/*
+ * mount_setattr()
+ */
+struct mount_attr {
+	__u64 attr_set;
+	__u64 attr_clr;
+	__u64 propagation;
+};
+
+/* Change propagation through mount_setattr(). */
+enum propagation_type {
+	MAKE_PROPAGATION_UNCHANGED	= 0, /* Don't change mount propagation (default). */
+	MAKE_PROPAGATION_UNBINDABLE	= 1, /* Make unbindable. */
+	MAKE_PROPAGATION_PRIVATE	= 2, /* Do not receive or send mount events. */
+	MAKE_PROPAGATION_DEPENDENT	= 3, /* Only receive mount events. */
+	MAKE_PROPAGATION_SHARED		= 4, /* Send and receive mount events. */
+};
+
+/* List of all mount_attr versions. */
+#define MOUNT_ATTR_SIZE_VER0	24 /* sizeof first published struct */
+#define MOUNT_ATTR_SIZE_LATEST	MOUNT_ATTR_SIZE_VER0
+
 #endif /* _UAPI_LINUX_MOUNT_H */
diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h
index f2b5d72a46c2..1cbe8bbda5fa 100644
--- a/tools/include/uapi/asm-generic/unistd.h
+++ b/tools/include/uapi/asm-generic/unistd.h
@@ -857,9 +857,11 @@ __SYSCALL(__NR_openat2, sys_openat2)
 __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
 #define __NR_faccessat2 439
 __SYSCALL(__NR_faccessat2, sys_faccessat2)
+#define __NR_mount_setattr 441
+__SYSCALL(__NR_mount_setattr, sys_mount_setattr)
 
 #undef __NR_syscalls
-#define __NR_syscalls 440
+#define __NR_syscalls 442
 
 /*
  * 32 bit systems traditionally used different
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 04/34] tests: add mount_setattr() selftests
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (2 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 03/34] fs: add mount_setattr() Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 05/34] fs: introduce MOUNT_ATTR_IDMAP Christian Brauner
                   ` (33 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

Add a range of selftests for the new mount_setattr() syscall to verify
that it works as expected. This tests that:
- no invalid flags can be specified
- changing properties of a single mount works and leaves other mounts in
  the mount tree unchanged
- changing a mount tre to read-only when one of the mounts has writers
  fails and leaves the whole mount tree unchanged
- changing mount properties from multiple threads works
- changing atime settings works
- changing mount propagation works
- changing the mount options of a mount tree where the individual mounts
  in the tree have different mount options only changes the flags that
  were requested to change
- changing mount options from another mount namespace fails
- changing mount options from another user namespace fails

[==========] Running 9 tests from 2 test cases.
[ RUN      ] mount_setattr.invalid_attributes
[       OK ] mount_setattr.invalid_attributes
[ RUN      ] mount_setattr.basic
[       OK ] mount_setattr.basic
[ RUN      ] mount_setattr.basic_recursive
[       OK ] mount_setattr.basic_recursive
[ RUN      ] mount_setattr.mount_has_writers
[       OK ] mount_setattr.mount_has_writers
[ RUN      ] mount_setattr.mixed_mount_options
[       OK ] mount_setattr.mixed_mount_options
[ RUN      ] mount_setattr.time_changes
[       OK ] mount_setattr.time_changes
[ RUN      ] mount_setattr.multi_threaded
[       OK ] mount_setattr.multi_threaded
[ RUN      ] mount_setattr.wrong_user_namespace
[       OK ] mount_setattr.wrong_user_namespace
[ RUN      ] mount_setattr.wrong_mount_namespace
[       OK ] mount_setattr.wrong_mount_namespace
[==========] 9 / 9 tests passed.
[  PASSED  ]

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 tools/testing/selftests/Makefile              |   1 +
 .../selftests/mount_setattr/.gitignore        |   1 +
 .../testing/selftests/mount_setattr/Makefile  |   7 +
 tools/testing/selftests/mount_setattr/config  |   1 +
 .../mount_setattr/mount_setattr_test.c        | 888 ++++++++++++++++++
 5 files changed, 898 insertions(+)
 create mode 100644 tools/testing/selftests/mount_setattr/.gitignore
 create mode 100644 tools/testing/selftests/mount_setattr/Makefile
 create mode 100644 tools/testing/selftests/mount_setattr/config
 create mode 100644 tools/testing/selftests/mount_setattr/mount_setattr_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index d9c283503159..87b7107dd9a6 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -34,6 +34,7 @@ TARGETS += memfd
 TARGETS += memory-hotplug
 TARGETS += mincore
 TARGETS += mount
+TARGETS += mount_setattr
 TARGETS += mqueue
 TARGETS += net
 TARGETS += net/forwarding
diff --git a/tools/testing/selftests/mount_setattr/.gitignore b/tools/testing/selftests/mount_setattr/.gitignore
new file mode 100644
index 000000000000..5f74d8488472
--- /dev/null
+++ b/tools/testing/selftests/mount_setattr/.gitignore
@@ -0,0 +1 @@
+mount_setattr_test
diff --git a/tools/testing/selftests/mount_setattr/Makefile b/tools/testing/selftests/mount_setattr/Makefile
new file mode 100644
index 000000000000..2250f7dcb81e
--- /dev/null
+++ b/tools/testing/selftests/mount_setattr/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0
+# Makefile for mount selftests.
+CFLAGS = -g -I../../../../usr/include/ -Wall -O2 -pthread
+
+TEST_GEN_FILES += mount_setattr_test
+
+include ../lib.mk
diff --git a/tools/testing/selftests/mount_setattr/config b/tools/testing/selftests/mount_setattr/config
new file mode 100644
index 000000000000..416bd53ce982
--- /dev/null
+++ b/tools/testing/selftests/mount_setattr/config
@@ -0,0 +1 @@
+CONFIG_USER_NS=y
diff --git a/tools/testing/selftests/mount_setattr/mount_setattr_test.c b/tools/testing/selftests/mount_setattr/mount_setattr_test.c
new file mode 100644
index 000000000000..7d320cfa7d3b
--- /dev/null
+++ b/tools/testing/selftests/mount_setattr/mount_setattr_test.c
@@ -0,0 +1,888 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sched.h>
+#include <stdio.h>
+#include <errno.h>
+#include <pthread.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <sys/wait.h>
+#include <sys/vfs.h>
+#include <sys/statvfs.h>
+#include <sys/sysinfo.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <grp.h>
+#include <stdbool.h>
+#include <stdarg.h>
+
+#include "../kselftest_harness.h"
+
+#ifndef CLONE_NEWNS
+#define CLONE_NEWNS 0x00020000
+#endif
+
+#ifndef CLONE_NEWUSER
+#define CLONE_NEWUSER 0x10000000
+#endif
+
+#ifndef MS_REC
+#define MS_REC 16384
+#endif
+
+#ifndef MS_RELATIME
+#define MS_RELATIME (1 << 21)
+#endif
+
+#ifndef MS_STRICTATIME
+#define MS_STRICTATIME (1 << 24)
+#endif
+
+#ifndef MOUNT_ATTR_RDONLY
+#define MOUNT_ATTR_RDONLY 0x00000001
+#endif
+
+#ifndef MOUNT_ATTR_NOSUID
+#define MOUNT_ATTR_NOSUID 0x00000002
+#endif
+
+#ifndef MOUNT_ATTR_NOEXEC
+#define MOUNT_ATTR_NOEXEC 0x00000008
+#endif
+
+#ifndef MOUNT_ATTR_NODIRATIME
+#define MOUNT_ATTR_NODIRATIME 0x00000080
+#endif
+
+#ifndef MOUNT_ATTR__ATIME
+#define MOUNT_ATTR__ATIME 0x00000070
+#endif
+
+#ifndef MOUNT_ATTR_RELATIME
+#define MOUNT_ATTR_RELATIME 0x00000000
+#endif
+
+#ifndef MOUNT_ATTR_NOATIME
+#define MOUNT_ATTR_NOATIME 0x00000010
+#endif
+
+#ifndef MOUNT_ATTR_STRICTATIME
+#define MOUNT_ATTR_STRICTATIME 0x00000020
+#endif
+
+#ifndef AT_RECURSIVE
+#define AT_RECURSIVE 0x8000
+#endif
+
+#ifndef MAKE_PROPAGATION_SHARED
+#define MAKE_PROPAGATION_SHARED 4
+#endif
+
+#define DEFAULT_THREADS 4
+#define ptr_to_int(p) ((int)((intptr_t)(p)))
+#define int_to_ptr(u) ((void *)((intptr_t)(u)))
+
+#ifndef __NR_mount_setattr
+	#if defined __alpha__
+		#define __NR_mount_setattr 550
+	#elif defined _MIPS_SIM
+		#if _MIPS_SIM == _MIPS_SIM_ABI32	/* o32 */
+			#define __NR_mount_setattr 4440
+		#endif
+		#if _MIPS_SIM == _MIPS_SIM_NABI32	/* n32 */
+			#define __NR_mount_setattr 6440
+		#endif
+		#if _MIPS_SIM == _MIPS_SIM_ABI64	/* n64 */
+			#define __NR_mount_setattr 5440
+		#endif
+	#elif defined __ia64__
+		#define __NR_mount_setattr (440 + 1024)
+	#else
+		#define __NR_mount_setattr 440
+#endif
+
+struct mount_attr {
+	__u64 attr_set;
+	__u64 attr_clr;
+	__u64 propagation;
+};
+#endif
+
+static inline int sys_mount_setattr(int dfd, const char *path, unsigned int flags,
+				    struct mount_attr *attr, size_t size)
+{
+	return syscall(__NR_mount_setattr, dfd, path, flags, attr, size);
+}
+
+static ssize_t write_nointr(int fd, const void *buf, size_t count)
+{
+	ssize_t ret;
+
+	do {
+		ret = write(fd, buf, count);
+	} while (ret < 0 && errno == EINTR);
+
+	return ret;
+}
+
+static int write_file(const char *path, const void *buf, size_t count)
+{
+	int fd;
+	ssize_t ret;
+
+	fd = open(path, O_WRONLY | O_CLOEXEC | O_NOCTTY | O_NOFOLLOW);
+	if (fd < 0)
+		return -1;
+
+	ret = write_nointr(fd, buf, count);
+	close(fd);
+	if (ret < 0 || (size_t)ret != count)
+		return -1;
+
+	return 0;
+}
+
+static int create_and_enter_userns(void)
+{
+	uid_t uid;
+	gid_t gid;
+	char map[100];
+
+	uid = getuid();
+	gid = getgid();
+
+	if (unshare(CLONE_NEWUSER))
+		return -1;
+
+	if (write_file("/proc/self/setgroups", "deny", sizeof("deny") - 1) &&
+	    errno != ENOENT)
+		return -1;
+
+	snprintf(map, sizeof(map), "0 %d 1", uid);
+	if (write_file("/proc/self/uid_map", map, strlen(map)))
+		return -1;
+
+
+	snprintf(map, sizeof(map), "0 %d 1", gid);
+	if (write_file("/proc/self/gid_map", map, strlen(map)))
+		return -1;
+
+	if (setgid(0))
+		return -1;
+
+	if (setuid(0))
+		return -1;
+
+	return 0;
+}
+
+static int prepare_unpriv_mountns(void)
+{
+	if (create_and_enter_userns())
+		return -1;
+
+	if (unshare(CLONE_NEWNS))
+		return -1;
+
+	if (mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, 0))
+		return -1;
+
+	return 0;
+}
+
+static int read_mnt_flags(const char *path)
+{
+	int ret;
+	struct statvfs stat;
+	unsigned int mnt_flags;
+
+	ret = statvfs(path, &stat);
+	if (ret != 0)
+		return -EINVAL;
+
+	if (stat.f_flag &
+	    ~(ST_RDONLY | ST_NOSUID | ST_NODEV | ST_NOEXEC | ST_NOATIME |
+	      ST_NODIRATIME | ST_RELATIME | ST_SYNCHRONOUS | ST_MANDLOCK))
+		return -EINVAL;
+
+	mnt_flags = 0;
+	if (stat.f_flag & ST_RDONLY)
+		mnt_flags |= MS_RDONLY;
+	if (stat.f_flag & ST_NOSUID)
+		mnt_flags |= MS_NOSUID;
+	if (stat.f_flag & ST_NODEV)
+		mnt_flags |= MS_NODEV;
+	if (stat.f_flag & ST_NOEXEC)
+		mnt_flags |= MS_NOEXEC;
+	if (stat.f_flag & ST_NOATIME)
+		mnt_flags |= MS_NOATIME;
+	if (stat.f_flag & ST_NODIRATIME)
+		mnt_flags |= MS_NODIRATIME;
+	if (stat.f_flag & ST_RELATIME)
+		mnt_flags |= MS_RELATIME;
+	if (stat.f_flag & ST_SYNCHRONOUS)
+		mnt_flags |= MS_SYNCHRONOUS;
+	if (stat.f_flag & ST_MANDLOCK)
+		mnt_flags |= ST_MANDLOCK;
+
+	return mnt_flags;
+}
+
+static char *get_field(char *src, int nfields)
+{
+	int i;
+	char *p = src;
+
+	for (i = 0; i < nfields; i++) {
+		while (*p && *p != ' ' && *p != '\t')
+			p++;
+
+		if (!*p)
+			break;
+
+		p++;
+	}
+
+	return p;
+}
+
+static void null_endofword(char *word)
+{
+	while (*word && *word != ' ' && *word != '\t')
+		word++;
+	*word = '\0';
+}
+
+static bool is_shared_mount(const char *path)
+{
+	size_t len = 0;
+	char *line = NULL;
+	FILE *f = NULL;
+
+	f = fopen("/proc/self/mountinfo", "re");
+	if (!f)
+		return false;
+
+	while (getline(&line, &len, f) != -1) {
+		char *opts, *target;
+
+		target = get_field(line, 4);
+		if (!target)
+			continue;
+
+		opts = get_field(target, 2);
+		if (!opts)
+			continue;
+
+		null_endofword(target);
+
+		if (strcmp(target, path) != 0)
+			continue;
+
+		null_endofword(opts);
+		if (strstr(opts, "shared:"))
+			return true;
+	}
+
+	free(line);
+	fclose(f);
+
+	return false;
+}
+
+static void *mount_setattr_thread(void *data)
+{
+	struct mount_attr attr = {
+		.attr_set	= MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID,
+		.attr_clr	= 0,
+		.propagation	= MAKE_PROPAGATION_SHARED,
+	};
+
+	if (sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)))
+		pthread_exit(int_to_ptr(-1));
+
+	pthread_exit(int_to_ptr(0));
+}
+
+FIXTURE(mount_setattr) {
+};
+
+FIXTURE_SETUP(mount_setattr)
+{
+	ASSERT_EQ(prepare_unpriv_mountns(), 0);
+
+	(void)umount2("/mnt", MNT_DETACH);
+	(void)umount2("/tmp", MNT_DETACH);
+
+	ASSERT_EQ(mount("testing", "/tmp", "tmpfs", MS_NOATIME | MS_NODEV,
+			"size=100000,mode=700"), 0);
+
+	ASSERT_EQ(mkdir("/tmp/B", 0777), 0);
+
+	ASSERT_EQ(mount("testing", "/tmp/B", "tmpfs", MS_NOATIME | MS_NODEV,
+			"size=100000,mode=700"), 0);
+
+	ASSERT_EQ(mkdir("/tmp/B/BB", 0777), 0);
+
+	ASSERT_EQ(mount("testing", "/tmp/B/BB", "tmpfs", MS_NOATIME | MS_NODEV,
+			"size=100000,mode=700"), 0);
+
+	ASSERT_EQ(mount("testing", "/mnt", "tmpfs", MS_NOATIME | MS_NODEV,
+			"size=100000,mode=700"), 0);
+
+	ASSERT_EQ(mkdir("/mnt/A", 0777), 0);
+
+	ASSERT_EQ(mount("testing", "/mnt/A", "tmpfs", MS_NOATIME | MS_NODEV,
+			"size=100000,mode=700"), 0);
+
+	ASSERT_EQ(mkdir("/mnt/A/AA", 0777), 0);
+
+	ASSERT_EQ(mount("/tmp", "/mnt/A/AA", NULL, MS_BIND | MS_REC, NULL), 0);
+
+	ASSERT_EQ(mkdir("/mnt/B", 0777), 0);
+
+	ASSERT_EQ(mount("testing", "/mnt/B", "ramfs",
+			MS_NOATIME | MS_NODEV | MS_NOSUID, 0), 0);
+
+	ASSERT_EQ(mkdir("/mnt/B/BB", 0777), 0);
+
+	ASSERT_EQ(mount("testing", "/tmp/B/BB", "devpts",
+			MS_RELATIME | MS_NOEXEC | MS_RDONLY, 0), 0);
+}
+
+FIXTURE_TEARDOWN(mount_setattr)
+{
+	(void)umount2("/mnt/A", MNT_DETACH);
+	(void)umount2("/tmp", MNT_DETACH);
+}
+
+TEST_F(mount_setattr, invalid_attributes)
+{
+	struct mount_attr invalid_attr = {
+		.attr_set = (1U << 31),
+	};
+
+	ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &invalid_attr,
+				    sizeof(invalid_attr)), 0);
+
+	invalid_attr.attr_set	= 0;
+	invalid_attr.attr_clr	= (1U << 31);
+	ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &invalid_attr,
+				    sizeof(invalid_attr)), 0);
+
+	invalid_attr.attr_clr		= 0;
+	invalid_attr.propagation	= (1U << 31);
+	ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &invalid_attr,
+				    sizeof(invalid_attr)), 0);
+
+	invalid_attr.attr_set		= (1U << 31);
+	invalid_attr.attr_clr		= (1U << 31);
+	invalid_attr.propagation	= (1U << 31);
+	ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &invalid_attr,
+				    sizeof(invalid_attr)), 0);
+
+	ASSERT_NE(sys_mount_setattr(-1, "mnt/A", AT_RECURSIVE, &invalid_attr,
+				    sizeof(invalid_attr)), 0);
+}
+
+TEST_F(mount_setattr, extensibility)
+{
+	unsigned int old_flags = 0, new_flags = 0, expected_flags = 0;
+	char *s = "dummy";
+	struct mount_attr invalid_attr = {};
+	struct mount_attr_large {
+		struct mount_attr attr1;
+		struct mount_attr attr2;
+	} large_attr = {};
+
+	old_flags = read_mnt_flags("/mnt/A");
+	ASSERT_GT(old_flags, 0);
+
+	ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, NULL,
+				    sizeof(invalid_attr)), 0);
+	ASSERT_EQ(errno, EFAULT);
+
+	ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, (void *)s,
+				    sizeof(invalid_attr)), 0);
+	ASSERT_EQ(errno, EINVAL);
+
+	ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &invalid_attr, 0), 0);
+	ASSERT_EQ(errno, EINVAL);
+
+	ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &invalid_attr,
+				    sizeof(invalid_attr) / 2), 0);
+	ASSERT_EQ(errno, EINVAL);
+
+	ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &invalid_attr,
+				    sizeof(invalid_attr) / 2), 0);
+	ASSERT_EQ(errno, EINVAL);
+
+	ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE,
+				    (void *)&large_attr, sizeof(large_attr)), 0);
+
+	large_attr.attr2.attr_set = MOUNT_ATTR_RDONLY;
+	ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE,
+				    (void *)&large_attr, sizeof(large_attr)), 0);
+
+	large_attr.attr2.attr_set = 0;
+	large_attr.attr1.attr_set = MOUNT_ATTR_RDONLY;
+	ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE,
+				    (void *)&large_attr, sizeof(large_attr)), 0);
+
+	expected_flags = old_flags;
+	expected_flags |= MS_RDONLY;
+
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, expected_flags);
+}
+
+TEST_F(mount_setattr, basic)
+{
+	unsigned int old_flags = 0, new_flags = 0, expected_flags = 0;
+	struct mount_attr attr = {
+		.attr_set	= MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOEXEC | MOUNT_ATTR_RELATIME,
+		.attr_clr	= MOUNT_ATTR__ATIME,
+	};
+
+	old_flags = read_mnt_flags("/mnt/A");
+	ASSERT_GT(old_flags, 0);
+
+	ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", 0, &attr, sizeof(attr)), 0);
+
+	expected_flags = old_flags;
+	expected_flags |= MS_RDONLY;
+	expected_flags |= MS_NOEXEC;
+	expected_flags &= ~MS_NOATIME;
+	expected_flags |= MS_RELATIME;
+
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, old_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, old_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, old_flags);
+}
+
+TEST_F(mount_setattr, basic_recursive)
+{
+	int fd;
+	unsigned int old_flags = 0, new_flags = 0, expected_flags = 0;
+	struct mount_attr attr = {
+		.attr_set	= MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOEXEC | MOUNT_ATTR_RELATIME,
+		.attr_clr	= MOUNT_ATTR__ATIME,
+	};
+
+	old_flags = read_mnt_flags("/mnt/A");
+	ASSERT_GT(old_flags, 0);
+
+	ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	expected_flags = old_flags;
+	expected_flags |= MS_RDONLY;
+	expected_flags |= MS_NOEXEC;
+	expected_flags &= ~MS_NOATIME;
+	expected_flags |= MS_RELATIME;
+
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	memset(&attr, 0, sizeof(attr));
+	attr.attr_clr = MOUNT_ATTR_RDONLY;
+	attr.propagation = MAKE_PROPAGATION_SHARED;
+	ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	expected_flags &= ~MS_RDONLY;
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A"), true);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A/AA"), true);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A/AA/B"), true);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A/AA/B/BB"), true);
+
+	fd = open("/mnt/A/AA/B/b", O_RDWR | O_CLOEXEC | O_CREAT | O_EXCL, 0777);
+	ASSERT_GE(fd, 0);
+
+	/*
+	 * We're holding a fd open for writing so this needs to fail somewhere
+	 * in the middle and the mount options need to be unchanged.
+	 */
+	attr.attr_set = MOUNT_ATTR_RDONLY;
+	ASSERT_LT(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A"), true);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A/AA"), true);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A/AA/B"), true);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A/AA/B/BB"), true);
+
+	EXPECT_EQ(close(fd), 0);
+}
+
+TEST_F(mount_setattr, mount_has_writers)
+{
+	int fd, dfd;
+	unsigned int old_flags = 0, new_flags = 0;
+	struct mount_attr attr = {
+		.attr_set	= MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOEXEC | MOUNT_ATTR_RELATIME,
+		.attr_clr	= MOUNT_ATTR__ATIME,
+		.propagation	= MAKE_PROPAGATION_SHARED,
+	};
+
+	old_flags = read_mnt_flags("/mnt/A");
+	ASSERT_GT(old_flags, 0);
+
+	fd = open("/mnt/A/AA/B/b", O_RDWR | O_CLOEXEC | O_CREAT | O_EXCL, 0777);
+	ASSERT_GE(fd, 0);
+
+	/*
+	 * We're holding a fd open to a mount somwhere in the middle so this
+	 * needs to fail somewhere in the middle. After this the mount options
+	 * need to be unchanged.
+	 */
+	ASSERT_LT(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, old_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A"), false);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, old_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A/AA"), false);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, old_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A/AA/B"), false);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, old_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A/AA/B/BB"), false);
+
+	dfd = open("/mnt/A/AA/B", O_DIRECTORY | O_CLOEXEC);
+	ASSERT_GE(dfd, 0);
+	EXPECT_EQ(fsync(dfd), 0);
+	EXPECT_EQ(close(dfd), 0);
+
+	EXPECT_EQ(fsync(fd), 0);
+	EXPECT_EQ(close(fd), 0);
+
+	/* All writers are gone so this should succeed. */
+	ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+}
+
+TEST_F(mount_setattr, mixed_mount_options)
+{
+	unsigned int old_flags1 = 0, old_flags2 = 0, new_flags = 0, expected_flags = 0;
+	struct mount_attr attr = {
+		.attr_clr = MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID | MOUNT_ATTR_NOEXEC | MOUNT_ATTR__ATIME,
+		.attr_set = MOUNT_ATTR_RELATIME,
+	};
+
+	old_flags1 = read_mnt_flags("/mnt/B");
+	ASSERT_GT(old_flags1, 0);
+
+	old_flags2 = read_mnt_flags("/mnt/B/BB");
+	ASSERT_GT(old_flags2, 0);
+
+	ASSERT_EQ(sys_mount_setattr(-1, "/mnt/B", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	expected_flags = old_flags2;
+	expected_flags &= ~(MS_RDONLY | MS_NOEXEC | MS_NOATIME | MS_NOSUID);
+	expected_flags |= MS_RELATIME;
+
+	new_flags = read_mnt_flags("/mnt/B");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	expected_flags = old_flags2;
+	expected_flags &= ~(MS_RDONLY | MS_NOEXEC | MS_NOATIME | MS_NOSUID);
+	expected_flags |= MS_RELATIME;
+
+	new_flags = read_mnt_flags("/mnt/B/BB");
+	ASSERT_EQ(new_flags, expected_flags);
+}
+
+TEST_F(mount_setattr, time_changes)
+{
+	unsigned int old_flags = 0, new_flags = 0, expected_flags = 0;
+	struct mount_attr attr = {
+		.attr_set	= MOUNT_ATTR_NODIRATIME | MOUNT_ATTR_NOATIME,
+	};
+
+	ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	attr.attr_set = MOUNT_ATTR_STRICTATIME;
+	ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	attr.attr_set = MOUNT_ATTR_STRICTATIME | MOUNT_ATTR_NOATIME;
+	ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	attr.attr_set = MOUNT_ATTR_STRICTATIME | MOUNT_ATTR_NOATIME;
+	attr.attr_clr = MOUNT_ATTR__ATIME;
+	ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	attr.attr_set = 0;
+	attr.attr_clr = MOUNT_ATTR_STRICTATIME;
+	ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	attr.attr_clr = MOUNT_ATTR_NOATIME;
+	ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	old_flags = read_mnt_flags("/mnt/A");
+	ASSERT_GT(old_flags, 0);
+
+	attr.attr_set = MOUNT_ATTR_NODIRATIME | MOUNT_ATTR_NOATIME;
+	attr.attr_clr = MOUNT_ATTR__ATIME;
+	ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	expected_flags = old_flags;
+	expected_flags |= MS_NOATIME;
+	expected_flags |= MS_NODIRATIME;
+
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	memset(&attr, 0, sizeof(attr));
+	attr.attr_set &= ~MOUNT_ATTR_NOATIME;
+	attr.attr_set |= MOUNT_ATTR_RELATIME;
+	attr.attr_clr |= MOUNT_ATTR__ATIME;
+	ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	expected_flags &= ~MS_NOATIME;
+	expected_flags |= MS_RELATIME;
+
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	memset(&attr, 0, sizeof(attr));
+	attr.attr_set &= ~MOUNT_ATTR_RELATIME;
+	attr.attr_set |= MOUNT_ATTR_STRICTATIME;
+	attr.attr_clr |= MOUNT_ATTR__ATIME;
+	ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	expected_flags &= ~MS_RELATIME;
+
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	memset(&attr, 0, sizeof(attr));
+	attr.attr_set &= ~MOUNT_ATTR_STRICTATIME;
+	attr.attr_set |= MOUNT_ATTR_NOATIME;
+	attr.attr_clr |= MOUNT_ATTR__ATIME;
+	ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	expected_flags |= MS_NOATIME;
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	memset(&attr, 0, sizeof(attr));
+	ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	memset(&attr, 0, sizeof(attr));
+	attr.attr_clr = MOUNT_ATTR_NODIRATIME;
+	ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	expected_flags &= ~MS_NODIRATIME;
+
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, expected_flags);
+}
+
+TEST_F(mount_setattr, multi_threaded)
+{
+	int i, j, nthreads, ret = 0;
+	unsigned int old_flags = 0, new_flags = 0, expected_flags = 0;
+	pthread_attr_t pattr;
+	pthread_t threads[DEFAULT_THREADS];
+
+	old_flags = read_mnt_flags("/mnt/A");
+	ASSERT_GT(old_flags, 0);
+
+	/* Try to change mount options from multiple threads. */
+	nthreads = get_nprocs_conf();
+	if (nthreads > DEFAULT_THREADS)
+		nthreads = DEFAULT_THREADS;
+
+	pthread_attr_init(&pattr);
+	for (i = 0; i < nthreads; i++)
+		ASSERT_EQ(pthread_create(&threads[i], &pattr, mount_setattr_thread, NULL), 0);
+
+	for (j = 0; j < i; j++) {
+		void *retptr = NULL;
+
+		EXPECT_EQ(pthread_join(threads[j], &retptr), 0);
+
+		ret += ptr_to_int(retptr);
+		EXPECT_EQ(ret, 0);
+	}
+	pthread_attr_destroy(&pattr);
+
+	ASSERT_EQ(ret, 0);
+
+	expected_flags = old_flags;
+	expected_flags |= MS_RDONLY;
+	expected_flags |= MS_NOSUID;
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A"), true);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A/AA"), true);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A/AA/B"), true);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A/AA/B/BB"), true);
+}
+
+TEST_F(mount_setattr, wrong_user_namespace)
+{
+	int ret;
+	struct mount_attr attr = {
+		.attr_set = MOUNT_ATTR_RDONLY,
+	};
+
+	EXPECT_EQ(create_and_enter_userns(), 0);
+	ret = sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr));
+	ASSERT_LT(ret, 0);
+	ASSERT_EQ(errno, EPERM);
+}
+
+TEST_F(mount_setattr, wrong_mount_namespace)
+{
+	int fd, ret;
+	struct mount_attr attr = {
+		.attr_set = MOUNT_ATTR_RDONLY,
+	};
+
+	fd = open("/mnt/A", O_DIRECTORY | O_CLOEXEC);
+	ASSERT_GE(fd, 0);
+
+	ASSERT_EQ(unshare(CLONE_NEWNS), 0);
+
+	ret = sys_mount_setattr(fd, "", AT_EMPTY_PATH | AT_RECURSIVE, &attr, sizeof(attr));
+	ASSERT_LT(ret, 0);
+	ASSERT_EQ(errno, EINVAL);
+}
+
+TEST_HARNESS_MAIN
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 05/34] fs: introduce MOUNT_ATTR_IDMAP
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (3 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 04/34] tests: add mount_setattr() selftests Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-11-01 14:45   ` Christoph Hellwig
  2020-10-29  0:32 ` [PATCH 06/34] fs: add id translation helpers Christian Brauner
                   ` (32 subsequent siblings)
  37 siblings, 1 reply; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

Introduce a new mount bind mount property to allow idmapping mounts. The
MOUNT_ATTR_IDMAP flag can be set via the new mount_setattr() syscall
together with a file descriptor referring to a user namespace.

The user namespace referenced by the namespace file descriptor will be
attached to the bind mount. All interactions with the filesystem going
through that mount will be shifted according to the mapping specified in that
user namespace.

Using user namespaces to mark mounts means we can reuse all the existing
infrastructure in the kernel that already exists to handle idmappings and can
also use this for permission checking to allow unprivileged user to create
idmapped mounts.

Idmapping a mount is decoupled from the caller's user and mount namespace.
This means idmapped mounts can be created in the initial user namespace
which is an important use-case for e.g. systemd-homed, portable usb-sticks
between systems, and other use-cases that have been brought up. For example,
assume a home directory where all files are owned by uid and gid 1000 and the
home directory is brought to a new laptop where the user has id 12345. The
system administrator can simply create a mount of this home directory with a
mapping of 1000:12345:1 other mappings to indicate the ids should be kept.
(With this it is e.g. also possible to create idmapped mounts on the host with
 an identity mapping 1:1:100000 where the root user is not mapped. A user with
 root access that e.g. has been pivot rooted into such a mount on the host will
 be not be able to execute, read, write, or create files as root.)

Given that idmapping a mount is decoupled from the caller's user namespace
a sufficiently privileged process such as a container manager can set up a
shifted mount for the container and the container can simply pivot root to
it. There's no need for the container to do anything. The mount will appear
correctly mapped independent of the user namespace the container uses. This
means we don't need to mark a mount as idmappable.

In order to create an idmapped mount the following conditions must be
fulfilled. The caller must either be privileged in the user namespace of
the superblock the mount belongs to or the mount must have already been
shifted before and the caller must be privileged in the user namespace that
this mount has been shifted to. The latter case means that shifted mounts
can e.g. be created by unprivileged users provided that the underlying
mount has already been idmapped to a user namespace they have privilege
over.

Once a mount has been idmapped it's idmapping cannot be changed. This is to
keep things simple. Callers that want another idmapping can simply create
another detached mount and idmap it.

The new CONFIG_IDMAP_MOUNTS option that can be used to compile the
kernel with idmapped mount support. It will default to off for quite
some time. Let's not be over confident.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/Kconfig                 |   6 ++
 fs/internal.h              |   1 +
 fs/namespace.c             | 157 ++++++++++++++++++++++++++++++++++++-
 include/linux/fs.h         |   1 +
 include/linux/mount.h      |  20 ++++-
 include/uapi/linux/mount.h |   6 +-
 6 files changed, 186 insertions(+), 5 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index aa4c12282301..2d45ec3c7e04 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -15,6 +15,12 @@ config VALIDATE_FS_PARSER
 	  Enable this to perform validation of the parameter description for a
 	  filesystem when it is registered.
 
+config IDMAP_MOUNTS
+	bool "Support id mappings per mount"
+	default n
+	help
+	  This allows the vfs to create idmappings per vfsmount.
+
 if BLOCK
 
 config FS_IOMAP
diff --git a/fs/internal.h b/fs/internal.h
index a5a6c470dc07..b6046b5186cd 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -88,6 +88,7 @@ struct mount_kattr {
 	unsigned int propagation;
 	unsigned int lookup_flags;
 	bool recurse;
+	struct user_namespace *userns;
 };
 
 extern struct vfsmount *lookup_mnt(const struct path *);
diff --git a/fs/namespace.c b/fs/namespace.c
index e9c515b012a4..aef39fc74afa 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -25,6 +25,7 @@
 #include <linux/proc_ns.h>
 #include <linux/magic.h>
 #include <linux/memblock.h>
+#include <linux/proc_fs.h>
 #include <linux/task_work.h>
 #include <linux/sched/task.h>
 #include <uapi/linux/mount.h>
@@ -210,6 +211,9 @@ static struct mount *alloc_vfsmnt(const char *name)
 		INIT_HLIST_NODE(&mnt->mnt_mp_list);
 		INIT_LIST_HEAD(&mnt->mnt_umounting);
 		INIT_HLIST_HEAD(&mnt->mnt_stuck_children);
+#ifdef CONFIG_IDMAP_MOUNTS
+		mnt->mnt.mnt_user_ns = &init_user_ns;
+#endif
 	}
 	return mnt;
 
@@ -555,6 +559,13 @@ int sb_prepare_remount_readonly(struct super_block *sb)
 
 static void free_vfsmnt(struct mount *mnt)
 {
+#ifdef CONFIG_IDMAP_MOUNTS
+	if ((mnt->mnt.mnt_flags & MNT_IDMAPPED) &&
+	    mnt_user_ns(&mnt->mnt) != &init_user_ns) {
+		put_user_ns(mnt->mnt.mnt_user_ns);
+		mnt->mnt.mnt_user_ns = NULL;
+	}
+#endif
 	kfree_const(mnt->mnt_devname);
 #ifdef CONFIG_SMP
 	free_percpu(mnt->mnt_pcp);
@@ -1063,6 +1074,11 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 	mnt->mnt.mnt_flags &= ~(MNT_WRITE_HOLD|MNT_MARKED|MNT_INTERNAL);
 
 	atomic_inc(&sb->s_active);
+#ifdef CONFIG_IDMAP_MOUNTS
+	mnt->mnt.mnt_user_ns = old->mnt.mnt_user_ns;
+	if (mnt_user_ns(&old->mnt) != &init_user_ns)
+		mnt->mnt.mnt_user_ns = get_user_ns(mnt->mnt.mnt_user_ns);
+#endif
 	mnt->mnt.mnt_sb = sb;
 	mnt->mnt.mnt_root = dget(root);
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
@@ -3454,7 +3470,8 @@ static int build_attr_flags(unsigned int attr_flags, unsigned int *flags)
 			   MOUNT_ATTR_NODEV |
 			   MOUNT_ATTR_NOEXEC |
 			   MOUNT_ATTR__ATIME |
-			   MOUNT_ATTR_NODIRATIME))
+			   MOUNT_ATTR_NODIRATIME |
+			   MOUNT_ATTR_IDMAP))
 		return -EINVAL;
 
 	if (attr_flags & MOUNT_ATTR_RDONLY)
@@ -3467,6 +3484,8 @@ static int build_attr_flags(unsigned int attr_flags, unsigned int *flags)
 		aflags |= MNT_NOEXEC;
 	if (attr_flags & MOUNT_ATTR_NODIRATIME)
 		aflags |= MNT_NODIRATIME;
+	if (attr_flags & MOUNT_ATTR_IDMAP)
+		aflags |= MNT_IDMAPPED;
 
 	*flags = aflags;
 	return 0;
@@ -3494,6 +3513,14 @@ SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags,
 	if ((flags & ~(FSMOUNT_CLOEXEC)) != 0)
 		return -EINVAL;
 
+	if (attr_flags & ~(MOUNT_ATTR_RDONLY |
+			   MOUNT_ATTR_NOSUID |
+			   MOUNT_ATTR_NODEV |
+			   MOUNT_ATTR_NOEXEC |
+			   MOUNT_ATTR__ATIME |
+			   MOUNT_ATTR_NODIRATIME))
+		return -EINVAL;
+
 	ret = build_attr_flags(attr_flags, &mnt_flags);
 	if (ret)
 		return ret;
@@ -3836,6 +3863,7 @@ static int do_mount_setattr(struct path *path, struct mount_kattr *kattr)
 	 */
 	m = mnt;
 	do {
+		unsigned int old_flags;
 		last = m;
 
 		if (!can_change_locked_flags(m, all_raised)) {
@@ -3843,11 +3871,61 @@ static int do_mount_setattr(struct path *path, struct mount_kattr *kattr)
 			break;
 		}
 
-		if (rdonly_set && !(m->mnt.mnt_flags & MNT_READONLY)) {
+		old_flags = READ_ONCE(m->mnt.mnt_flags);
+		if (rdonly_set && !(old_flags & MNT_READONLY)) {
 			err = mnt_hold_writers(m);
 			if (err)
 				break;
 		}
+
+#ifdef CONFIG_IDMAP_MOUNTS
+		if (kattr->attr_set & MNT_IDMAPPED) {
+			struct user_namespace *user_ns;
+			struct vfsmount *vmnt;
+
+			/*
+			 * Once a mount has been idmapped we don't allow it to
+			 * change its mapping. It makes things simpler and
+			 * callers can just create a detached mount they can
+			 * idmap. So make sure that this mount is the root of
+			 * an anon namespace.
+			 */
+			if ((old_flags & MNT_IDMAPPED) && !is_anon_ns(m->mnt_ns)) {
+				err = -EPERM;
+				break;
+			}
+
+			/*
+			 * The underlying filesystem doesn't support idmapped
+			 * mounts yet.
+			 */
+			vmnt = &m->mnt;
+			if (!(vmnt->mnt_sb->s_type->fs_flags & FS_ALLOW_IDMAP)) {
+				err = -EINVAL;
+				break;
+			}
+
+			/* We're controlling the superblock. */
+			if (ns_capable(vmnt->mnt_sb->s_user_ns, CAP_SYS_ADMIN)) {
+				err = 0;
+				continue;
+			}
+
+			/*
+			 * The mount is already shifted to a user namespace
+			 * that we have control over. (We already verified that
+			 * this is the root of an anon namespace above.)
+			 */
+			user_ns = READ_ONCE(vmnt->mnt_user_ns);
+			if ((old_flags & MNT_IDMAPPED) && ns_capable(user_ns, CAP_SYS_ADMIN)) {
+				err = 0;
+				continue;
+			}
+
+			err = -EPERM;
+			break;
+		}
+#endif
 	} while (kattr->recurse && (m = next_mnt(m, mnt)));
 
 	m = mnt;
@@ -3860,6 +3938,20 @@ static int do_mount_setattr(struct path *path, struct mount_kattr *kattr)
 			new_flags &= ~kattr->attr_clr;
 			/* Raise flags user wants us to set. */
 			new_flags |= kattr->attr_set;
+
+			/*
+			 * The MNT_IDMAPPED flag should be seen _after_ the
+			 * user_ns pointer in struct vfsmount is valid.
+			 */
+#ifdef CONFIG_IDMAP_MOUNTS
+			if (kattr->attr_set & MNT_IDMAPPED) {
+				struct user_namespace *user_ns = READ_ONCE(m->mnt.mnt_user_ns);
+				WRITE_ONCE(m->mnt.mnt_user_ns, get_user_ns(kattr->userns));
+				if (user_ns != &init_user_ns)
+					put_user_ns(user_ns);
+			}
+			smp_wmb();
+#endif
 			WRITE_ONCE(m->mnt.mnt_flags, new_flags);
 		}
 
@@ -3893,12 +3985,20 @@ static int do_mount_setattr(struct path *path, struct mount_kattr *kattr)
 			cleanup_group_ids(mnt, NULL);
 	}
 
+#ifdef CONFIG_IDMAP_MOUNTS
+	if (kattr->attr_set & MNT_IDMAPPED) {
+		put_user_ns(kattr->userns);
+		kattr->userns = NULL;
+	}
+#endif
+
 	return err;
 }
 
 static int build_mount_kattr(const struct mount_attr *attr,
 			     struct mount_kattr *kattr, unsigned int flags)
 {
+	int err = 0;
 	unsigned int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW;
 
 	if (flags & AT_NO_AUTOMOUNT)
@@ -3975,7 +4075,58 @@ static int build_mount_kattr(const struct mount_attr *attr,
 		}
 	}
 
-	return 0;
+#ifndef CONFIG_IDMAP_MOUNTS
+	if ((attr->attr_set | attr->attr_clr) & MOUNT_ATTR_IDMAP)
+		return -EINVAL;
+#else
+	if ((attr->attr_set & MOUNT_ATTR_IDMAP) && (attr->userns > INT_MAX))
+		return -EINVAL;
+
+	/* TODO: Implement MNT_IDMAPPED clearing. */
+	if (attr->attr_clr & MNT_IDMAPPED)
+		return -EINVAL;
+
+	if (attr->attr_set & MOUNT_ATTR_IDMAP) {
+		struct ns_common *ns;
+		struct user_namespace *user_ns;
+		struct file *file;
+
+		file = fget(attr->userns);
+		if (!file)
+			return -EBADF;
+
+		if (!proc_ns_file(file)) {
+			err = -EINVAL;
+			goto out_fput;
+		}
+
+		ns = get_proc_ns(file_inode(file));
+		if (ns->ops->type != CLONE_NEWUSER) {
+			err = -EINVAL;
+			goto out_fput;
+		}
+		user_ns = container_of(ns, struct user_namespace, ns);
+
+		/*
+		 * The init_user_ns is used to indicate that a vfsmount is not
+		 * idmapped. This is simpler than just having to treat NULL as
+		 * unmapped. Users wanting to idmap a mount to init_user_ns can
+		 * just use a namespace with an identity mapping.
+		 */
+		if (user_ns == &init_user_ns) {
+			err = -EPERM;
+			goto out_fput;
+		}
+
+		kattr->userns = get_user_ns(user_ns);
+		err = 0;
+	out_fput:
+		fput(file);
+	}
+
+#endif /* CONFIG_IDMAP_MOUNTS */
+
+	return err;
 }
 
 SYSCALL_DEFINE5(mount_setattr, int, dfd, const char __user *, path, unsigned int, flags,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0bd126418bb6..8314cd351673 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2217,6 +2217,7 @@ struct file_system_type {
 #define FS_HAS_SUBTYPE		4
 #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
 #define FS_DISALLOW_NOTIFY_PERM	16	/* Disable fanotify permission events */
+#define FS_ALLOW_IDMAP		32	/* FS has been updated to handle vfs idmappings. */
 #define FS_THP_SUPPORT		8192	/* Remove once all fs converted */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
 	int (*init_fs_context)(struct fs_context *);
diff --git a/include/linux/mount.h b/include/linux/mount.h
index aaf343b38671..d4ae170b2c03 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -31,6 +31,7 @@ struct fs_context;
 #define MNT_RELATIME	0x20
 #define MNT_READONLY	0x40	/* does the user want this to be r/o? */
 #define MNT_NOSYMFOLLOW	0x80
+#define MNT_IDMAPPED	0x400
 
 #define MNT_SHRINKABLE	0x100
 #define MNT_WRITE_HOLD	0x200
@@ -47,7 +48,7 @@ struct fs_context;
 #define MNT_SHARED_MASK	(MNT_UNBINDABLE)
 #define MNT_USER_SETTABLE_MASK  (MNT_NOSUID | MNT_NODEV | MNT_NOEXEC \
 				 | MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME \
-				 | MNT_READONLY | MNT_NOSYMFOLLOW)
+				 | MNT_READONLY | MNT_NOSYMFOLLOW | MNT_IDMAPPED)
 #define MNT_ATIME_MASK (MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME )
 
 #define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | \
@@ -72,8 +73,25 @@ struct vfsmount {
 	struct dentry *mnt_root;	/* root of the mounted tree */
 	struct super_block *mnt_sb;	/* pointer to superblock */
 	int mnt_flags;
+#ifdef CONFIG_IDMAP_MOUNTS
+	struct user_namespace *mnt_user_ns;
+#endif
 } __randomize_layout;
 
+static inline bool mnt_idmapped(const struct vfsmount *mnt)
+{
+	return READ_ONCE(mnt->mnt_flags) & MNT_IDMAPPED;
+}
+
+static inline struct user_namespace *mnt_user_ns(const struct vfsmount *mnt)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+	return READ_ONCE(mnt->mnt_user_ns);
+#else
+	return &init_user_ns;
+#endif
+}
+
 struct file; /* forward dec */
 struct path;
 
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index fb3ad26fdebf..672c58c619ed 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -117,6 +117,7 @@ enum fsconfig_command {
 #define MOUNT_ATTR_NOATIME	0x00000010 /* - Do not update access times. */
 #define MOUNT_ATTR_STRICTATIME	0x00000020 /* - Always perform atime updates */
 #define MOUNT_ATTR_NODIRATIME	0x00000080 /* Do not update directory access times */
+#define MOUNT_ATTR_IDMAP	0x00100000 /* Idmap this mount to @userns in mount_attr. */
 
 /*
  * mount_setattr()
@@ -125,6 +126,8 @@ struct mount_attr {
 	__u64 attr_set;
 	__u64 attr_clr;
 	__u64 propagation;
+	__u32 userns;
+	__u32 reserved[0];
 };
 
 /* Change propagation through mount_setattr(). */
@@ -138,6 +141,7 @@ enum propagation_type {
 
 /* List of all mount_attr versions. */
 #define MOUNT_ATTR_SIZE_VER0	24 /* sizeof first published struct */
-#define MOUNT_ATTR_SIZE_LATEST	MOUNT_ATTR_SIZE_VER0
+#define MOUNT_ATTR_SIZE_VER1	32 /* sizeof second published struct */
+#define MOUNT_ATTR_SIZE_LATEST	MOUNT_ATTR_SIZE_VER1
 
 #endif /* _UAPI_LINUX_MOUNT_H */
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 06/34] fs: add id translation helpers
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (4 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 05/34] fs: introduce MOUNT_ATTR_IDMAP Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-11-01 14:46   ` Christoph Hellwig
  2020-10-29  0:32 ` [PATCH 07/34] capability: handle idmapped mounts Christian Brauner
                   ` (31 subsequent siblings)
  37 siblings, 1 reply; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

Add simple helpers to make it easy to map kuids into and from idmapped
mounts. We provide simple wrappers that filesystems can use to
e.g. initialize inodes similar to i_{uid,gid}_read() and
i_{uid,gid}_write(). Accessing an inode through an idmapped mount will
require the inode to be mapped according to the mount's user namespace.
If the fsids are used to compare against inodes or to initialize inodes
they are required to be shifted from the mount's user namespace. Passing
the initial user namespace to these helpers makes them a nop and so any
non-idmapped paths will not be impacted.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 include/linux/fs.h | 75 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8314cd351673..8a891b80d0b4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -39,6 +39,7 @@
 #include <linux/fs_types.h>
 #include <linux/build_bug.h>
 #include <linux/stddef.h>
+#include <linux/cred.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -1574,6 +1575,80 @@ static inline void i_gid_write(struct inode *inode, gid_t gid)
 	inode->i_gid = make_kgid(inode->i_sb->s_user_ns, gid);
 }
 
+static inline kuid_t kuid_into_mnt(struct user_namespace *to, kuid_t kuid)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+	return make_kuid(to, __kuid_val(kuid));
+#else
+	return kuid;
+#endif
+}
+
+static inline kgid_t kgid_into_mnt(struct user_namespace *to, kgid_t kgid)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+	return make_kgid(to, __kgid_val(kgid));
+#else
+	return kgid;
+#endif
+}
+
+static inline kuid_t i_uid_into_mnt(struct user_namespace *to,
+				    const struct inode *inode)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+	return kuid_into_mnt(to, inode->i_uid);
+#else
+	return inode->i_uid;
+#endif
+}
+
+static inline kgid_t i_gid_into_mnt(struct user_namespace *to,
+				    const struct inode *inode)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+	return kgid_into_mnt(to, inode->i_gid);
+#else
+	return inode->i_gid;
+#endif
+}
+
+static inline kuid_t kuid_from_mnt(struct user_namespace *to, kuid_t kuid)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+	return KUIDT_INIT(from_kuid(to, kuid));
+#else
+	return kuid;
+#endif
+}
+
+static inline kgid_t kgid_from_mnt(struct user_namespace *to, kgid_t kgid)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+	return KGIDT_INIT(from_kgid(to, kgid));
+#else
+	return kgid;
+#endif
+}
+
+static inline kuid_t fsuid_into_mnt(struct user_namespace *to)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+	return kuid_from_mnt(to, current_fsuid());
+#else
+	return current_fsuid();
+#endif
+}
+
+static inline kgid_t fsgid_into_mnt(struct user_namespace *to)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+	return kgid_from_mnt(to, current_fsgid());
+#else
+	return current_fsgid();
+#endif
+}
+
 extern struct timespec64 current_time(struct inode *inode);
 
 /*
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 07/34] capability: handle idmapped mounts
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (5 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 06/34] fs: add id translation helpers Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-11-01 14:48   ` Christoph Hellwig
  2020-10-29  0:32 ` [PATCH 08/34] namei: add idmapped mount aware permission helpers Christian Brauner
                   ` (30 subsequent siblings)
  37 siblings, 1 reply; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

In order to determine whether a caller holds privilege over a given
inode the capability framework exposes the two helpers
privileged_wrt_inode_uidgid() and capable_wrt_inode_uidgid(). The former
verifies that the inode has a mapping in the caller's user namespace and
the latter additionally verifies that the caller has the requested
capability in their current user namespace. If the inode is accessed
through an idmapped mount we first need to map it according to the
mount's user namespace. Afterwards the checks are identical to
non-idmapped inodes. If the initial user namespace is passed all
operations are a nop so non-idmapped mounts will not see a change in
behavior and will also not see any performance impact.
Since the privileged_wrt_inode_uidgid() helper only has one caller it
makes more sense to simply add an additional user namespace argument and
adapt the single callsite it is used in. The capable_wrt_inode_uidgid()
helper is used in more places so we introduce a new
capable_wrt_mapped_inode_uidgid() helper which can be used by the vfs.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/exec.c                  |  2 +-
 include/linux/capability.h |  6 +++++-
 kernel/capability.c        | 22 ++++++++++++++++------
 3 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 547a2390baf5..8e75d7a33514 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1398,7 +1398,7 @@ void would_dump(struct linux_binprm *bprm, struct file *file)
 		/* Ensure mm->user_ns contains the executable */
 		user_ns = old = bprm->mm->user_ns;
 		while ((user_ns != &init_user_ns) &&
-		       !privileged_wrt_inode_uidgid(user_ns, inode))
+		       !privileged_wrt_inode_uidgid(user_ns, &init_user_ns, inode))
 			user_ns = user_ns->parent;
 
 		if (old != user_ns) {
diff --git a/include/linux/capability.h b/include/linux/capability.h
index 1e7fe311cabe..308d88096745 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -247,8 +247,12 @@ static inline bool ns_capable_setid(struct user_namespace *ns, int cap)
 	return true;
 }
 #endif /* CONFIG_MULTIUSER */
-extern bool privileged_wrt_inode_uidgid(struct user_namespace *ns, const struct inode *inode);
+extern bool privileged_wrt_inode_uidgid(struct user_namespace *ns,
+					struct user_namespace *mnt_user_ns,
+					const struct inode *inode);
 extern bool capable_wrt_inode_uidgid(const struct inode *inode, int cap);
+extern bool capable_wrt_mapped_inode_uidgid(struct user_namespace *mnt_user_ns,
+					const struct inode *inode, int cap);
 extern bool file_ns_capable(const struct file *file, struct user_namespace *ns, int cap);
 extern bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns);
 static inline bool perfmon_capable(void)
diff --git a/kernel/capability.c b/kernel/capability.c
index de7eac903a2a..427776414487 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -484,12 +484,24 @@ EXPORT_SYMBOL(file_ns_capable);
  *
  * Return true if the inode uid and gid are within the namespace.
  */
-bool privileged_wrt_inode_uidgid(struct user_namespace *ns, const struct inode *inode)
+bool privileged_wrt_inode_uidgid(struct user_namespace *ns,
+				 struct user_namespace *mnt_user_ns,
+				 const struct inode *inode)
 {
-	return kuid_has_mapping(ns, inode->i_uid) &&
-		kgid_has_mapping(ns, inode->i_gid);
+	return kuid_has_mapping(ns, i_uid_into_mnt(mnt_user_ns, inode)) &&
+	       kgid_has_mapping(ns, i_gid_into_mnt(mnt_user_ns, inode));
 }
 
+bool capable_wrt_mapped_inode_uidgid(struct user_namespace *mnt_user_ns,
+				 const struct inode *inode, int cap)
+{
+	struct user_namespace *ns = current_user_ns();
+
+	return ns_capable(ns, cap) &&
+	       privileged_wrt_inode_uidgid(ns, mnt_user_ns, inode);
+}
+EXPORT_SYMBOL(capable_wrt_mapped_inode_uidgid);
+
 /**
  * capable_wrt_inode_uidgid - Check nsown_capable and uid and gid mapped
  * @inode: The inode in question
@@ -501,9 +513,7 @@ bool privileged_wrt_inode_uidgid(struct user_namespace *ns, const struct inode *
  */
 bool capable_wrt_inode_uidgid(const struct inode *inode, int cap)
 {
-	struct user_namespace *ns = current_user_ns();
-
-	return ns_capable(ns, cap) && privileged_wrt_inode_uidgid(ns, inode);
+	return capable_wrt_mapped_inode_uidgid(&init_user_ns, inode, cap);
 }
 EXPORT_SYMBOL(capable_wrt_inode_uidgid);
 
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 08/34] namei: add idmapped mount aware permission helpers
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (6 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 07/34] capability: handle idmapped mounts Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 09/34] inode: add idmapped mount aware init and " Christian Brauner
                   ` (29 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

The two helpers inode_permission() and generic_permission() are used by
the vfs to perform basic permission checking by verifying that the
caller is privileged over an inode. In order to handle idmapped mount we
add the two helpers mapped_inode_permission() to
mapped_generic_permission() which take a user namespace argument. On
idmapped mounts the two new helpers will make sure to map the inode
according to the mount's user namespace and then peform identical
permission checks to inode_permission() and generic_permission(). If the
initial user namespace is passed mapped_inode_permission() and
mapped_generic_permission() are identical to inode_permission() and
generic_permission() so there will be no performance impact on
non-idmapped mounts. This also means that the inode_permission() and
generic_permission() helpers can be implemented on top of
mapped_inode_permission() and mapped_generic_permission() respectively
by just passing in the initial user namespace so no code is
unnecessarily duplicated.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/namei.c                | 71 ++++++++++++++++++++++++++++-----------
 fs/posix_acl.c            | 16 ++++++---
 include/linux/fs.h        |  2 ++
 include/linux/posix_acl.h |  4 ++-
 4 files changed, 67 insertions(+), 26 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index d4a6dd772303..2635f6a57de5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -259,7 +259,7 @@ void putname(struct filename *name)
 		__putname(name);
 }
 
-static int check_acl(struct inode *inode, int mask)
+static int check_acl(struct user_namespace *user_ns, struct inode *inode, int mask)
 {
 #ifdef CONFIG_FS_POSIX_ACL
 	struct posix_acl *acl;
@@ -271,14 +271,14 @@ static int check_acl(struct inode *inode, int mask)
 		/* no ->get_acl() calls in RCU mode... */
 		if (is_uncached_acl(acl))
 			return -ECHILD;
-	        return posix_acl_permission(inode, acl, mask);
+	        return posix_acl_permission(user_ns, inode, acl, mask);
 	}
 
 	acl = get_acl(inode, ACL_TYPE_ACCESS);
 	if (IS_ERR(acl))
 		return PTR_ERR(acl);
 	if (acl) {
-	        int error = posix_acl_permission(inode, acl, mask);
+	        int error = posix_acl_permission(user_ns, inode, acl, mask);
 	        posix_acl_release(acl);
 	        return error;
 	}
@@ -293,12 +293,14 @@ static int check_acl(struct inode *inode, int mask)
  * Note that the POSIX ACL check cares about the MAY_NOT_BLOCK bit,
  * for RCU walking.
  */
-static int acl_permission_check(struct inode *inode, int mask)
+static int acl_permission_check(struct user_namespace *user_ns, struct inode *inode, int mask)
 {
 	unsigned int mode = inode->i_mode;
+	kuid_t i_uid;
 
 	/* Are we the owner? If so, ACL's don't matter */
-	if (likely(uid_eq(current_fsuid(), inode->i_uid))) {
+	i_uid = i_uid_into_mnt(user_ns, inode);
+	if (likely(uid_eq(current_fsuid(), i_uid))) {
 		mask &= 7;
 		mode >>= 6;
 		return (mask & ~mode) ? -EACCES : 0;
@@ -306,7 +308,7 @@ static int acl_permission_check(struct inode *inode, int mask)
 
 	/* Do we have ACL's? */
 	if (IS_POSIXACL(inode) && (mode & S_IRWXG)) {
-		int error = check_acl(inode, mask);
+		int error = check_acl(user_ns, inode, mask);
 		if (error != -EAGAIN)
 			return error;
 	}
@@ -320,7 +322,8 @@ static int acl_permission_check(struct inode *inode, int mask)
 	 * about? Need to check group ownership if so.
 	 */
 	if (mask & (mode ^ (mode >> 3))) {
-		if (in_group_p(inode->i_gid))
+		kgid_t kgid = i_gid_into_mnt(user_ns, inode);
+		if (in_group_p(kgid))
 			mode >>= 3;
 	}
 
@@ -329,7 +332,7 @@ static int acl_permission_check(struct inode *inode, int mask)
 }
 
 /**
- * generic_permission -  check for access rights on a Posix-like filesystem
+ * mapped_generic_permission -  check for access rights on a Posix-like filesystem
  * @inode:	inode to check access rights for
  * @mask:	right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC,
  *		%MAY_NOT_BLOCK ...)
@@ -343,24 +346,25 @@ static int acl_permission_check(struct inode *inode, int mask)
  * request cannot be satisfied (eg. requires blocking or too much complexity).
  * It would then be called again in ref-walk mode.
  */
-int generic_permission(struct inode *inode, int mask)
+int mapped_generic_permission(struct user_namespace *user_ns, struct inode *inode,
+			  int mask)
 {
 	int ret;
 
 	/*
 	 * Do the basic permission checks.
 	 */
-	ret = acl_permission_check(inode, mask);
+	ret = acl_permission_check(user_ns, inode, mask);
 	if (ret != -EACCES)
 		return ret;
 
 	if (S_ISDIR(inode->i_mode)) {
 		/* DACs are overridable for directories */
 		if (!(mask & MAY_WRITE))
-			if (capable_wrt_inode_uidgid(inode,
-						     CAP_DAC_READ_SEARCH))
+			if (capable_wrt_mapped_inode_uidgid(user_ns, inode,
+							    CAP_DAC_READ_SEARCH))
 				return 0;
-		if (capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE))
+		if (capable_wrt_mapped_inode_uidgid(user_ns, inode, CAP_DAC_OVERRIDE))
 			return 0;
 		return -EACCES;
 	}
@@ -370,7 +374,8 @@ int generic_permission(struct inode *inode, int mask)
 	 */
 	mask &= MAY_READ | MAY_WRITE | MAY_EXEC;
 	if (mask == MAY_READ)
-		if (capable_wrt_inode_uidgid(inode, CAP_DAC_READ_SEARCH))
+		if (capable_wrt_mapped_inode_uidgid(user_ns, inode,
+						    CAP_DAC_READ_SEARCH))
 			return 0;
 	/*
 	 * Read/write DACs are always overridable.
@@ -378,11 +383,18 @@ int generic_permission(struct inode *inode, int mask)
 	 * at least one exec bit set.
 	 */
 	if (!(mask & MAY_EXEC) || (inode->i_mode & S_IXUGO))
-		if (capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE))
+		if (capable_wrt_mapped_inode_uidgid(user_ns, inode,
+						    CAP_DAC_OVERRIDE))
 			return 0;
 
 	return -EACCES;
 }
+EXPORT_SYMBOL(mapped_generic_permission);
+
+int generic_permission(struct inode *inode, int mask)
+{
+	return mapped_generic_permission(&init_user_ns, inode, mask);
+}
 EXPORT_SYMBOL(generic_permission);
 
 /*
@@ -391,7 +403,7 @@ EXPORT_SYMBOL(generic_permission);
  * flag in inode->i_opflags, that says "this has not special
  * permission function, use the fast case".
  */
-static inline int do_inode_permission(struct inode *inode, int mask)
+static inline int do_inode_permission(struct user_namespace *user_ns, struct inode *inode, int mask)
 {
 	if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) {
 		if (likely(inode->i_op->permission))
@@ -402,7 +414,7 @@ static inline int do_inode_permission(struct inode *inode, int mask)
 		inode->i_opflags |= IOP_FASTPERM;
 		spin_unlock(&inode->i_lock);
 	}
-	return generic_permission(inode, mask);
+	return mapped_generic_permission(user_ns, inode, mask);
 }
 
 /**
@@ -426,7 +438,9 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
 }
 
 /**
- * inode_permission - Check for access rights to a given inode
+ * mapped_inode_permission - Check for access rights to a given inode as seen from
+ *			 a given user namespace
+ * @userns: The user namespace the inode is seen from
  * @inode: Inode to check permission on
  * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
  *
@@ -436,7 +450,7 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
  *
  * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
  */
-int inode_permission(struct inode *inode, int mask)
+int mapped_inode_permission(struct user_namespace *user_ns, struct inode *inode, int mask)
 {
 	int retval;
 
@@ -460,7 +474,7 @@ int inode_permission(struct inode *inode, int mask)
 			return -EACCES;
 	}
 
-	retval = do_inode_permission(inode, mask);
+	retval = do_inode_permission(user_ns, inode, mask);
 	if (retval)
 		return retval;
 
@@ -470,6 +484,23 @@ int inode_permission(struct inode *inode, int mask)
 
 	return security_inode_permission(inode, mask);
 }
+EXPORT_SYMBOL(mapped_inode_permission);
+
+/**
+ * inode_permission - Check for access rights to a given inode
+ * @inode: Inode to check permission on
+ * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
+ *
+ * Check for read/write/execute permissions on an inode.  We use fs[ug]id for
+ * this, letting us set arbitrary permissions for filesystem access without
+ * changing the "normal" UIDs which are used for other things.
+ *
+ * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
+ */
+int inode_permission(struct inode *inode, int mask)
+{
+	return mapped_inode_permission(&init_user_ns, inode, mask);
+}
 EXPORT_SYMBOL(inode_permission);
 
 /**
diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 95882b3f5f62..f15b6ad35ec3 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -345,10 +345,12 @@ EXPORT_SYMBOL(posix_acl_from_mode);
  * by the acl. Returns -E... otherwise.
  */
 int
-posix_acl_permission(struct inode *inode, const struct posix_acl *acl, int want)
+posix_acl_permission(struct user_namespace *user_ns, struct inode *inode, const struct posix_acl *acl, int want)
 {
 	const struct posix_acl_entry *pa, *pe, *mask_obj;
 	int found = 0;
+	kuid_t uid;
+	kgid_t gid;
 
 	want &= MAY_READ | MAY_WRITE | MAY_EXEC;
 
@@ -356,22 +358,26 @@ posix_acl_permission(struct inode *inode, const struct posix_acl *acl, int want)
                 switch(pa->e_tag) {
                         case ACL_USER_OBJ:
 				/* (May have been checked already) */
-				if (uid_eq(inode->i_uid, current_fsuid()))
+				uid = i_uid_into_mnt(user_ns, inode);
+				if (uid_eq(uid, current_fsuid()))
                                         goto check_perm;
                                 break;
                         case ACL_USER:
-				if (uid_eq(pa->e_uid, current_fsuid()))
+				uid = kuid_into_mnt(user_ns, pa->e_uid);
+				if (uid_eq(uid, current_fsuid()))
                                         goto mask;
 				break;
                         case ACL_GROUP_OBJ:
-                                if (in_group_p(inode->i_gid)) {
+				gid = i_gid_into_mnt(user_ns, inode);
+                                if (in_group_p(gid)) {
 					found = 1;
 					if ((pa->e_perm & want) == want)
 						goto mask;
                                 }
 				break;
                         case ACL_GROUP:
-				if (in_group_p(pa->e_gid)) {
+				gid = kgid_into_mnt(user_ns, pa->e_gid);
+				if (in_group_p(gid)) {
 					found = 1;
 					if ((pa->e_perm & want) == want)
 						goto mask;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8a891b80d0b4..750ca4b3d89f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2820,7 +2820,9 @@ static inline int bmap(struct inode *inode,  sector_t *block)
 
 extern int notify_change(struct dentry *, struct iattr *, struct inode **);
 extern int inode_permission(struct inode *, int);
+extern int mapped_inode_permission(struct user_namespace *, struct inode *, int);
 extern int generic_permission(struct inode *, int);
+extern int mapped_generic_permission(struct user_namespace *, struct inode *, int);
 extern int __check_sticky(struct inode *dir, struct inode *inode);
 
 static inline bool execute_ok(struct inode *inode)
diff --git a/include/linux/posix_acl.h b/include/linux/posix_acl.h
index 90797f1b421d..8276baefed13 100644
--- a/include/linux/posix_acl.h
+++ b/include/linux/posix_acl.h
@@ -15,6 +15,8 @@
 #include <linux/refcount.h>
 #include <uapi/linux/posix_acl.h>
 
+struct user_namespace;
+
 struct posix_acl_entry {
 	short			e_tag;
 	unsigned short		e_perm;
@@ -62,7 +64,7 @@ posix_acl_release(struct posix_acl *acl)
 extern void posix_acl_init(struct posix_acl *, int);
 extern struct posix_acl *posix_acl_alloc(int, gfp_t);
 extern int posix_acl_valid(struct user_namespace *, const struct posix_acl *);
-extern int posix_acl_permission(struct inode *, const struct posix_acl *, int);
+extern int posix_acl_permission(struct user_namespace *, struct inode *, const struct posix_acl *, int);
 extern struct posix_acl *posix_acl_from_mode(umode_t, gfp_t);
 extern int posix_acl_equiv_mode(const struct posix_acl *, umode_t *);
 extern int __posix_acl_create(struct posix_acl **, gfp_t, umode_t *);
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 09/34] inode: add idmapped mount aware init and permission helpers
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (7 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 08/34] namei: add idmapped mount aware permission helpers Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 10/34] attr: handle idmapped mounts Christian Brauner
                   ` (28 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

The inode_owner_or_capable() helper determines whether the caller is the
owner of the inode or is capable with respect to that inode. Add a new
mapped_inode_owner_or_capable() helper to handle idmapped mounts. If the
If the inode is accessed through an idmapped mount we first need to map
it according to the mount's user namespace. Afterwards the checks are
identical to non-idmapped mounts. If the initial user namespace is
passed all operations are a nop so non-idmapped mounts will not see a
change in behavior and will also not see any performance impact. It also
means that the inode_owner_or_capable() helper can be implemented on top
of mapped_inode_owner_or_capable() by passing in the initial user
namespace.

Similarly, we add a new mapped_inode_init_owner() helper which
initializes a new inode on idmapped mounts by mapping the fsuid and
fsgid of the caller from the mount's user namespace. If the initial user
namespace is passed all operations are a nop so non-idmapped mounts will
not see a change in behavior and will also not see any performance
impact. It also means that the inode_init_owner() helper can be
implemented on top of mapped_inode_init_owner() by passing in the
initial user namespace.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/inode.c         | 53 ++++++++++++++++++++++++++++++++++++----------
 include/linux/fs.h |  4 ++++
 2 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 9d78c37b00b8..22de3cb3b1f4 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2130,15 +2130,17 @@ void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
 EXPORT_SYMBOL(init_special_inode);
 
 /**
- * inode_init_owner - Init uid,gid,mode for new inode according to posix standards
+ * mapped_inode_init_owner - Init uid,gid,mode for new inode according to posix
+ *                           standards on idmapped mounts
  * @inode: New inode
+ * @user_ns: User namespace the inode is accessed from
  * @dir: Directory inode
  * @mode: mode of the new inode
  */
-void inode_init_owner(struct inode *inode, const struct inode *dir,
-			umode_t mode)
+void mapped_inode_init_owner(struct inode *inode, struct user_namespace *user_ns,
+			 const struct inode *dir, umode_t mode)
 {
-	inode->i_uid = current_fsuid();
+	inode->i_uid = fsuid_into_mnt(user_ns);
 	if (dir && dir->i_mode & S_ISGID) {
 		inode->i_gid = dir->i_gid;
 
@@ -2146,34 +2148,63 @@ void inode_init_owner(struct inode *inode, const struct inode *dir,
 		if (S_ISDIR(mode))
 			mode |= S_ISGID;
 		else if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) &&
-			 !in_group_p(inode->i_gid) &&
-			 !capable_wrt_inode_uidgid(dir, CAP_FSETID))
+			 !in_group_p(i_gid_into_mnt(user_ns, inode)) &&
+			 !capable_wrt_mapped_inode_uidgid(user_ns, dir, CAP_FSETID))
 			mode &= ~S_ISGID;
 	} else
-		inode->i_gid = current_fsgid();
+		inode->i_gid = fsgid_into_mnt(user_ns);
 	inode->i_mode = mode;
 }
+EXPORT_SYMBOL(mapped_inode_init_owner);
+
+/**
+ * inode_init_owner - Init uid,gid,mode for new inode according to posix standards
+ * @inode: New inode
+ * @dir: Directory inode
+ * @mode: mode of the new inode
+ */
+void inode_init_owner(struct inode *inode, const struct inode *dir,
+			umode_t mode)
+{
+	return mapped_inode_init_owner(inode, &init_user_ns, dir, mode);
+}
 EXPORT_SYMBOL(inode_init_owner);
 
 /**
- * inode_owner_or_capable - check current task permissions to inode
+ * mapped_inode_owner_or_capable - check current task permissions to inode on idmapped mounts
+ * @user_ns: User namespace the inode is accessed from
  * @inode: inode being checked
  *
  * Return true if current either has CAP_FOWNER in a namespace with the
  * inode owner uid mapped, or owns the file.
  */
-bool inode_owner_or_capable(const struct inode *inode)
+bool mapped_inode_owner_or_capable(struct user_namespace *user_ns, const struct inode *inode)
 {
+	kuid_t i_uid;
 	struct user_namespace *ns;
 
-	if (uid_eq(current_fsuid(), inode->i_uid))
+	i_uid = i_uid_into_mnt(user_ns, inode);
+	if (uid_eq(current_fsuid(), i_uid))
 		return true;
 
 	ns = current_user_ns();
-	if (kuid_has_mapping(ns, inode->i_uid) && ns_capable(ns, CAP_FOWNER))
+	if (kuid_has_mapping(ns, i_uid) && ns_capable(ns, CAP_FOWNER))
 		return true;
 	return false;
 }
+EXPORT_SYMBOL(mapped_inode_owner_or_capable);
+
+/**
+ * inode_owner_or_capable - check current task permissions to inode
+ * @inode: inode being checked
+ *
+ * Return true if current either has CAP_FOWNER in a namespace with the
+ * inode owner uid mapped, or owns the file.
+ */
+bool inode_owner_or_capable(const struct inode *inode)
+{
+	return mapped_inode_owner_or_capable(&init_user_ns, inode);
+}
 EXPORT_SYMBOL(inode_owner_or_capable);
 
 /*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 750ca4b3d89f..f9e2d292b7b6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1777,6 +1777,8 @@ static inline int sb_start_intwrite_trylock(struct super_block *sb)
 
 
 extern bool inode_owner_or_capable(const struct inode *inode);
+extern bool mapped_inode_owner_or_capable(struct user_namespace *ns,
+				      const struct inode *inode);
 
 /*
  * VFS helper functions..
@@ -1820,6 +1822,8 @@ extern long compat_ptr_ioctl(struct file *file, unsigned int cmd,
  */
 extern void inode_init_owner(struct inode *inode, const struct inode *dir,
 			umode_t mode);
+extern void mapped_inode_init_owner(struct inode *inode, struct user_namespace *user_ns,
+			 const struct inode *dir, umode_t mode);
 extern bool may_open_dev(const struct path *path);
 
 /*
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 10/34] attr: handle idmapped mounts
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (8 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 09/34] inode: add idmapped mount aware init and " Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 11/34] acl: " Christian Brauner
                   ` (27 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

When file attributes are changed filesystems mostly rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Add the
setattr_mapped_prepare(), setattr_mapped_copy(), and
notify_mapped_change() helpers to handle idmapped mounts. If the inode
is accessed through an idmapped mount we need to map it according to the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed all
operations are a nop so non-idmapped mounts will not see a change in
behavior and will also not see any performance impact. It also means
that the inode_owner_or_capable() helper can be implemented on top of
mapped_inode_owner_or_capable() by passing in the initial user
namespace. Helpers that perform checks on the ia_uid and ia_gid fields
in struct iattr assume that ia_uid and ia_gid are intended values and so
they won't be mapped according to the mount's user namespace. This is
more transparent to the caller and further aligns the permission for
notify_change() and notify_mapped_change().

If the initial user namespace is passed all operations are a nop so
non-idmapped mounts will not see a change in behavior and will also not
see any performance impact. It also means that the
setattr_prepare(), setattr_copy(), and notify_change() helpers can
simply be implemented on top of setattr_mapped_prepare(),
setattr_mapped_copy(), and notify_mapped_change() by passing in the
initial user namespace.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/attr.c          | 136 ++++++++++++++++++++++++++++++++++-----------
 include/linux/fs.h |   6 ++
 2 files changed, 110 insertions(+), 32 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index b4bbdbd4c8ca..f39c03ac85e0 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -18,34 +18,39 @@
 #include <linux/evm.h>
 #include <linux/ima.h>
 
-static bool chown_ok(const struct inode *inode, kuid_t uid)
+static bool chown_ok(struct user_namespace *user_ns,
+		     const struct inode *inode,
+		     kuid_t uid)
 {
-	if (uid_eq(current_fsuid(), inode->i_uid) &&
-	    uid_eq(uid, inode->i_uid))
+	kuid_t kuid = i_uid_into_mnt(user_ns, inode);
+	if (uid_eq(current_fsuid(), kuid) && uid_eq(uid, kuid))
 		return true;
-	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+	if (capable_wrt_mapped_inode_uidgid(user_ns, inode, CAP_CHOWN))
 		return true;
-	if (uid_eq(inode->i_uid, INVALID_UID) &&
+	if (uid_eq(kuid, INVALID_UID) &&
 	    ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
 		return true;
 	return false;
 }
 
-static bool chgrp_ok(const struct inode *inode, kgid_t gid)
+static bool chgrp_ok(struct user_namespace *user_ns,
+		     const struct inode *inode, kgid_t gid)
 {
-	if (uid_eq(current_fsuid(), inode->i_uid) &&
-	    (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
+	kgid_t kgid = i_gid_into_mnt(user_ns, inode);
+	if (uid_eq(current_fsuid(), i_uid_into_mnt(user_ns, inode)) &&
+	    (in_group_p(gid) || gid_eq(gid, kgid)))
 		return true;
-	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+	if (capable_wrt_mapped_inode_uidgid(user_ns, inode, CAP_CHOWN))
 		return true;
-	if (gid_eq(inode->i_gid, INVALID_GID) &&
+	if (gid_eq(kgid, INVALID_GID) &&
 	    ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
 		return true;
 	return false;
 }
 
 /**
- * setattr_prepare - check if attribute changes to a dentry are allowed
+ * setattr_mapped_prepare - check if attribute changes to a dentry are allowed
+ * @user_ns:	user namespace of the mount
  * @dentry:	dentry to check
  * @attr:	attributes to change
  *
@@ -58,7 +63,8 @@ static bool chgrp_ok(const struct inode *inode, kgid_t gid)
  * Should be called as the first thing in ->setattr implementations,
  * possibly after taking additional locks.
  */
-int setattr_prepare(struct dentry *dentry, struct iattr *attr)
+int setattr_mapped_prepare(struct user_namespace *user_ns,
+			   struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = d_inode(dentry);
 	unsigned int ia_valid = attr->ia_valid;
@@ -78,27 +84,27 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
 		goto kill_priv;
 
 	/* Make sure a caller can chown. */
-	if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
+	if ((ia_valid & ATTR_UID) && !chown_ok(user_ns, inode, attr->ia_uid))
 		return -EPERM;
 
 	/* Make sure caller can chgrp. */
-	if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
+	if ((ia_valid & ATTR_GID) && !chgrp_ok(user_ns, inode, attr->ia_gid))
 		return -EPERM;
 
 	/* Make sure a caller can chmod. */
 	if (ia_valid & ATTR_MODE) {
-		if (!inode_owner_or_capable(inode))
+		if (!mapped_inode_owner_or_capable(user_ns, inode))
 			return -EPERM;
 		/* Also check the setgid bit! */
-		if (!in_group_p((ia_valid & ATTR_GID) ? attr->ia_gid :
-				inode->i_gid) &&
-		    !capable_wrt_inode_uidgid(inode, CAP_FSETID))
+               if (!in_group_p((ia_valid & ATTR_GID) ? attr->ia_gid :
+                                i_gid_into_mnt(user_ns, inode)) &&
+                    !capable_wrt_inode_uidgid(inode, CAP_FSETID))
 			attr->ia_mode &= ~S_ISGID;
 	}
 
 	/* Check for setting the inode time. */
 	if (ia_valid & (ATTR_MTIME_SET | ATTR_ATIME_SET | ATTR_TIMES_SET)) {
-		if (!inode_owner_or_capable(inode))
+		if (!mapped_inode_owner_or_capable(user_ns, inode))
 			return -EPERM;
 	}
 
@@ -114,6 +120,12 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
 
 	return 0;
 }
+EXPORT_SYMBOL(setattr_mapped_prepare);
+
+int setattr_prepare(struct dentry *dentry, struct iattr *attr)
+{
+	return setattr_mapped_prepare(&init_user_ns, dentry, attr);
+}
 EXPORT_SYMBOL(setattr_prepare);
 
 /**
@@ -161,21 +173,28 @@ int inode_newsize_ok(const struct inode *inode, loff_t offset)
 EXPORT_SYMBOL(inode_newsize_ok);
 
 /**
- * setattr_copy - copy simple metadata updates into the generic inode
+ * setattr_mappedcopy - copy simple metadata updates into the generic inode on idmapped mounts
+ * @user_ns:	the user namespace the inode is accessed from
  * @inode:	the inode to be updated
  * @attr:	the new attributes
  *
- * setattr_copy must be called with i_mutex held.
+ * setattr_mapped_copy must be called with i_mutex held.
  *
- * setattr_copy updates the inode's metadata with that specified
- * in attr. Noticeably missing is inode size update, which is more complex
+ * setattr_mapped_copy updates the inode's metadata with that specified
+ * in attr on idmapped mounts. If file ownership is changed setattr_mapped_copy
+ * doesn't map ia_uid and ia_gid. It will asssume the caller has already
+ * provided the intended values. Necessary permission checks to determine
+ * whether or not the S_ISGID property needs to be removed are performed with
+ * the correct idmapped mount permission helpers.
+ * Noticeably missing is inode size update, which is more complex
  * as it requires pagecache updates.
  *
  * The inode is not marked as dirty after this operation. The rationale is
  * that for "simple" filesystems, the struct inode is the inode storage.
  * The caller is free to mark the inode dirty afterwards if needed.
  */
-void setattr_copy(struct inode *inode, const struct iattr *attr)
+void setattr_mapped_copy(struct user_namespace *user_ns, struct inode *inode,
+			 const struct iattr *attr)
 {
 	unsigned int ia_valid = attr->ia_valid;
 
@@ -191,36 +210,62 @@ void setattr_copy(struct inode *inode, const struct iattr *attr)
 		inode->i_ctime = attr->ia_ctime;
 	if (ia_valid & ATTR_MODE) {
 		umode_t mode = attr->ia_mode;
-
-		if (!in_group_p(inode->i_gid) &&
-		    !capable_wrt_inode_uidgid(inode, CAP_FSETID))
+		kgid_t kgid = i_gid_into_mnt(user_ns, inode);
+		if (!in_group_p(kgid) &&
+		    !capable_wrt_mapped_inode_uidgid(user_ns, inode, CAP_FSETID))
 			mode &= ~S_ISGID;
 		inode->i_mode = mode;
 	}
 }
+EXPORT_SYMBOL(setattr_mapped_copy);
+
+/**
+ * setattr_copy - copy simple metadata updates into the generic inode
+ * @inode:	the inode to be updated
+ * @attr:	the new attributes
+ *
+ * setattr_copy must be called with i_mutex held.
+ *
+ * setattr_copy updates the inode's metadata with that specified
+ * in attr. Noticeably missing is inode size update, which is more complex
+ * as it requires pagecache updates.
+ *
+ * The inode is not marked as dirty after this operation. The rationale is
+ * that for "simple" filesystems, the struct inode is the inode storage.
+ * The caller is free to mark the inode dirty afterwards if needed.
+ */
+void setattr_copy(struct inode *inode, const struct iattr *attr)
+{
+	return setattr_mapped_copy(&init_user_ns, inode, attr);
+}
 EXPORT_SYMBOL(setattr_copy);
 
 /**
- * notify_change - modify attributes of a filesytem object
+ * notify_mapped_change - modify attributes of a filesytem object on idmapped mounts
+ * @user_ns:	the user namespace of the mount
  * @dentry:	object affected
  * @attr:	new attributes
  * @delegated_inode: returns inode, if the inode is delegated
  *
  * The caller must hold the i_mutex on the affected object.
  *
- * If notify_change discovers a delegation in need of breaking,
+ * If notify_mapped_change discovers a delegation in need of breaking,
  * it will return -EWOULDBLOCK and return a reference to the inode in
  * delegated_inode.  The caller should then break the delegation and
  * retry.  Because breaking a delegation may take a long time, the
  * caller should drop the i_mutex before doing so.
  *
+ * If file ownership is changed notify_mapped_change() doesn't map ia_uid and
+ * ia_gid. It will asssume the caller has already provided the intended values.
+ *
  * Alternatively, a caller may pass NULL for delegated_inode.  This may
  * be appropriate for callers that expect the underlying filesystem not
  * to be NFS exported.  Also, passing NULL is fine for callers holding
  * the file open for write, as there can be no conflicting delegation in
  * that case.
  */
-int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **delegated_inode)
+int notify_mapped_change(struct user_namespace *user_ns, struct dentry *dentry,
+			 struct iattr *attr, struct inode **delegated_inode)
 {
 	struct inode *inode = dentry->d_inode;
 	umode_t mode = inode->i_mode;
@@ -243,8 +288,8 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
 		if (IS_IMMUTABLE(inode))
 			return -EPERM;
 
-		if (!inode_owner_or_capable(inode)) {
-			error = inode_permission(inode, MAY_WRITE);
+		if (!mapped_inode_owner_or_capable(user_ns, inode)) {
+			error = mapped_inode_permission(user_ns, inode, MAY_WRITE);
 			if (error)
 				return error;
 		}
@@ -345,4 +390,31 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
 
 	return error;
 }
+EXPORT_SYMBOL(notify_mapped_change);
+
+/**
+ * notify_change - modify attributes of a filesytem object
+ * @dentry:	object affected
+ * @attr:	new attributes
+ * @delegated_inode: returns inode, if the inode is delegated
+ *
+ * The caller must hold the i_mutex on the affected object.
+ *
+ * If notify_change discovers a delegation in need of breaking,
+ * it will return -EWOULDBLOCK and return a reference to the inode in
+ * delegated_inode.  The caller should then break the delegation and
+ * retry.  Because breaking a delegation may take a long time, the
+ * caller should drop the i_mutex before doing so.
+ *
+ * Alternatively, a caller may pass NULL for delegated_inode.  This may
+ * be appropriate for callers that expect the underlying filesystem not
+ * to be NFS exported.  Also, passing NULL is fine for callers holding
+ * the file open for write, as there can be no conflicting delegation in
+ * that case.
+ */
+int notify_change(struct dentry *dentry, struct iattr *attr,
+		  struct inode **delegated_inode)
+{
+	return notify_mapped_change(&init_user_ns, dentry, attr, delegated_inode);
+}
 EXPORT_SYMBOL(notify_change);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f9e2d292b7b6..f41d93b0e6d7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2823,6 +2823,8 @@ static inline int bmap(struct inode *inode,  sector_t *block)
 #endif
 
 extern int notify_change(struct dentry *, struct iattr *, struct inode **);
+extern int notify_mapped_change(struct user_namespace *, struct dentry *,
+			    struct iattr *, struct inode **);
 extern int inode_permission(struct inode *, int);
 extern int mapped_inode_permission(struct user_namespace *, struct inode *, int);
 extern int generic_permission(struct inode *, int);
@@ -3282,8 +3284,12 @@ extern int buffer_migrate_page_norefs(struct address_space *,
 #endif
 
 extern int setattr_prepare(struct dentry *, struct iattr *);
+extern int setattr_mapped_prepare(struct user_namespace *, struct dentry *,
+			      struct iattr *);
 extern int inode_newsize_ok(const struct inode *, loff_t offset);
 extern void setattr_copy(struct inode *inode, const struct iattr *attr);
+extern void setattr_mapped_copy(struct user_namespace *user_ns, struct inode *inode,
+			    const struct iattr *attr);
 
 extern int file_update_time(struct file *file);
 
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 11/34] acl: handle idmapped mounts
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (9 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 10/34] attr: handle idmapped mounts Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 12/34] xattr: " Christian Brauner
                   ` (26 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

The posix acl permission checking helpers determine whether a caller is
privileged over an inode according to the acls associated with the
inode. Add helpers that make it possible to handle acls on idampped
mounts.

The vfs and the filesystems targeted by this first iteration make use of
posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to
translate basic posix access and default permissions such as the
ACL_USER and ACL_GROUP type according to the initial user namespace (or
the superblock's user namespace) to and from the caller's current user
namespace. Adapt these two helpers to handle idmapped mounts whereby we
either shift from or into the mount's user namespace depending on in
which direction we're translating.
Similarly, cap_convert_nscap() is used by the vfs to translate user
namespace and non-user namespace aware filesystem capabilities from the
superblock's user namespace to the caller's user namespace. Enable it to
handle idmapped mounts by accounting for the mount's user namespace.

In addition the fileystems targeted in the first iteration of this patch
series make use of the posix_acl_chmod() and, posix_acl_update_mode()
helpers. Both helpers perform permission checks on the target inode. Add
two new helpers posix_mapped_acl_chmod() and
posix_mapped_acl_update_mode() to handle idmapped mounts. These two
helpers are called when acls are set by the respective filesystems to
handle this case we add a new ->set_mapped() method to struct
xattr_handler which passes the mount's user namespace down.

To this end the standard posix access and default attribute handlers
posix_acl_access_xattr_handler and posix_acl_default_xattr_handler gain
a new posix_acl_xattr_set_mapped() callback which serves as the
implemtation of the newly added ->set_mapped() method in struct
xattr_handler. This callback maps the inode according to the mount's
user namespace but otherwise performs identical checks as its
non-idmapped aware counterpart.

If the initial user namespace is passed to any of the new helpers the
permission checking is identical to their non-idmapped aware
counterparts without any performance impact. This means that the
non-idmapped aware helpers can simply be implemented on top of their
idmapped-mount aware counterparts by passing the initial user namespace
without any change in behavior or performance.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/posix_acl.c                  | 110 ++++++++++++++++++++++++++------
 fs/xattr.c                      |   6 +-
 include/linux/capability.h      |   3 +-
 include/linux/posix_acl.h       |  10 +++
 include/linux/posix_acl_xattr.h |  12 ++--
 include/linux/xattr.h           |   6 ++
 security/commoncap.c            |  15 +++--
 7 files changed, 128 insertions(+), 34 deletions(-)

diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index f15b6ad35ec3..665eb7921e1c 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -558,7 +558,7 @@ __posix_acl_chmod(struct posix_acl **acl, gfp_t gfp, umode_t mode)
 EXPORT_SYMBOL(__posix_acl_chmod);
 
 int
-posix_acl_chmod(struct inode *inode, umode_t mode)
+posix_mapped_acl_chmod(struct user_namespace *user_ns, struct inode *inode, umode_t mode)
 {
 	struct posix_acl *acl;
 	int ret = 0;
@@ -582,6 +582,12 @@ posix_acl_chmod(struct inode *inode, umode_t mode)
 	posix_acl_release(acl);
 	return ret;
 }
+
+int
+posix_acl_chmod(struct inode *inode, umode_t mode)
+{
+	return posix_mapped_acl_chmod(&init_user_ns, inode, mode);
+}
 EXPORT_SYMBOL(posix_acl_chmod);
 
 int
@@ -636,7 +642,8 @@ posix_acl_create(struct inode *dir, umode_t *mode,
 EXPORT_SYMBOL_GPL(posix_acl_create);
 
 /**
- * posix_acl_update_mode  -  update mode in set_acl
+ * posix_mapped_acl_update_mode  -  update mode in set_acl
+ * @user_ns: user namespace the inode is accessed from
  * @inode: target inode
  * @mode_p: mode (pointer) for update
  * @acl: acl pointer
@@ -650,8 +657,9 @@ EXPORT_SYMBOL_GPL(posix_acl_create);
  *
  * Called from set_acl inode operations.
  */
-int posix_acl_update_mode(struct inode *inode, umode_t *mode_p,
-			  struct posix_acl **acl)
+int posix_mapped_acl_update_mode(struct user_namespace *user_ns,
+			     struct inode *inode, umode_t *mode_p,
+			     struct posix_acl **acl)
 {
 	umode_t mode = inode->i_mode;
 	int error;
@@ -661,12 +669,34 @@ int posix_acl_update_mode(struct inode *inode, umode_t *mode_p,
 		return error;
 	if (error == 0)
 		*acl = NULL;
-	if (!in_group_p(inode->i_gid) &&
-	    !capable_wrt_inode_uidgid(inode, CAP_FSETID))
+	if (!in_group_p(i_gid_into_mnt(user_ns, inode)) &&
+	    !capable_wrt_mapped_inode_uidgid(user_ns, inode, CAP_FSETID))
 		mode &= ~S_ISGID;
 	*mode_p = mode;
 	return 0;
 }
+EXPORT_SYMBOL(posix_mapped_acl_update_mode);
+
+/**
+ * posix_acl_update_mode  -  update mode in set_acl
+ * @inode: target inode
+ * @mode_p: mode (pointer) for update
+ * @acl: acl pointer
+ *
+ * Update the file mode when setting an ACL: compute the new file permission
+ * bits based on the ACL.  In addition, if the ACL is equivalent to the new
+ * file mode, set *@acl to NULL to indicate that no ACL should be set.
+ *
+ * As with chmod, clear the setgid bit if the caller is not in the owning group
+ * or capable of CAP_FSETID (see inode_change_ok).
+ *
+ * Called from set_acl inode operations.
+ */
+int posix_acl_update_mode(struct inode *inode, umode_t *mode_p,
+			  struct posix_acl **acl)
+{
+	return posix_mapped_acl_update_mode(&init_user_ns, inode, mode_p, acl);
+}
 EXPORT_SYMBOL(posix_acl_update_mode);
 
 /*
@@ -674,7 +704,8 @@ EXPORT_SYMBOL(posix_acl_update_mode);
  */
 static void posix_acl_fix_xattr_userns(
 	struct user_namespace *to, struct user_namespace *from,
-	void *value, size_t size)
+	struct user_namespace *mnt_user_ns,
+	void *value, size_t size, bool from_user)
 {
 	struct posix_acl_xattr_header *header = value;
 	struct posix_acl_xattr_entry *entry = (void *)(header + 1), *end;
@@ -699,10 +730,18 @@ static void posix_acl_fix_xattr_userns(
 		switch(le16_to_cpu(entry->e_tag)) {
 		case ACL_USER:
 			uid = make_kuid(from, le32_to_cpu(entry->e_id));
+			if (from_user)
+				uid = kuid_from_mnt(mnt_user_ns, uid);
+			else
+				uid = kuid_into_mnt(mnt_user_ns, uid);
 			entry->e_id = cpu_to_le32(from_kuid(to, uid));
 			break;
 		case ACL_GROUP:
 			gid = make_kgid(from, le32_to_cpu(entry->e_id));
+			if (from_user)
+				gid = kgid_from_mnt(mnt_user_ns, gid);
+			else
+				gid = kgid_into_mnt(mnt_user_ns, gid);
 			entry->e_id = cpu_to_le32(from_kgid(to, gid));
 			break;
 		default:
@@ -711,21 +750,25 @@ static void posix_acl_fix_xattr_userns(
 	}
 }
 
-void posix_acl_fix_xattr_from_user(void *value, size_t size)
+void posix_acl_fix_xattr_from_user(struct user_namespace *mnt_user_ns,
+				   void *value, size_t size)
 {
 	struct user_namespace *user_ns = current_user_ns();
-	if (user_ns == &init_user_ns)
+	if ((user_ns == &init_user_ns) && (mnt_user_ns == &init_user_ns))
 		return;
-	posix_acl_fix_xattr_userns(&init_user_ns, user_ns, value, size);
+	posix_acl_fix_xattr_userns(&init_user_ns, user_ns, mnt_user_ns, value, size, true);
 }
+EXPORT_SYMBOL(posix_acl_fix_xattr_from_user);
 
-void posix_acl_fix_xattr_to_user(void *value, size_t size)
+void posix_acl_fix_xattr_to_user(struct user_namespace *mnt_user_ns,
+				 void *value, size_t size)
 {
 	struct user_namespace *user_ns = current_user_ns();
-	if (user_ns == &init_user_ns)
+	if ((user_ns == &init_user_ns) && (mnt_user_ns == &init_user_ns))
 		return;
-	posix_acl_fix_xattr_userns(user_ns, &init_user_ns, value, size);
+	posix_acl_fix_xattr_userns(user_ns, &init_user_ns, mnt_user_ns, value, size, false);
 }
+EXPORT_SYMBOL(posix_acl_fix_xattr_to_user);
 
 /*
  * Convert from extended attribute to in-memory representation.
@@ -863,8 +906,9 @@ posix_acl_xattr_get(const struct xattr_handler *handler,
 	return error;
 }
 
-int
-set_posix_acl(struct inode *inode, int type, struct posix_acl *acl)
+static int
+set_posix_mapped_acl(struct user_namespace *user_ns, struct inode *inode,
+		 int type, struct posix_acl *acl)
 {
 	if (!IS_POSIXACL(inode))
 		return -EOPNOTSUPP;
@@ -873,7 +917,7 @@ set_posix_acl(struct inode *inode, int type, struct posix_acl *acl)
 
 	if (type == ACL_TYPE_DEFAULT && !S_ISDIR(inode->i_mode))
 		return acl ? -EACCES : 0;
-	if (!inode_owner_or_capable(inode))
+	if (!mapped_inode_owner_or_capable(user_ns, inode))
 		return -EPERM;
 
 	if (acl) {
@@ -883,13 +927,21 @@ set_posix_acl(struct inode *inode, int type, struct posix_acl *acl)
 	}
 	return inode->i_op->set_acl(inode, acl, type);
 }
+
+int
+set_posix_acl(struct inode *inode, int type, struct posix_acl *acl)
+{
+
+	return set_posix_mapped_acl(&init_user_ns, inode, type, acl);
+}
 EXPORT_SYMBOL(set_posix_acl);
 
 static int
-posix_acl_xattr_set(const struct xattr_handler *handler,
-		    struct dentry *unused, struct inode *inode,
-		    const char *name, const void *value,
-		    size_t size, int flags)
+posix_acl_xattr_set_mapped(const struct xattr_handler *handler,
+			   struct user_namespace *user_ns,
+			   struct dentry *unused, struct inode *inode,
+			   const char *name, const void *value, size_t size,
+			   int flags)
 {
 	struct posix_acl *acl = NULL;
 	int ret;
@@ -899,11 +951,21 @@ posix_acl_xattr_set(const struct xattr_handler *handler,
 		if (IS_ERR(acl))
 			return PTR_ERR(acl);
 	}
-	ret = set_posix_acl(inode, handler->flags, acl);
+	ret = set_posix_mapped_acl(user_ns, inode, handler->flags, acl);
 	posix_acl_release(acl);
 	return ret;
 }
 
+static int
+posix_acl_xattr_set(const struct xattr_handler *handler,
+		    struct dentry *unused, struct inode *inode,
+		    const char *name, const void *value,
+		    size_t size, int flags)
+{
+	return posix_acl_xattr_set_mapped(handler, &init_user_ns, unused, inode,
+					  name, value, size, flags);
+}
+
 static bool
 posix_acl_xattr_list(struct dentry *dentry)
 {
@@ -916,6 +978,9 @@ const struct xattr_handler posix_acl_access_xattr_handler = {
 	.list = posix_acl_xattr_list,
 	.get = posix_acl_xattr_get,
 	.set = posix_acl_xattr_set,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.set_mapped = posix_acl_xattr_set_mapped,
+#endif
 };
 EXPORT_SYMBOL_GPL(posix_acl_access_xattr_handler);
 
@@ -925,6 +990,9 @@ const struct xattr_handler posix_acl_default_xattr_handler = {
 	.list = posix_acl_xattr_list,
 	.get = posix_acl_xattr_get,
 	.set = posix_acl_xattr_set,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.set_mapped = posix_acl_xattr_set_mapped,
+#endif
 };
 EXPORT_SYMBOL_GPL(posix_acl_default_xattr_handler);
 
diff --git a/fs/xattr.c b/fs/xattr.c
index cd7a563e8bcd..96ff53b42251 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -536,9 +536,9 @@ setxattr(struct dentry *d, const char __user *name, const void __user *value,
 		}
 		if ((strcmp(kname, XATTR_NAME_POSIX_ACL_ACCESS) == 0) ||
 		    (strcmp(kname, XATTR_NAME_POSIX_ACL_DEFAULT) == 0))
-			posix_acl_fix_xattr_from_user(kvalue, size);
+			posix_acl_fix_xattr_from_user(&init_user_ns, kvalue, size);
 		else if (strcmp(kname, XATTR_NAME_CAPS) == 0) {
-			error = cap_convert_nscap(d, &kvalue, size);
+			error = cap_convert_nscap(&init_user_ns, d, &kvalue, size);
 			if (error < 0)
 				goto out;
 			size = error;
@@ -636,7 +636,7 @@ getxattr(struct dentry *d, const char __user *name, void __user *value,
 	if (error > 0) {
 		if ((strcmp(kname, XATTR_NAME_POSIX_ACL_ACCESS) == 0) ||
 		    (strcmp(kname, XATTR_NAME_POSIX_ACL_DEFAULT) == 0))
-			posix_acl_fix_xattr_to_user(kvalue, error);
+			posix_acl_fix_xattr_to_user(&init_user_ns, kvalue, error);
 		if (size && copy_to_user(value, kvalue, error))
 			error = -EFAULT;
 	} else if (error == -ERANGE && size >= XATTR_SIZE_MAX) {
diff --git a/include/linux/capability.h b/include/linux/capability.h
index 308d88096745..7e9277d64024 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -274,6 +274,7 @@ static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
 /* audit system wants to get cap info from files as well */
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
 
-extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size);
+extern int cap_convert_nscap(struct user_namespace *user_ns,
+			     struct dentry *dentry, void **ivalue, size_t size);
 
 #endif /* !_LINUX_CAPABILITY_H */
diff --git a/include/linux/posix_acl.h b/include/linux/posix_acl.h
index 8276baefed13..1ab19ded39fd 100644
--- a/include/linux/posix_acl.h
+++ b/include/linux/posix_acl.h
@@ -75,9 +75,13 @@ extern int set_posix_acl(struct inode *, int, struct posix_acl *);
 
 #ifdef CONFIG_FS_POSIX_ACL
 extern int posix_acl_chmod(struct inode *, umode_t);
+extern int posix_mapped_acl_chmod(struct user_namespace *, struct inode *, umode_t);
 extern int posix_acl_create(struct inode *, umode_t *, struct posix_acl **,
 		struct posix_acl **);
 extern int posix_acl_update_mode(struct inode *, umode_t *, struct posix_acl **);
+extern int posix_mapped_acl_update_mode(struct user_namespace *user_ns,
+				    struct inode *inode, umode_t *mode_p,
+				    struct posix_acl **acl);
 
 extern int simple_set_acl(struct inode *, struct posix_acl *, int);
 extern int simple_acl_create(struct inode *, struct inode *);
@@ -99,6 +103,12 @@ static inline int posix_acl_chmod(struct inode *inode, umode_t mode)
 	return 0;
 }
 
+static inline int posix_mapped_acl_chmod(struct user_namespace *user_ns,
+				     struct inode *inode, umode_t mode)
+{
+	return 0;
+}
+
 #define simple_set_acl		NULL
 
 static inline int simple_acl_create(struct inode *dir, struct inode *inode)
diff --git a/include/linux/posix_acl_xattr.h b/include/linux/posix_acl_xattr.h
index 2387709991b5..9fdac573e1cb 100644
--- a/include/linux/posix_acl_xattr.h
+++ b/include/linux/posix_acl_xattr.h
@@ -33,13 +33,17 @@ posix_acl_xattr_count(size_t size)
 }
 
 #ifdef CONFIG_FS_POSIX_ACL
-void posix_acl_fix_xattr_from_user(void *value, size_t size);
-void posix_acl_fix_xattr_to_user(void *value, size_t size);
+void posix_acl_fix_xattr_from_user(struct user_namespace *mnt_user_ns,
+				   void *value, size_t size);
+void posix_acl_fix_xattr_to_user(struct user_namespace *mnt_user_ns,
+				 void *value, size_t size);
 #else
-static inline void posix_acl_fix_xattr_from_user(void *value, size_t size)
+static inline void posix_acl_fix_xattr_from_user(struct user_namespace *mnt_user_ns,
+						 void *value, size_t size)
 {
 }
-static inline void posix_acl_fix_xattr_to_user(void *value, size_t size)
+static inline void posix_acl_fix_xattr_to_user(struct user_namespace *mnt_user_ns,
+					       void *value, size_t size)
 {
 }
 #endif
diff --git a/include/linux/xattr.h b/include/linux/xattr.h
index 10b4dc2709f0..908441e74f51 100644
--- a/include/linux/xattr.h
+++ b/include/linux/xattr.h
@@ -37,6 +37,12 @@ struct xattr_handler {
 	int (*set)(const struct xattr_handler *, struct dentry *dentry,
 		   struct inode *inode, const char *name, const void *buffer,
 		   size_t size, int flags);
+#ifdef CONFIG_IDMAP_MOUNTS
+	int (*set_mapped)(const struct xattr_handler *,
+			  struct user_namespace *user_ns, struct dentry *dentry,
+			  struct inode *inode, const char *name,
+			  const void *buffer, size_t size, int flags);
+#endif
 };
 
 const char *xattr_full_name(const struct xattr_handler *, const char *);
diff --git a/security/commoncap.c b/security/commoncap.c
index 59bf3c1674c8..21f2ff7c346b 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -451,15 +451,18 @@ int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
 }
 
 static kuid_t rootid_from_xattr(const void *value, size_t size,
-				struct user_namespace *task_ns)
+				struct user_namespace *task_ns,
+				struct user_namespace *user_ns)
 {
 	const struct vfs_ns_cap_data *nscap = value;
+	kuid_t rootkid;
 	uid_t rootid = 0;
 
 	if (size == XATTR_CAPS_SZ_3)
 		rootid = le32_to_cpu(nscap->rootid);
 
-	return make_kuid(task_ns, rootid);
+	rootkid = make_kuid(task_ns, rootid);
+	return kuid_from_mnt(user_ns, rootkid);
 }
 
 static bool validheader(size_t size, const struct vfs_cap_data *cap)
@@ -473,7 +476,8 @@ static bool validheader(size_t size, const struct vfs_cap_data *cap)
  *
  * If all is ok, we return the new size, on error return < 0.
  */
-int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size)
+int cap_convert_nscap(struct user_namespace *user_ns, struct dentry *dentry,
+		      void **ivalue, size_t size)
 {
 	struct vfs_ns_cap_data *nscap;
 	uid_t nsrootid;
@@ -489,14 +493,14 @@ int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size)
 		return -EINVAL;
 	if (!validheader(size, cap))
 		return -EINVAL;
-	if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
+	if (!capable_wrt_mapped_inode_uidgid(user_ns, inode, CAP_SETFCAP))
 		return -EPERM;
 	if (size == XATTR_CAPS_SZ_2)
 		if (ns_capable(inode->i_sb->s_user_ns, CAP_SETFCAP))
 			/* user is privileged, just write the v2 */
 			return size;
 
-	rootid = rootid_from_xattr(*ivalue, size, task_ns);
+	rootid = rootid_from_xattr(*ivalue, size, task_ns, user_ns);
 	if (!uid_valid(rootid))
 		return -EINVAL;
 
@@ -520,6 +524,7 @@ int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size)
 	*ivalue = nscap;
 	return newsize;
 }
+EXPORT_SYMBOL(cap_convert_nscap);
 
 /*
  * Calculate the new process capability sets from the capability sets attached
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 12/34] xattr: handle idmapped mounts
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (10 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 11/34] acl: " Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 13/34] selftests: add idmapped mounts xattr selftest Christian Brauner
                   ` (25 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Tycho Andersen,
	Christian Brauner

From: Tycho Andersen <tycho@tycho.pizza>

When interacting with extended attributes the vfs verifies that the
caller is privileged over the inode with which the extended attribute is
associated. Add helpers to handle extended attributes on idmapped
mounts. If the inode is accessed through an idmapped mount we need to
map it according to the mount's user namespace. Afterwards the checks
are identical to non-idmapped mounts.
This patch adds helpers to get, set, and remove extended attributes on
idmapped mounts. The four helpers vfs_mapped_getxattr(),
vfs_mapped_setxattr(), __vfs_mapped_removexattr(), and
vfs_mapped_removexattr() are either used directly by the vfs (e.g.
vfs_mapped_getxattr_alloc()) or by the filesystems targeted in this
first interation.

If the initial user namespace is passed all operations are a nop so
non-idmapped mounts will not see a change in behavior and will also not
see any performance impact. It also means that the non-idmapped-mount
aware helpers can be implemented on top of their idmapped-mount aware
counterparts by passing the initial user namespace.

Signed-off-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/xattr.c            | 252 +++++++++++++++++++++++++++++-------------
 include/linux/xattr.h |  23 ++++
 2 files changed, 196 insertions(+), 79 deletions(-)

diff --git a/fs/xattr.c b/fs/xattr.c
index 96ff53b42251..cdda2baeb9f7 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -83,7 +83,8 @@ xattr_resolve_name(struct inode *inode, const char **name)
  * because different namespaces have very different rules.
  */
 static int
-xattr_permission(struct inode *inode, const char *name, int mask)
+xattr_permission(struct user_namespace *user_ns, struct inode *inode,
+		 const char *name, int mask)
 {
 	/*
 	 * We can never set or remove an extended attribute on a read-only
@@ -127,11 +128,11 @@ xattr_permission(struct inode *inode, const char *name, int mask)
 		if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
 			return (mask & MAY_WRITE) ? -EPERM : -ENODATA;
 		if (S_ISDIR(inode->i_mode) && (inode->i_mode & S_ISVTX) &&
-		    (mask & MAY_WRITE) && !inode_owner_or_capable(inode))
+		    (mask & MAY_WRITE) && !mapped_inode_owner_or_capable(user_ns, inode))
 			return -EPERM;
 	}
 
-	return inode_permission(inode, mask);
+	return mapped_inode_permission(user_ns, inode, mask);
 }
 
 /*
@@ -161,9 +162,10 @@ xattr_supported_namespace(struct inode *inode, const char *prefix)
 }
 EXPORT_SYMBOL(xattr_supported_namespace);
 
-int
-__vfs_setxattr(struct dentry *dentry, struct inode *inode, const char *name,
-	       const void *value, size_t size, int flags)
+static int
+__vfs_mapped_setxattr(struct user_namespace *user_ns, struct dentry *dentry,
+		  struct inode *inode, const char *name, const void *value,
+		  size_t size, int flags)
 {
 	const struct xattr_handler *handler;
 
@@ -174,7 +176,14 @@ __vfs_setxattr(struct dentry *dentry, struct inode *inode, const char *name,
 		return -EOPNOTSUPP;
 	if (size == 0)
 		value = "";  /* empty EA, do not remove */
-	return handler->set(handler, dentry, inode, name, value, size, flags);
+	return xattr_handler_set(handler, user_ns, dentry, inode, name, value, size, flags);
+}
+
+int
+__vfs_setxattr(struct dentry *dentry, struct inode *inode, const char *name,
+	       const void *value, size_t size, int flags)
+{
+	return __vfs_mapped_setxattr(&init_user_ns, dentry, inode, name, value, size, flags);
 }
 EXPORT_SYMBOL(__vfs_setxattr);
 
@@ -182,6 +191,7 @@ EXPORT_SYMBOL(__vfs_setxattr);
  *  __vfs_setxattr_noperm - perform setxattr operation without performing
  *  permission checks.
  *
+ *  @user_ns - user namespace of the mount
  *  @dentry - object to perform setxattr on
  *  @name - xattr name to set
  *  @value - value to set @name to
@@ -194,8 +204,10 @@ EXPORT_SYMBOL(__vfs_setxattr);
  *  is executed. It also assumes that the caller will make the appropriate
  *  permission checks.
  */
-int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
-		const void *value, size_t size, int flags)
+static int
+__vfs_mapped_setxattr_noperm(struct user_namespace *user_ns,
+			     struct dentry *dentry, const char *name,
+			     const void *value, size_t size, int flags)
 {
 	struct inode *inode = dentry->d_inode;
 	int error = -EAGAIN;
@@ -205,7 +217,7 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
 	if (issec)
 		inode->i_flags &= ~S_NOSEC;
 	if (inode->i_opflags & IOP_XATTR) {
-		error = __vfs_setxattr(dentry, inode, name, value, size, flags);
+		error = __vfs_mapped_setxattr(user_ns, dentry, inode, name, value, size, flags);
 		if (!error) {
 			fsnotify_xattr(dentry);
 			security_inode_post_setxattr(dentry, name, value,
@@ -231,27 +243,23 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
 	return error;
 }
 
-/**
- * __vfs_setxattr_locked - set an extended attribute while holding the inode
- * lock
- *
- *  @dentry: object to perform setxattr on
- *  @name: xattr name to set
- *  @value: value to set @name to
- *  @size: size of @value
- *  @flags: flags to pass into filesystem operations
- *  @delegated_inode: on return, will contain an inode pointer that
- *  a delegation was broken on, NULL if none.
- */
-int
-__vfs_setxattr_locked(struct dentry *dentry, const char *name,
-		const void *value, size_t size, int flags,
-		struct inode **delegated_inode)
+int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
+		const void *value, size_t size, int flags)
+{
+	return __vfs_mapped_setxattr_noperm(&init_user_ns, dentry, name, value,
+					    size, flags);
+}
+
+static int
+__vfs_mapped_setxattr_locked(struct user_namespace *user_ns,
+			     struct dentry *dentry, const char *name,
+			     const void *value, size_t size, int flags,
+			     struct inode **delegated_inode)
 {
 	struct inode *inode = dentry->d_inode;
 	int error;
 
-	error = xattr_permission(inode, name, MAY_WRITE);
+	error = xattr_permission(user_ns, inode, name, MAY_WRITE);
 	if (error)
 		return error;
 
@@ -263,16 +271,37 @@ __vfs_setxattr_locked(struct dentry *dentry, const char *name,
 	if (error)
 		goto out;
 
-	error = __vfs_setxattr_noperm(dentry, name, value, size, flags);
+	error = __vfs_mapped_setxattr_noperm(user_ns, dentry, name, value, size, flags);
 
 out:
 	return error;
 }
+
+/**
+ * __vfs_setxattr_locked - set an extended attribute while holding the inode
+ * lock
+ *
+ *  @dentry: object to perform setxattr on
+ *  @name: xattr name to set
+ *  @value: value to set @name to
+ *  @size: size of @value
+ *  @flags: flags to pass into filesystem operations
+ *  @delegated_inode: on return, will contain an inode pointer that
+ *  a delegation was broken on, NULL if none.
+ */
+int
+__vfs_setxattr_locked(struct dentry *dentry, const char *name,
+		const void *value, size_t size, int flags,
+		struct inode **delegated_inode)
+{
+	return __vfs_mapped_setxattr_locked(&init_user_ns, dentry, name, value,
+					    size, flags, delegated_inode);
+}
 EXPORT_SYMBOL_GPL(__vfs_setxattr_locked);
 
 int
-vfs_setxattr(struct dentry *dentry, const char *name, const void *value,
-		size_t size, int flags)
+vfs_mapped_setxattr(struct user_namespace *user_ns, struct dentry *dentry,
+		const char *name, const void *value, size_t size, int flags)
 {
 	struct inode *inode = dentry->d_inode;
 	struct inode *delegated_inode = NULL;
@@ -280,8 +309,8 @@ vfs_setxattr(struct dentry *dentry, const char *name, const void *value,
 
 retry_deleg:
 	inode_lock(inode);
-	error = __vfs_setxattr_locked(dentry, name, value, size, flags,
-	    &delegated_inode);
+	error = __vfs_mapped_setxattr_locked(user_ns, dentry, name, value, size,
+					     flags, &delegated_inode);
 	inode_unlock(inode);
 
 	if (delegated_inode) {
@@ -291,6 +320,14 @@ vfs_setxattr(struct dentry *dentry, const char *name, const void *value,
 	}
 	return error;
 }
+EXPORT_SYMBOL_GPL(vfs_mapped_setxattr);
+
+int
+vfs_setxattr(struct dentry *dentry, const char *name, const void *value,
+		size_t size, int flags)
+{
+	return vfs_mapped_setxattr(&init_user_ns, dentry, name, value, size, flags);
+}
 EXPORT_SYMBOL_GPL(vfs_setxattr);
 
 static ssize_t
@@ -319,24 +356,17 @@ xattr_getsecurity(struct inode *inode, const char *name, void *value,
 	return len;
 }
 
-/*
- * vfs_getxattr_alloc - allocate memory, if necessary, before calling getxattr
- *
- * Allocate memory, if not already allocated, or re-allocate correct size,
- * before retrieving the extended attribute.
- *
- * Returns the result of alloc, if failed, or the getxattr operation.
- */
 ssize_t
-vfs_getxattr_alloc(struct dentry *dentry, const char *name, char **xattr_value,
-		   size_t xattr_size, gfp_t flags)
+vfs_mapped_getxattr_alloc(struct user_namespace *user_ns, struct dentry *dentry,
+		      const char *name, char **xattr_value, size_t xattr_size,
+		      gfp_t flags)
 {
 	const struct xattr_handler *handler;
 	struct inode *inode = dentry->d_inode;
 	char *value = *xattr_value;
 	int error;
 
-	error = xattr_permission(inode, name, MAY_READ);
+	error = xattr_permission(user_ns, inode, name, MAY_READ);
 	if (error)
 		return error;
 
@@ -361,6 +391,22 @@ vfs_getxattr_alloc(struct dentry *dentry, const char *name, char **xattr_value,
 	return error;
 }
 
+/*
+ * vfs_getxattr_alloc - allocate memory, if necessary, before calling getxattr
+ *
+ * Allocate memory, if not already allocated, or re-allocate correct size,
+ * before retrieving the extended attribute.
+ *
+ * Returns the result of alloc, if failed, or the getxattr operation.
+ */
+ssize_t
+vfs_getxattr_alloc(struct dentry *dentry, const char *name, char **xattr_value,
+		   size_t xattr_size, gfp_t flags)
+{
+	return vfs_mapped_getxattr_alloc(&init_user_ns, dentry, name, xattr_value,
+				     xattr_size, flags);
+}
+
 ssize_t
 __vfs_getxattr(struct dentry *dentry, struct inode *inode, const char *name,
 	       void *value, size_t size)
@@ -377,12 +423,13 @@ __vfs_getxattr(struct dentry *dentry, struct inode *inode, const char *name,
 EXPORT_SYMBOL(__vfs_getxattr);
 
 ssize_t
-vfs_getxattr(struct dentry *dentry, const char *name, void *value, size_t size)
+vfs_mapped_getxattr(struct user_namespace *user_ns, struct dentry *dentry,
+		const char *name, void *value, size_t size)
 {
 	struct inode *inode = dentry->d_inode;
 	int error;
 
-	error = xattr_permission(inode, name, MAY_READ);
+	error = xattr_permission(user_ns, inode, name, MAY_READ);
 	if (error)
 		return error;
 
@@ -405,6 +452,13 @@ vfs_getxattr(struct dentry *dentry, const char *name, void *value, size_t size)
 nolsm:
 	return __vfs_getxattr(dentry, inode, name, value, size);
 }
+EXPORT_SYMBOL_GPL(vfs_mapped_getxattr);
+
+ssize_t
+vfs_getxattr(struct dentry *dentry, const char *name, void *value, size_t size)
+{
+	return vfs_mapped_getxattr(&init_user_ns, dentry, name, value, size);
+}
 EXPORT_SYMBOL_GPL(vfs_getxattr);
 
 ssize_t
@@ -428,7 +482,7 @@ vfs_listxattr(struct dentry *dentry, char *list, size_t size)
 EXPORT_SYMBOL_GPL(vfs_listxattr);
 
 int
-__vfs_removexattr(struct dentry *dentry, const char *name)
+__vfs_mapped_removexattr(struct user_namespace *user_ns, struct dentry *dentry, const char *name)
 {
 	struct inode *inode = d_inode(dentry);
 	const struct xattr_handler *handler;
@@ -438,27 +492,26 @@ __vfs_removexattr(struct dentry *dentry, const char *name)
 		return PTR_ERR(handler);
 	if (!handler->set)
 		return -EOPNOTSUPP;
-	return handler->set(handler, dentry, inode, name, NULL, 0, XATTR_REPLACE);
+	return xattr_handler_set(handler, user_ns, dentry, inode, name, NULL, 0, XATTR_REPLACE);
 }
-EXPORT_SYMBOL(__vfs_removexattr);
+EXPORT_SYMBOL(__vfs_mapped_removexattr);
 
-/**
- * __vfs_removexattr_locked - set an extended attribute while holding the inode
- * lock
- *
- *  @dentry: object to perform setxattr on
- *  @name: name of xattr to remove
- *  @delegated_inode: on return, will contain an inode pointer that
- *  a delegation was broken on, NULL if none.
- */
 int
-__vfs_removexattr_locked(struct dentry *dentry, const char *name,
-		struct inode **delegated_inode)
+__vfs_removexattr(struct dentry *dentry, const char *name)
+{
+	return __vfs_mapped_removexattr(&init_user_ns, dentry, name);
+}
+EXPORT_SYMBOL(__vfs_removexattr);
+
+static int
+__vfs_mapped_removexattr_locked(struct user_namespace *user_ns,
+				struct dentry *dentry, const char *name,
+				struct inode **delegated_inode)
 {
 	struct inode *inode = dentry->d_inode;
 	int error;
 
-	error = xattr_permission(inode, name, MAY_WRITE);
+	error = xattr_permission(user_ns, inode, name, MAY_WRITE);
 	if (error)
 		return error;
 
@@ -470,7 +523,7 @@ __vfs_removexattr_locked(struct dentry *dentry, const char *name,
 	if (error)
 		goto out;
 
-	error = __vfs_removexattr(dentry, name);
+	error = __vfs_mapped_removexattr(user_ns, dentry, name);
 
 	if (!error) {
 		fsnotify_xattr(dentry);
@@ -480,10 +533,27 @@ __vfs_removexattr_locked(struct dentry *dentry, const char *name,
 out:
 	return error;
 }
+
+/**
+ * __vfs_removexattr_locked - set an extended attribute while holding the inode
+ * lock
+ *
+ *  @dentry: object to perform setxattr on
+ *  @name: name of xattr to remove
+ *  @delegated_inode: on return, will contain an inode pointer that
+ *  a delegation was broken on, NULL if none.
+ */
+int
+__vfs_removexattr_locked(struct dentry *dentry, const char *name,
+			 struct inode **delegated_inode)
+{
+	return __vfs_mapped_removexattr_locked(&init_user_ns, dentry, name, delegated_inode);
+}
 EXPORT_SYMBOL_GPL(__vfs_removexattr_locked);
 
 int
-vfs_removexattr(struct dentry *dentry, const char *name)
+vfs_mapped_removexattr(struct user_namespace *user_ns, struct dentry *dentry,
+		       const char *name)
 {
 	struct inode *inode = dentry->d_inode;
 	struct inode *delegated_inode = NULL;
@@ -491,7 +561,7 @@ vfs_removexattr(struct dentry *dentry, const char *name)
 
 retry_deleg:
 	inode_lock(inode);
-	error = __vfs_removexattr_locked(dentry, name, &delegated_inode);
+	error = __vfs_mapped_removexattr_locked(user_ns, dentry, name, &delegated_inode);
 	inode_unlock(inode);
 
 	if (delegated_inode) {
@@ -502,14 +572,22 @@ vfs_removexattr(struct dentry *dentry, const char *name)
 
 	return error;
 }
+EXPORT_SYMBOL_GPL(vfs_mapped_removexattr);
+
+int
+vfs_removexattr(struct dentry *dentry, const char *name)
+{
+	return vfs_mapped_removexattr(&init_user_ns, dentry, name);
+}
 EXPORT_SYMBOL_GPL(vfs_removexattr);
 
 /*
  * Extended attribute SET operations
  */
 static long
-setxattr(struct dentry *d, const char __user *name, const void __user *value,
-	 size_t size, int flags)
+setxattr(struct user_namespace *user_ns, struct dentry *d,
+	 const char __user *name, const void __user *value, size_t size,
+	 int flags)
 {
 	int error;
 	void *kvalue = NULL;
@@ -536,16 +614,16 @@ setxattr(struct dentry *d, const char __user *name, const void __user *value,
 		}
 		if ((strcmp(kname, XATTR_NAME_POSIX_ACL_ACCESS) == 0) ||
 		    (strcmp(kname, XATTR_NAME_POSIX_ACL_DEFAULT) == 0))
-			posix_acl_fix_xattr_from_user(&init_user_ns, kvalue, size);
+			posix_acl_fix_xattr_from_user(user_ns, kvalue, size);
 		else if (strcmp(kname, XATTR_NAME_CAPS) == 0) {
-			error = cap_convert_nscap(&init_user_ns, d, &kvalue, size);
+			error = cap_convert_nscap(user_ns, d, &kvalue, size);
 			if (error < 0)
 				goto out;
 			size = error;
 		}
 	}
 
-	error = vfs_setxattr(d, kname, kvalue, size, flags);
+	error = vfs_mapped_setxattr(user_ns, d, kname, kvalue, size, flags);
 out:
 	kvfree(kvalue);
 
@@ -558,13 +636,17 @@ static int path_setxattr(const char __user *pathname,
 {
 	struct path path;
 	int error;
+
 retry:
 	error = user_path_at(AT_FDCWD, pathname, lookup_flags, &path);
 	if (error)
 		return error;
 	error = mnt_want_write(path.mnt);
 	if (!error) {
-		error = setxattr(path.dentry, name, value, size, flags);
+		struct user_namespace *user_ns;
+
+		user_ns = mnt_user_ns(path.mnt);
+		error = setxattr(user_ns, path.dentry, name, value, size, flags);
 		mnt_drop_write(path.mnt);
 	}
 	path_put(&path);
@@ -600,7 +682,11 @@ SYSCALL_DEFINE5(fsetxattr, int, fd, const char __user *, name,
 	audit_file(f.file);
 	error = mnt_want_write_file(f.file);
 	if (!error) {
-		error = setxattr(f.file->f_path.dentry, name, value, size, flags);
+		struct user_namespace *user_ns;
+
+		user_ns = mnt_user_ns(f.file->f_path.mnt);
+		error = setxattr(user_ns, f.file->f_path.dentry, name, value,
+				 size, flags);
 		mnt_drop_write_file(f.file);
 	}
 	fdput(f);
@@ -612,7 +698,7 @@ SYSCALL_DEFINE5(fsetxattr, int, fd, const char __user *, name,
  */
 static ssize_t
 getxattr(struct dentry *d, const char __user *name, void __user *value,
-	 size_t size)
+	 size_t size, struct user_namespace *user_ns)
 {
 	ssize_t error;
 	void *kvalue = NULL;
@@ -632,11 +718,11 @@ getxattr(struct dentry *d, const char __user *name, void __user *value,
 			return -ENOMEM;
 	}
 
-	error = vfs_getxattr(d, kname, kvalue, size);
+	error = vfs_mapped_getxattr(user_ns, d, kname, kvalue, size);
 	if (error > 0) {
 		if ((strcmp(kname, XATTR_NAME_POSIX_ACL_ACCESS) == 0) ||
 		    (strcmp(kname, XATTR_NAME_POSIX_ACL_DEFAULT) == 0))
-			posix_acl_fix_xattr_to_user(&init_user_ns, kvalue, error);
+			posix_acl_fix_xattr_to_user(user_ns, kvalue, error);
 		if (size && copy_to_user(value, kvalue, error))
 			error = -EFAULT;
 	} else if (error == -ERANGE && size >= XATTR_SIZE_MAX) {
@@ -654,13 +740,15 @@ static ssize_t path_getxattr(const char __user *pathname,
 			     const char __user *name, void __user *value,
 			     size_t size, unsigned int lookup_flags)
 {
+	struct user_namespace *user_ns;
 	struct path path;
 	ssize_t error;
 retry:
 	error = user_path_at(AT_FDCWD, pathname, lookup_flags, &path);
 	if (error)
 		return error;
-	error = getxattr(path.dentry, name, value, size);
+	user_ns = mnt_user_ns(path.mnt);
+	error = getxattr(path.dentry, name, value, size, user_ns);
 	path_put(&path);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
@@ -684,13 +772,15 @@ SYSCALL_DEFINE4(lgetxattr, const char __user *, pathname,
 SYSCALL_DEFINE4(fgetxattr, int, fd, const char __user *, name,
 		void __user *, value, size_t, size)
 {
+	struct user_namespace *user_ns;
 	struct fd f = fdget(fd);
 	ssize_t error = -EBADF;
 
 	if (!f.file)
 		return error;
 	audit_file(f.file);
-	error = getxattr(f.file->f_path.dentry, name, value, size);
+	user_ns = mnt_user_ns(f.file->f_path.mnt);
+	error = getxattr(f.file->f_path.dentry, name, value, size, user_ns);
 	fdput(f);
 	return error;
 }
@@ -774,7 +864,7 @@ SYSCALL_DEFINE3(flistxattr, int, fd, char __user *, list, size_t, size)
  * Extended attribute REMOVE operations
  */
 static long
-removexattr(struct dentry *d, const char __user *name)
+removexattr(struct user_namespace *user_ns, struct dentry *d, const char __user *name)
 {
 	int error;
 	char kname[XATTR_NAME_MAX + 1];
@@ -785,7 +875,7 @@ removexattr(struct dentry *d, const char __user *name)
 	if (error < 0)
 		return error;
 
-	return vfs_removexattr(d, kname);
+	return vfs_mapped_removexattr(user_ns, d, kname);
 }
 
 static int path_removexattr(const char __user *pathname,
@@ -799,7 +889,9 @@ static int path_removexattr(const char __user *pathname,
 		return error;
 	error = mnt_want_write(path.mnt);
 	if (!error) {
-		error = removexattr(path.dentry, name);
+		struct user_namespace *user_ns = mnt_user_ns(path.mnt);
+
+		error = removexattr(user_ns, path.dentry, name);
 		mnt_drop_write(path.mnt);
 	}
 	path_put(&path);
@@ -832,7 +924,9 @@ SYSCALL_DEFINE2(fremovexattr, int, fd, const char __user *, name)
 	audit_file(f.file);
 	error = mnt_want_write_file(f.file);
 	if (!error) {
-		error = removexattr(f.file->f_path.dentry, name);
+		struct user_namespace *user_ns = mnt_user_ns(f.file->f_path.mnt);
+
+		error = removexattr(user_ns, f.file->f_path.dentry, name);
 		mnt_drop_write_file(f.file);
 	}
 	fdput(f);
diff --git a/include/linux/xattr.h b/include/linux/xattr.h
index 908441e74f51..b2eeecdf6669 100644
--- a/include/linux/xattr.h
+++ b/include/linux/xattr.h
@@ -16,6 +16,7 @@
 #include <linux/types.h>
 #include <linux/spinlock.h>
 #include <linux/mm.h>
+#include <linux/user_namespace.h>
 #include <uapi/linux/xattr.h>
 
 struct inode;
@@ -45,6 +46,19 @@ struct xattr_handler {
 #endif
 };
 
+static inline int xattr_handler_set(const struct xattr_handler *handler,
+			  struct user_namespace *user_ns, struct dentry *dentry,
+			  struct inode *inode, const char *name,
+			  const void *buffer, size_t size, int flags)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+	if (handler->set_mapped)
+		return handler->set_mapped(handler, user_ns, dentry, inode,
+					   name, buffer, size, flags);
+#endif
+	return handler->set(handler, dentry, inode, name, buffer, size, flags);
+}
+
 const char *xattr_full_name(const struct xattr_handler *, const char *);
 
 struct xattr {
@@ -55,18 +69,27 @@ struct xattr {
 
 ssize_t __vfs_getxattr(struct dentry *, struct inode *, const char *, void *, size_t);
 ssize_t vfs_getxattr(struct dentry *, const char *, void *, size_t);
+ssize_t vfs_mapped_getxattr(struct user_namespace *user_ns, struct dentry *dentry,
+			const char *name, void *value, size_t size);
 ssize_t vfs_listxattr(struct dentry *d, char *list, size_t size);
 int __vfs_setxattr(struct dentry *, struct inode *, const char *, const void *, size_t, int);
 int __vfs_setxattr_noperm(struct dentry *, const char *, const void *, size_t, int);
 int __vfs_setxattr_locked(struct dentry *, const char *, const void *, size_t, int, struct inode **);
+int vfs_mapped_setxattr(struct user_namespace *, struct dentry *, const char *, const void *, size_t, int);
 int vfs_setxattr(struct dentry *, const char *, const void *, size_t, int);
 int __vfs_removexattr(struct dentry *, const char *);
+int __vfs_mapped_removexattr(struct user_namespace *, struct dentry *, const char *);
 int __vfs_removexattr_locked(struct dentry *, const char *, struct inode **);
 int vfs_removexattr(struct dentry *, const char *);
+int vfs_mapped_removexattr(struct user_namespace *user_ns, struct dentry *, const char *);
 
 ssize_t generic_listxattr(struct dentry *dentry, char *buffer, size_t buffer_size);
 ssize_t vfs_getxattr_alloc(struct dentry *dentry, const char *name,
 			   char **xattr_value, size_t size, gfp_t flags);
+ssize_t vfs_mapped_getxattr_alloc(struct user_namespace *user_ns,
+			      struct dentry *dentry, const char *name,
+			      char **xattr_value, size_t xattr_size,
+			      gfp_t flags);
 
 int xattr_supported_namespace(struct inode *inode, const char *prefix);
 
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 13/34] selftests: add idmapped mounts xattr selftest
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (11 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 12/34] xattr: " Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 14/34] commoncap: handle idmapped mounts Christian Brauner
                   ` (24 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Tycho Andersen,
	Christian Brauner

From: Tycho Andersen <tycho@tycho.pizza>

Add some tests for setting extended attributes on idmapped mounts.

Signed-off-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 .../testing/selftests/idmap_mounts/.gitignore |   1 +
 tools/testing/selftests/idmap_mounts/Makefile |   8 +
 tools/testing/selftests/idmap_mounts/config   |   1 +
 tools/testing/selftests/idmap_mounts/xattr.c  | 389 ++++++++++++++++++
 4 files changed, 399 insertions(+)
 create mode 100644 tools/testing/selftests/idmap_mounts/.gitignore
 create mode 100644 tools/testing/selftests/idmap_mounts/Makefile
 create mode 100644 tools/testing/selftests/idmap_mounts/config
 create mode 100644 tools/testing/selftests/idmap_mounts/xattr.c

diff --git a/tools/testing/selftests/idmap_mounts/.gitignore b/tools/testing/selftests/idmap_mounts/.gitignore
new file mode 100644
index 000000000000..18c5e90522ad
--- /dev/null
+++ b/tools/testing/selftests/idmap_mounts/.gitignore
@@ -0,0 +1 @@
+xattr
diff --git a/tools/testing/selftests/idmap_mounts/Makefile b/tools/testing/selftests/idmap_mounts/Makefile
new file mode 100644
index 000000000000..ce0549b09b2a
--- /dev/null
+++ b/tools/testing/selftests/idmap_mounts/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+# Makefile for mount selftests.
+CFLAGS = -g -I../../../../usr/include/ -Wall -O2 -pthread
+
+TEST_GEN_FILES += xattr
+
+include ../lib.mk
+
diff --git a/tools/testing/selftests/idmap_mounts/config b/tools/testing/selftests/idmap_mounts/config
new file mode 100644
index 000000000000..80730abc534b
--- /dev/null
+++ b/tools/testing/selftests/idmap_mounts/config
@@ -0,0 +1 @@
+CONFIG_IDMAP_MOUNTS=y
diff --git a/tools/testing/selftests/idmap_mounts/xattr.c b/tools/testing/selftests/idmap_mounts/xattr.c
new file mode 100644
index 000000000000..a3d70294ce43
--- /dev/null
+++ b/tools/testing/selftests/idmap_mounts/xattr.c
@@ -0,0 +1,389 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <sched.h>
+#include <sys/mount.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <linux/limits.h>
+
+#include "../kselftest_harness.h"
+
+#ifndef __NR_mount_setattr
+	#if defined __alpha__
+		#define __NR_mount_setattr 551
+	#elif defined _MIPS_SIM
+		#if _MIPS_SIM == _MIPS_SIM_ABI32	/* o32 */
+			#define __NR_mount_setattr 4441
+		#endif
+		#if _MIPS_SIM == _MIPS_SIM_NABI32	/* n32 */
+			#define __NR_mount_setattr 6441
+		#endif
+		#if _MIPS_SIM == _MIPS_SIM_ABI64	/* n64 */
+			#define __NR_mount_setattr 5441
+		#endif
+	#elif defined __ia64__
+		#define __NR_mount_setattr (441 + 1024)
+	#else
+		#define __NR_mount_setattr 441
+	#endif
+
+#ifndef __NR_open_tree
+	#if defined __alpha__
+		#define __NR_open_tree 538
+	#elif defined _MIPS_SIM
+		#if _MIPS_SIM == _MIPS_SIM_ABI32	/* o32 */
+			#define __NR_open_tree 4428
+		#endif
+		#if _MIPS_SIM == _MIPS_SIM_NABI32	/* n32 */
+			#define __NR_open_tree 6428
+		#endif
+		#if _MIPS_SIM == _MIPS_SIM_ABI64	/* n64 */
+			#define __NR_open_tree 5428
+		#endif
+	#elif defined __ia64__
+		#define __NR_open_tree (428 + 1024)
+	#else
+		#define __NR_open_tree 428
+	#endif
+#endif
+
+#ifndef __NR_move_mount
+	#if defined __alpha__
+		#define __NR_move_mount 539
+	#elif defined _MIPS_SIM
+		#if _MIPS_SIM == _MIPS_SIM_ABI32	/* o32 */
+			#define __NR_move_mount 4429
+		#endif
+		#if _MIPS_SIM == _MIPS_SIM_NABI32	/* n32 */
+			#define __NR_move_mount 6429
+		#endif
+		#if _MIPS_SIM == _MIPS_SIM_ABI64	/* n64 */
+			#define __NR_move_mount 5429
+		#endif
+	#elif defined __ia64__
+		#define __NR_move_mount (428 + 1024)
+	#else
+		#define __NR_move_mount 429
+	#endif
+#endif
+
+
+struct mount_attr {
+	__u64 attr_set;
+	__u64 attr_clr;
+	__u64 propagation;
+	__u32 userns;
+};
+#endif
+
+#ifndef MOVE_MOUNT_F_EMPTY_PATH
+#define MOVE_MOUNT_F_EMPTY_PATH 0x00000004 /* Empty from path permitted */
+#endif
+
+#ifndef MOUNT_ATTR_SHIFT
+#define MOUNT_ATTR_SHIFT 0x00100000
+#endif
+
+#ifndef OPEN_TREE_CLONE
+#define OPEN_TREE_CLONE 1
+#endif
+
+#ifndef OPEN_TREE_CLOEXEC
+#define OPEN_TREE_CLOEXEC O_CLOEXEC
+#endif
+
+#ifndef AT_RECURSIVE
+#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */
+#endif
+
+static inline int sys_mount_setattr(int dfd, const char *path, unsigned int flags,
+				    struct mount_attr *attr, size_t size)
+{
+	return syscall(__NR_mount_setattr, dfd, path, flags, attr, size);
+}
+
+static inline int sys_open_tree(int dfd, const char *filename, unsigned int flags)
+{
+	return syscall(__NR_open_tree, dfd, filename, flags);
+}
+
+static inline int sys_move_mount(int from_dfd, const char *from_pathname, int to_dfd,
+				 const char *to_pathname, unsigned int flags)
+{
+	return syscall(__NR_move_mount, from_dfd, from_pathname, to_dfd, to_pathname, flags);
+}
+
+
+static ssize_t write_nointr(int fd, const void *buf, size_t count)
+{
+	ssize_t ret;
+
+	do {
+		ret = write(fd, buf, count);
+	} while (ret < 0 && errno == EINTR);
+
+	return ret;
+}
+
+static int write_file(const char *path, const void *buf, size_t count)
+{
+	int fd;
+	ssize_t ret;
+
+	fd = open(path, O_WRONLY | O_CLOEXEC | O_NOCTTY | O_NOFOLLOW);
+	if (fd < 0)
+		return -1;
+
+	ret = write_nointr(fd, buf, count);
+	close(fd);
+	if (ret < 0 || (size_t)ret != count)
+		return -1;
+
+	return 0;
+}
+
+static int map_ids(pid_t pid, unsigned long nsid, unsigned long hostid,
+		   unsigned long range)
+{
+	char map[100], procfile[256];
+
+	snprintf(procfile, sizeof(procfile), "/proc/%d/setgroups", pid);
+	if (write_file(procfile, "deny", sizeof("deny") - 1) &&
+	    errno != ENOENT)
+		return -1;
+
+	snprintf(procfile, sizeof(procfile), "/proc/%d/uid_map", pid);
+	snprintf(map, sizeof(map), "%lu %lu %lu", nsid, hostid, range);
+	if (write_file(procfile, map, strlen(map)))
+		return -1;
+
+
+	snprintf(procfile, sizeof(procfile), "/proc/%d/gid_map", pid);
+	snprintf(map, sizeof(map), "%lu %lu %lu", nsid, hostid, range);
+	if (write_file(procfile, map, strlen(map)))
+		return -1;
+
+	return 0;
+}
+
+#define __STACK_SIZE (8 * 1024 * 1024)
+static pid_t do_clone(int (*fn)(void *), void *arg, int flags)
+{
+	void *stack;
+
+	stack = malloc(__STACK_SIZE);
+	if (!stack)
+		return -ENOMEM;
+
+#ifdef __ia64__
+	return __clone2(fn, stack, __STACK_SIZE, flags | SIGCHLD, arg, NULL);
+#else
+	return clone(fn, stack + __STACK_SIZE, flags | SIGCHLD, arg, NULL);
+#endif
+}
+
+static int get_userns_fd_cb(void *data)
+{
+	return kill(getpid(), SIGSTOP);
+}
+
+static int get_userns_fd(unsigned long nsid, unsigned long hostid,
+			 unsigned long range)
+{
+	int ret;
+	pid_t pid;
+	char path[256];
+
+	pid = do_clone(get_userns_fd_cb, NULL, CLONE_NEWUSER | CLONE_NEWNS);
+	if (pid < 0)
+		return -errno;
+
+	ret = map_ids(pid, nsid, hostid, range);
+	if (ret < 0)
+		return ret;
+
+	snprintf(path, sizeof(path), "/proc/%d/ns/user", pid);
+	ret = open(path, O_RDONLY | O_CLOEXEC);
+	kill(pid, SIGKILL);
+	return ret;
+}
+
+struct run_as_data {
+	int userns;
+	int (*f)(void *data);
+	void *data;
+};
+
+static int run_in_cb(void *data)
+{
+	struct run_as_data *rad = data;
+
+	if (setns(rad->userns, CLONE_NEWUSER) < 0) {
+		perror("setns");
+		return 1;
+	}
+
+	if (setuid(100010)) {
+		perror("setuid");
+		return 1;
+	}
+
+	if (setgid(100010)) {
+		perror("setgid");
+		return 1;
+	}
+
+	return rad->f(rad->data);
+}
+
+static int wait_for_pid(pid_t pid)
+{
+	int status, ret;
+
+again:
+	ret = waitpid(pid, &status, 0);
+	if (ret == -1) {
+		if (errno == EINTR)
+			goto again;
+
+		return -1;
+	}
+
+	if (!WIFEXITED(status))
+		return -1;
+
+	return WEXITSTATUS(status);
+}
+
+static int run_in(int userns, int (*f)(void *), void *f_data)
+{
+	pid_t pid;
+	struct run_as_data data;
+
+	data.userns = userns;
+	data.f = f;
+	data.data = f_data;
+	pid = do_clone(run_in_cb, &data, 0);
+	if (pid < 0)
+		return -errno;
+
+	return wait_for_pid(pid);
+}
+
+FIXTURE(ext4_xattr) {};
+
+FIXTURE_SETUP(ext4_xattr)
+{
+	int fd;
+
+	fd = open("/tmp/idmap_mounts.ext4", O_CREAT | O_WRONLY, 0600);
+	ASSERT_GE(fd, 0);
+	ASSERT_EQ(ftruncate(fd, 640 * 1024), 0);
+	ASSERT_EQ(close(fd), 0);
+	ASSERT_EQ(system("mkfs.ext4 /tmp/idmap_mounts.ext4"), 0);
+	ASSERT_EQ(mkdir("/tmp/ext4", 0777), 0);
+	ASSERT_EQ(system("mount -o loop -t ext4 /tmp/idmap_mounts.ext4 /tmp/ext4"), 0);
+}
+
+FIXTURE_TEARDOWN(ext4_xattr)
+{
+	umount("/tmp/ext4/dest");
+	umount("/tmp/ext4");
+	rmdir("/tmp/ext4");
+	unlink("/tmp/idmap_mounts.ext4");
+}
+
+struct getacl_should_be_data {
+	char path[256];
+	uid_t uid;
+};
+
+static int getacl_should_be_uid(void *data)
+{
+	struct getacl_should_be_data *ssb = data;
+	char cmd[512];
+	int ret;
+
+	snprintf(cmd, sizeof(cmd), "getfacl %s | grep user:%u:rwx", ssb->path, ssb->uid);
+	ret = system(cmd);
+	if (ret < 0) {
+		perror("system");
+		return -1;
+	}
+	if (!WIFEXITED(ret))
+		return -1;
+	return WEXITSTATUS(ret);
+}
+
+static int ls_path(void *data)
+{
+	char cmd[PATH_MAX];
+	char *path = data;
+	int ret;
+
+	snprintf(cmd, sizeof(cmd), "ls %s", path);
+	ret = system(cmd);
+	if (ret < 0) {
+		perror("system");
+		return -1;
+	}
+	if (!WIFEXITED(ret))
+		return -1;
+	return WEXITSTATUS(ret);
+}
+
+TEST_F(ext4_xattr, setattr_didnt_work)
+{
+	int mount_fd, ret;
+	struct mount_attr attr = {};
+	struct getacl_should_be_data ssb;
+
+	ASSERT_EQ(mkdir("/tmp/ext4/source", 0777), 0);
+	ASSERT_EQ(mkdir("/tmp/ext4/dest", 0777), 0);
+
+	mount_fd = sys_open_tree(-EBADF, "/tmp/ext4/source",
+				 OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC | AT_EMPTY_PATH);
+	ASSERT_GE(mount_fd, 0);
+
+	ASSERT_EQ(sys_move_mount(mount_fd, "", -EBADF, "/tmp/ext4/dest",
+				 MOVE_MOUNT_F_EMPTY_PATH), 0);
+
+	attr.attr_set = MOUNT_ATTR_SHIFT;
+	attr.userns = get_userns_fd(100010, 100020, 5);
+	ASSERT_GE(attr.userns, 0);
+	ret = sys_mount_setattr(mount_fd, "", AT_EMPTY_PATH | AT_RECURSIVE,
+				    &attr, sizeof(attr));
+	ASSERT_EQ(close(mount_fd), 0);
+	ASSERT_EQ(ret, 0);
+
+	ASSERT_EQ(mkdir("/tmp/ext4/source/foo", 0700), 0);
+	ASSERT_EQ(chown("/tmp/ext4/source/foo", 100010, 100010), 0);
+
+	ASSERT_EQ(system("setfacl -m u:100010:rwx /tmp/ext4/source/foo"), 0);
+	EXPECT_EQ(system("getfacl /tmp/ext4/source/foo | grep user:100010:rwx"), 0);
+	EXPECT_EQ(system("getfacl /tmp/ext4/dest/foo | grep user:100020:rwx"), 0);
+
+	snprintf(ssb.path, sizeof(ssb.path), "/tmp/ext4/source/foo");
+	ssb.uid = 4294967295;
+	EXPECT_EQ(run_in(attr.userns, getacl_should_be_uid, &ssb), 0);
+
+	snprintf(ssb.path, sizeof(ssb.path), "/tmp/ext4/dest/foo");
+	ssb.uid = 100010;
+	EXPECT_EQ(run_in(attr.userns, getacl_should_be_uid, &ssb), 0);
+
+	/*
+	 * now, dir is owned by someone else in the user namespace, but we can
+	 * still read it because of acls
+	 */
+	ASSERT_EQ(chown("/tmp/ext4/source/foo", 100012, 100012), 0);
+	EXPECT_EQ(run_in(attr.userns, ls_path, "/tmp/ext4/dest/foo"), 0);
+
+	/*
+	 * if we delete the acls, the ls should fail because it's 700.
+	 */
+	ASSERT_EQ(system("setfacl --remove-all /tmp/ext4/source/foo"), 0);
+	EXPECT_NE(run_in(attr.userns, ls_path, "/tmp/ext4/dest/foo"), 0);
+}
+
+TEST_HARNESS_MAIN
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 14/34] commoncap: handle idmapped mounts
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (12 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 13/34] selftests: add idmapped mounts xattr selftest Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 15/34] stat: add mapped_generic_fillattr() Christian Brauner
                   ` (23 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

When interacting with user namespace and non-user namespace aware
filesystem capabilities the vfs will perform various security checks to
determine whether or not the filesystem capabilities can be used by the
caller (e.g. during exec), or even whether they need to be removed. The
main infrastructure for this resides in the capability codepaths but
they are called through the LSM security infrastructure even though they
are not technically an LSM or optional. This extends the existing
security hooks security_inode_removexattr(), security_inode_killpriv(),
security_inode_getsecurity() to pass down the mount's user namespace and
makes them aware of idmapped mounts.
In order to actually get filesystem capabilities from disk the capability
infrastructure exposes the get_vfs_caps_from_disk() helper. For user namespace
aware filesystem capabilities a root uid is stored alongside the capabilities.
In order to determine whether the caller can make use of the filesystem
capability or whether it needs to be ignored it is translated according to the
superblock's user namespace. If it can be translated to uid 0 according to that
id mapping the caller can use the filesystem capabilities stored on disk. If we
are accessing the inode that holds the filesystem capabilities through an
idmapped mount we need to map root uid according to the mount's user namespace.
Afterwards the checks are identical to non-idmapped mounts. Reading filesystem
caps from disk enforces that the root uid associated with the filesystem
capability must have a mapping in the superblock's user namespace and that the
caller is either in the same user namespace or is a descendant of the superblock's user
namespace. For filesystems that are mountable inside user namespace the
container can just mount the filesystem and won't usually need to idmap it. If
it does create an idmapped mount it can mark it with a user namespace it has
created and which is therefore a descendant of the s_user_ns. For filesystems
that are not mountable inside user namespaces the descendant rule is trivially
true because the s_user_ns will be the initial user namespace.

If the initial user namespace is passed all operations are a nop so
non-idmapped mounts will not see a change in behavior and will also not
see any performance impact. It also means that the non-idmapped-mount
aware helpers can be implemented on top of their idmapped-mount aware
counterparts by passing the initial user namespace.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/attr.c                     |  2 +-
 fs/xattr.c                    | 12 ++++++------
 include/linux/capability.h    |  3 +++
 include/linux/lsm_hook_defs.h | 10 ++++++----
 include/linux/lsm_hooks.h     |  1 +
 include/linux/security.h      | 36 +++++++++++++++++++++++------------
 kernel/auditsc.c              |  3 ++-
 security/commoncap.c          | 35 ++++++++++++++++++++++++----------
 security/security.c           | 18 +++++++++++-------
 security/selinux/hooks.c      | 13 ++++++++-----
 security/smack/smack_lsm.c    | 11 ++++++-----
 11 files changed, 93 insertions(+), 51 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index f39c03ac85e0..4daf6ac6de6d 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -113,7 +113,7 @@ int setattr_mapped_prepare(struct user_namespace *user_ns,
 	if (ia_valid & ATTR_KILL_PRIV) {
 		int error;
 
-		error = security_inode_killpriv(dentry);
+		error = security_inode_killpriv(user_ns, dentry);
 		if (error)
 			return error;
 	}
diff --git a/fs/xattr.c b/fs/xattr.c
index cdda2baeb9f7..40b02227257e 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -331,18 +331,18 @@ vfs_setxattr(struct dentry *dentry, const char *name, const void *value,
 EXPORT_SYMBOL_GPL(vfs_setxattr);
 
 static ssize_t
-xattr_getsecurity(struct inode *inode, const char *name, void *value,
-			size_t size)
+xattr_getsecurity(struct user_namespace *user_ns, struct inode *inode,
+		  const char *name, void *value, size_t size)
 {
 	void *buffer = NULL;
 	ssize_t len;
 
 	if (!value || !size) {
-		len = security_inode_getsecurity(inode, name, &buffer, false);
+		len = security_inode_getsecurity(user_ns, inode, name, &buffer, false);
 		goto out_noalloc;
 	}
 
-	len = security_inode_getsecurity(inode, name, &buffer, true);
+	len = security_inode_getsecurity(user_ns, inode, name, &buffer, true);
 	if (len < 0)
 		return len;
 	if (size < len) {
@@ -440,7 +440,7 @@ vfs_mapped_getxattr(struct user_namespace *user_ns, struct dentry *dentry,
 	if (!strncmp(name, XATTR_SECURITY_PREFIX,
 				XATTR_SECURITY_PREFIX_LEN)) {
 		const char *suffix = name + XATTR_SECURITY_PREFIX_LEN;
-		int ret = xattr_getsecurity(inode, suffix, value, size);
+		int ret = xattr_getsecurity(user_ns, inode, suffix, value, size);
 		/*
 		 * Only overwrite the return value if a security module
 		 * is actually active.
@@ -515,7 +515,7 @@ __vfs_mapped_removexattr_locked(struct user_namespace *user_ns,
 	if (error)
 		return error;
 
-	error = security_inode_removexattr(dentry, name);
+	error = security_inode_removexattr(user_ns, dentry, name);
 	if (error)
 		goto out;
 
diff --git a/include/linux/capability.h b/include/linux/capability.h
index 7e9277d64024..630fcc60b955 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -273,6 +273,9 @@ static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
 
 /* audit system wants to get cap info from files as well */
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
+extern int get_mapped_vfs_caps_from_disk(struct user_namespace *user_ns,
+				     const struct dentry *dentry,
+				     struct cpu_vfs_cap_data *cpu_caps);
 
 extern int cap_convert_nscap(struct user_namespace *user_ns,
 			     struct dentry *dentry, void **ivalue, size_t size);
diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index 32a940117e7a..980297f9028f 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -139,11 +139,13 @@ LSM_HOOK(void, LSM_RET_VOID, inode_post_setxattr, struct dentry *dentry,
 	 const char *name, const void *value, size_t size, int flags)
 LSM_HOOK(int, 0, inode_getxattr, struct dentry *dentry, const char *name)
 LSM_HOOK(int, 0, inode_listxattr, struct dentry *dentry)
-LSM_HOOK(int, 0, inode_removexattr, struct dentry *dentry, const char *name)
+LSM_HOOK(int, 0, inode_removexattr, struct user_namespace *user_ns,
+	 struct dentry *dentry, const char *name)
 LSM_HOOK(int, 0, inode_need_killpriv, struct dentry *dentry)
-LSM_HOOK(int, 0, inode_killpriv, struct dentry *dentry)
-LSM_HOOK(int, -EOPNOTSUPP, inode_getsecurity, struct inode *inode,
-	 const char *name, void **buffer, bool alloc)
+LSM_HOOK(int, 0, inode_killpriv, struct user_namespace *user_ns,
+	 struct dentry *dentry)
+LSM_HOOK(int, -EOPNOTSUPP, inode_getsecurity, struct user_namespace *user_ns,
+	 struct inode *inode, const char *name, void **buffer, bool alloc)
 LSM_HOOK(int, -EOPNOTSUPP, inode_setsecurity, struct inode *inode,
 	 const char *name, const void *value, size_t size, int flags)
 LSM_HOOK(int, 0, inode_listsecurity, struct inode *inode, char *buffer,
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index c503f7ab8afb..465c9c308922 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -444,6 +444,7 @@
  * @inode_killpriv:
  *	The setuid bit is being removed.  Remove similar security labels.
  *	Called with the dentry->d_inode->i_mutex held.
+ *	@user_ns the user namespace of the mount.
  *	@dentry is the dentry being changed.
  *	Return 0 on success.  If error is returned, then the operation
  *	causing setuid bit removal is failed.
diff --git a/include/linux/security.h b/include/linux/security.h
index bc2725491560..c148200041e8 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -146,10 +146,13 @@ extern int cap_capset(struct cred *new, const struct cred *old,
 extern int cap_bprm_creds_from_file(struct linux_binprm *bprm, struct file *file);
 extern int cap_inode_setxattr(struct dentry *dentry, const char *name,
 			      const void *value, size_t size, int flags);
-extern int cap_inode_removexattr(struct dentry *dentry, const char *name);
+extern int cap_inode_removexattr(struct user_namespace *user_ns,
+				 struct dentry *dentry, const char *name);
 extern int cap_inode_need_killpriv(struct dentry *dentry);
-extern int cap_inode_killpriv(struct dentry *dentry);
-extern int cap_inode_getsecurity(struct inode *inode, const char *name,
+extern int cap_inode_killpriv(struct user_namespace *user_ns,
+			      struct dentry *dentry);
+extern int cap_inode_getsecurity(struct user_namespace *user_ns,
+				 struct inode *inode, const char *name,
 				 void **buffer, bool alloc);
 extern int cap_mmap_addr(unsigned long addr);
 extern int cap_mmap_file(struct file *file, unsigned long reqprot,
@@ -350,10 +353,14 @@ void security_inode_post_setxattr(struct dentry *dentry, const char *name,
 				  const void *value, size_t size, int flags);
 int security_inode_getxattr(struct dentry *dentry, const char *name);
 int security_inode_listxattr(struct dentry *dentry);
-int security_inode_removexattr(struct dentry *dentry, const char *name);
+int security_inode_removexattr(struct user_namespace *user_ns,
+			       struct dentry *dentry, const char *name);
 int security_inode_need_killpriv(struct dentry *dentry);
-int security_inode_killpriv(struct dentry *dentry);
-int security_inode_getsecurity(struct inode *inode, const char *name, void **buffer, bool alloc);
+int security_inode_killpriv(struct user_namespace *user_ns,
+			    struct dentry *dentry);
+int security_inode_getsecurity(struct user_namespace *user_ns,
+			       struct inode *inode, const char *name,
+			       void **buffer, bool alloc);
 int security_inode_setsecurity(struct inode *inode, const char *name, const void *value, size_t size, int flags);
 int security_inode_listsecurity(struct inode *inode, char *buffer, size_t buffer_size);
 void security_inode_getsecid(struct inode *inode, u32 *secid);
@@ -851,10 +858,11 @@ static inline int security_inode_listxattr(struct dentry *dentry)
 	return 0;
 }
 
-static inline int security_inode_removexattr(struct dentry *dentry,
-			const char *name)
+static inline int security_inode_removexattr(struct user_namespace *user_ns,
+					     struct dentry *dentry,
+					     const char *name)
 {
-	return cap_inode_removexattr(dentry, name);
+	return cap_inode_removexattr(user_ns, dentry, name);
 }
 
 static inline int security_inode_need_killpriv(struct dentry *dentry)
@@ -862,12 +870,16 @@ static inline int security_inode_need_killpriv(struct dentry *dentry)
 	return cap_inode_need_killpriv(dentry);
 }
 
-static inline int security_inode_killpriv(struct dentry *dentry)
+static inline int security_inode_killpriv(struct user_namespace *user_ns,
+					  struct dentry *dentry)
 {
-	return cap_inode_killpriv(dentry);
+	return cap_inode_killpriv(user_ns, dentry);
 }
 
-static inline int security_inode_getsecurity(struct inode *inode, const char *name, void **buffer, bool alloc)
+static inline int security_inode_getsecurity(struct user_namespace *user_ns,
+					     struct inode *inode,
+					     const char *name, void **buffer,
+					     bool alloc)
 {
 	return -EOPNOTSUPP;
 }
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 8dba8f0983b5..9cdecdea61d9 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -2495,7 +2495,8 @@ int __audit_log_bprm_fcaps(struct linux_binprm *bprm,
 	ax->d.next = context->aux;
 	context->aux = (void *)ax;
 
-	get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
+	get_mapped_vfs_caps_from_disk(mnt_user_ns(bprm->file->f_path.mnt),
+				  bprm->file->f_path.dentry, &vcaps);
 
 	ax->fcap.permitted = vcaps.permitted;
 	ax->fcap.inheritable = vcaps.inheritable;
diff --git a/security/commoncap.c b/security/commoncap.c
index 21f2ff7c346b..4da61ed86b1d 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -303,17 +303,18 @@ int cap_inode_need_killpriv(struct dentry *dentry)
 
 /**
  * cap_inode_killpriv - Erase the security markings on an inode
+ * @user_ns: The user namespace of the mount
  * @dentry: The inode/dentry to alter
  *
  * Erase the privilege-enhancing security markings on an inode.
  *
  * Returns 0 if successful, -ve on error.
  */
-int cap_inode_killpriv(struct dentry *dentry)
+int cap_inode_killpriv(struct user_namespace *user_ns, struct dentry *dentry)
 {
 	int error;
 
-	error = __vfs_removexattr(dentry, XATTR_NAME_CAPS);
+	error = __vfs_mapped_removexattr(user_ns, dentry, XATTR_NAME_CAPS);
 	if (error == -EOPNOTSUPP)
 		error = 0;
 	return error;
@@ -366,8 +367,8 @@ static bool is_v3header(size_t size, const struct vfs_cap_data *cap)
  * by the integrity subsystem, which really wants the unconverted values -
  * so that's good.
  */
-int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
-			  bool alloc)
+int cap_inode_getsecurity(struct user_namespace *user_ns, struct inode *inode,
+			  const char *name, void **buffer, bool alloc)
 {
 	int size, ret;
 	kuid_t kroot;
@@ -386,8 +387,8 @@ int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
 		return -EINVAL;
 
 	size = sizeof(struct vfs_ns_cap_data);
-	ret = (int) vfs_getxattr_alloc(dentry, XATTR_NAME_CAPS,
-				 &tmpbuf, size, GFP_NOFS);
+	ret = (int)vfs_mapped_getxattr_alloc(user_ns, dentry, XATTR_NAME_CAPS,
+					 &tmpbuf, size, GFP_NOFS);
 	dput(dentry);
 
 	if (ret < 0)
@@ -412,6 +413,9 @@ int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
 	root = le32_to_cpu(nscap->rootid);
 	kroot = make_kuid(fs_ns, root);
 
+	/* If this is an idmapped mount shift the kuid. */
+	kroot = kuid_into_mnt(user_ns, kroot);
+
 	/* If the root kuid maps to a valid uid in current ns, then return
 	 * this as a nscap. */
 	mappedroot = from_kuid(current_user_ns(), kroot);
@@ -573,7 +577,9 @@ static inline int bprm_caps_from_vfs_caps(struct cpu_vfs_cap_data *caps,
 /*
  * Extract the on-exec-apply capability sets for an executable file.
  */
-int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps)
+int get_mapped_vfs_caps_from_disk(struct user_namespace *user_ns,
+			      const struct dentry *dentry,
+			      struct cpu_vfs_cap_data *cpu_caps)
 {
 	struct inode *inode = d_backing_inode(dentry);
 	__u32 magic_etc;
@@ -629,6 +635,7 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
 	/* Limit the caps to the mounter of the filesystem
 	 * or the more limited uid specified in the xattr.
 	 */
+	rootkuid = kuid_into_mnt(user_ns, rootkuid);
 	if (!rootid_owns_currentns(rootkuid))
 		return -ENODATA;
 
@@ -647,6 +654,12 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
 	return 0;
 }
 
+int get_vfs_caps_from_disk(const struct dentry *dentry,
+			   struct cpu_vfs_cap_data *cpu_caps)
+{
+	return get_mapped_vfs_caps_from_disk(&init_user_ns, dentry, cpu_caps);
+}
+
 /*
  * Attempt to get the on-exec apply capability sets for an executable file from
  * its xattrs and, if present, apply them to the proposed credentials being
@@ -674,7 +687,7 @@ static int get_file_caps(struct linux_binprm *bprm, struct file *file,
 	if (!current_in_userns(file->f_path.mnt->mnt_sb->s_user_ns))
 		return 0;
 
-	rc = get_vfs_caps_from_disk(file->f_path.dentry, &vcaps);
+	rc = get_mapped_vfs_caps_from_disk(mnt_user_ns(file->f_path.mnt), file->f_path.dentry, &vcaps);
 	if (rc < 0) {
 		if (rc == -EINVAL)
 			printk(KERN_NOTICE "Invalid argument reading file caps for %s\n",
@@ -939,6 +952,7 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
 
 /**
  * cap_inode_removexattr - Determine whether an xattr may be removed
+ * @user_ns: The user namespace of the vfsmount
  * @dentry: The inode/dentry being altered
  * @name: The name of the xattr to be changed
  *
@@ -948,7 +962,8 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
  * This is used to make sure security xattrs don't get removed by those who
  * aren't privileged to remove them.
  */
-int cap_inode_removexattr(struct dentry *dentry, const char *name)
+int cap_inode_removexattr(struct user_namespace *mnt_user_ns,
+			  struct dentry *dentry, const char *name)
 {
 	struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
 
@@ -962,7 +977,7 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
 		struct inode *inode = d_backing_inode(dentry);
 		if (!inode)
 			return -EINVAL;
-		if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
+		if (!capable_wrt_mapped_inode_uidgid(mnt_user_ns, inode, CAP_SETFCAP))
 			return -EPERM;
 		return 0;
 	}
diff --git a/security/security.c b/security/security.c
index a28045dc9e7f..b72940314ed5 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1326,7 +1326,8 @@ int security_inode_listxattr(struct dentry *dentry)
 	return call_int_hook(inode_listxattr, 0, dentry);
 }
 
-int security_inode_removexattr(struct dentry *dentry, const char *name)
+int security_inode_removexattr(struct user_namespace *user_ns,
+			       struct dentry *dentry, const char *name)
 {
 	int ret;
 
@@ -1336,9 +1337,9 @@ int security_inode_removexattr(struct dentry *dentry, const char *name)
 	 * SELinux and Smack integrate the cap call,
 	 * so assume that all LSMs supplying this call do so.
 	 */
-	ret = call_int_hook(inode_removexattr, 1, dentry, name);
+	ret = call_int_hook(inode_removexattr, 1, user_ns, dentry, name);
 	if (ret == 1)
-		ret = cap_inode_removexattr(dentry, name);
+		ret = cap_inode_removexattr(user_ns, dentry, name);
 	if (ret)
 		return ret;
 	ret = ima_inode_removexattr(dentry, name);
@@ -1352,12 +1353,15 @@ int security_inode_need_killpriv(struct dentry *dentry)
 	return call_int_hook(inode_need_killpriv, 0, dentry);
 }
 
-int security_inode_killpriv(struct dentry *dentry)
+int security_inode_killpriv(struct user_namespace *user_ns,
+			    struct dentry *dentry)
 {
-	return call_int_hook(inode_killpriv, 0, dentry);
+	return call_int_hook(inode_killpriv, 0, user_ns, dentry);
 }
 
-int security_inode_getsecurity(struct inode *inode, const char *name, void **buffer, bool alloc)
+int security_inode_getsecurity(struct user_namespace *user_ns,
+			       struct inode *inode, const char *name,
+			       void **buffer, bool alloc)
 {
 	struct security_hook_list *hp;
 	int rc;
@@ -1368,7 +1372,7 @@ int security_inode_getsecurity(struct inode *inode, const char *name, void **buf
 	 * Only one module will provide an attribute with a given name.
 	 */
 	hlist_for_each_entry(hp, &security_hook_heads.inode_getsecurity, list) {
-		rc = hp->hook.inode_getsecurity(inode, name, buffer, alloc);
+		rc = hp->hook.inode_getsecurity(user_ns, inode, name, buffer, alloc);
 		if (rc != LSM_RET_DEFAULT(inode_getsecurity))
 			return rc;
 	}
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 6b1826fc3658..3af7e8a39fd4 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -3260,10 +3260,11 @@ static int selinux_inode_listxattr(struct dentry *dentry)
 	return dentry_has_perm(cred, dentry, FILE__GETATTR);
 }
 
-static int selinux_inode_removexattr(struct dentry *dentry, const char *name)
+static int selinux_inode_removexattr(struct user_namespace *user_ns,
+				     struct dentry *dentry, const char *name)
 {
 	if (strcmp(name, XATTR_NAME_SELINUX)) {
-		int rc = cap_inode_removexattr(dentry, name);
+		int rc = cap_inode_removexattr(user_ns, dentry, name);
 		if (rc)
 			return rc;
 
@@ -3329,7 +3330,9 @@ static int selinux_path_notify(const struct path *path, u64 mask,
  *
  * Permission check is handled by selinux_inode_getxattr hook.
  */
-static int selinux_inode_getsecurity(struct inode *inode, const char *name, void **buffer, bool alloc)
+static int selinux_inode_getsecurity(struct user_namespace *user_ns,
+				     struct inode *inode, const char *name,
+				     void **buffer, bool alloc)
 {
 	u32 size;
 	int error;
@@ -6524,8 +6527,8 @@ static int selinux_inode_setsecctx(struct dentry *dentry, void *ctx, u32 ctxlen)
 static int selinux_inode_getsecctx(struct inode *inode, void **ctx, u32 *ctxlen)
 {
 	int len = 0;
-	len = selinux_inode_getsecurity(inode, XATTR_SELINUX_SUFFIX,
-						ctx, true);
+	len = selinux_inode_getsecurity(&init_user_ns, inode,
+					XATTR_SELINUX_SUFFIX, ctx, true);
 	if (len < 0)
 		return len;
 	*ctxlen = len;
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index 5c90b9fa4d40..1baf2da5d3e3 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -1362,7 +1362,8 @@ static int smack_inode_getxattr(struct dentry *dentry, const char *name)
  *
  * Returns 0 if access is permitted, an error code otherwise
  */
-static int smack_inode_removexattr(struct dentry *dentry, const char *name)
+static int smack_inode_removexattr(struct user_namespace *user_ns,
+				   struct dentry *dentry, const char *name)
 {
 	struct inode_smack *isp;
 	struct smk_audit_info ad;
@@ -1377,7 +1378,7 @@ static int smack_inode_removexattr(struct dentry *dentry, const char *name)
 		if (!smack_privileged(CAP_MAC_ADMIN))
 			rc = -EPERM;
 	} else
-		rc = cap_inode_removexattr(dentry, name);
+		rc = cap_inode_removexattr(user_ns, dentry, name);
 
 	if (rc != 0)
 		return rc;
@@ -1420,9 +1421,9 @@ static int smack_inode_removexattr(struct dentry *dentry, const char *name)
  *
  * Returns the size of the attribute or an error code
  */
-static int smack_inode_getsecurity(struct inode *inode,
-				   const char *name, void **buffer,
-				   bool alloc)
+static int smack_inode_getsecurity(struct user_namespace *user_ns,
+				   struct inode *inode, const char *name,
+				   void **buffer, bool alloc)
 {
 	struct socket_smack *ssp;
 	struct socket *sock;
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 15/34] stat: add mapped_generic_fillattr()
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (13 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 14/34] commoncap: handle idmapped mounts Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 16/34] namei: handle idmapped mounts in may_*() helpers Christian Brauner
                   ` (22 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

The generic_fillattr() helper fills in the basic attributes associated with an
inode. Add a mapped_generic_fillattr() helper to handle idmapped mounts. If the
inode is accessed through an idmapped mount we need to map it according to the
mount's user namespace. If the initial user namespace is passed all operations
are a nop so non-idmapped mounts will not see a change in behavior and will
also not see any performance impact. This also means that the
non-idmapped-mount aware generic_fillattr() helper can be implemented on top of
the idmapped-mount aware mapped_generic_fillattr() helper.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/stat.c          | 18 +++++++++++++-----
 include/linux/fs.h |  1 +
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/fs/stat.c b/fs/stat.c
index dacecdda2e79..ee6d92aec7ac 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -25,7 +25,8 @@
 #include "mount.h"
 
 /**
- * generic_fillattr - Fill in the basic attributes from the inode struct
+ * mapped_generic_fillattr - Fill in the basic attributes from the inode struct on idmapped mounts
+ * @user_ns: the user namespace from which we access this inode
  * @inode: Inode to use as the source
  * @stat: Where to fill in the attributes
  *
@@ -33,14 +34,15 @@
  * found on the VFS inode structure.  This is the default if no getattr inode
  * operation is supplied.
  */
-void generic_fillattr(struct inode *inode, struct kstat *stat)
+void mapped_generic_fillattr(struct user_namespace *mnt_user_ns,
+			 struct inode *inode, struct kstat *stat)
 {
 	stat->dev = inode->i_sb->s_dev;
 	stat->ino = inode->i_ino;
 	stat->mode = inode->i_mode;
 	stat->nlink = inode->i_nlink;
-	stat->uid = inode->i_uid;
-	stat->gid = inode->i_gid;
+	stat->uid = i_uid_into_mnt(mnt_user_ns, inode);
+	stat->gid = i_gid_into_mnt(mnt_user_ns, inode);
 	stat->rdev = inode->i_rdev;
 	stat->size = i_size_read(inode);
 	stat->atime = inode->i_atime;
@@ -49,6 +51,12 @@ void generic_fillattr(struct inode *inode, struct kstat *stat)
 	stat->blksize = i_blocksize(inode);
 	stat->blocks = inode->i_blocks;
 }
+EXPORT_SYMBOL(mapped_generic_fillattr);
+
+void generic_fillattr(struct inode *inode, struct kstat *stat)
+{
+	mapped_generic_fillattr(&init_user_ns, inode, stat);
+}
 EXPORT_SYMBOL(generic_fillattr);
 
 /**
@@ -87,7 +95,7 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
 		return inode->i_op->getattr(path, stat, request_mask,
 					    query_flags);
 
-	generic_fillattr(inode, stat);
+	mapped_generic_fillattr(mnt_user_ns(path->mnt), inode, stat);
 	return 0;
 }
 EXPORT_SYMBOL(vfs_getattr_nosec);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f41d93b0e6d7..e66852dee65d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3162,6 +3162,7 @@ extern int page_symlink(struct inode *inode, const char *symname, int len);
 extern const struct inode_operations page_symlink_inode_operations;
 extern void kfree_link(void *);
 extern void generic_fillattr(struct inode *, struct kstat *);
+extern void mapped_generic_fillattr(struct user_namespace *, struct inode *, struct kstat *);
 extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int);
 extern int vfs_getattr(const struct path *, struct kstat *, u32, unsigned int);
 void __inode_add_bytes(struct inode *inode, loff_t bytes);
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 16/34] namei: handle idmapped mounts in may_*() helpers
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (14 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 15/34] stat: add mapped_generic_fillattr() Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 17/34] namei: introduce struct renamedata Christian Brauner
                   ` (21 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

The may_follow_link(), may_linkat(), may_lookup(), may_open(), may_o_create(),
may_create_in_sticky(), may_delete(), and may_create() helpers determine
whether the caller is privileged enough to perform the associated operations.
Let them handle idmapped mounts by mappings the inode and fsids according to
the mount's user namespace. Afterwards the checks are identical to non-idmapped
inodes. If the initial user namespace is passed all operations are a nop so
non-idmapped mounts will not see a change in behavior and will also not see any
performance impact.
Since the may_*() helpers are not exposed to other parts of the kernel we can
simply extend them with an additional argument in case they don't already have
access to the mount's user namespace.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/namei.c | 106 +++++++++++++++++++++++++++++++----------------------
 1 file changed, 63 insertions(+), 43 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 2635f6a57de5..76ee4d52bd5e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -985,11 +985,14 @@ int sysctl_protected_regular __read_mostly;
  */
 static inline int may_follow_link(struct nameidata *nd, const struct inode *inode)
 {
+	struct user_namespace *user_ns;
+
 	if (!sysctl_protected_symlinks)
 		return 0;
 
+	user_ns = mnt_user_ns(nd->path.mnt);
 	/* Allowed if owner and follower match. */
-	if (uid_eq(current_cred()->fsuid, inode->i_uid))
+	if (uid_eq(current_cred()->fsuid, i_uid_into_mnt(user_ns, inode)))
 		return 0;
 
 	/* Allowed if parent directory not sticky and world-writable. */
@@ -1020,7 +1023,7 @@ static inline int may_follow_link(struct nameidata *nd, const struct inode *inod
  *
  * Otherwise returns true.
  */
-static bool safe_hardlink_source(struct inode *inode)
+static bool safe_hardlink_source(struct user_namespace *user_ns, struct inode *inode)
 {
 	umode_t mode = inode->i_mode;
 
@@ -1037,7 +1040,7 @@ static bool safe_hardlink_source(struct inode *inode)
 		return false;
 
 	/* Hardlinking to unreadable or unwritable sources is dangerous. */
-	if (inode_permission(inode, MAY_READ | MAY_WRITE))
+	if (mapped_inode_permission(user_ns, inode, MAY_READ | MAY_WRITE))
 		return false;
 
 	return true;
@@ -1058,6 +1061,7 @@ static bool safe_hardlink_source(struct inode *inode)
 int may_linkat(struct path *link)
 {
 	struct inode *inode = link->dentry->d_inode;
+	struct user_namespace *user_ns;
 
 	/* Inode writeback is not safe when the uid or gid are invalid. */
 	if (!uid_valid(inode->i_uid) || !gid_valid(inode->i_gid))
@@ -1069,7 +1073,9 @@ int may_linkat(struct path *link)
 	/* Source inode owner (or CAP_FOWNER) can hardlink all they like,
 	 * otherwise, it must be a safe source.
 	 */
-	if (safe_hardlink_source(inode) || inode_owner_or_capable(inode))
+	user_ns = mnt_user_ns(link->mnt);
+	if (safe_hardlink_source(user_ns, inode) ||
+	    mapped_inode_owner_or_capable(user_ns, inode))
 		return 0;
 
 	audit_log_path_denied(AUDIT_ANOM_LINK, "linkat");
@@ -1097,14 +1103,17 @@ int may_linkat(struct path *link)
  *
  * Returns 0 if the open is allowed, -ve on error.
  */
-static int may_create_in_sticky(umode_t dir_mode, kuid_t dir_uid,
-				struct inode * const inode)
+static int may_create_in_sticky(struct nameidata *nd, struct inode *const inode)
 {
+	struct user_namespace *user_ns;
+	umode_t dir_mode = nd->dir_mode;
+	kuid_t dir_uid = nd->dir_uid;
+
+	user_ns = mnt_user_ns(nd->path.mnt);
 	if ((!sysctl_protected_fifos && S_ISFIFO(inode->i_mode)) ||
 	    (!sysctl_protected_regular && S_ISREG(inode->i_mode)) ||
-	    likely(!(dir_mode & S_ISVTX)) ||
-	    uid_eq(inode->i_uid, dir_uid) ||
-	    uid_eq(current_fsuid(), inode->i_uid))
+	    likely(!(dir_mode & S_ISVTX)) || uid_eq(inode->i_uid, dir_uid) ||
+	    uid_eq(current_fsuid(), i_uid_into_mnt(user_ns, inode)))
 		return 0;
 
 	if (likely(dir_mode & 0002) ||
@@ -1596,14 +1605,16 @@ static struct dentry *lookup_slow(const struct qstr *name,
 
 static inline int may_lookup(struct nameidata *nd)
 {
+	struct user_namespace *user_ns = mnt_user_ns(nd->path.mnt);
+
 	if (nd->flags & LOOKUP_RCU) {
-		int err = inode_permission(nd->inode, MAY_EXEC|MAY_NOT_BLOCK);
+		int err = mapped_inode_permission(user_ns, nd->inode, MAY_EXEC|MAY_NOT_BLOCK);
 		if (err != -ECHILD)
 			return err;
 		if (unlazy_walk(nd))
 			return -ECHILD;
 	}
-	return inode_permission(nd->inode, MAY_EXEC);
+	return mapped_inode_permission(user_ns, nd->inode, MAY_EXEC);
 }
 
 static int reserve_stack(struct nameidata *nd, struct path *link, unsigned seq)
@@ -2680,15 +2691,18 @@ int user_path_at_empty(int dfd, const char __user *name, unsigned flags,
 }
 EXPORT_SYMBOL(user_path_at_empty);
 
-int __check_sticky(struct inode *dir, struct inode *inode)
+static int do_check_sticky(struct user_namespace *user_ns, struct inode *dir, struct inode *inode)
 {
-	kuid_t fsuid = current_fsuid();
-
-	if (uid_eq(inode->i_uid, fsuid))
+	if (uid_eq(i_uid_into_mnt(user_ns, inode), current_fsuid()))
 		return 0;
-	if (uid_eq(dir->i_uid, fsuid))
+	if (uid_eq(i_uid_into_mnt(user_ns, dir), current_fsuid()))
 		return 0;
-	return !capable_wrt_inode_uidgid(inode, CAP_FOWNER);
+	return !capable_wrt_mapped_inode_uidgid(user_ns, inode, CAP_FOWNER);
+}
+
+int __check_sticky(struct inode *dir, struct inode *inode)
+{
+	return do_check_sticky(&init_user_ns, dir, inode);
 }
 EXPORT_SYMBOL(__check_sticky);
 
@@ -2712,7 +2726,7 @@ EXPORT_SYMBOL(__check_sticky);
  * 11. We don't allow removal of NFS sillyrenamed files; it's handled by
  *     nfs_async_unlink().
  */
-static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
+static int may_delete(struct user_namespace *user_ns, struct inode *dir, struct dentry *victim, bool isdir)
 {
 	struct inode *inode = d_backing_inode(victim);
 	int error;
@@ -2729,13 +2743,13 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
 
 	audit_inode_child(dir, victim, AUDIT_TYPE_CHILD_DELETE);
 
-	error = inode_permission(dir, MAY_WRITE | MAY_EXEC);
+	error = mapped_inode_permission(user_ns, dir, MAY_WRITE | MAY_EXEC);
 	if (error)
 		return error;
 	if (IS_APPEND(dir))
 		return -EPERM;
 
-	if (check_sticky(dir, inode) || IS_APPEND(inode) ||
+	if (do_check_sticky(user_ns, dir, inode) || IS_APPEND(inode) ||
 	    IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode))
 		return -EPERM;
 	if (isdir) {
@@ -2761,7 +2775,8 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
  *  4. We should have write and exec permissions on dir
  *  5. We can't do it if dir is immutable (done in permission())
  */
-static inline int may_create(struct inode *dir, struct dentry *child)
+static inline int may_create(struct user_namespace *user_ns, struct inode *dir,
+			     struct dentry *child)
 {
 	struct user_namespace *s_user_ns;
 	audit_inode_child(dir, child, AUDIT_TYPE_CHILD_CREATE);
@@ -2770,10 +2785,10 @@ static inline int may_create(struct inode *dir, struct dentry *child)
 	if (IS_DEADDIR(dir))
 		return -ENOENT;
 	s_user_ns = dir->i_sb->s_user_ns;
-	if (!kuid_has_mapping(s_user_ns, current_fsuid()) ||
-	    !kgid_has_mapping(s_user_ns, current_fsgid()))
+	if (!kuid_has_mapping(s_user_ns, fsuid_into_mnt(user_ns)) ||
+	    !kgid_has_mapping(s_user_ns, fsgid_into_mnt(user_ns)))
 		return -EOVERFLOW;
-	return inode_permission(dir, MAY_WRITE | MAY_EXEC);
+	return mapped_inode_permission(user_ns, dir, MAY_WRITE | MAY_EXEC);
 }
 
 /*
@@ -2823,7 +2838,7 @@ EXPORT_SYMBOL(unlock_rename);
 int vfs_create(struct inode *dir, struct dentry *dentry, umode_t mode,
 		bool want_excl)
 {
-	int error = may_create(dir, dentry);
+	int error = may_create(&init_user_ns, dir, dentry);
 	if (error)
 		return error;
 
@@ -2846,7 +2861,7 @@ int vfs_mkobj(struct dentry *dentry, umode_t mode,
 		void *arg)
 {
 	struct inode *dir = dentry->d_parent->d_inode;
-	int error = may_create(dir, dentry);
+	int error = may_create(&init_user_ns, dir, dentry);
 	if (error)
 		return error;
 
@@ -2870,6 +2885,7 @@ bool may_open_dev(const struct path *path)
 
 static int may_open(const struct path *path, int acc_mode, int flag)
 {
+	struct user_namespace *user_ns;
 	struct dentry *dentry = path->dentry;
 	struct inode *inode = dentry->d_inode;
 	int error;
@@ -2903,7 +2919,8 @@ static int may_open(const struct path *path, int acc_mode, int flag)
 		break;
 	}
 
-	error = inode_permission(inode, MAY_OPEN | acc_mode);
+	user_ns = mnt_user_ns(path->mnt);
+	error = mapped_inode_permission(user_ns, inode, MAY_OPEN | acc_mode);
 	if (error)
 		return error;
 
@@ -2918,7 +2935,7 @@ static int may_open(const struct path *path, int acc_mode, int flag)
 	}
 
 	/* O_NOATIME can only be set by the owner or superuser */
-	if (flag & O_NOATIME && !inode_owner_or_capable(inode))
+	if (flag & O_NOATIME && !mapped_inode_owner_or_capable(user_ns, inode))
 		return -EPERM;
 
 	return 0;
@@ -2955,17 +2972,19 @@ static inline int open_to_namei_flags(int flag)
 
 static int may_o_create(const struct path *dir, struct dentry *dentry, umode_t mode)
 {
-	struct user_namespace *s_user_ns;
+	struct user_namespace *s_user_ns, *user_ns;
 	int error = security_path_mknod(dir, dentry, mode, 0);
 	if (error)
 		return error;
 
+	user_ns = mnt_user_ns(dir->mnt);
 	s_user_ns = dir->dentry->d_sb->s_user_ns;
-	if (!kuid_has_mapping(s_user_ns, current_fsuid()) ||
-	    !kgid_has_mapping(s_user_ns, current_fsgid()))
+	if (!kuid_has_mapping(s_user_ns, fsuid_into_mnt(user_ns)) ||
+	    !kgid_has_mapping(s_user_ns, fsgid_into_mnt(user_ns)))
 		return -EOVERFLOW;
 
-	error = inode_permission(dir->dentry->d_inode, MAY_WRITE | MAY_EXEC);
+	error = mapped_inode_permission(user_ns, dir->dentry->d_inode,
+				    MAY_WRITE | MAY_EXEC);
 	if (error)
 		return error;
 
@@ -3258,7 +3277,7 @@ static int do_open(struct nameidata *nd,
 			return -EEXIST;
 		if (d_is_dir(nd->path.dentry))
 			return -EISDIR;
-		error = may_create_in_sticky(nd->dir_mode, nd->dir_uid,
+		error = may_create_in_sticky(nd,
 					     d_backing_inode(nd->path.dentry));
 		if (unlikely(error))
 			return error;
@@ -3560,7 +3579,7 @@ EXPORT_SYMBOL(user_path_create);
 int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
 {
 	bool is_whiteout = S_ISCHR(mode) && dev == WHITEOUT_DEV;
-	int error = may_create(dir, dentry);
+	int error = may_create(&init_user_ns, dir, dentry);
 
 	if (error)
 		return error;
@@ -3661,7 +3680,7 @@ SYSCALL_DEFINE3(mknod, const char __user *, filename, umode_t, mode, unsigned, d
 
 int vfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 {
-	int error = may_create(dir, dentry);
+	int error = may_create(&init_user_ns, dir, dentry);
 	unsigned max_links = dir->i_sb->s_max_links;
 
 	if (error)
@@ -3722,7 +3741,7 @@ SYSCALL_DEFINE2(mkdir, const char __user *, pathname, umode_t, mode)
 
 int vfs_rmdir(struct inode *dir, struct dentry *dentry)
 {
-	int error = may_delete(dir, dentry, 1);
+	int error = may_delete(&init_user_ns, dir, dentry, 1);
 
 	if (error)
 		return error;
@@ -3844,7 +3863,7 @@ SYSCALL_DEFINE1(rmdir, const char __user *, pathname)
 int vfs_unlink(struct inode *dir, struct dentry *dentry, struct inode **delegated_inode)
 {
 	struct inode *target = dentry->d_inode;
-	int error = may_delete(dir, dentry, 0);
+	int error = may_delete(&init_user_ns, dir, dentry, 0);
 
 	if (error)
 		return error;
@@ -3976,7 +3995,7 @@ SYSCALL_DEFINE1(unlink, const char __user *, pathname)
 
 int vfs_symlink(struct inode *dir, struct dentry *dentry, const char *oldname)
 {
-	int error = may_create(dir, dentry);
+	int error = may_create(&init_user_ns, dir, dentry);
 
 	if (error)
 		return error;
@@ -4065,7 +4084,7 @@ int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_de
 	if (!inode)
 		return -ENOENT;
 
-	error = may_create(dir, new_dentry);
+	error = may_create(&init_user_ns, dir, new_dentry);
 	if (error)
 		return error;
 
@@ -4257,6 +4276,7 @@ int vfs_rename(struct inode *old_dir, struct dentry *old_dentry,
 	       struct inode **delegated_inode, unsigned int flags)
 {
 	int error;
+	struct user_namespace *user_ns = &init_user_ns;
 	bool is_dir = d_is_dir(old_dentry);
 	struct inode *source = old_dentry->d_inode;
 	struct inode *target = new_dentry->d_inode;
@@ -4267,19 +4287,19 @@ int vfs_rename(struct inode *old_dir, struct dentry *old_dentry,
 	if (source == target)
 		return 0;
 
-	error = may_delete(old_dir, old_dentry, is_dir);
+	error = may_delete(user_ns, old_dir, old_dentry, is_dir);
 	if (error)
 		return error;
 
 	if (!target) {
-		error = may_create(new_dir, new_dentry);
+		error = may_create(user_ns, new_dir, new_dentry);
 	} else {
 		new_is_dir = d_is_dir(new_dentry);
 
 		if (!(flags & RENAME_EXCHANGE))
-			error = may_delete(new_dir, new_dentry, is_dir);
+			error = may_delete(user_ns, new_dir, new_dentry, is_dir);
 		else
-			error = may_delete(new_dir, new_dentry, new_is_dir);
+			error = may_delete(user_ns, new_dir, new_dentry, new_is_dir);
 	}
 	if (error)
 		return error;
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 17/34] namei: introduce struct renamedata
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (15 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 16/34] namei: handle idmapped mounts in may_*() helpers Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 18/34] namei: prepare for idmapped mounts Christian Brauner
                   ` (20 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

In order to handle idmapped mounts we will extend the vfs rename helper
to take two new arguments in follow up patches. Since this operations already
takes a bunch of arguments add a simple struct renamedata (based on struct
nameidata) and make the current helper to use it before we extend it.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/namei.c | 144 ++++++++++++++++++++++++++++++++---------------------
 1 file changed, 88 insertions(+), 56 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 76ee4d52bd5e..781f11795a22 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4221,62 +4221,24 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname
 	return do_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0);
 }
 
-/**
- * vfs_rename - rename a filesystem object
- * @old_dir:	parent of source
- * @old_dentry:	source
- * @new_dir:	parent of destination
- * @new_dentry:	destination
- * @delegated_inode: returns an inode needing a delegation break
- * @flags:	rename flags
- *
- * The caller must hold multiple mutexes--see lock_rename()).
- *
- * If vfs_rename discovers a delegation in need of breaking at either
- * the source or destination, it will return -EWOULDBLOCK and return a
- * reference to the inode in delegated_inode.  The caller should then
- * break the delegation and retry.  Because breaking a delegation may
- * take a long time, the caller should drop all locks before doing
- * so.
- *
- * Alternatively, a caller may pass NULL for delegated_inode.  This may
- * be appropriate for callers that expect the underlying filesystem not
- * to be NFS exported.
- *
- * The worst of all namespace operations - renaming directory. "Perverted"
- * doesn't even start to describe it. Somebody in UCB had a heck of a trip...
- * Problems:
- *
- *	a) we can get into loop creation.
- *	b) race potential - two innocent renames can create a loop together.
- *	   That's where 4.4 screws up. Current fix: serialization on
- *	   sb->s_vfs_rename_mutex. We might be more accurate, but that's another
- *	   story.
- *	c) we have to lock _four_ objects - parents and victim (if it exists),
- *	   and source (if it is not a directory).
- *	   And that - after we got ->i_mutex on parents (until then we don't know
- *	   whether the target exists).  Solution: try to be smart with locking
- *	   order for inodes.  We rely on the fact that tree topology may change
- *	   only under ->s_vfs_rename_mutex _and_ that parent of the object we
- *	   move will be locked.  Thus we can rank directories by the tree
- *	   (ancestors first) and rank all non-directories after them.
- *	   That works since everybody except rename does "lock parent, lookup,
- *	   lock child" and rename is under ->s_vfs_rename_mutex.
- *	   HOWEVER, it relies on the assumption that any object with ->lookup()
- *	   has no more than 1 dentry.  If "hybrid" objects will ever appear,
- *	   we'd better make sure that there's no link(2) for them.
- *	d) conversion from fhandle to dentry may come in the wrong moment - when
- *	   we are removing the target. Solution: we will have to grab ->i_mutex
- *	   in the fhandle_to_dentry code. [FIXME - current nfsfh.c relies on
- *	   ->i_mutex on parents, which works but leads to some truly excessive
- *	   locking].
- */
-int vfs_rename(struct inode *old_dir, struct dentry *old_dentry,
-	       struct inode *new_dir, struct dentry *new_dentry,
-	       struct inode **delegated_inode, unsigned int flags)
+struct renamedata {
+	struct inode *old_dir;
+	struct dentry *old_dentry;
+	struct inode *new_dir;
+	struct dentry *new_dentry;
+	struct inode **delegated_inode;
+	unsigned int flags;
+} __randomize_layout;
+
+static int __vfs_rename(struct renamedata *rd)
 {
 	int error;
 	struct user_namespace *user_ns = &init_user_ns;
+	struct inode *old_dir = rd->old_dir, *new_dir = rd->new_dir;
+	struct dentry *old_dentry = rd->old_dentry,
+		      *new_dentry = rd->new_dentry;
+	struct inode **delegated_inode = rd->delegated_inode;
+	unsigned int flags = rd->flags;
 	bool is_dir = d_is_dir(old_dentry);
 	struct inode *source = old_dentry->d_inode;
 	struct inode *target = new_dentry->d_inode;
@@ -4395,11 +4357,76 @@ int vfs_rename(struct inode *old_dir, struct dentry *old_dentry,
 
 	return error;
 }
+
+/**
+ * vfs_rename - rename a filesystem object
+ * @old_dir:	parent of source
+ * @old_dentry:	source
+ * @new_dir:	parent of destination
+ * @new_dentry:	destination
+ * @delegated_inode: returns an inode needing a delegation break
+ * @flags:	rename flags
+ *
+ * The caller must hold multiple mutexes--see lock_rename()).
+ *
+ * If vfs_rename discovers a delegation in need of breaking at either
+ * the source or destination, it will return -EWOULDBLOCK and return a
+ * reference to the inode in delegated_inode.  The caller should then
+ * break the delegation and retry.  Because breaking a delegation may
+ * take a long time, the caller should drop all locks before doing
+ * so.
+ *
+ * Alternatively, a caller may pass NULL for delegated_inode.  This may
+ * be appropriate for callers that expect the underlying filesystem not
+ * to be NFS exported.
+ *
+ * The worst of all namespace operations - renaming directory. "Perverted"
+ * doesn't even start to describe it. Somebody in UCB had a heck of a trip...
+ * Problems:
+ *
+ *	a) we can get into loop creation.
+ *	b) race potential - two innocent renames can create a loop together.
+ *	   That's where 4.4 screws up. Current fix: serialization on
+ *	   sb->s_vfs_rename_mutex. We might be more accurate, but that's another
+ *	   story.
+ *	c) we have to lock _four_ objects - parents and victim (if it exists),
+ *	   and source (if it is not a directory).
+ *	   And that - after we got ->i_mutex on parents (until then we don't know
+ *	   whether the target exists).  Solution: try to be smart with locking
+ *	   order for inodes.  We rely on the fact that tree topology may change
+ *	   only under ->s_vfs_rename_mutex _and_ that parent of the object we
+ *	   move will be locked.  Thus we can rank directories by the tree
+ *	   (ancestors first) and rank all non-directories after them.
+ *	   That works since everybody except rename does "lock parent, lookup,
+ *	   lock child" and rename is under ->s_vfs_rename_mutex.
+ *	   HOWEVER, it relies on the assumption that any object with ->lookup()
+ *	   has no more than 1 dentry.  If "hybrid" objects will ever appear,
+ *	   we'd better make sure that there's no link(2) for them.
+ *	d) conversion from fhandle to dentry may come in the wrong moment - when
+ *	   we are removing the target. Solution: we will have to grab ->i_mutex
+ *	   in the fhandle_to_dentry code. [FIXME - current nfsfh.c relies on
+ *	   ->i_mutex on parents, which works but leads to some truly excessive
+ *	   locking].
+ */
+int vfs_rename(struct inode *old_dir, struct dentry *old_dentry,
+	       struct inode *new_dir, struct dentry *new_dentry,
+	       struct inode **delegated_inode, unsigned int flags)
+{
+	struct renamedata rd = {
+		.old_dir	 = old_dir,
+		.new_dir	 = new_dir,
+		.old_dentry	 = old_dentry,
+		.delegated_inode = delegated_inode,
+		.flags		 = flags,
+	};
+	return __vfs_rename(&rd);
+}
 EXPORT_SYMBOL(vfs_rename);
 
 static int do_renameat2(int olddfd, const char __user *oldname, int newdfd,
 			const char __user *newname, unsigned int flags)
 {
+	struct renamedata rd;
 	struct dentry *old_dentry, *new_dentry;
 	struct dentry *trap;
 	struct path old_path, new_path;
@@ -4505,9 +4532,14 @@ static int do_renameat2(int olddfd, const char __user *oldname, int newdfd,
 				     &new_path, new_dentry, flags);
 	if (error)
 		goto exit5;
-	error = vfs_rename(old_path.dentry->d_inode, old_dentry,
-			   new_path.dentry->d_inode, new_dentry,
-			   &delegated_inode, flags);
+
+	rd.old_dir	   = old_path.dentry->d_inode;
+	rd.old_dentry	   = old_dentry;
+	rd.new_dir	   = new_path.dentry->d_inode;
+	rd.new_dentry	   = new_dentry;
+	rd.delegated_inode = &delegated_inode;
+	rd.flags	   = flags;
+	error = __vfs_rename(&rd);
 exit5:
 	dput(new_dentry);
 exit4:
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 18/34] namei: prepare for idmapped mounts
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (16 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 17/34] namei: introduce struct renamedata Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 19/34] namei: add lookup helpers with idmapped mounts aware permission checking Christian Brauner
                   ` (19 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

The various vfs_*() helpers are called by filesystems or by the vfs itself to
perform core operations create, link, mkdir, mknod, rename, rmdir, tmpfile and
unlink. Add a set of helpers that handle idmapped mounts. If the inode is
accessed through an idmapped mount it is mapped according to the mount's user
namespace. Afterwards the checks and operations are identical to non-idmapped
mounts. If the initial user namespace is passed all mapping operations are a
nop so non-idmapped mounts will not see a change in behavior and will also not
see any performance impact. It also means that the non-idmapped-mount aware
helpers can be implemented on top of their idmapped-mount aware counterparts by
passing the initial user namespace.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/namei.c               | 229 +++++++++++++++++++++++++++------------
 fs/overlayfs/overlayfs.h |   2 +-
 include/linux/fs.h       |  32 +++++-
 3 files changed, 192 insertions(+), 71 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 781f11795a22..a8a3de936cfc 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2835,10 +2835,10 @@ void unlock_rename(struct dentry *p1, struct dentry *p2)
 }
 EXPORT_SYMBOL(unlock_rename);
 
-int vfs_create(struct inode *dir, struct dentry *dentry, umode_t mode,
-		bool want_excl)
+int vfs_mapped_create(struct user_namespace *user_ns, struct inode *dir,
+		      struct dentry *dentry, umode_t mode, bool want_excl)
 {
-	int error = may_create(&init_user_ns, dir, dentry);
+	int error = may_create(user_ns, dir, dentry);
 	if (error)
 		return error;
 
@@ -2854,6 +2854,13 @@ int vfs_create(struct inode *dir, struct dentry *dentry, umode_t mode,
 		fsnotify_create(dir, dentry);
 	return error;
 }
+EXPORT_SYMBOL(vfs_mapped_create);
+
+int vfs_create(struct inode *dir, struct dentry *dentry, umode_t mode,
+		bool want_excl)
+{
+	return vfs_mapped_create(&init_user_ns, dir, dentry, mode, want_excl);
+}
 EXPORT_SYMBOL(vfs_create);
 
 int vfs_mkobj(struct dentry *dentry, umode_t mode,
@@ -3313,7 +3320,9 @@ static int do_open(struct nameidata *nd,
 	return error;
 }
 
-struct dentry *vfs_tmpfile(struct dentry *dentry, umode_t mode, int open_flag)
+struct dentry *vfs_mapped_tmpfile(struct user_namespace *user_ns,
+				  struct dentry *dentry, umode_t mode,
+				  int open_flag)
 {
 	struct dentry *child = NULL;
 	struct inode *dir = dentry->d_inode;
@@ -3321,7 +3330,7 @@ struct dentry *vfs_tmpfile(struct dentry *dentry, umode_t mode, int open_flag)
 	int error;
 
 	/* we want directory to be writable */
-	error = inode_permission(dir, MAY_WRITE | MAY_EXEC);
+	error = mapped_inode_permission(user_ns, dir, MAY_WRITE | MAY_EXEC);
 	if (error)
 		goto out_err;
 	error = -EOPNOTSUPP;
@@ -3350,12 +3359,19 @@ struct dentry *vfs_tmpfile(struct dentry *dentry, umode_t mode, int open_flag)
 	dput(child);
 	return ERR_PTR(error);
 }
+EXPORT_SYMBOL(vfs_mapped_tmpfile);
+
+struct dentry *vfs_tmpfile(struct dentry *dentry, umode_t mode, int open_flag)
+{
+	return vfs_mapped_tmpfile(&init_user_ns, dentry, mode, open_flag);
+}
 EXPORT_SYMBOL(vfs_tmpfile);
 
 static int do_tmpfile(struct nameidata *nd, unsigned flags,
 		const struct open_flags *op,
 		struct file *file)
 {
+	struct user_namespace *user_ns;
 	struct dentry *child;
 	struct path path;
 	int error = path_lookupat(nd, flags | LOOKUP_DIRECTORY, &path);
@@ -3364,7 +3380,8 @@ static int do_tmpfile(struct nameidata *nd, unsigned flags,
 	error = mnt_want_write(path.mnt);
 	if (unlikely(error))
 		goto out;
-	child = vfs_tmpfile(path.dentry, op->mode, op->open_flag);
+	user_ns = mnt_user_ns(path.mnt);
+	child = vfs_mapped_tmpfile(user_ns, path.dentry, op->mode, op->open_flag);
 	error = PTR_ERR(child);
 	if (IS_ERR(child))
 		goto out2;
@@ -3576,10 +3593,11 @@ inline struct dentry *user_path_create(int dfd, const char __user *pathname,
 }
 EXPORT_SYMBOL(user_path_create);
 
-int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
+int vfs_mapped_mknod(struct user_namespace *user_ns, struct inode *dir,
+		     struct dentry *dentry, umode_t mode, dev_t dev)
 {
 	bool is_whiteout = S_ISCHR(mode) && dev == WHITEOUT_DEV;
-	int error = may_create(&init_user_ns, dir, dentry);
+	int error = may_create(user_ns, dir, dentry);
 
 	if (error)
 		return error;
@@ -3604,6 +3622,12 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
 		fsnotify_create(dir, dentry);
 	return error;
 }
+EXPORT_SYMBOL(vfs_mapped_mknod);
+
+int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
+{
+	return vfs_mapped_mknod(&init_user_ns, dir, dentry, mode, dev);
+}
 EXPORT_SYMBOL(vfs_mknod);
 
 static int may_mknod(umode_t mode)
@@ -3626,6 +3650,7 @@ static int may_mknod(umode_t mode)
 static long do_mknodat(int dfd, const char __user *filename, umode_t mode,
 		unsigned int dev)
 {
+	struct user_namespace *user_ns;
 	struct dentry *dentry;
 	struct path path;
 	int error;
@@ -3644,18 +3669,22 @@ static long do_mknodat(int dfd, const char __user *filename, umode_t mode,
 	error = security_path_mknod(&path, dentry, mode, dev);
 	if (error)
 		goto out;
+
+	user_ns = mnt_user_ns(path.mnt);
 	switch (mode & S_IFMT) {
 		case 0: case S_IFREG:
-			error = vfs_create(path.dentry->d_inode,dentry,mode,true);
+			error = vfs_mapped_create(user_ns, path.dentry->d_inode,
+						  dentry, mode, true);
 			if (!error)
 				ima_post_path_mknod(dentry);
 			break;
 		case S_IFCHR: case S_IFBLK:
-			error = vfs_mknod(path.dentry->d_inode,dentry,mode,
-					new_decode_dev(dev));
+			error = vfs_mapped_mknod(user_ns, path.dentry->d_inode,
+						 dentry, mode, new_decode_dev(dev));
 			break;
 		case S_IFIFO: case S_IFSOCK:
-			error = vfs_mknod(path.dentry->d_inode,dentry,mode,0);
+			error = vfs_mapped_mknod(user_ns, path.dentry->d_inode,
+						 dentry, mode, 0);
 			break;
 	}
 out:
@@ -3678,9 +3707,10 @@ SYSCALL_DEFINE3(mknod, const char __user *, filename, umode_t, mode, unsigned, d
 	return do_mknodat(AT_FDCWD, filename, mode, dev);
 }
 
-int vfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
+int vfs_mapped_mkdir(struct user_namespace *user_ns, struct inode *dir,
+		     struct dentry *dentry, umode_t mode)
 {
-	int error = may_create(&init_user_ns, dir, dentry);
+	int error = may_create(user_ns, dir, dentry);
 	unsigned max_links = dir->i_sb->s_max_links;
 
 	if (error)
@@ -3702,6 +3732,12 @@ int vfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 		fsnotify_mkdir(dir, dentry);
 	return error;
 }
+EXPORT_SYMBOL(vfs_mapped_mkdir);
+
+int vfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	return vfs_mapped_mkdir(&init_user_ns, dir, dentry, mode);
+}
 EXPORT_SYMBOL(vfs_mkdir);
 
 static long do_mkdirat(int dfd, const char __user *pathname, umode_t mode)
@@ -3719,8 +3755,11 @@ static long do_mkdirat(int dfd, const char __user *pathname, umode_t mode)
 	if (!IS_POSIXACL(path.dentry->d_inode))
 		mode &= ~current_umask();
 	error = security_path_mkdir(&path, dentry, mode);
-	if (!error)
-		error = vfs_mkdir(path.dentry->d_inode, dentry, mode);
+	if (!error) {
+		struct user_namespace *user_ns;
+		user_ns = mnt_user_ns(path.mnt);
+		error = vfs_mapped_mkdir(user_ns, path.dentry->d_inode, dentry, mode);
+	}
 	done_path_create(&path, dentry);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
@@ -3739,9 +3778,10 @@ SYSCALL_DEFINE2(mkdir, const char __user *, pathname, umode_t, mode)
 	return do_mkdirat(AT_FDCWD, pathname, mode);
 }
 
-int vfs_rmdir(struct inode *dir, struct dentry *dentry)
+int vfs_mapped_rmdir(struct user_namespace *user_ns, struct inode *dir,
+		     struct dentry *dentry)
 {
-	int error = may_delete(&init_user_ns, dir, dentry, 1);
+	int error = may_delete(user_ns, dir, dentry, 1);
 
 	if (error)
 		return error;
@@ -3777,10 +3817,17 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
 		d_delete(dentry);
 	return error;
 }
+EXPORT_SYMBOL(vfs_mapped_rmdir);
+
+int vfs_rmdir(struct inode *dir, struct dentry *dentry)
+{
+	return vfs_mapped_rmdir(&init_user_ns, dir, dentry);
+}
 EXPORT_SYMBOL(vfs_rmdir);
 
 long do_rmdir(int dfd, struct filename *name)
 {
+	struct user_namespace *user_ns;
 	int error = 0;
 	struct dentry *dentry;
 	struct path path;
@@ -3821,7 +3868,8 @@ long do_rmdir(int dfd, struct filename *name)
 	error = security_path_rmdir(&path, dentry);
 	if (error)
 		goto exit3;
-	error = vfs_rmdir(path.dentry->d_inode, dentry);
+	user_ns = mnt_user_ns(path.mnt);
+	error = vfs_mapped_rmdir(user_ns, path.dentry->d_inode, dentry);
 exit3:
 	dput(dentry);
 exit2:
@@ -3842,28 +3890,11 @@ SYSCALL_DEFINE1(rmdir, const char __user *, pathname)
 	return do_rmdir(AT_FDCWD, getname(pathname));
 }
 
-/**
- * vfs_unlink - unlink a filesystem object
- * @dir:	parent directory
- * @dentry:	victim
- * @delegated_inode: returns victim inode, if the inode is delegated.
- *
- * The caller must hold dir->i_mutex.
- *
- * If vfs_unlink discovers a delegation, it will return -EWOULDBLOCK and
- * return a reference to the inode in delegated_inode.  The caller
- * should then break the delegation on that inode and retry.  Because
- * breaking a delegation may take a long time, the caller should drop
- * dir->i_mutex before doing so.
- *
- * Alternatively, a caller may pass NULL for delegated_inode.  This may
- * be appropriate for callers that expect the underlying filesystem not
- * to be NFS exported.
- */
-int vfs_unlink(struct inode *dir, struct dentry *dentry, struct inode **delegated_inode)
+int vfs_mapped_unlink(struct user_namespace *user_ns, struct inode *dir,
+		      struct dentry *dentry, struct inode **delegated_inode)
 {
 	struct inode *target = dentry->d_inode;
-	int error = may_delete(&init_user_ns, dir, dentry, 0);
+	int error = may_delete(user_ns, dir, dentry, 0);
 
 	if (error)
 		return error;
@@ -3899,6 +3930,30 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry, struct inode **delegate
 
 	return error;
 }
+EXPORT_SYMBOL(vfs_mapped_unlink);
+
+/**
+ * vfs_unlink - unlink a filesystem object
+ * @dir:	parent directory
+ * @dentry:	victim
+ * @delegated_inode: returns victim inode, if the inode is delegated.
+ *
+ * The caller must hold dir->i_mutex.
+ *
+ * If vfs_unlink discovers a delegation, it will return -EWOULDBLOCK and
+ * return a reference to the inode in delegated_inode.  The caller
+ * should then break the delegation on that inode and retry.  Because
+ * breaking a delegation may take a long time, the caller should drop
+ * dir->i_mutex before doing so.
+ *
+ * Alternatively, a caller may pass NULL for delegated_inode.  This may
+ * be appropriate for callers that expect the underlying filesystem not
+ * to be NFS exported.
+ */
+int vfs_unlink(struct inode *dir, struct dentry *dentry, struct inode **delegated_inode)
+{
+	return vfs_mapped_unlink(&init_user_ns, dir, dentry, delegated_inode);
+}
 EXPORT_SYMBOL(vfs_unlink);
 
 /*
@@ -3934,6 +3989,8 @@ long do_unlinkat(int dfd, struct filename *name)
 	dentry = __lookup_hash(&last, path.dentry, lookup_flags);
 	error = PTR_ERR(dentry);
 	if (!IS_ERR(dentry)) {
+		struct user_namespace *user_ns;
+
 		/* Why not before? Because we want correct error value */
 		if (last.name[last.len])
 			goto slashes;
@@ -3944,7 +4001,8 @@ long do_unlinkat(int dfd, struct filename *name)
 		error = security_path_unlink(&path, dentry);
 		if (error)
 			goto exit2;
-		error = vfs_unlink(path.dentry->d_inode, dentry, &delegated_inode);
+		user_ns = mnt_user_ns(path.mnt);
+		error = vfs_mapped_unlink(user_ns, path.dentry->d_inode, dentry, &delegated_inode);
 exit2:
 		dput(dentry);
 	}
@@ -3993,9 +4051,10 @@ SYSCALL_DEFINE1(unlink, const char __user *, pathname)
 	return do_unlinkat(AT_FDCWD, getname(pathname));
 }
 
-int vfs_symlink(struct inode *dir, struct dentry *dentry, const char *oldname)
+int vfs_mapped_symlink(struct user_namespace *user_ns, struct inode *dir,
+		       struct dentry *dentry, const char *oldname)
 {
-	int error = may_create(&init_user_ns, dir, dentry);
+	int error = may_create(user_ns, dir, dentry);
 
 	if (error)
 		return error;
@@ -4012,6 +4071,12 @@ int vfs_symlink(struct inode *dir, struct dentry *dentry, const char *oldname)
 		fsnotify_create(dir, dentry);
 	return error;
 }
+EXPORT_SYMBOL(vfs_mapped_symlink);
+
+int vfs_symlink(struct inode *dir, struct dentry *dentry, const char *oldname)
+{
+	return vfs_mapped_symlink(&init_user_ns, dir, dentry, oldname);
+}
 EXPORT_SYMBOL(vfs_symlink);
 
 static long do_symlinkat(const char __user *oldname, int newdfd,
@@ -4033,8 +4098,12 @@ static long do_symlinkat(const char __user *oldname, int newdfd,
 		goto out_putname;
 
 	error = security_path_symlink(&path, dentry, from->name);
-	if (!error)
-		error = vfs_symlink(path.dentry->d_inode, dentry, from->name);
+	if (!error) {
+		struct user_namespace *user_ns;
+		user_ns = mnt_user_ns(path.mnt);
+		error = vfs_mapped_symlink(user_ns, path.dentry->d_inode,
+					   dentry, from->name);
+	}
 	done_path_create(&path, dentry);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
@@ -4057,8 +4126,9 @@ SYSCALL_DEFINE2(symlink, const char __user *, oldname, const char __user *, newn
 }
 
 /**
- * vfs_link - create a new link
+ * vfs_mapped_link - create a new link on an idmapped mount
  * @old_dentry:	object to be linked
+ * @user_ns:	the user namespace of the mount
  * @dir:	new parent
  * @new_dentry:	where to create the new link
  * @delegated_inode: returns inode needing a delegation break
@@ -4075,7 +4145,9 @@ SYSCALL_DEFINE2(symlink, const char __user *, oldname, const char __user *, newn
  * be appropriate for callers that expect the underlying filesystem not
  * to be NFS exported.
  */
-int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry, struct inode **delegated_inode)
+int vfs_mapped_link(struct dentry *old_dentry, struct user_namespace *user_ns,
+		    struct inode *dir, struct dentry *new_dentry,
+		    struct inode **delegated_inode)
 {
 	struct inode *inode = old_dentry->d_inode;
 	unsigned max_links = dir->i_sb->s_max_links;
@@ -4084,7 +4156,7 @@ int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_de
 	if (!inode)
 		return -ENOENT;
 
-	error = may_create(&init_user_ns, dir, new_dentry);
+	error = may_create(user_ns, dir, new_dentry);
 	if (error)
 		return error;
 
@@ -4134,6 +4206,33 @@ int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_de
 		fsnotify_link(dir, inode, new_dentry);
 	return error;
 }
+EXPORT_SYMBOL(vfs_mapped_link);
+
+/**
+ * vfs_link - create a new link
+ * @old_dentry:	object to be linked
+ * @dir:	new parent
+ * @new_dentry:	where to create the new link
+ * @delegated_inode: returns inode needing a delegation break
+ *
+ * The caller must hold dir->i_mutex
+ *
+ * If vfs_link discovers a delegation on the to-be-linked file in need
+ * of breaking, it will return -EWOULDBLOCK and return a reference to the
+ * inode in delegated_inode.  The caller should then break the delegation
+ * and retry.  Because breaking a delegation may take a long time, the
+ * caller should drop the i_mutex before doing so.
+ *
+ * Alternatively, a caller may pass NULL for delegated_inode.  This may
+ * be appropriate for callers that expect the underlying filesystem not
+ * to be NFS exported.
+ */
+int vfs_link(struct dentry *old_dentry, struct inode *dir,
+	     struct dentry *new_dentry, struct inode **delegated_inode)
+{
+	return vfs_mapped_link(old_dentry, &init_user_ns, dir, new_dentry,
+			       delegated_inode);
+}
 EXPORT_SYMBOL(vfs_link);
 
 /*
@@ -4148,6 +4247,7 @@ EXPORT_SYMBOL(vfs_link);
 static int do_linkat(int olddfd, const char __user *oldname, int newdfd,
 	      const char __user *newname, int flags)
 {
+	struct user_namespace *user_ns;
 	struct dentry *new_dentry;
 	struct path old_path, new_path;
 	struct inode *delegated_inode = NULL;
@@ -4189,7 +4289,9 @@ static int do_linkat(int olddfd, const char __user *oldname, int newdfd,
 	error = security_path_link(old_path.dentry, &new_path, new_dentry);
 	if (error)
 		goto out_dput;
-	error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry, &delegated_inode);
+	user_ns = mnt_user_ns(new_path.mnt);
+	error = vfs_mapped_link(old_path.dentry, user_ns,
+				new_path.dentry->d_inode, new_dentry, &delegated_inode);
 out_dput:
 	done_path_create(&new_path, new_dentry);
 	if (delegated_inode) {
@@ -4221,19 +4323,9 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname
 	return do_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0);
 }
 
-struct renamedata {
-	struct inode *old_dir;
-	struct dentry *old_dentry;
-	struct inode *new_dir;
-	struct dentry *new_dentry;
-	struct inode **delegated_inode;
-	unsigned int flags;
-} __randomize_layout;
-
-static int __vfs_rename(struct renamedata *rd)
+int vfs_mapped_rename(struct renamedata *rd)
 {
 	int error;
-	struct user_namespace *user_ns = &init_user_ns;
 	struct inode *old_dir = rd->old_dir, *new_dir = rd->new_dir;
 	struct dentry *old_dentry = rd->old_dentry,
 		      *new_dentry = rd->new_dentry;
@@ -4249,19 +4341,19 @@ static int __vfs_rename(struct renamedata *rd)
 	if (source == target)
 		return 0;
 
-	error = may_delete(user_ns, old_dir, old_dentry, is_dir);
+	error = may_delete(rd->old_user_ns, old_dir, old_dentry, is_dir);
 	if (error)
 		return error;
 
 	if (!target) {
-		error = may_create(user_ns, new_dir, new_dentry);
+		error = may_create(rd->new_user_ns, new_dir, new_dentry);
 	} else {
 		new_is_dir = d_is_dir(new_dentry);
 
 		if (!(flags & RENAME_EXCHANGE))
-			error = may_delete(user_ns, new_dir, new_dentry, is_dir);
+			error = may_delete(rd->new_user_ns, new_dir, new_dentry, is_dir);
 		else
-			error = may_delete(user_ns, new_dir, new_dentry, new_is_dir);
+			error = may_delete(rd->new_user_ns, new_dir, new_dentry, new_is_dir);
 	}
 	if (error)
 		return error;
@@ -4275,12 +4367,12 @@ static int __vfs_rename(struct renamedata *rd)
 	 */
 	if (new_dir != old_dir) {
 		if (is_dir) {
-			error = inode_permission(source, MAY_WRITE);
+			error = mapped_inode_permission(rd->old_user_ns, source, MAY_WRITE);
 			if (error)
 				return error;
 		}
 		if ((flags & RENAME_EXCHANGE) && new_is_dir) {
-			error = inode_permission(target, MAY_WRITE);
+			error = mapped_inode_permission(rd->new_user_ns, target, MAY_WRITE);
 			if (error)
 				return error;
 		}
@@ -4357,6 +4449,7 @@ static int __vfs_rename(struct renamedata *rd)
 
 	return error;
 }
+EXPORT_SYMBOL(vfs_mapped_rename);
 
 /**
  * vfs_rename - rename a filesystem object
@@ -4419,7 +4512,7 @@ int vfs_rename(struct inode *old_dir, struct dentry *old_dentry,
 		.delegated_inode = delegated_inode,
 		.flags		 = flags,
 	};
-	return __vfs_rename(&rd);
+	return vfs_mapped_rename(&rd);
 }
 EXPORT_SYMBOL(vfs_rename);
 
@@ -4535,11 +4628,13 @@ static int do_renameat2(int olddfd, const char __user *oldname, int newdfd,
 
 	rd.old_dir	   = old_path.dentry->d_inode;
 	rd.old_dentry	   = old_dentry;
+	rd.old_user_ns	   = mnt_user_ns(old_path.mnt);
 	rd.new_dir	   = new_path.dentry->d_inode;
 	rd.new_dentry	   = new_dentry;
+	rd.new_user_ns	   = mnt_user_ns(new_path.mnt);
 	rd.delegated_inode = &delegated_inode;
 	rd.flags	   = flags;
-	error = __vfs_rename(&rd);
+	error = vfs_mapped_rename(&rd);
 exit5:
 	dput(new_dentry);
 exit4:
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index f8880aa2ba0e..30ee48ddfaa2 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -224,7 +224,7 @@ static inline int ovl_do_rename(struct inode *olddir, struct dentry *olddentry,
 
 static inline int ovl_do_whiteout(struct inode *dir, struct dentry *dentry)
 {
-	int err = vfs_whiteout(dir, dentry);
+	int err = vfs_whiteout(&init_user_ns, dir, dentry);
 	pr_debug("whiteout(%pd2) = %i\n", dentry, err);
 	return err;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e66852dee65d..f523b1db48c4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1784,21 +1784,47 @@ extern bool mapped_inode_owner_or_capable(struct user_namespace *ns,
  * VFS helper functions..
  */
 extern int vfs_create(struct inode *, struct dentry *, umode_t, bool);
+extern int vfs_mapped_create(struct user_namespace *, struct inode *, struct dentry *, umode_t, bool);
 extern int vfs_mkdir(struct inode *, struct dentry *, umode_t);
+extern int vfs_mapped_mkdir(struct user_namespace *, struct inode *, struct dentry *, umode_t);
 extern int vfs_mknod(struct inode *, struct dentry *, umode_t, dev_t);
+extern int vfs_mapped_mknod(struct user_namespace *, struct inode *, struct dentry *, umode_t, dev_t);
 extern int vfs_symlink(struct inode *, struct dentry *, const char *);
+extern int vfs_mapped_symlink(struct user_namespace *, struct inode *, struct dentry *, const char *);
 extern int vfs_link(struct dentry *, struct inode *, struct dentry *, struct inode **);
+extern int vfs_mapped_link(struct dentry *, struct user_namespace *, struct inode *,
+		       struct dentry *, struct inode **);
 extern int vfs_rmdir(struct inode *, struct dentry *);
+extern int vfs_mapped_rmdir(struct user_namespace *, struct inode *, struct dentry *);
 extern int vfs_unlink(struct inode *, struct dentry *, struct inode **);
-extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *, struct inode **, unsigned int);
+extern int vfs_mapped_unlink(struct user_namespace *, struct inode *, struct dentry *, struct inode **);
+
+struct renamedata {
+	struct user_namespace *old_user_ns;
+	struct inode *old_dir;
+	struct dentry *old_dentry;
+	struct user_namespace *new_user_ns;
+	struct inode *new_dir;
+	struct dentry *new_dentry;
+	struct inode **delegated_inode;
+	unsigned int flags;
+} __randomize_layout;
+
+extern int vfs_rename(struct inode *, struct dentry *, struct inode *,
+		      struct dentry *, struct inode **, unsigned int);
+extern int vfs_mapped_rename(struct renamedata *);
 
-static inline int vfs_whiteout(struct inode *dir, struct dentry *dentry)
+static inline int vfs_whiteout(struct user_namespace *user_ns,
+			       struct inode *dir, struct dentry *dentry)
 {
-	return vfs_mknod(dir, dentry, S_IFCHR | WHITEOUT_MODE, WHITEOUT_DEV);
+	return vfs_mapped_mknod(user_ns, dir, dentry, S_IFCHR | WHITEOUT_MODE, WHITEOUT_DEV);
 }
 
 extern struct dentry *vfs_tmpfile(struct dentry *dentry, umode_t mode,
 				  int open_flag);
+extern struct dentry *vfs_mapped_tmpfile(struct user_namespace *user_ns,
+				     struct dentry *dentry, umode_t mode,
+				     int open_flag);
 
 int vfs_mkobj(struct dentry *, umode_t,
 		int (*f)(struct dentry *, umode_t, void *),
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 19/34] namei: add lookup helpers with idmapped mounts aware permission checking
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (17 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 18/34] namei: prepare for idmapped mounts Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 20/34] open: handle idmapped mounts in do_truncate() Christian Brauner
                   ` (18 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

The lookup_one_len(), lookup_one_len_unlocked(), and lookup_positive_unlocked()
helpers are used by filesystems targeted in this first iteration to lookup
dentries if the caller is privileged over the inode of the base dentry. Add
three new helpers lookup_one_len_mapped(), lookup_one_len_mapped_unlocked(),
and lookup_one_len_mapped_unlocked() to handle idmapped mounts. If the inode is
accessed through an idmapped mount it is mapped according to the mount's user
namespace. Afterwards the permissions checks are identical to non-idmapped
mounts. If the initial user namespace is passed all mapping operations are a
nop so non-idmapped mounts will not see a change in behavior and will also not
see any performance impact. It also means that the non-idmapped-mount aware
helpers can be implemented on top of their idmapped-mount aware counterparts by
passing the initial user namespace.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/namei.c            | 47 ++++++++++++++++++++++++++++++++-----------
 include/linux/namei.h |  6 ++++++
 2 files changed, 41 insertions(+), 12 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index a8a3de936cfc..7901ea09e80e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2517,8 +2517,9 @@ int vfs_path_lookup(struct dentry *dentry, struct vfsmount *mnt,
 }
 EXPORT_SYMBOL(vfs_path_lookup);
 
-static int lookup_one_len_common(const char *name, struct dentry *base,
-				 int len, struct qstr *this)
+static int lookup_one_len_common(const char *name, struct dentry *base, int len,
+				 struct qstr *this,
+				 struct user_namespace *mnt_user_ns)
 {
 	this->name = name;
 	this->len = len;
@@ -2546,7 +2547,7 @@ static int lookup_one_len_common(const char *name, struct dentry *base,
 			return err;
 	}
 
-	return inode_permission(base->d_inode, MAY_EXEC);
+	return mapped_inode_permission(mnt_user_ns, base->d_inode, MAY_EXEC);
 }
 
 /**
@@ -2570,7 +2571,7 @@ struct dentry *try_lookup_one_len(const char *name, struct dentry *base, int len
 
 	WARN_ON_ONCE(!inode_is_locked(base->d_inode));
 
-	err = lookup_one_len_common(name, base, len, &this);
+	err = lookup_one_len_common(name, base, len, &this, &init_user_ns);
 	if (err)
 		return ERR_PTR(err);
 
@@ -2589,7 +2590,8 @@ EXPORT_SYMBOL(try_lookup_one_len);
  *
  * The caller must hold base->i_mutex.
  */
-struct dentry *lookup_one_len(const char *name, struct dentry *base, int len)
+struct dentry *lookup_one_len_mapped(const char *name, struct dentry *base, int len,
+				 struct user_namespace *mnt_user_ns)
 {
 	struct dentry *dentry;
 	struct qstr this;
@@ -2597,13 +2599,19 @@ struct dentry *lookup_one_len(const char *name, struct dentry *base, int len)
 
 	WARN_ON_ONCE(!inode_is_locked(base->d_inode));
 
-	err = lookup_one_len_common(name, base, len, &this);
+	err = lookup_one_len_common(name, base, len, &this, mnt_user_ns);
 	if (err)
 		return ERR_PTR(err);
 
 	dentry = lookup_dcache(&this, base, 0);
 	return dentry ? dentry : __lookup_slow(&this, base, 0);
 }
+EXPORT_SYMBOL(lookup_one_len_mapped);
+
+struct dentry *lookup_one_len(const char *name, struct dentry *base, int len)
+{
+	return lookup_one_len_mapped(name, base, len, &init_user_ns);
+}
 EXPORT_SYMBOL(lookup_one_len);
 
 /**
@@ -2618,14 +2626,14 @@ EXPORT_SYMBOL(lookup_one_len);
  * Unlike lookup_one_len, it should be called without the parent
  * i_mutex held, and will take the i_mutex itself if necessary.
  */
-struct dentry *lookup_one_len_unlocked(const char *name,
-				       struct dentry *base, int len)
+struct dentry *lookup_one_len_mapped_unlocked(const char *name, struct dentry *base,
+					  int len, struct user_namespace *mnt_user_ns)
 {
 	struct qstr this;
 	int err;
 	struct dentry *ret;
 
-	err = lookup_one_len_common(name, base, len, &this);
+	err = lookup_one_len_common(name, base, len, &this, mnt_user_ns);
 	if (err)
 		return ERR_PTR(err);
 
@@ -2634,6 +2642,13 @@ struct dentry *lookup_one_len_unlocked(const char *name,
 		ret = lookup_slow(&this, base, 0);
 	return ret;
 }
+EXPORT_SYMBOL(lookup_one_len_mapped_unlocked);
+
+struct dentry *lookup_one_len_unlocked(const char *name,
+				       struct dentry *base, int len)
+{
+	return lookup_one_len_mapped_unlocked(name, base, len, &init_user_ns);
+}
 EXPORT_SYMBOL(lookup_one_len_unlocked);
 
 /*
@@ -2644,16 +2659,24 @@ EXPORT_SYMBOL(lookup_one_len_unlocked);
  * need to be very careful; pinned positives have ->d_inode stable, so
  * this one avoids such problems.
  */
-struct dentry *lookup_positive_unlocked(const char *name,
-				       struct dentry *base, int len)
+struct dentry *lookup_positive_mapped_unlocked(const char *name,
+					   struct dentry *base, int len,
+					   struct user_namespace *mnt_user_ns)
 {
-	struct dentry *ret = lookup_one_len_unlocked(name, base, len);
+	struct dentry *ret = lookup_one_len_mapped_unlocked(name, base, len, mnt_user_ns);
 	if (!IS_ERR(ret) && d_flags_negative(smp_load_acquire(&ret->d_flags))) {
 		dput(ret);
 		ret = ERR_PTR(-ENOENT);
 	}
 	return ret;
 }
+EXPORT_SYMBOL(lookup_positive_mapped_unlocked);
+
+struct dentry *lookup_positive_unlocked(const char *name,
+				       struct dentry *base, int len)
+{
+	return lookup_positive_mapped_unlocked(name, base, len, &init_user_ns);
+}
 EXPORT_SYMBOL(lookup_positive_unlocked);
 
 #ifdef CONFIG_UNIX98_PTYS
diff --git a/include/linux/namei.h b/include/linux/namei.h
index a4bb992623c4..42dbe4c2653a 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -68,8 +68,14 @@ extern struct dentry *kern_path_locked(const char *, struct path *);
 
 extern struct dentry *try_lookup_one_len(const char *, struct dentry *, int);
 extern struct dentry *lookup_one_len(const char *, struct dentry *, int);
+extern struct dentry *lookup_one_len_mapped(const char *, struct dentry *, int,
+					struct user_namespace *);
 extern struct dentry *lookup_one_len_unlocked(const char *, struct dentry *, int);
+extern struct dentry *lookup_one_len_mapped_unlocked(const char *, struct dentry *,
+						 int, struct user_namespace *);
 extern struct dentry *lookup_positive_unlocked(const char *, struct dentry *, int);
+extern struct dentry *lookup_positive_mapped_unlocked(const char *, struct dentry *,
+						  int, struct user_namespace *);
 
 extern int follow_down_one(struct path *);
 extern int follow_down(struct path *);
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 20/34] open: handle idmapped mounts in do_truncate()
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (18 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 19/34] namei: add lookup helpers with idmapped mounts aware permission checking Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 21/34] open: handle idmapped mounts Christian Brauner
                   ` (17 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

When truncating files the vfs will verify that the caller is privileged over
the inode. Since the do_truncate() helper is only used in a few places in the
vfs code extend it to handle idmapped mounts instead of adding a new helper.
If the inode is accessed through an idmapped mount it is mapped according to
the mount's user namespace. Afterwards the permissions checks are identical to
non-idmapped mounts. If the initial user namespace is passed all mapping
operations are a nop so non-idmapped mounts will not see a change in behavior
and will also not see any performance impact.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/coredump.c      | 12 +++++++++---
 fs/inode.c         | 13 +++++++++----
 fs/namei.c         |  6 +++---
 fs/open.c          | 21 +++++++++++++--------
 include/linux/fs.h |  4 ++--
 5 files changed, 36 insertions(+), 20 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 0cd9056d79cc..25beac7230ff 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -703,6 +703,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 			goto close_fail;
 		}
 	} else {
+		struct user_namespace *user_ns;
 		struct inode *inode;
 		int open_flags = O_CREAT | O_RDWR | O_NOFOLLOW |
 				 O_LARGEFILE | O_EXCL;
@@ -786,7 +787,8 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 			goto close_fail;
 		if (!(cprm.file->f_mode & FMODE_CAN_WRITE))
 			goto close_fail;
-		if (do_truncate(cprm.file->f_path.dentry, 0, 0, cprm.file))
+		user_ns = mnt_user_ns(cprm.file->f_path.mnt);
+		if (do_truncate(user_ns, cprm.file->f_path.dentry, 0, 0, cprm.file))
 			goto close_fail;
 	}
 
@@ -931,8 +933,12 @@ void dump_truncate(struct coredump_params *cprm)
 
 	if (file->f_op->llseek && file->f_op->llseek != no_llseek) {
 		offset = file->f_op->llseek(file, 0, SEEK_CUR);
-		if (i_size_read(file->f_mapping->host) < offset)
-			do_truncate(file->f_path.dentry, offset, 0, file);
+		if (i_size_read(file->f_mapping->host) < offset) {
+			struct user_namespace *user_ns;
+
+			user_ns = mnt_user_ns(file->f_path.mnt);
+			do_truncate(user_ns, file->f_path.dentry, offset, 0, file);
+		}
 	}
 }
 EXPORT_SYMBOL(dump_truncate);
diff --git a/fs/inode.c b/fs/inode.c
index 22de3cb3b1f4..a9e2c8232e61 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1904,7 +1904,8 @@ int dentry_needs_remove_privs(struct dentry *dentry)
 	return mask;
 }
 
-static int __remove_privs(struct dentry *dentry, int kill)
+static int __remove_privs(struct user_namespace *user_ns, struct dentry *dentry,
+			  int kill)
 {
 	struct iattr newattrs;
 
@@ -1913,7 +1914,7 @@ static int __remove_privs(struct dentry *dentry, int kill)
 	 * Note we call this on write, so notify_change will not
 	 * encounter any conflicting delegations:
 	 */
-	return notify_change(dentry, &newattrs, NULL);
+	return notify_mapped_change(user_ns, dentry, &newattrs, NULL);
 }
 
 /*
@@ -1939,8 +1940,12 @@ int file_remove_privs(struct file *file)
 	kill = dentry_needs_remove_privs(dentry);
 	if (kill < 0)
 		return kill;
-	if (kill)
-		error = __remove_privs(dentry, kill);
+	if (kill) {
+		struct user_namespace *user_ns;
+
+		user_ns = mnt_user_ns(file->f_path.mnt);
+		error = __remove_privs(user_ns, dentry, kill);
+	}
 	if (!error)
 		inode_has_no_xattr(inode);
 
diff --git a/fs/namei.c b/fs/namei.c
index 7901ea09e80e..76c9637eccb9 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2985,9 +2985,9 @@ static int handle_truncate(struct file *filp)
 	if (!error)
 		error = security_path_truncate(path);
 	if (!error) {
-		error = do_truncate(path->dentry, 0,
-				    ATTR_MTIME|ATTR_CTIME|ATTR_OPEN,
-				    filp);
+		error = do_truncate(mnt_user_ns(filp->f_path.mnt),
+				    path->dentry, 0,
+				    ATTR_MTIME | ATTR_CTIME | ATTR_OPEN, filp);
 	}
 	put_write_access(inode);
 	return error;
diff --git a/fs/open.c b/fs/open.c
index 9af548fb841b..efa462b6b9c7 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -35,8 +35,8 @@
 
 #include "internal.h"
 
-int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
-	struct file *filp)
+int do_truncate(struct user_namespace *user_ns, struct dentry *dentry,
+		loff_t length, unsigned int time_attrs, struct file *filp)
 {
 	int ret;
 	struct iattr newattrs;
@@ -61,13 +61,14 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
 
 	inode_lock(dentry->d_inode);
 	/* Note any delegations or leases have already been broken: */
-	ret = notify_change(dentry, &newattrs, NULL);
+	ret = notify_mapped_change(user_ns, dentry, &newattrs, NULL);
 	inode_unlock(dentry->d_inode);
 	return ret;
 }
 
 long vfs_truncate(const struct path *path, loff_t length)
 {
+	struct user_namespace *user_ns;
 	struct inode *inode;
 	long error;
 
@@ -83,7 +84,8 @@ long vfs_truncate(const struct path *path, loff_t length)
 	if (error)
 		goto out;
 
-	error = inode_permission(inode, MAY_WRITE);
+	user_ns = mnt_user_ns(path->mnt);
+	error = mapped_inode_permission(user_ns, inode, MAY_WRITE);
 	if (error)
 		goto mnt_drop_write_and_out;
 
@@ -107,7 +109,7 @@ long vfs_truncate(const struct path *path, loff_t length)
 	if (!error)
 		error = security_path_truncate(path);
 	if (!error)
-		error = do_truncate(path->dentry, length, 0, NULL);
+		error = do_truncate(user_ns, path->dentry, length, 0, NULL);
 
 put_write_and_out:
 	put_write_access(inode);
@@ -186,13 +188,16 @@ long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
 	/* Check IS_APPEND on real upper inode */
 	if (IS_APPEND(file_inode(f.file)))
 		goto out_putf;
-
 	sb_start_write(inode->i_sb);
 	error = locks_verify_truncate(inode, f.file, length);
 	if (!error)
 		error = security_path_truncate(&f.file->f_path);
-	if (!error)
-		error = do_truncate(dentry, length, ATTR_MTIME|ATTR_CTIME, f.file);
+	if (!error) {
+		struct user_namespace *user_ns;
+
+		user_ns = mnt_user_ns(f.file->f_path.mnt);
+		error = do_truncate(user_ns, dentry, length, ATTR_MTIME | ATTR_CTIME, f.file);
+	}
 	sb_end_write(inode->i_sb);
 out_putf:
 	fdput(f);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f523b1db48c4..bfcfa3d7374f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2610,8 +2610,8 @@ struct filename {
 static_assert(offsetof(struct filename, iname) % sizeof(long) == 0);
 
 extern long vfs_truncate(const struct path *, loff_t);
-extern int do_truncate(struct dentry *, loff_t start, unsigned int time_attrs,
-		       struct file *filp);
+extern int do_truncate(struct user_namespace *, struct dentry *, loff_t start,
+		       unsigned int time_attrs, struct file *filp);
 extern int vfs_fallocate(struct file *file, int mode, loff_t offset,
 			loff_t len);
 extern long do_sys_open(int dfd, const char __user *filename, int flags,
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 21/34] open: handle idmapped mounts
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (19 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 20/34] open: handle idmapped mounts in do_truncate() Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 22/34] af_unix: " Christian Brauner
                   ` (16 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

For core file operations such as changing directories or chrooting, determining
file access, changing mode or ownership the vfs will verify that the caller is
privileged over the inode. Extend the various helpers to handle idmapped
mounts. If the inode is accessed through an idmapped mount it is mapped
according to the mount's user namespace. Afterwards the permissions checks are
identical to non-idmapped mounts. When changing file ownership we need to map
the mount from the mount's user namespace. If the initial user namespace is
passed all mapping operations are a nop so non-idmapped mounts will not see a
change in behavior and will also not see any performance impact.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/open.c | 31 ++++++++++++++++++++++++-------
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index efa462b6b9c7..ca113399010a 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -401,6 +401,7 @@ static const struct cred *access_override_creds(void)
 
 static long do_faccessat(int dfd, const char __user *filename, int mode, int flags)
 {
+	struct user_namespace *user_ns;
 	struct path path;
 	struct inode *inode;
 	int res;
@@ -441,7 +442,8 @@ static long do_faccessat(int dfd, const char __user *filename, int mode, int fla
 			goto out_path_release;
 	}
 
-	res = inode_permission(inode, mode | MAY_ACCESS);
+	user_ns = mnt_user_ns(path.mnt);
+	res = mapped_inode_permission(user_ns, inode, mode | MAY_ACCESS);
 	/* SuS v2 requires we report a read only fs too */
 	if (res || !(mode & S_IWOTH) || special_file(inode->i_mode))
 		goto out_path_release;
@@ -489,6 +491,7 @@ SYSCALL_DEFINE2(access, const char __user *, filename, int, mode)
 
 SYSCALL_DEFINE1(chdir, const char __user *, filename)
 {
+	struct user_namespace *user_ns;
 	struct path path;
 	int error;
 	unsigned int lookup_flags = LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
@@ -497,7 +500,8 @@ SYSCALL_DEFINE1(chdir, const char __user *, filename)
 	if (error)
 		goto out;
 
-	error = inode_permission(path.dentry->d_inode, MAY_EXEC | MAY_CHDIR);
+	user_ns = mnt_user_ns(path.mnt);
+	error = mapped_inode_permission(user_ns, path.dentry->d_inode, MAY_EXEC | MAY_CHDIR);
 	if (error)
 		goto dput_and_out;
 
@@ -515,6 +519,7 @@ SYSCALL_DEFINE1(chdir, const char __user *, filename)
 
 SYSCALL_DEFINE1(fchdir, unsigned int, fd)
 {
+	struct user_namespace *user_ns;
 	struct fd f = fdget_raw(fd);
 	int error;
 
@@ -526,7 +531,8 @@ SYSCALL_DEFINE1(fchdir, unsigned int, fd)
 	if (!d_can_lookup(f.file->f_path.dentry))
 		goto out_putf;
 
-	error = inode_permission(file_inode(f.file), MAY_EXEC | MAY_CHDIR);
+	user_ns = mnt_user_ns(f.file->f_path.mnt);
+	error = mapped_inode_permission(user_ns, file_inode(f.file), MAY_EXEC | MAY_CHDIR);
 	if (!error)
 		set_fs_pwd(current->fs, &f.file->f_path);
 out_putf:
@@ -537,6 +543,7 @@ SYSCALL_DEFINE1(fchdir, unsigned int, fd)
 
 SYSCALL_DEFINE1(chroot, const char __user *, filename)
 {
+	struct user_namespace *user_ns;
 	struct path path;
 	int error;
 	unsigned int lookup_flags = LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
@@ -545,7 +552,8 @@ SYSCALL_DEFINE1(chroot, const char __user *, filename)
 	if (error)
 		goto out;
 
-	error = inode_permission(path.dentry->d_inode, MAY_EXEC | MAY_CHDIR);
+	user_ns = mnt_user_ns(path.mnt);
+	error = mapped_inode_permission(user_ns, path.dentry->d_inode, MAY_EXEC | MAY_CHDIR);
 	if (error)
 		goto dput_and_out;
 
@@ -570,6 +578,7 @@ SYSCALL_DEFINE1(chroot, const char __user *, filename)
 
 int chmod_common(const struct path *path, umode_t mode)
 {
+	struct user_namespace *user_ns;
 	struct inode *inode = path->dentry->d_inode;
 	struct inode *delegated_inode = NULL;
 	struct iattr newattrs;
@@ -585,7 +594,8 @@ int chmod_common(const struct path *path, umode_t mode)
 		goto out_unlock;
 	newattrs.ia_mode = (mode & S_IALLUGO) | (inode->i_mode & ~S_IALLUGO);
 	newattrs.ia_valid = ATTR_MODE | ATTR_CTIME;
-	error = notify_change(path->dentry, &newattrs, &delegated_inode);
+	user_ns = mnt_user_ns(path->mnt);
+	error = notify_mapped_change(user_ns, path->dentry, &newattrs, &delegated_inode);
 out_unlock:
 	inode_unlock(inode);
 	if (delegated_inode) {
@@ -646,6 +656,7 @@ SYSCALL_DEFINE2(chmod, const char __user *, filename, umode_t, mode)
 
 int chown_common(const struct path *path, uid_t user, gid_t group)
 {
+	struct user_namespace *user_ns;
 	struct inode *inode = path->dentry->d_inode;
 	struct inode *delegated_inode = NULL;
 	int error;
@@ -656,6 +667,12 @@ int chown_common(const struct path *path, uid_t user, gid_t group)
 	uid = make_kuid(current_user_ns(), user);
 	gid = make_kgid(current_user_ns(), group);
 
+	user_ns = mnt_user_ns(path->mnt);
+	if (mnt_idmapped(path->mnt)) {
+		uid = kuid_from_mnt(user_ns, uid);
+		gid = kgid_from_mnt(user_ns, gid);
+	}
+
 retry_deleg:
 	newattrs.ia_valid =  ATTR_CTIME;
 	if (user != (uid_t) -1) {
@@ -676,7 +693,7 @@ int chown_common(const struct path *path, uid_t user, gid_t group)
 	inode_lock(inode);
 	error = security_path_chown(path, uid, gid);
 	if (!error)
-		error = notify_change(path->dentry, &newattrs, &delegated_inode);
+		error = notify_mapped_change(user_ns, path->dentry, &newattrs, &delegated_inode);
 	inode_unlock(inode);
 	if (delegated_inode) {
 		error = break_deleg_wait(&delegated_inode);
@@ -1133,7 +1150,7 @@ struct file *filp_open(const char *filename, int flags, umode_t mode)
 {
 	struct filename *name = getname_kernel(filename);
 	struct file *file = ERR_CAST(name);
-	
+
 	if (!IS_ERR(name)) {
 		file = file_open_name(name, flags, mode);
 		putname(name);
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 22/34] af_unix: handle idmapped mounts
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (20 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 21/34] open: handle idmapped mounts Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 23/34] utimes: " Christian Brauner
                   ` (15 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

When binding a non-abstract AF_UNIX socket it will gain a representation in the
filesystem. Enable the socket infrastructure to handle idmapped mounts by using
the new vfs_mapped_mknod() helper. Non-idmapped mounts will not see any altered
behavior.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 net/unix/af_unix.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 41c3303c3357..f79f7ce3243f 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -996,7 +996,7 @@ static int unix_mknod(const char *sun_path, umode_t mode, struct path *res)
 	 */
 	err = security_path_mknod(&path, dentry, mode, 0);
 	if (!err) {
-		err = vfs_mknod(d_inode(path.dentry), dentry, mode, 0);
+		err = vfs_mapped_mknod(mnt_user_ns(path.mnt), d_inode(path.dentry), dentry, mode, 0);
 		if (!err) {
 			res->mnt = mntget(path.mnt);
 			res->dentry = dget(dentry);
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 23/34] utimes: handle idmapped mounts
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (21 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 22/34] af_unix: " Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 24/34] would_dump: " Christian Brauner
                   ` (14 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

Enable the vfs_utimes() helper to handle idmapped mounts by passing down the
mount's user namespace to the earlier introduced notify_mapped_change() helper.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/utimes.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/utimes.c b/fs/utimes.c
index fd3cc4226224..1c0b1f56fce2 100644
--- a/fs/utimes.c
+++ b/fs/utimes.c
@@ -22,6 +22,7 @@ int vfs_utimes(const struct path *path, struct timespec64 *times)
 	struct iattr newattrs;
 	struct inode *inode = path->dentry->d_inode;
 	struct inode *delegated_inode = NULL;
+	struct user_namespace *user_ns;
 
 	if (times) {
 		if (!nsec_valid(times[0].tv_nsec) ||
@@ -61,8 +62,9 @@ int vfs_utimes(const struct path *path, struct timespec64 *times)
 		newattrs.ia_valid |= ATTR_TOUCH;
 	}
 retry_deleg:
+	user_ns = mnt_user_ns(path->mnt);
 	inode_lock(inode);
-	error = notify_change(path->dentry, &newattrs, &delegated_inode);
+	error = notify_mapped_change(user_ns, path->dentry, &newattrs, &delegated_inode);
 	inode_unlock(inode);
 	if (delegated_inode) {
 		error = break_deleg_wait(&delegated_inode);
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 24/34] would_dump: handle idmapped mounts
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (22 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 23/34] utimes: " Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 25/34] exec: " Christian Brauner
                   ` (13 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

When determining whether or not to create a coredump the vfs will verify that
the caller is privileged over the inode. Make the would_dump() helper handle
idmapped mounts by passing down the mount's user namespace of the exec file.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/exec.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 8e75d7a33514..76de175eeba8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1391,14 +1391,15 @@ EXPORT_SYMBOL(begin_new_exec);
 void would_dump(struct linux_binprm *bprm, struct file *file)
 {
 	struct inode *inode = file_inode(file);
-	if (inode_permission(inode, MAY_READ) < 0) {
+	struct user_namespace *ns = mnt_user_ns(file->f_path.mnt);
+	if (mapped_inode_permission(ns, inode, MAY_READ) < 0) {
 		struct user_namespace *old, *user_ns;
 		bprm->interp_flags |= BINPRM_FLAGS_ENFORCE_NONDUMP;
 
 		/* Ensure mm->user_ns contains the executable */
 		user_ns = old = bprm->mm->user_ns;
 		while ((user_ns != &init_user_ns) &&
-		       !privileged_wrt_inode_uidgid(user_ns, &init_user_ns, inode))
+		       !privileged_wrt_inode_uidgid(user_ns, ns, inode))
 			user_ns = user_ns->parent;
 
 		if (old != user_ns) {
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 25/34] exec: handle idmapped mounts
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (23 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 24/34] would_dump: " Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 26/34] fs: add helpers for idmap mounts Christian Brauner
                   ` (12 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

When executing a setuid binary the kernel will verify in bprm_fill_uid() that
the inode has a mapping in the caller's user namespace before setting the
callers uid and gid. Let bprm_fill_uid() handle idmapped mounts. If the inode
is accessed through an idmapped mount it is mapped according to the mount's
user namespace. Afterwards the checks are identical to non-idmapped mounts.On
regular mounts this is a nop.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/exec.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 76de175eeba8..cd11ab505a41 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1567,6 +1567,7 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
 static void bprm_fill_uid(struct linux_binprm *bprm, struct file *file)
 {
 	/* Handle suid and sgid on files */
+	struct user_namespace *user_ns;
 	struct inode *inode;
 	unsigned int mode;
 	kuid_t uid;
@@ -1583,13 +1584,15 @@ static void bprm_fill_uid(struct linux_binprm *bprm, struct file *file)
 	if (!(mode & (S_ISUID|S_ISGID)))
 		return;
 
+	user_ns = mnt_user_ns(file->f_path.mnt);
+
 	/* Be careful if suid/sgid is set */
 	inode_lock(inode);
 
 	/* reload atomically mode/uid/gid now that lock held */
 	mode = inode->i_mode;
-	uid = inode->i_uid;
-	gid = inode->i_gid;
+	uid = i_uid_into_mnt(user_ns, inode);
+	gid = i_gid_into_mnt(user_ns, inode);
 	inode_unlock(inode);
 
 	/* We ignore suid/sgid if there are no mappings for them in the ns */
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 26/34] fs: add helpers for idmap mounts
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (24 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 25/34] exec: " Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 27/34] apparmor: handle idmapped mounts Christian Brauner
                   ` (11 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

When the kernel is configured with CONFIG_IDMAP_MOUNTS additional inode methods
are provided. A filesystem that is aware of idmapped mounts will receive the
user namespace the mount has been marked with as an additional argument. This
can be used for additional permission checking and also to enable filesystems
to translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.

In this iteration I've decided to add a set of new inode methods instead of
adapting the existing ones. This is mainly done to keep the noise-level as low
as possible. But we're very happy to adapt the existing methods and all
filesystems using it instead of adding dedicated new helpers. In any case we
expect to be done to a single set of inode methods ones we've transitioned
filesystems whether or not we add new methods or not.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/attr.c          |   2 +-
 fs/namei.c         |  24 +++++----
 fs/posix_acl.c     |   4 +-
 include/linux/fs.h | 129 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 146 insertions(+), 13 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 4daf6ac6de6d..d13ef3f8eac0 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -378,7 +378,7 @@ int notify_mapped_change(struct user_namespace *user_ns, struct dentry *dentry,
 		return error;
 
 	if (inode->i_op->setattr)
-		error = inode->i_op->setattr(dentry, attr);
+		error = iop_setattr(inode, user_ns, dentry, attr);
 	else
 		error = simple_setattr(dentry, attr);
 
diff --git a/fs/namei.c b/fs/namei.c
index 76c9637eccb9..d6dbfab126d7 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -407,7 +407,7 @@ static inline int do_inode_permission(struct user_namespace *user_ns, struct ino
 {
 	if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) {
 		if (likely(inode->i_op->permission))
-			return inode->i_op->permission(inode, mask);
+			return iop_permission(inode, user_ns, inode, mask);
 
 		/* This gets set once for the inode lifetime */
 		spin_lock(&inode->i_lock);
@@ -2872,7 +2872,7 @@ int vfs_mapped_create(struct user_namespace *user_ns, struct inode *dir,
 	error = security_inode_create(dir, dentry, mode);
 	if (error)
 		return error;
-	error = dir->i_op->create(dir, dentry, mode, want_excl);
+	error = iop_create(dir, user_ns, dir, dentry, mode, want_excl);
 	if (!error)
 		fsnotify_create(dir, dentry);
 	return error;
@@ -3175,14 +3175,18 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 
 	/* Negative dentry, just create the file */
 	if (!dentry->d_inode && (open_flag & O_CREAT)) {
+		struct user_namespace *user_ns;
+
 		file->f_mode |= FMODE_CREATED;
 		audit_inode_child(dir_inode, dentry, AUDIT_TYPE_CHILD_CREATE);
 		if (!dir_inode->i_op->create) {
 			error = -EACCES;
 			goto out_dput;
 		}
-		error = dir_inode->i_op->create(dir_inode, dentry, mode,
-						open_flag & O_EXCL);
+
+		user_ns = mnt_user_ns(nd->path.mnt);
+		error = iop_create(dir_inode, user_ns, dir_inode, dentry, mode,
+				   open_flag & O_EXCL);
 		if (error)
 			goto out_dput;
 	}
@@ -3363,7 +3367,7 @@ struct dentry *vfs_mapped_tmpfile(struct user_namespace *user_ns,
 	child = d_alloc(dentry, &slash_name);
 	if (unlikely(!child))
 		goto out_err;
-	error = dir->i_op->tmpfile(dir, child, mode);
+	error = iop_tmpfile(dir, user_ns, dir, child, mode);
 	if (error)
 		goto out_err;
 	error = -ENOENT;
@@ -3640,7 +3644,7 @@ int vfs_mapped_mknod(struct user_namespace *user_ns, struct inode *dir,
 	if (error)
 		return error;
 
-	error = dir->i_op->mknod(dir, dentry, mode, dev);
+	error = iop_mknod(dir, user_ns, dir, dentry, mode, dev);
 	if (!error)
 		fsnotify_create(dir, dentry);
 	return error;
@@ -3750,7 +3754,7 @@ int vfs_mapped_mkdir(struct user_namespace *user_ns, struct inode *dir,
 	if (max_links && dir->i_nlink >= max_links)
 		return -EMLINK;
 
-	error = dir->i_op->mkdir(dir, dentry, mode);
+	error = iop_mkdir(dir, user_ns, dir, dentry, mode);
 	if (!error)
 		fsnotify_mkdir(dir, dentry);
 	return error;
@@ -4089,7 +4093,7 @@ int vfs_mapped_symlink(struct user_namespace *user_ns, struct inode *dir,
 	if (error)
 		return error;
 
-	error = dir->i_op->symlink(dir, dentry, oldname);
+	error = iop_symlink(dir, user_ns, dir, dentry, oldname);
 	if (!error)
 		fsnotify_create(dir, dentry);
 	return error;
@@ -4435,8 +4439,8 @@ int vfs_mapped_rename(struct renamedata *rd)
 		if (error)
 			goto out;
 	}
-	error = old_dir->i_op->rename(old_dir, old_dentry,
-				       new_dir, new_dentry, flags);
+	error = iop_rename(old_dir, rd->new_user_ns, old_dir, old_dentry,
+			   new_dir, new_dentry, flags);
 	if (error)
 		goto out;
 
diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 665eb7921e1c..b8e204ac2caa 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -578,7 +578,7 @@ posix_mapped_acl_chmod(struct user_namespace *user_ns, struct inode *inode, umod
 	ret = __posix_acl_chmod(&acl, GFP_KERNEL, mode);
 	if (ret)
 		return ret;
-	ret = inode->i_op->set_acl(inode, acl, ACL_TYPE_ACCESS);
+	ret = iop_set_acl(inode, user_ns, inode, acl, ACL_TYPE_ACCESS);
 	posix_acl_release(acl);
 	return ret;
 }
@@ -925,7 +925,7 @@ set_posix_mapped_acl(struct user_namespace *user_ns, struct inode *inode,
 		if (ret)
 			return ret;
 	}
-	return inode->i_op->set_acl(inode, acl, type);
+	return iop_set_acl(inode, user_ns, inode, acl, type);
 }
 
 int
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bfcfa3d7374f..cb01141d726a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1978,8 +1978,137 @@ struct inode_operations {
 			   umode_t create_mode);
 	int (*tmpfile) (struct inode *, struct dentry *, umode_t);
 	int (*set_acl)(struct inode *, struct posix_acl *, int);
+#ifdef CONFIG_IDMAP_MOUNTS
+	int (*permission_mapped) (struct user_namespace *, struct inode *, int);
+	int (*create_mapped) (struct user_namespace *, struct inode *,
+			      struct dentry *, umode_t, bool);
+	int (*mknod_mapped) (struct user_namespace *, struct inode *,
+			     struct dentry *, umode_t, dev_t);
+	int (*mkdir_mapped) (struct user_namespace *, struct inode *,
+			     struct dentry *, umode_t);
+	int (*tmpfile_mapped) (struct user_namespace *, struct inode *,
+			       struct dentry *, umode_t);
+	int (*symlink_mapped) (struct user_namespace *, struct inode *,
+			       struct dentry *, const char *);
+	int (*rename_mapped) (struct user_namespace *, struct inode *,
+			      struct dentry *, struct inode *, struct dentry *,
+			      unsigned int);
+	int (*setattr_mapped) (struct user_namespace *, struct dentry *,
+			       struct iattr *);
+	int (*set_acl_mapped)(struct user_namespace *, struct inode *,
+			      struct posix_acl *, int);
+#endif
 } ____cacheline_aligned;
 
+static inline int iop_permission(struct inode *caller,
+				 struct user_namespace *user_ns,
+				 struct inode *inode, int mask)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+	if (caller->i_op->permission_mapped)
+		return caller->i_op->permission_mapped(user_ns, inode, mask);
+#endif
+	return caller->i_op->permission(inode, mask);
+}
+
+static inline int iop_create(struct inode *caller,
+			     struct user_namespace *user_ns,
+			     struct inode *inode, struct dentry *dentry,
+			     umode_t mode, bool excl)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+	if (caller->i_op->create_mapped)
+		return caller->i_op->create_mapped(user_ns, inode, dentry,
+						    mode, excl);
+#endif
+	return caller->i_op->create(inode, dentry, mode, excl);
+}
+
+static inline int iop_mknod(struct inode *caller,
+			    struct user_namespace *user_ns, struct inode *inode,
+			    struct dentry *dentry, umode_t mode, dev_t dev)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+	if (caller->i_op->mknod_mapped)
+		return caller->i_op->mknod_mapped(user_ns, inode, dentry, mode, dev);
+#endif
+	return caller->i_op->mknod(inode, dentry, mode, dev);
+}
+
+static inline int iop_mkdir(struct inode *caller,
+			    struct user_namespace *user_ns, struct inode *inode,
+			    struct dentry *dentry, umode_t mode)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+	if (caller->i_op->mkdir_mapped)
+		return caller->i_op->mkdir_mapped(user_ns, inode, dentry, mode);
+#endif
+	return caller->i_op->mkdir(inode, dentry, mode);
+}
+
+static inline int iop_tmpfile(struct inode *caller,
+			      struct user_namespace *user_ns,
+			      struct inode *inode, struct dentry *dentry,
+			      umode_t mode)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+	if (caller->i_op->tmpfile_mapped)
+		return caller->i_op->tmpfile_mapped(user_ns, inode, dentry, mode);
+#endif
+	return caller->i_op->tmpfile(inode, dentry, mode);
+}
+
+static inline int iop_symlink(struct inode *caller,
+			      struct user_namespace *user_ns,
+			      struct inode *inode, struct dentry *dentry,
+			      const char *name)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+	if (caller->i_op->symlink_mapped)
+		return caller->i_op->symlink_mapped(user_ns, inode, dentry, name);
+#endif
+	return caller->i_op->symlink(inode, dentry, name);
+}
+
+static inline int iop_rename(struct inode *caller,
+			     struct user_namespace *user_ns,
+			     struct inode *old_inode, struct dentry *old_dentry,
+			     struct inode *new_inode, struct dentry *new_dentry,
+			     unsigned int flags)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+	if (caller->i_op->rename_mapped)
+		return caller->i_op->rename_mapped(user_ns, old_inode,
+						   old_dentry, new_inode,
+						   new_dentry, flags);
+#endif
+	return caller->i_op->rename(old_inode, old_dentry, new_inode,
+				    new_dentry, flags);
+}
+
+static inline int iop_setattr(struct inode *caller,
+			      struct user_namespace *user_ns,
+			      struct dentry *dentry, struct iattr *attr)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+	if (caller->i_op->setattr_mapped)
+		return caller->i_op->setattr_mapped(user_ns, dentry, attr);
+#endif
+	return caller->i_op->setattr(dentry, attr);
+}
+
+static inline int iop_set_acl(struct inode *caller,
+			      struct user_namespace *user_ns,
+			      struct inode *inode, struct posix_acl *acl,
+			      int type)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+	if (caller->i_op->set_acl_mapped)
+		return caller->i_op->set_acl_mapped(user_ns, inode, acl, type);
+#endif
+	return caller->i_op->set_acl(inode, acl, type);
+}
+
 static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
 				     struct iov_iter *iter)
 {
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 27/34] apparmor: handle idmapped mounts
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (25 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 26/34] fs: add helpers for idmap mounts Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 28/34] audit: " Christian Brauner
                   ` (10 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

The i_uid and i_gid are only ever used when logging for AppArmor. This is
already broken in a bunch of places where the global root id is reported
instead of the i_uid or i_gid of the file. Nonetheless, be kind and log the
mapped inode if we're coming from an idmapped mount.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 security/apparmor/domain.c |  9 ++++++---
 security/apparmor/file.c   |  5 ++++-
 security/apparmor/lsm.c    | 12 ++++++++----
 3 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/security/apparmor/domain.c b/security/apparmor/domain.c
index f919ebd042fd..91b132961b67 100644
--- a/security/apparmor/domain.c
+++ b/security/apparmor/domain.c
@@ -10,12 +10,14 @@
 
 #include <linux/errno.h>
 #include <linux/fdtable.h>
+#include <linux/fs.h>
 #include <linux/file.h>
 #include <linux/mount.h>
 #include <linux/syscalls.h>
 #include <linux/tracehook.h>
 #include <linux/personality.h>
 #include <linux/xattr.h>
+#include <linux/user_namespace.h>
 
 #include "include/audit.h"
 #include "include/apparmorfs.h"
@@ -858,8 +860,10 @@ int apparmor_bprm_creds_for_exec(struct linux_binprm *bprm)
 	const char *info = NULL;
 	int error = 0;
 	bool unsafe = false;
+	struct user_namespace *user_ns = mnt_user_ns(bprm->file->f_path.mnt);
+	kuid_t i_uid = i_uid_into_mnt(user_ns, file_inode(bprm->file));
 	struct path_cond cond = {
-		file_inode(bprm->file)->i_uid,
+		i_uid,
 		file_inode(bprm->file)->i_mode
 	};
 
@@ -967,8 +971,7 @@ int apparmor_bprm_creds_for_exec(struct linux_binprm *bprm)
 	error = fn_for_each(label, profile,
 			aa_audit_file(profile, &nullperms, OP_EXEC, MAY_EXEC,
 				      bprm->filename, NULL, new,
-				      file_inode(bprm->file)->i_uid, info,
-				      error));
+				      i_uid, info, error));
 	aa_put_label(new);
 	goto done;
 }
diff --git a/security/apparmor/file.c b/security/apparmor/file.c
index 92acf9a49405..d6d9e71f1900 100644
--- a/security/apparmor/file.c
+++ b/security/apparmor/file.c
@@ -11,6 +11,8 @@
 #include <linux/tty.h>
 #include <linux/fdtable.h>
 #include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/mount.h>
 
 #include "include/apparmor.h"
 #include "include/audit.h"
@@ -508,8 +510,9 @@ static int __file_path_perm(const char *op, struct aa_label *label,
 {
 	struct aa_profile *profile;
 	struct aa_perms perms = {};
+	struct user_namespace *user_ns = mnt_user_ns(file->f_path.mnt);
 	struct path_cond cond = {
-		.uid = file_inode(file)->i_uid,
+		.uid = i_uid_into_mnt(user_ns, file_inode(file)),
 		.mode = file_inode(file)->i_mode
 	};
 	char *buffer;
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index ffeaee5ed968..ece9afc3994f 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -224,7 +224,8 @@ static int common_perm(const char *op, const struct path *path, u32 mask,
  */
 static int common_perm_cond(const char *op, const struct path *path, u32 mask)
 {
-	struct path_cond cond = { d_backing_inode(path->dentry)->i_uid,
+	struct user_namespace *user_ns = mnt_user_ns(path->mnt);
+	struct path_cond cond = { i_uid_into_mnt(user_ns, d_backing_inode(path->dentry)),
 				  d_backing_inode(path->dentry)->i_mode
 	};
 
@@ -266,12 +267,13 @@ static int common_perm_rm(const char *op, const struct path *dir,
 			  struct dentry *dentry, u32 mask)
 {
 	struct inode *inode = d_backing_inode(dentry);
+	struct user_namespace *user_ns = mnt_user_ns(dir->mnt);
 	struct path_cond cond = { };
 
 	if (!inode || !path_mediated_fs(dentry))
 		return 0;
 
-	cond.uid = inode->i_uid;
+	cond.uid = i_uid_into_mnt(user_ns, inode);
 	cond.mode = inode->i_mode;
 
 	return common_perm_dir_dentry(op, dir, dentry, mask, &cond);
@@ -361,11 +363,12 @@ static int apparmor_path_rename(const struct path *old_dir, struct dentry *old_d
 
 	label = begin_current_label_crit_section();
 	if (!unconfined(label)) {
+		struct user_namespace *user_ns = mnt_user_ns(old_dir->mnt);
 		struct path old_path = { .mnt = old_dir->mnt,
 					 .dentry = old_dentry };
 		struct path new_path = { .mnt = new_dir->mnt,
 					 .dentry = new_dentry };
-		struct path_cond cond = { d_backing_inode(old_dentry)->i_uid,
+		struct path_cond cond = { i_uid_into_mnt(user_ns, d_backing_inode(old_dentry)),
 					  d_backing_inode(old_dentry)->i_mode
 		};
 
@@ -420,8 +423,9 @@ static int apparmor_file_open(struct file *file)
 
 	label = aa_get_newest_cred_label(file->f_cred);
 	if (!unconfined(label)) {
+		struct user_namespace *user_ns = mnt_user_ns(file->f_path.mnt);
 		struct inode *inode = file_inode(file);
-		struct path_cond cond = { inode->i_uid, inode->i_mode };
+		struct path_cond cond = { i_uid_into_mnt(user_ns, inode), inode->i_mode };
 
 		error = aa_path_perm(OP_OPEN, label, &file->f_path, 0,
 				     aa_map_file_to_perms(file), &cond);
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 28/34] audit: handle idmapped mounts
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (26 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 27/34] apparmor: handle idmapped mounts Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 29/34] ima: " Christian Brauner
                   ` (9 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

Audit will sometimes log the inode's i_uid and i_gid. Enable audit to log the
mapped inode when it is accessed from an idmapped mount.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/namei.c            | 14 +++++++-------
 include/linux/audit.h | 10 ++++++----
 ipc/mqueue.c          |  8 ++++----
 kernel/auditsc.c      | 26 ++++++++++++++------------
 4 files changed, 31 insertions(+), 27 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index d6dbfab126d7..545ce391a12d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1006,7 +1006,7 @@ static inline int may_follow_link(struct nameidata *nd, const struct inode *inod
 	if (nd->flags & LOOKUP_RCU)
 		return -ECHILD;
 
-	audit_inode(nd->name, nd->stack[0].link.dentry, 0);
+	audit_inode(nd->name, user_ns, nd->stack[0].link.dentry, 0);
 	audit_log_path_denied(AUDIT_ANOM_LINK, "follow_link");
 	return -EACCES;
 }
@@ -2413,7 +2413,7 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
 		retval = path_lookupat(&nd, flags | LOOKUP_REVAL, path);
 
 	if (likely(!retval))
-		audit_inode(name, path->dentry,
+		audit_inode(name, mnt_user_ns(path->mnt), path->dentry,
 			    flags & LOOKUP_MOUNTPOINT ? AUDIT_INODE_NOEVAL : 0);
 	restore_nameidata();
 	putname(name);
@@ -2455,7 +2455,7 @@ static struct filename *filename_parentat(int dfd, struct filename *name,
 	if (likely(!retval)) {
 		*last = nd.last;
 		*type = nd.last_type;
-		audit_inode(name, parent->dentry, AUDIT_INODE_PARENT);
+		audit_inode(name, mnt_user_ns(parent->mnt), parent->dentry, AUDIT_INODE_PARENT);
 	} else {
 		putname(name);
 		name = ERR_PTR(retval);
@@ -3239,7 +3239,7 @@ static const char *open_last_lookups(struct nameidata *nd,
 			if (unlikely(error))
 				return ERR_PTR(error);
 		}
-		audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
+		audit_inode(nd->name, mnt_user_ns(nd->path.mnt), dir, AUDIT_INODE_PARENT);
 		/* trailing slashes? */
 		if (unlikely(nd->last.name[nd->last.len]))
 			return ERR_PTR(-EISDIR);
@@ -3305,7 +3305,7 @@ static int do_open(struct nameidata *nd,
 			return error;
 	}
 	if (!(file->f_mode & FMODE_CREATED))
-		audit_inode(nd->name, nd->path.dentry, 0);
+		audit_inode(nd->name, mnt_user_ns(nd->path.mnt), nd->path.dentry, 0);
 	if (open_flag & O_CREAT) {
 		if ((open_flag & O_EXCL) && !(file->f_mode & FMODE_CREATED))
 			return -EEXIST;
@@ -3414,7 +3414,7 @@ static int do_tmpfile(struct nameidata *nd, unsigned flags,
 		goto out2;
 	dput(path.dentry);
 	path.dentry = child;
-	audit_inode(nd->name, child, 0);
+	audit_inode(nd->name, user_ns, child, 0);
 	/* Don't check for other permissions, the inode was just created */
 	error = may_open(&path, 0, op->open_flag);
 	if (error)
@@ -3433,7 +3433,7 @@ static int do_o_path(struct nameidata *nd, unsigned flags, struct file *file)
 	struct path path;
 	int error = path_lookupat(nd, flags, &path);
 	if (!error) {
-		audit_inode(nd->name, path.dentry, 0);
+		audit_inode(nd->name, mnt_user_ns(path.mnt), path.dentry, 0);
 		error = vfs_open(&path, file);
 		path_put(&path);
 	}
diff --git a/include/linux/audit.h b/include/linux/audit.h
index b3d859831a31..217d2b0c273e 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -293,8 +293,8 @@ extern void __audit_syscall_exit(int ret_success, long ret_value);
 extern struct filename *__audit_reusename(const __user char *uptr);
 extern void __audit_getname(struct filename *name);
 extern void __audit_getcwd(void);
-extern void __audit_inode(struct filename *name, const struct dentry *dentry,
-				unsigned int flags);
+extern void __audit_inode(struct filename *name, struct user_namespace *user_ns,
+			  const struct dentry *dentry, unsigned int flags);
 extern void __audit_file(const struct file *);
 extern void __audit_inode_child(struct inode *parent,
 				const struct dentry *dentry,
@@ -357,10 +357,11 @@ static inline void audit_getcwd(void)
 		__audit_getcwd();
 }
 static inline void audit_inode(struct filename *name,
+				struct user_namespace *user_ns,
 				const struct dentry *dentry,
 				unsigned int aflags) {
 	if (unlikely(!audit_dummy_context()))
-		__audit_inode(name, dentry, aflags);
+		__audit_inode(name, user_ns, dentry, aflags);
 }
 static inline void audit_file(struct file *file)
 {
@@ -371,7 +372,7 @@ static inline void audit_inode_parent_hidden(struct filename *name,
 						const struct dentry *dentry)
 {
 	if (unlikely(!audit_dummy_context()))
-		__audit_inode(name, dentry,
+		__audit_inode(name, &init_user_ns, dentry,
 				AUDIT_INODE_PARENT | AUDIT_INODE_HIDDEN);
 }
 static inline void audit_inode_child(struct inode *parent,
@@ -587,6 +588,7 @@ static inline void audit_getname(struct filename *name)
 static inline void audit_getcwd(void)
 { }
 static inline void audit_inode(struct filename *name,
+				struct user_namespace *user_ns,
 				const struct dentry *dentry,
 				unsigned int aflags)
 { }
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index beff0cfcd1e8..08ade6410563 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -849,8 +849,8 @@ static void remove_notification(struct mqueue_inode_info *info)
 	info->notify_user_ns = NULL;
 }
 
-static int prepare_open(struct dentry *dentry, int oflag, int ro,
-			umode_t mode, struct filename *name,
+static int prepare_open(struct user_namespace *user_ns, struct dentry *dentry,
+			int oflag, int ro, umode_t mode, struct filename *name,
 			struct mq_attr *attr)
 {
 	static const int oflag2acc[O_ACCMODE] = { MAY_READ, MAY_WRITE,
@@ -867,7 +867,7 @@ static int prepare_open(struct dentry *dentry, int oflag, int ro,
 				  mqueue_create_attr, attr);
 	}
 	/* it already existed */
-	audit_inode(name, dentry, 0);
+	audit_inode(name, user_ns, dentry, 0);
 	if ((oflag & (O_CREAT|O_EXCL)) == (O_CREAT|O_EXCL))
 		return -EEXIST;
 	if ((oflag & O_ACCMODE) == (O_RDWR | O_WRONLY))
@@ -903,7 +903,7 @@ static int do_mq_open(const char __user *u_name, int oflag, umode_t mode,
 		goto out_putfd;
 	}
 	path.mnt = mntget(mnt);
-	error = prepare_open(path.dentry, oflag, ro, mode, name, attr);
+	error = prepare_open(mnt_user_ns(path.mnt), path.dentry, oflag, ro, mode, name, attr);
 	if (!error) {
 		struct file *file = dentry_open(&path, oflag, current_cred());
 		if (!IS_ERR(file))
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 9cdecdea61d9..0dc26fcf4742 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1936,6 +1936,7 @@ void __audit_getname(struct filename *name)
 }
 
 static inline int audit_copy_fcaps(struct audit_names *name,
+				   struct user_namespace *user_ns,
 				   const struct dentry *dentry)
 {
 	struct cpu_vfs_cap_data caps;
@@ -1944,7 +1945,7 @@ static inline int audit_copy_fcaps(struct audit_names *name,
 	if (!dentry)
 		return 0;
 
-	rc = get_vfs_caps_from_disk(dentry, &caps);
+	rc = get_mapped_vfs_caps_from_disk(user_ns, dentry, &caps);
 	if (rc)
 		return rc;
 
@@ -1960,21 +1961,22 @@ static inline int audit_copy_fcaps(struct audit_names *name,
 
 /* Copy inode data into an audit_names. */
 static void audit_copy_inode(struct audit_names *name,
-			     const struct dentry *dentry,
-			     struct inode *inode, unsigned int flags)
+			     struct user_namespace *user_ns,
+			     const struct dentry *dentry, struct inode *inode,
+			     unsigned int flags)
 {
 	name->ino   = inode->i_ino;
 	name->dev   = inode->i_sb->s_dev;
 	name->mode  = inode->i_mode;
-	name->uid   = inode->i_uid;
-	name->gid   = inode->i_gid;
+	name->uid   = i_uid_into_mnt(user_ns, inode);
+	name->gid   = i_gid_into_mnt(user_ns, inode);
 	name->rdev  = inode->i_rdev;
 	security_inode_getsecid(inode, &name->osid);
 	if (flags & AUDIT_INODE_NOEVAL) {
 		name->fcap_ver = -1;
 		return;
 	}
-	audit_copy_fcaps(name, dentry);
+	audit_copy_fcaps(name, user_ns, dentry);
 }
 
 /**
@@ -1983,8 +1985,8 @@ static void audit_copy_inode(struct audit_names *name,
  * @dentry: dentry being audited
  * @flags: attributes for this particular entry
  */
-void __audit_inode(struct filename *name, const struct dentry *dentry,
-		   unsigned int flags)
+void __audit_inode(struct filename *name, struct user_namespace *user_ns,
+		   const struct dentry *dentry, unsigned int flags)
 {
 	struct audit_context *context = audit_context();
 	struct inode *inode = d_backing_inode(dentry);
@@ -2078,12 +2080,12 @@ void __audit_inode(struct filename *name, const struct dentry *dentry,
 		n->type = AUDIT_TYPE_NORMAL;
 	}
 	handle_path(dentry);
-	audit_copy_inode(n, dentry, inode, flags & AUDIT_INODE_NOEVAL);
+	audit_copy_inode(n, user_ns, dentry, inode, flags & AUDIT_INODE_NOEVAL);
 }
 
 void __audit_file(const struct file *file)
 {
-	__audit_inode(NULL, file->f_path.dentry, 0);
+	__audit_inode(NULL, mnt_user_ns(file->f_path.mnt), file->f_path.dentry, 0);
 }
 
 /**
@@ -2175,7 +2177,7 @@ void __audit_inode_child(struct inode *parent,
 		n = audit_alloc_name(context, AUDIT_TYPE_PARENT);
 		if (!n)
 			return;
-		audit_copy_inode(n, NULL, parent, 0);
+		audit_copy_inode(n, &init_user_ns, NULL, parent, 0);
 	}
 
 	if (!found_child) {
@@ -2194,7 +2196,7 @@ void __audit_inode_child(struct inode *parent,
 	}
 
 	if (inode)
-		audit_copy_inode(found_child, dentry, inode, 0);
+		audit_copy_inode(found_child, &init_user_ns, dentry, inode, 0);
 	else
 		found_child->ino = AUDIT_INO_UNSET;
 }
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 29/34] ima: handle idmapped mounts
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (27 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 28/34] audit: " Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 30/34] ext4: support " Christian Brauner
                   ` (8 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

IMA does sometimes access the inode's i_uid and compares it against the rules'
fowner. Enable IMA to handle idmapped mounts by passing down the mount's user
namespace. We simply make use of the helpers we introduced before.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/attr.c                                    |  2 +-
 fs/namei.c                                   |  4 +--
 include/linux/ima.h                          | 15 ++++++-----
 security/integrity/ima/ima.h                 | 19 ++++++++-----
 security/integrity/ima/ima_api.c             | 10 ++++---
 security/integrity/ima/ima_appraise.c        | 14 +++++-----
 security/integrity/ima/ima_asymmetric_keys.c |  2 +-
 security/integrity/ima/ima_main.c            | 28 ++++++++++++--------
 security/integrity/ima/ima_policy.c          | 17 ++++++------
 security/integrity/ima/ima_queue_keys.c      |  2 +-
 10 files changed, 66 insertions(+), 47 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index d13ef3f8eac0..9b05608bacd3 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -384,7 +384,7 @@ int notify_mapped_change(struct user_namespace *user_ns, struct dentry *dentry,
 
 	if (!error) {
 		fsnotify_change(dentry, ia_valid);
-		ima_inode_post_setattr(dentry);
+		ima_inode_post_setattr(user_ns, dentry);
 		evm_inode_post_setattr(dentry, ia_valid);
 	}
 
diff --git a/fs/namei.c b/fs/namei.c
index 545ce391a12d..ba78b57e1d86 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3379,7 +3379,7 @@ struct dentry *vfs_mapped_tmpfile(struct user_namespace *user_ns,
 		inode->i_state |= I_LINKABLE;
 		spin_unlock(&inode->i_lock);
 	}
-	ima_post_create_tmpfile(inode);
+	ima_post_create_tmpfile(user_ns, inode);
 	return child;
 
 out_err:
@@ -3703,7 +3703,7 @@ static long do_mknodat(int dfd, const char __user *filename, umode_t mode,
 			error = vfs_mapped_create(user_ns, path.dentry->d_inode,
 						  dentry, mode, true);
 			if (!error)
-				ima_post_path_mknod(dentry);
+				ima_post_path_mknod(user_ns, dentry);
 			break;
 		case S_IFCHR: case S_IFBLK:
 			error = vfs_mapped_mknod(user_ns, path.dentry->d_inode,
diff --git a/include/linux/ima.h b/include/linux/ima.h
index 8fa7bcfb2da2..c3e3c260ad40 100644
--- a/include/linux/ima.h
+++ b/include/linux/ima.h
@@ -16,7 +16,7 @@ struct linux_binprm;
 #ifdef CONFIG_IMA
 extern int ima_bprm_check(struct linux_binprm *bprm);
 extern int ima_file_check(struct file *file, int mask);
-extern void ima_post_create_tmpfile(struct inode *inode);
+extern void ima_post_create_tmpfile(struct user_namespace *user_ns, struct inode *inode);
 extern void ima_file_free(struct file *file);
 extern int ima_file_mmap(struct file *file, unsigned long prot);
 extern int ima_file_mprotect(struct vm_area_struct *vma, unsigned long prot);
@@ -27,7 +27,8 @@ extern int ima_read_file(struct file *file, enum kernel_read_file_id id,
 			 bool contents);
 extern int ima_post_read_file(struct file *file, void *buf, loff_t size,
 			      enum kernel_read_file_id id);
-extern void ima_post_path_mknod(struct dentry *dentry);
+extern void ima_post_path_mknod(struct user_namespace *user_ns,
+				struct dentry *dentry);
 extern int ima_file_hash(struct file *file, char *buf, size_t buf_size);
 extern void ima_kexec_cmdline(int kernel_fd, const void *buf, int size);
 
@@ -61,7 +62,8 @@ static inline int ima_file_check(struct file *file, int mask)
 	return 0;
 }
 
-static inline void ima_post_create_tmpfile(struct inode *inode)
+static inline void ima_post_create_tmpfile(struct user_namespace *user_ns,
+					   struct inode *inode)
 {
 }
 
@@ -105,7 +107,8 @@ static inline int ima_post_read_file(struct file *file, void *buf, loff_t size,
 	return 0;
 }
 
-static inline void ima_post_path_mknod(struct dentry *dentry)
+static inline void ima_post_path_mknod(struct user_namespace *user_ns,
+				       struct dentry *dentry)
 {
 	return;
 }
@@ -141,7 +144,7 @@ static inline void ima_post_key_create_or_update(struct key *keyring,
 
 #ifdef CONFIG_IMA_APPRAISE
 extern bool is_ima_appraise_enabled(void);
-extern void ima_inode_post_setattr(struct dentry *dentry);
+extern void ima_inode_post_setattr(struct user_namespace *user_ns, struct dentry *dentry);
 extern int ima_inode_setxattr(struct dentry *dentry, const char *xattr_name,
 		       const void *xattr_value, size_t xattr_value_len);
 extern int ima_inode_removexattr(struct dentry *dentry, const char *xattr_name);
@@ -151,7 +154,7 @@ static inline bool is_ima_appraise_enabled(void)
 	return 0;
 }
 
-static inline void ima_inode_post_setattr(struct dentry *dentry)
+static inline void ima_inode_post_setattr(struct user_namespace *user_ns, struct dentry *dentry)
 {
 	return;
 }
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index 38043074ce5e..66b99df600cf 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -252,8 +252,9 @@ static inline void ima_process_queued_keys(void) {}
 #endif /* CONFIG_IMA_QUEUE_EARLY_BOOT_KEYS */
 
 /* LIM API function definitions */
-int ima_get_action(struct inode *inode, const struct cred *cred, u32 secid,
-		   int mask, enum ima_hooks func, int *pcr,
+int ima_get_action(struct user_namespace *user_ns, struct inode *inode,
+		   const struct cred *cred, u32 secid, int mask,
+		   enum ima_hooks func, int *pcr,
 		   struct ima_template_desc **template_desc,
 		   const char *keyring);
 int ima_must_measure(struct inode *inode, int mask, enum ima_hooks func);
@@ -265,7 +266,8 @@ void ima_store_measurement(struct integrity_iint_cache *iint, struct file *file,
 			   struct evm_ima_xattr_data *xattr_value,
 			   int xattr_len, const struct modsig *modsig, int pcr,
 			   struct ima_template_desc *template_desc);
-void process_buffer_measurement(struct inode *inode, const void *buf, int size,
+void process_buffer_measurement(struct user_namespace *user_ns,
+				struct inode *inode, const void *buf, int size,
 				const char *eventname, enum ima_hooks func,
 				int pcr, const char *keyring);
 void ima_audit_measurement(struct integrity_iint_cache *iint,
@@ -280,8 +282,9 @@ void ima_free_template_entry(struct ima_template_entry *entry);
 const char *ima_d_path(const struct path *path, char **pathbuf, char *filename);
 
 /* IMA policy related functions */
-int ima_match_policy(struct inode *inode, const struct cred *cred, u32 secid,
-		     enum ima_hooks func, int mask, int flags, int *pcr,
+int ima_match_policy(struct user_namespace *user_ns, struct inode *inode,
+		     const struct cred *cred, u32 secid, enum ima_hooks func,
+		     int mask, int flags, int *pcr,
 		     struct ima_template_desc **template_desc,
 		     const char *keyring);
 void ima_init_policy(void);
@@ -312,7 +315,8 @@ int ima_appraise_measurement(enum ima_hooks func,
 			     struct file *file, const unsigned char *filename,
 			     struct evm_ima_xattr_data *xattr_value,
 			     int xattr_len, const struct modsig *modsig);
-int ima_must_appraise(struct inode *inode, int mask, enum ima_hooks func);
+int ima_must_appraise(struct user_namespace *user_ns, struct inode *inode,
+		      int mask, enum ima_hooks func);
 void ima_update_xattr(struct integrity_iint_cache *iint, struct file *file);
 enum integrity_status ima_get_cache_status(struct integrity_iint_cache *iint,
 					   enum ima_hooks func);
@@ -339,7 +343,8 @@ static inline int ima_appraise_measurement(enum ima_hooks func,
 	return INTEGRITY_UNKNOWN;
 }
 
-static inline int ima_must_appraise(struct inode *inode, int mask,
+static inline int ima_must_appraise(struct user_namespace *user_ns,
+				    struct inode *inode, int mask,
 				    enum ima_hooks func)
 {
 	return 0;
diff --git a/security/integrity/ima/ima_api.c b/security/integrity/ima/ima_api.c
index 4f39fb93f278..ec51ada849a5 100644
--- a/security/integrity/ima/ima_api.c
+++ b/security/integrity/ima/ima_api.c
@@ -162,6 +162,7 @@ void ima_add_violation(struct file *file, const unsigned char *filename,
 
 /**
  * ima_get_action - appraise & measure decision based on policy.
+ * @user_ns: userns of the mount through which the inode is accessed
  * @inode: pointer to the inode associated with the object being validated
  * @cred: pointer to credentials structure to validate
  * @secid: secid of the task being validated
@@ -183,8 +184,9 @@ void ima_add_violation(struct file *file, const unsigned char *filename,
  * Returns IMA_MEASURE, IMA_APPRAISE mask.
  *
  */
-int ima_get_action(struct inode *inode, const struct cred *cred, u32 secid,
-		   int mask, enum ima_hooks func, int *pcr,
+int ima_get_action(struct user_namespace *user_ns, struct inode *inode,
+		   const struct cred *cred, u32 secid, int mask,
+		   enum ima_hooks func, int *pcr,
 		   struct ima_template_desc **template_desc,
 		   const char *keyring)
 {
@@ -192,8 +194,8 @@ int ima_get_action(struct inode *inode, const struct cred *cred, u32 secid,
 
 	flags &= ima_policy_flag;
 
-	return ima_match_policy(inode, cred, secid, func, mask, flags, pcr,
-				template_desc, keyring);
+	return ima_match_policy(user_ns, inode, cred, secid, func, mask, flags,
+				pcr, template_desc, keyring);
 }
 
 /*
diff --git a/security/integrity/ima/ima_appraise.c b/security/integrity/ima/ima_appraise.c
index 3dd8c2e4314e..a8a5d7c45b72 100644
--- a/security/integrity/ima/ima_appraise.c
+++ b/security/integrity/ima/ima_appraise.c
@@ -63,7 +63,8 @@ bool is_ima_appraise_enabled(void)
  *
  * Return 1 to appraise or hash
  */
-int ima_must_appraise(struct inode *inode, int mask, enum ima_hooks func)
+int ima_must_appraise(struct user_namespace *user_ns, struct inode *inode,
+		      int mask, enum ima_hooks func)
 {
 	u32 secid;
 
@@ -71,8 +72,8 @@ int ima_must_appraise(struct inode *inode, int mask, enum ima_hooks func)
 		return 0;
 
 	security_task_getsecid(current, &secid);
-	return ima_match_policy(inode, current_cred(), secid, func, mask,
-				IMA_APPRAISE | IMA_HASH, NULL, NULL, NULL);
+	return ima_match_policy(user_ns, inode, current_cred(), secid, func,
+				mask, IMA_APPRAISE | IMA_HASH, NULL, NULL, NULL);
 }
 
 static int ima_fix_xattr(struct dentry *dentry,
@@ -345,7 +346,7 @@ int ima_check_blacklist(struct integrity_iint_cache *iint,
 
 		rc = is_binary_blacklisted(digest, digestsize);
 		if ((rc == -EPERM) && (iint->flags & IMA_MEASURE))
-			process_buffer_measurement(NULL, digest, digestsize,
+			process_buffer_measurement(NULL, NULL, digest, digestsize,
 						   "blacklisted-hash", NONE,
 						   pcr, NULL);
 	}
@@ -496,6 +497,7 @@ void ima_update_xattr(struct integrity_iint_cache *iint, struct file *file)
 
 /**
  * ima_inode_post_setattr - reflect file metadata changes
+ * @user_ns: user namespace of the mount
  * @dentry: pointer to the affected dentry
  *
  * Changes to a dentry's metadata might result in needing to appraise.
@@ -503,7 +505,7 @@ void ima_update_xattr(struct integrity_iint_cache *iint, struct file *file)
  * This function is called from notify_change(), which expects the caller
  * to lock the inode's i_mutex.
  */
-void ima_inode_post_setattr(struct dentry *dentry)
+void ima_inode_post_setattr(struct user_namespace *user_ns, struct dentry *dentry)
 {
 	struct inode *inode = d_backing_inode(dentry);
 	struct integrity_iint_cache *iint;
@@ -513,7 +515,7 @@ void ima_inode_post_setattr(struct dentry *dentry)
 	    || !(inode->i_opflags & IOP_XATTR))
 		return;
 
-	action = ima_must_appraise(inode, MAY_ACCESS, POST_SETATTR);
+	action = ima_must_appraise(user_ns, inode, MAY_ACCESS, POST_SETATTR);
 	if (!action)
 		__vfs_removexattr(dentry, XATTR_NAME_IMA);
 	iint = integrity_iint_find(inode);
diff --git a/security/integrity/ima/ima_asymmetric_keys.c b/security/integrity/ima/ima_asymmetric_keys.c
index 1c68c500c26f..9810f3bfa57f 100644
--- a/security/integrity/ima/ima_asymmetric_keys.c
+++ b/security/integrity/ima/ima_asymmetric_keys.c
@@ -58,7 +58,7 @@ void ima_post_key_create_or_update(struct key *keyring, struct key *key,
 	 * if the IMA policy is configured to measure a key linked
 	 * to the given keyring.
 	 */
-	process_buffer_measurement(NULL, payload, payload_len,
+	process_buffer_measurement(NULL, NULL, payload, payload_len,
 				   keyring->description, KEY_CHECK, 0,
 				   keyring->description);
 }
diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c
index 2d1af8899cab..562003bab943 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -218,8 +218,8 @@ static int process_measurement(struct file *file, const struct cred *cred,
 	 * bitmask based on the appraise/audit/measurement policy.
 	 * Included is the appraise submask.
 	 */
-	action = ima_get_action(inode, cred, secid, mask, func, &pcr,
-				&template_desc, NULL);
+	action = ima_get_action(mnt_user_ns(file->f_path.mnt), inode, cred,
+				secid, mask, func, &pcr, &template_desc, NULL);
 	violation_check = ((func == FILE_CHECK || func == MMAP_CHECK) &&
 			   (ima_policy_flag & IMA_MEASURE));
 	if (!action && !violation_check)
@@ -431,8 +431,9 @@ int ima_file_mprotect(struct vm_area_struct *vma, unsigned long prot)
 
 	security_task_getsecid(current, &secid);
 	inode = file_inode(vma->vm_file);
-	action = ima_get_action(inode, current_cred(), secid, MAY_EXEC,
-				MMAP_CHECK, &pcr, &template, 0);
+	action = ima_get_action(mnt_user_ns(vma->vm_file->f_path.mnt), inode,
+				current_cred(), secid, MAY_EXEC, MMAP_CHECK,
+				&pcr, &template, 0);
 
 	/* Is the mmap'ed file in policy? */
 	if (!(action & (IMA_MEASURE | IMA_APPRAISE_SUBMASK)))
@@ -568,12 +569,13 @@ EXPORT_SYMBOL_GPL(ima_file_hash);
  * Skip calling process_measurement(), but indicate which newly, created
  * tmpfiles are in policy.
  */
-void ima_post_create_tmpfile(struct inode *inode)
+void ima_post_create_tmpfile(struct user_namespace *user_ns,
+			     struct inode *inode)
 {
 	struct integrity_iint_cache *iint;
 	int must_appraise;
 
-	must_appraise = ima_must_appraise(inode, MAY_ACCESS, FILE_CHECK);
+	must_appraise = ima_must_appraise(user_ns, inode, MAY_ACCESS, FILE_CHECK);
 	if (!must_appraise)
 		return;
 
@@ -589,18 +591,19 @@ void ima_post_create_tmpfile(struct inode *inode)
 
 /**
  * ima_post_path_mknod - mark as a new inode
+ * @user_ns: user namespace of the mount
  * @dentry: newly created dentry
  *
  * Mark files created via the mknodat syscall as new, so that the
  * file data can be written later.
  */
-void ima_post_path_mknod(struct dentry *dentry)
+void ima_post_path_mknod(struct user_namespace *user_ns, struct dentry *dentry)
 {
 	struct integrity_iint_cache *iint;
 	struct inode *inode = dentry->d_inode;
 	int must_appraise;
 
-	must_appraise = ima_must_appraise(inode, MAY_ACCESS, FILE_CHECK);
+	must_appraise = ima_must_appraise(user_ns, inode, MAY_ACCESS, FILE_CHECK);
 	if (!must_appraise)
 		return;
 
@@ -780,6 +783,7 @@ int ima_post_load_data(char *buf, loff_t size,
 
 /*
  * process_buffer_measurement - Measure the buffer to ima log.
+ * @userns: user namespace of the mount through which the inode is accessed
  * @inode: inode associated with the object being measured (NULL for KEY_CHECK)
  * @buf: pointer to the buffer that needs to be added to the log.
  * @size: size of buffer(in bytes).
@@ -790,7 +794,8 @@ int ima_post_load_data(char *buf, loff_t size,
  *
  * Based on policy, the buffer is measured into the ima log.
  */
-void process_buffer_measurement(struct inode *inode, const void *buf, int size,
+void process_buffer_measurement(struct user_namespace *user_ns,
+				struct inode *inode, const void *buf, int size,
 				const char *eventname, enum ima_hooks func,
 				int pcr, const char *keyring)
 {
@@ -823,7 +828,7 @@ void process_buffer_measurement(struct inode *inode, const void *buf, int size,
 	 */
 	if (func) {
 		security_task_getsecid(current, &secid);
-		action = ima_get_action(inode, current_cred(), secid, 0, func,
+		action = ima_get_action(user_ns, inode, current_cred(), secid, 0, func,
 					&pcr, &template, keyring);
 		if (!(action & IMA_MEASURE))
 			return;
@@ -895,7 +900,8 @@ void ima_kexec_cmdline(int kernel_fd, const void *buf, int size)
 	if (!f.file)
 		return;
 
-	process_buffer_measurement(file_inode(f.file), buf, size,
+	process_buffer_measurement(mnt_user_ns(f.file->f_path.mnt),
+				   file_inode(f.file), buf, size,
 				   "kexec-cmdline", KEXEC_CMDLINE, 0, NULL);
 	fdput(f);
 }
diff --git a/security/integrity/ima/ima_policy.c b/security/integrity/ima/ima_policy.c
index 9b5adeaa47fc..003d974ce2f3 100644
--- a/security/integrity/ima/ima_policy.c
+++ b/security/integrity/ima/ima_policy.c
@@ -497,10 +497,10 @@ static bool ima_match_keyring(struct ima_rule_entry *rule,
  *
  * Returns true on rule match, false on failure.
  */
-static bool ima_match_rules(struct ima_rule_entry *rule, struct inode *inode,
+static bool ima_match_rules(struct ima_rule_entry *rule,
+			    struct user_namespace *user_ns, struct inode *inode,
 			    const struct cred *cred, u32 secid,
-			    enum ima_hooks func, int mask,
-			    const char *keyring)
+			    enum ima_hooks func, int mask, const char *keyring)
 {
 	int i;
 
@@ -539,7 +539,7 @@ static bool ima_match_rules(struct ima_rule_entry *rule, struct inode *inode,
 	}
 
 	if ((rule->flags & IMA_FOWNER) &&
-	    !rule->fowner_op(inode->i_uid, rule->fowner))
+	    !rule->fowner_op(i_uid_into_mnt(user_ns, inode), rule->fowner))
 		return false;
 	for (i = 0; i < MAX_LSM_RULES; i++) {
 		int rc = 0;
@@ -620,8 +620,9 @@ static int get_subaction(struct ima_rule_entry *rule, enum ima_hooks func)
  * list when walking it.  Reads are many orders of magnitude more numerous
  * than writes so ima_match_policy() is classical RCU candidate.
  */
-int ima_match_policy(struct inode *inode, const struct cred *cred, u32 secid,
-		     enum ima_hooks func, int mask, int flags, int *pcr,
+int ima_match_policy(struct user_namespace *user_ns, struct inode *inode,
+		     const struct cred *cred, u32 secid, enum ima_hooks func,
+		     int mask, int flags, int *pcr,
 		     struct ima_template_desc **template_desc,
 		     const char *keyring)
 {
@@ -637,8 +638,8 @@ int ima_match_policy(struct inode *inode, const struct cred *cred, u32 secid,
 		if (!(entry->action & actmask))
 			continue;
 
-		if (!ima_match_rules(entry, inode, cred, secid, func, mask,
-				     keyring))
+		if (!ima_match_rules(entry, user_ns, inode, cred, secid, func,
+				     mask, keyring))
 			continue;
 
 		action |= entry->flags & IMA_ACTION_FLAGS;
diff --git a/security/integrity/ima/ima_queue_keys.c b/security/integrity/ima/ima_queue_keys.c
index 69a8626a35c0..2bacc4f3e6ba 100644
--- a/security/integrity/ima/ima_queue_keys.c
+++ b/security/integrity/ima/ima_queue_keys.c
@@ -158,7 +158,7 @@ void ima_process_queued_keys(void)
 
 	list_for_each_entry_safe(entry, tmp, &ima_keys, list) {
 		if (!timer_expired)
-			process_buffer_measurement(NULL, entry->payload,
+			process_buffer_measurement(NULL, NULL, entry->payload,
 						   entry->payload_len,
 						   entry->keyring_name,
 						   KEY_CHECK, 0,
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 30/34] ext4: support idmapped mounts
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (28 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 29/34] ima: " Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 31/34] expfs: handle " Christian Brauner
                   ` (7 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

This enables ext4 to support idmapped mounts. All dedicated helpers we need for
this exist. The vfs will have already made sure that the fsids can be
translated after having been shifted if they are on an idmapped mount.

This implements helpers for the new inode operations that we've added. The core
change is the allocation of a new inode based on the mount's user namespace.
Code duplication is virtually non-existent because we can implement the
non-idmapped mount aware inode methods on top of the idmapped mount aware inode
methods. When the initial user namespace is passed the idmapped mount helpers
are nops and all mounts are marked with the initial user namespace by default.

It is also noteworthy that the idmapped mount implementation allows us to
cleanly handle ioctls() too. I've kept this as a single patch for now since the
change is overall fairly mechanical but I'm happy to split this.

Let's create simple example where we idmap an ext4 filesystem:

 root@f2-vm:~# truncate -s 5G ext4.img

 root@f2-vm:~# mkfs.ext4 ./ext4.img
 mke2fs 1.45.5 (07-Jan-2020)
 Discarding device blocks: done
 Creating filesystem with 1310720 4k blocks and 327680 inodes
 Filesystem UUID: 3fd91794-c6ca-4b0f-9964-289a000919cf
 Superblock backups stored on blocks:
         32768, 98304, 163840, 229376, 294912, 819200, 884736

 Allocating group tables: done
 Writing inode tables: done
 Creating journal (16384 blocks): done
 Writing superblocks and filesystem accounting information: done

 root@f2-vm:~# losetup -f --show ./ext4.img
 /dev/loop0

 root@f2-vm:~# mount /dev/loop0 /mnt

 root@f2-vm:~# ls -al /mnt/
 total 24
 drwxr-xr-x  3 root root  4096 Oct 28 13:34 .
 drwxr-xr-x 30 root root  4096 Oct 28 13:22 ..
 drwx------  2 root root 16384 Oct 28 13:34 lost+found

 # Let's create an idmapped mount at /idmapped1 where we map uid and gid 0 to
 # uid and gid 1000
 root@f2-vm:/# ./mount2 -mb:0:1000:1 /mnt/ /idmapped1/

 root@f2-vm:/# ls -al /idmapped1/
 total 24
 drwxr-xr-x  3 ubuntu ubuntu  4096 Oct 28 13:34 .
 drwxr-xr-x 30 root   root    4096 Oct 28 13:22 ..
 drwx------  2 ubuntu ubuntu 16384 Oct 28 13:34 lost+found

 # Let's create an idmapped mount at /idmapped2 where we map uid and gid 0 to
 # uid and gid 2000
 root@f2-vm:/# ./mount2 -mb:0:2000:1 /mnt/ /idmapped2/

 root@f2-vm:/# ls -al /idmapped2/
 total 24
 drwxr-xr-x  3 2000 2000  4096 Oct 28 13:34 .
 drwxr-xr-x 31 root root  4096 Oct 28 13:39 ..
 drwx------  2 2000 2000 16384 Oct 28 13:34 lost+found

Let's create another example where we idmap the rootfs filesystem without a
mapping for uid 0 and gid 0:

 # Create an idmapped mount of for a full POSIX range of rootfs under /mnt
 # but without a mapping for uid 0 to reduce attack surface

 root@f2-vm:/# ./mount2 -mb:1:1:65536 / /mnt/

 # Since we don't have a mapping for uid and gid 0 all files owned by uid and
 # gid 0 should show up as uid and gid 65534:
 root@f2-vm:/# ls -al /mnt/
 total 664
 drwxr-xr-x 31 nobody nogroup   4096 Oct 28 13:39 .
 drwxr-xr-x 31 root   root      4096 Oct 28 13:39 ..
 lrwxrwxrwx  1 nobody nogroup      7 Aug 25 07:44 bin -> usr/bin
 drwxr-xr-x  4 nobody nogroup   4096 Oct 28 13:17 boot
 drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:48 dev
 drwxr-xr-x 81 nobody nogroup   4096 Oct 28 04:00 etc
 drwxr-xr-x  4 nobody nogroup   4096 Oct 28 04:00 home
 lrwxrwxrwx  1 nobody nogroup      7 Aug 25 07:44 lib -> usr/lib
 lrwxrwxrwx  1 nobody nogroup      9 Aug 25 07:44 lib32 -> usr/lib32
 lrwxrwxrwx  1 nobody nogroup      9 Aug 25 07:44 lib64 -> usr/lib64
 lrwxrwxrwx  1 nobody nogroup     10 Aug 25 07:44 libx32 -> usr/libx32
 drwx------  2 nobody nogroup  16384 Aug 25 07:47 lost+found
 drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:44 media
 drwxr-xr-x 31 nobody nogroup   4096 Oct 28 13:39 mnt
 drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:44 opt
 drwxr-xr-x  2 nobody nogroup   4096 Apr 15  2020 proc
 drwx--x--x  6 nobody nogroup   4096 Oct 28 13:34 root
 drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:46 run
 lrwxrwxrwx  1 nobody nogroup      8 Aug 25 07:44 sbin -> usr/sbin
 drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:44 srv
 drwxr-xr-x  2 nobody nogroup   4096 Apr 15  2020 sys
 drwxrwxrwt 10 nobody nogroup   4096 Oct 28 13:19 tmp
 drwxr-xr-x 14 nobody nogroup   4096 Oct 20 13:00 usr
 drwxr-xr-x 12 nobody nogroup   4096 Aug 25 07:45 var

 # Since we do have a mapping for uid and gid 1000 all files owned by uid and
 # gid 1000 should simply show up as uid and gid 1000:
 root@f2-vm:/# ls -al /mnt/home/ubuntu/
 total 40
 drwxr-xr-x 3 ubuntu ubuntu  4096 Oct 28 00:43 .
 drwxr-xr-x 4 nobody nogroup 4096 Oct 28 04:00 ..
 -rw------- 1 ubuntu ubuntu  2936 Oct 28 12:26 .bash_history
 -rw-r--r-- 1 ubuntu ubuntu   220 Feb 25  2020 .bash_logout
 -rw-r--r-- 1 ubuntu ubuntu  3771 Feb 25  2020 .bashrc
 -rw-r--r-- 1 ubuntu ubuntu   807 Feb 25  2020 .profile
 -rw-r--r-- 1 ubuntu ubuntu     0 Oct 16 16:11 .sudo_as_admin_successful
 -rw------- 1 ubuntu ubuntu  1144 Oct 28 00:43 .viminfo

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/ext4/acl.c            |  11 ++-
 fs/ext4/acl.h            |   3 +
 fs/ext4/ext4.h           |  14 ++--
 fs/ext4/file.c           |   4 ++
 fs/ext4/ialloc.c         |   7 +-
 fs/ext4/inode.c          |  27 ++++++--
 fs/ext4/ioctl.c          |  18 ++---
 fs/ext4/namei.c          | 145 ++++++++++++++++++++++++++++++++-------
 fs/ext4/super.c          |   4 ++
 fs/ext4/symlink.c        |   9 +++
 fs/ext4/xattr_hurd.c     |  22 ++++--
 fs/ext4/xattr_security.c |  18 ++++-
 fs/ext4/xattr_trusted.c  |  18 ++++-
 13 files changed, 242 insertions(+), 58 deletions(-)

diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
index 68aaed48315f..9c7b898a7496 100644
--- a/fs/ext4/acl.c
+++ b/fs/ext4/acl.c
@@ -222,7 +222,8 @@ __ext4_set_acl(handle_t *handle, struct inode *inode, int type,
 }
 
 int
-ext4_set_acl(struct inode *inode, struct posix_acl *acl, int type)
+ext4_set_acl_mapped(struct user_namespace *user_ns,
+		    struct inode *inode, struct posix_acl *acl, int type)
 {
 	handle_t *handle;
 	int error, credits, retries = 0;
@@ -245,7 +246,7 @@ ext4_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 	ext4_fc_start_update(inode);
 
 	if ((type == ACL_TYPE_ACCESS) && acl) {
-		error = posix_acl_update_mode(inode, &mode, &acl);
+		error = posix_mapped_acl_update_mode(user_ns, inode, &mode, &acl);
 		if (error)
 			goto out_stop;
 		if (mode != inode->i_mode)
@@ -266,6 +267,12 @@ ext4_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 	return error;
 }
 
+int
+ext4_set_acl(struct inode *inode, struct posix_acl *acl, int type)
+{
+	return ext4_set_acl_mapped(&init_user_ns, inode, acl, type);
+}
+
 /*
  * Initialize the ACLs of a new inode. Called from ext4_new_inode.
  *
diff --git a/fs/ext4/acl.h b/fs/ext4/acl.h
index 9b63f5416a2f..aff92cd331cf 100644
--- a/fs/ext4/acl.h
+++ b/fs/ext4/acl.h
@@ -58,11 +58,14 @@ static inline int ext4_acl_count(size_t size)
 struct posix_acl *ext4_get_acl(struct inode *inode, int type);
 int ext4_set_acl(struct inode *inode, struct posix_acl *acl, int type);
 extern int ext4_init_acl(handle_t *, struct inode *, struct inode *);
+int ext4_set_acl_mapped(struct user_namespace *user_ns, struct inode *inode,
+			struct posix_acl *acl, int type);
 
 #else  /* CONFIG_EXT4_FS_POSIX_ACL */
 #include <linux/sched.h>
 #define ext4_get_acl NULL
 #define ext4_set_acl NULL
+#define ext4_set_acl_mapped NULL
 
 static inline int
 ext4_init_acl(handle_t *handle, struct inode *inode, struct inode *dir)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 254d1c26bea8..fb8bad3dd94f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2714,18 +2714,19 @@ extern int ext4fs_dirhash(const struct inode *dir, const char *name, int len,
 
 /* ialloc.c */
 extern int ext4_mark_inode_used(struct super_block *sb, int ino);
-extern struct inode *__ext4_new_inode(handle_t *, struct inode *, umode_t,
+extern struct inode *__ext4_new_inode(struct user_namespace *, handle_t *,
+				      struct inode *, umode_t,
 				      const struct qstr *qstr, __u32 goal,
 				      uid_t *owner, __u32 i_flags,
 				      int handle_type, unsigned int line_no,
 				      int nblocks);
 
-#define ext4_new_inode(handle, dir, mode, qstr, goal, owner, i_flags) \
-	__ext4_new_inode((handle), (dir), (mode), (qstr), (goal), (owner), \
-			 i_flags, 0, 0, 0)
-#define ext4_new_inode_start_handle(dir, mode, qstr, goal, owner, \
+#define ext4_new_inode(handle, dir, mode, qstr, goal, owner, i_flags)          \
+	__ext4_new_inode(&init_user_ns, (handle), (dir), (mode), (qstr),       \
+			 (goal), (owner), i_flags, 0, 0, 0)
+#define ext4_new_inode_start_handle(user_ns, dir, mode, qstr, goal, owner, \
 				    type, nblocks)		    \
-	__ext4_new_inode(NULL, (dir), (mode), (qstr), (goal), (owner), \
+	__ext4_new_inode((user_ns), NULL, (dir), (mode), (qstr), (goal), (owner), \
 			 0, (type), __LINE__, (nblocks))
 
 
@@ -2833,6 +2834,7 @@ extern struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 
 extern int  ext4_write_inode(struct inode *, struct writeback_control *);
 extern int  ext4_setattr(struct dentry *, struct iattr *);
+extern int  ext4_setattr_mapped(struct user_namespace *, struct dentry *, struct iattr *);
 extern int  ext4_getattr(const struct path *, struct kstat *, u32, unsigned int);
 extern void ext4_evict_inode(struct inode *);
 extern void ext4_clear_inode(struct inode *);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index d85412d12e3a..44880c0cdadf 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -919,5 +919,9 @@ const struct inode_operations ext4_file_inode_operations = {
 	.get_acl	= ext4_get_acl,
 	.set_acl	= ext4_set_acl,
 	.fiemap		= ext4_fiemap,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.setattr_mapped	= ext4_setattr_mapped,
+	.set_acl_mapped	= ext4_set_acl_mapped,
+#endif
 };
 
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index b215c564bc31..7676a2bcbb51 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -919,7 +919,8 @@ static int ext4_xattr_credits_for_new_inode(struct inode *dir, mode_t mode,
  * For other inodes, search forward from the parent directory's block
  * group to find a free inode.
  */
-struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
+struct inode *__ext4_new_inode(struct user_namespace *user_ns,
+			       handle_t *handle, struct inode *dir,
 			       umode_t mode, const struct qstr *qstr,
 			       __u32 goal, uid_t *owner, __u32 i_flags,
 			       int handle_type, unsigned int line_no,
@@ -969,10 +970,10 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 		i_gid_write(inode, owner[1]);
 	} else if (test_opt(sb, GRPID)) {
 		inode->i_mode = mode;
-		inode->i_uid = current_fsuid();
+		inode->i_uid = fsuid_into_mnt(user_ns);
 		inode->i_gid = dir->i_gid;
 	} else
-		inode_init_owner(inode, dir, mode);
+		mapped_inode_init_owner(inode, user_ns, dir, mode);
 
 	if (ext4_has_feature_project(sb) &&
 	    ext4_test_inode_flag(dir, EXT4_INODE_PROJINHERIT))
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 03c2253005f0..c3c75fcae6bf 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -20,6 +20,7 @@
  */
 
 #include <linux/fs.h>
+#include <linux/mount.h>
 #include <linux/time.h>
 #include <linux/highuid.h>
 #include <linux/pagemap.h>
@@ -5280,7 +5281,7 @@ static void ext4_wait_for_tail_page_commit(struct inode *inode)
 }
 
 /*
- * ext4_setattr()
+ * __ext4_setattr()
  *
  * Called from notify_change.
  *
@@ -5303,7 +5304,8 @@ static void ext4_wait_for_tail_page_commit(struct inode *inode)
  *
  * Called with inode->i_mutex down.
  */
-int ext4_setattr(struct dentry *dentry, struct iattr *attr)
+static int __ext4_setattr(struct user_namespace *user_ns, struct dentry *dentry,
+			  struct iattr *attr)
 {
 	struct inode *inode = d_inode(dentry);
 	int error, rc = 0;
@@ -5321,7 +5323,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 				  ATTR_GID | ATTR_TIMES_SET))))
 		return -EPERM;
 
-	error = setattr_prepare(dentry, attr);
+	error = setattr_mapped_prepare(user_ns, dentry, attr);
 	if (error)
 		return error;
 
@@ -5496,7 +5498,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 	}
 
 	if (!error) {
-		setattr_copy(inode, attr);
+		setattr_mapped_copy(user_ns, inode, attr);
 		mark_inode_dirty(inode);
 	}
 
@@ -5508,7 +5510,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 		ext4_orphan_del(NULL, inode);
 
 	if (!error && (ia_valid & ATTR_MODE))
-		rc = posix_acl_chmod(inode, inode->i_mode);
+		rc = posix_mapped_acl_chmod(user_ns, inode, inode->i_mode);
 
 err_out:
 	if  (error)
@@ -5519,9 +5521,21 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 	return error;
 }
 
+int ext4_setattr(struct dentry *dentry, struct iattr *attr)
+{
+	return __ext4_setattr(&init_user_ns, dentry, attr);
+}
+
+int ext4_setattr_mapped(struct user_namespace *user_ns, struct dentry *dentry,
+			struct iattr *attr)
+{
+	return __ext4_setattr(user_ns, dentry, attr);
+}
+
 int ext4_getattr(const struct path *path, struct kstat *stat,
 		 u32 request_mask, unsigned int query_flags)
 {
+	struct user_namespace *user_ns;
 	struct inode *inode = d_inode(path->dentry);
 	struct ext4_inode *raw_inode;
 	struct ext4_inode_info *ei = EXT4_I(inode);
@@ -5555,7 +5569,8 @@ int ext4_getattr(const struct path *path, struct kstat *stat,
 				  STATX_ATTR_NODUMP |
 				  STATX_ATTR_VERITY);
 
-	generic_fillattr(inode, stat);
+	user_ns = mnt_user_ns(path->mnt);
+	mapped_generic_fillattr(user_ns, inode, stat);
 	return 0;
 }
 
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index f0381876a7e5..90d7860de36e 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -111,6 +111,7 @@ void ext4_reset_inode_seed(struct inode *inode)
  *
  */
 static long swap_inode_boot_loader(struct super_block *sb,
+				struct user_namespace *user_ns,
 				struct inode *inode)
 {
 	handle_t *handle;
@@ -139,7 +140,7 @@ static long swap_inode_boot_loader(struct super_block *sb,
 	}
 
 	if (IS_RDONLY(inode) || IS_APPEND(inode) || IS_IMMUTABLE(inode) ||
-	    !inode_owner_or_capable(inode) || !capable(CAP_SYS_ADMIN)) {
+	    !mapped_inode_owner_or_capable(user_ns, inode) || !capable(CAP_SYS_ADMIN)) {
 		err = -EPERM;
 		goto journal_err_out;
 	}
@@ -814,6 +815,7 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 	struct inode *inode = file_inode(filp);
 	struct super_block *sb = inode->i_sb;
 	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct user_namespace *user_ns = mnt_user_ns(filp->f_path.mnt);
 	unsigned int flags;
 
 	ext4_debug("cmd = %u, arg = %lu\n", cmd, arg);
@@ -829,7 +831,7 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 	case FS_IOC_SETFLAGS: {
 		int err;
 
-		if (!inode_owner_or_capable(inode))
+		if (!mapped_inode_owner_or_capable(user_ns, inode))
 			return -EACCES;
 
 		if (get_user(flags, (int __user *) arg))
@@ -871,7 +873,7 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		__u32 generation;
 		int err;
 
-		if (!inode_owner_or_capable(inode))
+		if (!mapped_inode_owner_or_capable(user_ns, inode))
 			return -EPERM;
 
 		if (ext4_has_metadata_csum(inode->i_sb)) {
@@ -1010,7 +1012,7 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 	case EXT4_IOC_MIGRATE:
 	{
 		int err;
-		if (!inode_owner_or_capable(inode))
+		if (!mapped_inode_owner_or_capable(user_ns, inode))
 			return -EACCES;
 
 		err = mnt_want_write_file(filp);
@@ -1032,7 +1034,7 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 	case EXT4_IOC_ALLOC_DA_BLKS:
 	{
 		int err;
-		if (!inode_owner_or_capable(inode))
+		if (!mapped_inode_owner_or_capable(user_ns, inode))
 			return -EACCES;
 
 		err = mnt_want_write_file(filp);
@@ -1051,7 +1053,7 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		err = mnt_want_write_file(filp);
 		if (err)
 			return err;
-		err = swap_inode_boot_loader(sb, inode);
+		err = swap_inode_boot_loader(sb, user_ns, inode);
 		mnt_drop_write_file(filp);
 		return err;
 	}
@@ -1214,7 +1216,7 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 
 	case EXT4_IOC_CLEAR_ES_CACHE:
 	{
-		if (!inode_owner_or_capable(inode))
+		if (!mapped_inode_owner_or_capable(user_ns, inode))
 			return -EACCES;
 		ext4_clear_inode_es(inode);
 		return 0;
@@ -1260,7 +1262,7 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 			return -EFAULT;
 
 		/* Make sure caller has proper permission */
-		if (!inode_owner_or_capable(inode))
+		if (!mapped_inode_owner_or_capable(user_ns, inode))
 			return -EACCES;
 
 		if (fa.fsx_xflags & ~EXT4_SUPPORTED_FS_XFLAGS)
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 5159830dacb8..8c501899a1a8 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2606,8 +2606,8 @@ static int ext4_add_nondir(handle_t *handle,
  * If the create succeeds, we fill in the inode information
  * with d_instantiate().
  */
-static int ext4_create(struct inode *dir, struct dentry *dentry, umode_t mode,
-		       bool excl)
+static int __ext4_create(struct user_namespace *user_ns, struct inode *dir,
+			 struct dentry *dentry, umode_t mode, bool excl)
 {
 	handle_t *handle;
 	struct inode *inode, *inode_save;
@@ -2620,8 +2620,8 @@ static int ext4_create(struct inode *dir, struct dentry *dentry, umode_t mode,
 	credits = (EXT4_DATA_TRANS_BLOCKS(dir->i_sb) +
 		   EXT4_INDEX_EXTRA_TRANS_BLOCKS + 3);
 retry:
-	inode = ext4_new_inode_start_handle(dir, mode, &dentry->d_name, 0,
-					    NULL, EXT4_HT_DIR, credits);
+	inode = ext4_new_inode_start_handle(user_ns, dir, mode, &dentry->d_name,
+					    0, NULL, EXT4_HT_DIR, credits);
 	handle = ext4_journal_current_handle();
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
@@ -2643,8 +2643,14 @@ static int ext4_create(struct inode *dir, struct dentry *dentry, umode_t mode,
 	return err;
 }
 
-static int ext4_mknod(struct inode *dir, struct dentry *dentry,
-		      umode_t mode, dev_t rdev)
+static int ext4_create(struct inode *dir, struct dentry *dentry, umode_t mode,
+		       bool excl)
+{
+	return __ext4_create(&init_user_ns, dir, dentry, mode, excl);
+}
+
+static int __ext4_mknod(struct user_namespace *user_ns, struct inode *dir,
+			struct dentry *dentry, umode_t mode, dev_t rdev)
 {
 	handle_t *handle;
 	struct inode *inode, *inode_save;
@@ -2657,8 +2663,8 @@ static int ext4_mknod(struct inode *dir, struct dentry *dentry,
 	credits = (EXT4_DATA_TRANS_BLOCKS(dir->i_sb) +
 		   EXT4_INDEX_EXTRA_TRANS_BLOCKS + 3);
 retry:
-	inode = ext4_new_inode_start_handle(dir, mode, &dentry->d_name, 0,
-					    NULL, EXT4_HT_DIR, credits);
+	inode = ext4_new_inode_start_handle(user_ns, dir, mode, &dentry->d_name,
+					    0, NULL, EXT4_HT_DIR, credits);
 	handle = ext4_journal_current_handle();
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
@@ -2680,7 +2686,14 @@ static int ext4_mknod(struct inode *dir, struct dentry *dentry,
 	return err;
 }
 
-static int ext4_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
+static int ext4_mknod(struct inode *dir, struct dentry *dentry,
+			umode_t mode, dev_t rdev)
+{
+	return __ext4_mknod(&init_user_ns, dir, dentry, mode, rdev);
+}
+
+static int __ext4_tmpfile(struct user_namespace *user_ns, struct inode *dir,
+			  struct dentry *dentry, umode_t mode)
 {
 	handle_t *handle;
 	struct inode *inode;
@@ -2691,7 +2704,7 @@ static int ext4_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 		return err;
 
 retry:
-	inode = ext4_new_inode_start_handle(dir, mode,
+	inode = ext4_new_inode_start_handle(user_ns, dir, mode,
 					    NULL, 0, NULL,
 					    EXT4_HT_DIR,
 			EXT4_MAXQUOTAS_INIT_BLOCKS(dir->i_sb) +
@@ -2720,6 +2733,11 @@ static int ext4_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 	return err;
 }
 
+static int ext4_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	return __ext4_tmpfile(&init_user_ns, dir, dentry, mode);
+}
+
 struct ext4_dir_entry_2 *ext4_init_dot_dotdot(struct inode *inode,
 			  struct ext4_dir_entry_2 *de,
 			  int blocksize, int csum_size,
@@ -2789,7 +2807,8 @@ int ext4_init_new_dir(handle_t *handle, struct inode *dir,
 	return err;
 }
 
-static int ext4_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
+static int __ext4_mkdir(struct user_namespace *user_ns, struct inode *dir,
+			struct dentry *dentry, umode_t mode)
 {
 	handle_t *handle;
 	struct inode *inode;
@@ -2805,7 +2824,7 @@ static int ext4_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 	credits = (EXT4_DATA_TRANS_BLOCKS(dir->i_sb) +
 		   EXT4_INDEX_EXTRA_TRANS_BLOCKS + 3);
 retry:
-	inode = ext4_new_inode_start_handle(dir, S_IFDIR | mode,
+	inode = ext4_new_inode_start_handle(user_ns, dir, S_IFDIR | mode,
 					    &dentry->d_name,
 					    0, NULL, EXT4_HT_DIR, credits);
 	handle = ext4_journal_current_handle();
@@ -2853,6 +2872,10 @@ static int ext4_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 	return err;
 }
 
+static int ext4_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	return __ext4_mkdir(&init_user_ns, dir, dentry, mode);
+}
 /*
  * routine to check that the specified directory is empty (for rmdir)
  */
@@ -3301,8 +3324,8 @@ static int ext4_unlink(struct inode *dir, struct dentry *dentry)
 	return retval;
 }
 
-static int ext4_symlink(struct inode *dir,
-			struct dentry *dentry, const char *symname)
+static int __ext4_symlink(struct user_namespace *user_ns, struct inode *dir,
+			  struct dentry *dentry, const char *symname)
 {
 	handle_t *handle;
 	struct inode *inode;
@@ -3342,7 +3365,7 @@ static int ext4_symlink(struct inode *dir,
 			  EXT4_INDEX_EXTRA_TRANS_BLOCKS + 3;
 	}
 
-	inode = ext4_new_inode_start_handle(dir, S_IFLNK|S_IRWXUGO,
+	inode = ext4_new_inode_start_handle(user_ns, dir, S_IFLNK|S_IRWXUGO,
 					    &dentry->d_name, 0, NULL,
 					    EXT4_HT_DIR, credits);
 	handle = ext4_journal_current_handle();
@@ -3431,6 +3454,12 @@ static int ext4_symlink(struct inode *dir,
 	return err;
 }
 
+static int ext4_symlink(struct inode *dir, struct dentry *dentry,
+			const char *symname)
+{
+	return __ext4_symlink(&init_user_ns, dir, dentry, symname);
+}
+
 int __ext4_link(struct inode *dir, struct inode *inode, struct dentry *dentry)
 {
 	handle_t *handle;
@@ -3674,7 +3703,8 @@ static void ext4_update_dir_count(handle_t *handle, struct ext4_renament *ent)
 	}
 }
 
-static struct inode *ext4_whiteout_for_rename(struct ext4_renament *ent,
+static struct inode *ext4_whiteout_for_rename(struct user_namespace *user_ns,
+					      struct ext4_renament *ent,
 					      int credits, handle_t **h)
 {
 	struct inode *wh;
@@ -3688,7 +3718,8 @@ static struct inode *ext4_whiteout_for_rename(struct ext4_renament *ent,
 	credits += (EXT4_MAXQUOTAS_TRANS_BLOCKS(ent->dir->i_sb) +
 		    EXT4_XATTR_TRANS_BLOCKS + 4);
 retry:
-	wh = ext4_new_inode_start_handle(ent->dir, S_IFCHR | WHITEOUT_MODE,
+	wh = ext4_new_inode_start_handle(user_ns, ent->dir,
+					 S_IFCHR | WHITEOUT_MODE,
 					 &ent->dentry->d_name, 0, NULL,
 					 EXT4_HT_DIR, credits);
 
@@ -3715,9 +3746,9 @@ static struct inode *ext4_whiteout_for_rename(struct ext4_renament *ent,
  * while new_{dentry,inode) refers to the destination dentry/inode
  * This comes from rename(const char *oldpath, const char *newpath)
  */
-static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
-		       struct inode *new_dir, struct dentry *new_dentry,
-		       unsigned int flags)
+static int ext4_rename(struct user_namespace *user_ns, struct inode *old_dir,
+		       struct dentry *old_dentry, struct inode *new_dir,
+		       struct dentry *new_dentry, unsigned int flags)
 {
 	handle_t *handle = NULL;
 	struct ext4_renament old = {
@@ -3801,7 +3832,7 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
 			goto end_rename;
 		}
 	} else {
-		whiteout = ext4_whiteout_for_rename(&old, credits, &handle);
+		whiteout = ext4_whiteout_for_rename(user_ns, &old, credits, &handle);
 		if (IS_ERR(whiteout)) {
 			retval = PTR_ERR(whiteout);
 			whiteout = NULL;
@@ -4093,9 +4124,9 @@ static int ext4_cross_rename(struct inode *old_dir, struct dentry *old_dentry,
 	return retval;
 }
 
-static int ext4_rename2(struct inode *old_dir, struct dentry *old_dentry,
-			struct inode *new_dir, struct dentry *new_dentry,
-			unsigned int flags)
+static int __ext4_rename2(struct user_namespace *user_ns, struct inode *old_dir,
+			  struct dentry *old_dentry, struct inode *new_dir,
+			  struct dentry *new_dentry, unsigned int flags)
 {
 	int err;
 
@@ -4115,9 +4146,59 @@ static int ext4_rename2(struct inode *old_dir, struct dentry *old_dentry,
 					 new_dir, new_dentry);
 	}
 
-	return ext4_rename(old_dir, old_dentry, new_dir, new_dentry, flags);
+	return ext4_rename(user_ns, old_dir, old_dentry, new_dir, new_dentry, flags);
+}
+
+static int ext4_rename2(struct inode *old_dir, struct dentry *old_dentry,
+			struct inode *new_dir, struct dentry *new_dentry,
+			unsigned int flags)
+{
+	return __ext4_rename2(&init_user_ns, old_dir, old_dentry, new_dir,
+			      new_dentry, flags);
+}
+
+#ifdef CONFIG_IDMAP_MOUNTS
+static int ext4_create_mapped(struct user_namespace *user_ns, struct inode *dir,
+			      struct dentry *dentry, umode_t mode, bool excl)
+{
+	return __ext4_create(user_ns, dir, dentry, mode, excl);
 }
 
+static int ext4_mknod_mapped(struct user_namespace *user_ns, struct inode *dir,
+			     struct dentry *dentry, umode_t mode, dev_t rdev)
+{
+	return __ext4_mknod(user_ns, dir, dentry, mode, rdev);
+}
+static int ext4_tmpfile_mapped(struct user_namespace *user_ns,
+			       struct inode *dir, struct dentry *dentry,
+			       umode_t mode)
+{
+	return __ext4_tmpfile(user_ns, dir, dentry, mode);
+}
+
+static int ext4_mkdir_mapped(struct user_namespace *user_ns, struct inode *dir,
+			     struct dentry *dentry, umode_t mode)
+{
+	return __ext4_mkdir(user_ns, dir, dentry, mode);
+}
+
+static int ext4_symlink_mapped(struct user_namespace *user_ns,
+			       struct inode *dir, struct dentry *dentry,
+			       const char *symname)
+{
+	return __ext4_symlink(user_ns, dir, dentry, symname);
+}
+
+static int ext4_rename2_mapped(struct user_namespace *user_ns,
+			       struct inode *old_dir, struct dentry *old_dentry,
+			       struct inode *new_dir, struct dentry *new_dentry,
+			       unsigned int flags)
+{
+	return __ext4_rename2(user_ns, old_dir, old_dentry, new_dir,
+			      new_dentry, flags);
+}
+#endif
+
 /*
  * directories can handle most operations...
  */
@@ -4138,6 +4219,16 @@ const struct inode_operations ext4_dir_inode_operations = {
 	.get_acl	= ext4_get_acl,
 	.set_acl	= ext4_set_acl,
 	.fiemap         = ext4_fiemap,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.create_mapped	= ext4_create_mapped,
+	.symlink_mapped	= ext4_symlink_mapped,
+	.mkdir_mapped	= ext4_mkdir_mapped,
+	.mknod_mapped	= ext4_mknod_mapped,
+	.tmpfile_mapped	= ext4_tmpfile_mapped,
+	.rename_mapped	= ext4_rename2_mapped,
+	.setattr_mapped	= ext4_setattr_mapped,
+	.set_acl_mapped	= ext4_set_acl_mapped,
+#endif
 };
 
 const struct inode_operations ext4_special_inode_operations = {
@@ -4146,4 +4237,8 @@ const struct inode_operations ext4_special_inode_operations = {
 	.listxattr	= ext4_listxattr,
 	.get_acl	= ext4_get_acl,
 	.set_acl	= ext4_set_acl,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.setattr_mapped	= ext4_setattr_mapped,
+	.set_acl_mapped	= ext4_set_acl_mapped,
+#endif
 };
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 2fe141ff3c7e..fcbd9b8f7613 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -6641,7 +6641,11 @@ static struct file_system_type ext4_fs_type = {
 	.name		= "ext4",
 	.mount		= ext4_mount,
 	.kill_sb	= kill_block_super,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.fs_flags	= FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+#else
 	.fs_flags	= FS_REQUIRES_DEV,
+#endif
 };
 MODULE_ALIAS_FS("ext4");
 
diff --git a/fs/ext4/symlink.c b/fs/ext4/symlink.c
index dd05af983092..fced0a321c95 100644
--- a/fs/ext4/symlink.c
+++ b/fs/ext4/symlink.c
@@ -57,6 +57,9 @@ const struct inode_operations ext4_encrypted_symlink_inode_operations = {
 	.setattr	= ext4_setattr,
 	.getattr	= ext4_getattr,
 	.listxattr	= ext4_listxattr,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.setattr_mapped	= ext4_setattr_mapped,
+#endif
 };
 
 const struct inode_operations ext4_symlink_inode_operations = {
@@ -64,6 +67,9 @@ const struct inode_operations ext4_symlink_inode_operations = {
 	.setattr	= ext4_setattr,
 	.getattr	= ext4_getattr,
 	.listxattr	= ext4_listxattr,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.setattr_mapped	= ext4_setattr_mapped,
+#endif
 };
 
 const struct inode_operations ext4_fast_symlink_inode_operations = {
@@ -71,4 +77,7 @@ const struct inode_operations ext4_fast_symlink_inode_operations = {
 	.setattr	= ext4_setattr,
 	.getattr	= ext4_getattr,
 	.listxattr	= ext4_listxattr,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.setattr_mapped	= ext4_setattr_mapped,
+#endif
 };
diff --git a/fs/ext4/xattr_hurd.c b/fs/ext4/xattr_hurd.c
index 8cfa74a56361..e9dfeba4c340 100644
--- a/fs/ext4/xattr_hurd.c
+++ b/fs/ext4/xattr_hurd.c
@@ -31,10 +31,11 @@ ext4_xattr_hurd_get(const struct xattr_handler *handler,
 }
 
 static int
-ext4_xattr_hurd_set(const struct xattr_handler *handler,
-		    struct dentry *unused, struct inode *inode,
-		    const char *name, const void *value,
-		    size_t size, int flags)
+ext4_xattr_hurd_set_mapped(const struct xattr_handler *handler,
+			   struct user_namespace *user_ns,
+			   struct dentry *unused, struct inode *inode,
+			   const char *name, const void *value,
+			   size_t size, int flags)
 {
 	if (!test_opt(inode->i_sb, XATTR_USER))
 		return -EOPNOTSUPP;
@@ -43,9 +44,22 @@ ext4_xattr_hurd_set(const struct xattr_handler *handler,
 			      name, value, size, flags);
 }
 
+static int
+ext4_xattr_hurd_set(const struct xattr_handler *handler,
+		    struct dentry *unused, struct inode *inode,
+		    const char *name, const void *value,
+		    size_t size, int flags)
+{
+	return ext4_xattr_hurd_set_mapped(handler, &init_user_ns, unused, inode,
+					  name, value, size, flags);
+}
+
 const struct xattr_handler ext4_xattr_hurd_handler = {
 	.prefix	= XATTR_HURD_PREFIX,
 	.list	= ext4_xattr_hurd_list,
 	.get	= ext4_xattr_hurd_get,
 	.set	= ext4_xattr_hurd_set,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.set_mapped = ext4_xattr_hurd_set_mapped,
+#endif
 };
diff --git a/fs/ext4/xattr_security.c b/fs/ext4/xattr_security.c
index 197a9d8a15ef..0772b1c1b140 100644
--- a/fs/ext4/xattr_security.c
+++ b/fs/ext4/xattr_security.c
@@ -21,14 +21,25 @@ ext4_xattr_security_get(const struct xattr_handler *handler,
 			      name, buffer, size);
 }
 
+static int
+ext4_xattr_security_set_mapped(const struct xattr_handler *handler,
+			       struct user_namespace *user_ns,
+			       struct dentry *unused, struct inode *inode,
+			       const char *name, const void *value,
+			       size_t size, int flags)
+{
+	return ext4_xattr_set(inode, EXT4_XATTR_INDEX_SECURITY,
+			      name, value, size, flags);
+}
+
 static int
 ext4_xattr_security_set(const struct xattr_handler *handler,
 			struct dentry *unused, struct inode *inode,
 			const char *name, const void *value,
 			size_t size, int flags)
 {
-	return ext4_xattr_set(inode, EXT4_XATTR_INDEX_SECURITY,
-			      name, value, size, flags);
+	return ext4_xattr_security_set_mapped(handler, &init_user_ns, unused,
+					      inode, name, value, size, flags);
 }
 
 static int
@@ -62,4 +73,7 @@ const struct xattr_handler ext4_xattr_security_handler = {
 	.prefix	= XATTR_SECURITY_PREFIX,
 	.get	= ext4_xattr_security_get,
 	.set	= ext4_xattr_security_set,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.set_mapped = ext4_xattr_security_set_mapped,
+#endif
 };
diff --git a/fs/ext4/xattr_trusted.c b/fs/ext4/xattr_trusted.c
index e9389e5d75c3..aa407b60ce79 100644
--- a/fs/ext4/xattr_trusted.c
+++ b/fs/ext4/xattr_trusted.c
@@ -28,14 +28,25 @@ ext4_xattr_trusted_get(const struct xattr_handler *handler,
 			      name, buffer, size);
 }
 
+static int
+ext4_xattr_trusted_set_mapped(const struct xattr_handler *handler,
+			      struct user_namespace *user_ns,
+			      struct dentry *unused, struct inode *inode,
+			      const char *name, const void *value,
+			      size_t size, int flags)
+{
+	return ext4_xattr_set(inode, EXT4_XATTR_INDEX_TRUSTED,
+			      name, value, size, flags);
+}
+
 static int
 ext4_xattr_trusted_set(const struct xattr_handler *handler,
 		       struct dentry *unused, struct inode *inode,
 		       const char *name, const void *value,
 		       size_t size, int flags)
 {
-	return ext4_xattr_set(inode, EXT4_XATTR_INDEX_TRUSTED,
-			      name, value, size, flags);
+	return ext4_xattr_trusted_set_mapped(handler, &init_user_ns, unused,
+					     inode, name, value, size, flags);
 }
 
 const struct xattr_handler ext4_xattr_trusted_handler = {
@@ -43,4 +54,7 @@ const struct xattr_handler ext4_xattr_trusted_handler = {
 	.list	= ext4_xattr_trusted_list,
 	.get	= ext4_xattr_trusted_get,
 	.set	= ext4_xattr_trusted_set,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.set_mapped = ext4_xattr_trusted_set_mapped,
+#endif
 };
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 31/34] expfs: handle idmapped mounts
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (29 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 30/34] ext4: support " Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  0:32 ` [PATCH 32/34] overlayfs: handle idmapped lower directories Christian Brauner
                   ` (6 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

In follow-up patches we will port overlayfs to support idmapped mounts and
since it makes use of expfs port expfs to handle idmapped mounts by using one
of our newly introduced idmapped mount aware lookup helpers.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/exportfs/expfs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index 2dd55b172d57..8066e8bdc912 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -145,7 +145,7 @@ static struct dentry *reconnect_one(struct vfsmount *mnt,
 	if (err)
 		goto out_err;
 	dprintk("%s: found name: %s\n", __func__, nbuf);
-	tmp = lookup_one_len_unlocked(nbuf, parent, strlen(nbuf));
+	tmp = lookup_one_len_mapped_unlocked(nbuf, parent, strlen(nbuf), mnt_user_ns(mnt));
 	if (IS_ERR(tmp)) {
 		dprintk("%s: lookup failed: %d\n", __func__, PTR_ERR(tmp));
 		err = PTR_ERR(tmp);
@@ -525,7 +525,7 @@ struct dentry *exportfs_decode_fh(struct vfsmount *mnt, struct fid *fid,
 		}
 
 		inode_lock(target_dir->d_inode);
-		nresult = lookup_one_len(nbuf, target_dir, strlen(nbuf));
+		nresult = lookup_one_len_mapped(nbuf, target_dir, strlen(nbuf), mnt_user_ns(mnt));
 		if (!IS_ERR(nresult)) {
 			if (unlikely(nresult->d_inode != result->d_inode)) {
 				dput(nresult);
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 32/34] overlayfs: handle idmapped lower directories
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (30 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 31/34] expfs: handle " Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-30 11:10   ` Amir Goldstein
  2020-10-29  0:32 ` [PATCH 33/34] overlayfs: handle idmapped merged mounts Christian Brauner
                   ` (5 subsequent siblings)
  37 siblings, 1 reply; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

As an overlay filesystem overlayfs can be mounted on top of other filesystems
and bind-mounts. This means it can also be bind-mounted on top of one or
multiple idmapped lower directories and/or an idmapped upper directory.
In previous patches we've enabled the vfs to handle idmapped mounts and so we
should have all of the helpers available to let overlayfs handle idmapped
mounts. To handle such scenarios correctly overlayfs needs to be switched from
non-idmapped mount aware vfs helpers to idmapped mount aware vfs helpers.
In order to have overlayfs correctly support idmapped mounts as lower and upper
directories we need to pass down the mount's user namespace associated with the
lower and upper directories whenver we perform idmapped mount aware operations.

Luckily, when overlayfs is mounted it creates private mounts of the lower and
upper directories via clone_private_mount() which calls clone_mnt() internally.
If any of the lower or upper directories are on an idmapped mount then
clone_mnt() called in clone_private_mount() will also pin the user namespace
the vfsmount has been marked with. Overlayfs stashes the information about the
lower and upper directories and the mounts that they are on so that this
information can be retrieved when needed. This makes it possible to support
idmapped mounts as lower and upper directories. Support for idmapped merged
mounts will be added in a follow-up patch.

Whenever we perform idmap mount aware operations we need to pass down the
mount's user namespace to the vfs helpers we've introduced in earlier patches.
Permission checks on the lower and upper directories are performed by switching
from the inode_permission() and inode_owner_or_capable() helpers to the new
mapped_inode_permission() and mapped_inode_owner_or_capable() helpers.
Similarly we switch from non-idmapped mount aware lookup helpers to
idmapped-mount aware lookup helpers. In all cases where we need to check
permissions in the lower or upper directories we pass down the mount associated
with the lower and upper directory at the time of creating the overlayfs mount.
This nicely lines up with the permission model outlined in the overlayfs
documentation (Special thanks to Amir for pointing me to this document!).

Thank to Amir for pointing me to the overlayfs permission model documentation!

A very special thank you to my friend Seth Forshee who has given invaluable
advice when coming up with these patches!

As an example let's create overlayfs mount in the initial user namespace with
an idmapped lower and upper mount:

 # This is a directory where all file ownership starts with uid and gid 10000.
 root@f2-vm:/# ls -al /var/lib/lxc/f1/rootfs
 total 108
 drwxr-xr-x  20 10000 10000  4096 Oct 28 11:13 .
 drwxrwx---   4 10000 10000  4096 Oct 28 11:17 ..
 -rw-r--r--   1 10000 10000  7197 Oct 24 09:45 asdf
 drwxr-xr-x   2 10000 10000  4096 Oct 16 19:07 ASDF
 lrwxrwxrwx   1 10000 10000     7 Sep 24 07:43 bin -> usr/bin
 drwxr-xr-x   2 10000 10000  4096 Apr 15  2020 boot
 -rw-r--r--   1 10000 10000 13059 Oct  8 12:38 ccc
 drwxr-xr-x   2 11000 11000  4096 Oct 23 17:10 ddd
 drwxr-xr-x   3 10000 10000  4096 Sep 25 08:04 dev
 drwxr-xr-x  61 10000 10000  4096 Sep 25 08:04 etc

 # Create an idmapped mount on the host such that all files owned by uid and
 # gid 10000 show up as being owned by uid 0 and gid 0.
 /mount2 --idmap both:10000:0:10000 /var/lib/lxc/f1/rootfs/ /lower1/

 # Verify that the files show up as uid and gid 0 under the idmapped mount at /lower1
 root@f2-vm:/# ls -al /lower1/
 total 108
 drwxr-xr-x  20 root   root    4096 Oct 28 11:13 .
 drwxr-xr-x  29 root   root    4096 Oct 28 11:57 ..
 -rw-r--r--   1 root   root    7197 Oct 24 09:45 asdf
 drwxr-xr-x   2 root   root    4096 Oct 16 19:07 ASDF
 lrwxrwxrwx   1 root   root       7 Sep 24 07:43 bin -> usr/bin
 drwxr-xr-x   2 root   root    4096 Apr 15  2020 boot
 -rw-r--r--   1 root   root   13059 Oct  8 12:38 ccc
 drwxr-xr-x   2 ubuntu ubuntu  4096 Oct 23 17:10 ddd
 drwxr-xr-x   3 root   root    4096 Sep 25 08:04 dev
 drwxr-xr-x  61 root   root    4096 Sep 25 08:04 etc

 # Create an idmapped upper mount at /upper. Now, files created as id 0 will
 # show up as id 10000 in /upper and files created as id 1000 will show up as
 # id 11000 under /upper.
 /mount2 --idmap both:10000:0:10000 /upper /upper
 mkdir /upper/upper
 mkdir /upper/work

 # Create an overlayfs mount.
 mount -t overlay overlay -o lowerdir=/lower1/,upperdir=/upper/upper/,workdir=/upper/work/ /merged/

 root@f2-vm:/# ls -al /merged/
 total 124
 drwxr-xr-x   1 root   root    4096 Oct 25 23:04 .
 drwxr-xr-x  29 root   root    4096 Oct 28 12:07 ..
 -rw-r--r--   1 root   root    7197 Oct 24 09:45 asdf
 drwxr-xr-x   2 root   root    4096 Oct 16 19:07 ASDF
 lrwxrwxrwx   1 root   root       7 Sep 24 07:43 bin -> usr/bin
 drwxr-xr-x   2 root   root    4096 Apr 15  2020 boot
 -rw-r--r--   1 root   root   13059 Oct  8 12:38 ccc
 drwxr-xr-x   2 ubuntu ubuntu  4096 Oct 23 17:10 ddd
 drwxr-xr-x   3 root   root    4096 Sep 25 08:04 dev
 drwxr-xr-x  61 root   root    4096 Sep 25 08:04 etc

 # Create a file as as root
 root@f2-vm:/merged# touch /merged/A-FILE

 root@f2-vm:/merged# ls -al /merged/A-FILE
 -rw-r--r-- 1 root root 0 Oct 28 12:16 /merged/A-FILE

 # Chown the file to a simple user
 root@f2-vm:/merged# chown 1000:1000 /merged/A-FILE

 root@f2-vm:/merged# ls -al /merged/A-FILE
 -rw-r--r-- 1 ubuntu ubuntu 0 Oct 28 12:16 /merged/A-FILE

 # Create a directory and delegate to simple user
 root@f2-vm:/merged# mkdir /merged/A-DIR

 root@f2-vm:/merged# chown 1000:1000 /merged/A-DIR/

 # Login as user
 root@f2-vm:/merged# sudo -u ubuntu -- bash -i

 # Create a file as simpel user
 ubuntu@f2-vm:/merged$ touch /merged/A-DIR/A-USER-FILE

 ubuntu@f2-vm:/merged$ ls -al /merged/A-DIR/A-USER-FILE
 -rw-rw-r-- 1 ubuntu ubuntu 0 Oct 28 12:18 /merged/A-DIR/A-USER-FILE

 # Let's look at these files in our idmapped upper directory
 ubuntu@f2-vm:/$ ls -alR /upper/upper/
 /upper/upper/:
 total 12
 drwxr-xr-x 3 root   root   4096 Oct 28 12:23 .
 drwxr-xr-x 4 root   root   4096 Oct 21 13:48 ..
 drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 12:18 A-DIR
 -rw-r--r-- 1 ubuntu ubuntu    0 Oct 28 12:16 A-FILE

 /upper/upper/A-DIR:
 total 8
 drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 12:18 .
 drwxr-xr-x 3 root   root   4096 Oct 28 12:23 ..
 -rw-rw-r-- 1 ubuntu ubuntu    0 Oct 28 12:18 A-USER-FILE

 # Let's remove the idmapped /upper mount (overlayfs will have it's own private mount anyway)
 umount /upper

 # Let's look at these files in our upper directory with the idmapped mount removed
 ubuntu@f2-vm:/$ ls -alR /upper/upper/
 /upper/upper/:
 total 12
 drwxr-xr-x 3 10000 10000 4096 Oct 28 12:23 .
 drwxr-xr-x 4 10000 10000 4096 Oct 21 13:48 ..
 drwxr-xr-x 2 11000 11000 4096 Oct 28 12:18 A-DIR
 -rw-r--r-- 1 11000 11000    0 Oct 28 12:16 A-FILE

 /upper/upper/A-DIR:
 total 8
 drwxr-xr-x 2 11000 11000 4096 Oct 28 12:18 .
 drwxr-xr-x 3 10000 10000 4096 Oct 28 12:23 ..
 -rw-rw-r-- 1 11000 11000    0 Oct 28 12:18 A-USER-FILE

 # Let's create a few acls from the /merged directory  on an already existing file
 # triggering a copy-up operation
  root@f2-vm:/merged# setfacl -m u:1000:rwx /merged/asdf
  root@f2-vm:/merged# getfacl /merged/asdf
  getfacl: Removing leading '/' from absolute path names
  # file: merged/asdf
  # owner: root
  # group: root
  user::rw-
  user:ubuntu:rwx
  group::r--
  mask::rwx
  other::r--

  # Let's look at this file from our upper directory
  root@f2-vm:/merged# getfacl /upper/upper/asdf
  getfacl: Removing leading '/' from absolute path names
  # file: upper/upper/asdf
  # owner: 10000
  # group: 10000
  user::rw-
  user:11000:rwx
  group::r--
  mask::rwx
  other::r--

Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/overlayfs/copy_up.c   | 100 +++++++++++++++-----------
 fs/overlayfs/dir.c       | 151 ++++++++++++++++++++++-----------------
 fs/overlayfs/export.c    |   3 +-
 fs/overlayfs/file.c      |  23 +++---
 fs/overlayfs/inode.c     |  89 ++++++++++++++++++-----
 fs/overlayfs/namei.c     |  64 ++++++++++-------
 fs/overlayfs/overlayfs.h | 149 ++++++++++++++++++++++++++------------
 fs/overlayfs/ovl_entry.h |   1 +
 fs/overlayfs/readdir.c   |  34 +++++----
 fs/overlayfs/super.c     | 106 ++++++++++++++++++++-------
 fs/overlayfs/util.c      |  38 +++++-----
 11 files changed, 494 insertions(+), 264 deletions(-)

diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 955ecd4030f0..1b8721796fd4 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -43,7 +43,8 @@ static bool ovl_must_copy_xattr(const char *name)
 	       !strncmp(name, XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN);
 }
 
-int ovl_copy_xattr(struct super_block *sb, struct dentry *old,
+int ovl_copy_xattr(struct super_block *sb, struct user_namespace *old_user_ns,
+		   struct dentry *old, struct user_namespace *new_user_ns,
 		   struct dentry *new)
 {
 	ssize_t list_size, size, value_size = 0;
@@ -85,9 +86,9 @@ int ovl_copy_xattr(struct super_block *sb, struct dentry *old,
 		if (ovl_is_private_xattr(sb, name))
 			continue;
 retry:
-		size = vfs_getxattr(old, name, value, value_size);
+		size = vfs_mapped_getxattr(old_user_ns, old, name, value, value_size);
 		if (size == -ERANGE)
-			size = vfs_getxattr(old, name, NULL, 0);
+			size = vfs_mapped_getxattr(old_user_ns, old, name, NULL, 0);
 
 		if (size < 0) {
 			error = size;
@@ -114,7 +115,7 @@ int ovl_copy_xattr(struct super_block *sb, struct dentry *old,
 			error = 0;
 			continue; /* Discard */
 		}
-		error = vfs_setxattr(new, name, value, size, 0);
+		error = vfs_mapped_setxattr(new_user_ns, new, name, value, size, 0);
 		if (error) {
 			if (error != -EOPNOTSUPP || ovl_must_copy_xattr(name))
 				break;
@@ -228,17 +229,19 @@ static int ovl_copy_up_data(struct ovl_fs *ofs, struct path *old,
 	return error;
 }
 
-static int ovl_set_size(struct dentry *upperdentry, struct kstat *stat)
+static int ovl_set_size(struct user_namespace *user_ns,
+			struct dentry *upperdentry, struct kstat *stat)
 {
 	struct iattr attr = {
 		.ia_valid = ATTR_SIZE,
 		.ia_size = stat->size,
 	};
 
-	return notify_change(upperdentry, &attr, NULL);
+	return notify_mapped_change(user_ns, upperdentry, &attr, NULL);
 }
 
-static int ovl_set_timestamps(struct dentry *upperdentry, struct kstat *stat)
+static int ovl_set_timestamps(struct user_namespace *user_ns,
+			      struct dentry *upperdentry, struct kstat *stat)
 {
 	struct iattr attr = {
 		.ia_valid =
@@ -247,10 +250,11 @@ static int ovl_set_timestamps(struct dentry *upperdentry, struct kstat *stat)
 		.ia_mtime = stat->mtime,
 	};
 
-	return notify_change(upperdentry, &attr, NULL);
+	return notify_mapped_change(user_ns, upperdentry, &attr, NULL);
 }
 
-int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat)
+int ovl_set_attr(struct user_namespace *user_ns, struct dentry *upperdentry,
+		 struct kstat *stat)
 {
 	int err = 0;
 
@@ -259,7 +263,7 @@ int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat)
 			.ia_valid = ATTR_MODE,
 			.ia_mode = stat->mode,
 		};
-		err = notify_change(upperdentry, &attr, NULL);
+		err = notify_mapped_change(user_ns, upperdentry, &attr, NULL);
 	}
 	if (!err) {
 		struct iattr attr = {
@@ -267,10 +271,10 @@ int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat)
 			.ia_uid = stat->uid,
 			.ia_gid = stat->gid,
 		};
-		err = notify_change(upperdentry, &attr, NULL);
+		err = notify_mapped_change(user_ns, upperdentry, &attr, NULL);
 	}
 	if (!err)
-		ovl_set_timestamps(upperdentry, stat);
+		ovl_set_timestamps(user_ns, upperdentry, stat);
 
 	return err;
 }
@@ -356,8 +360,8 @@ int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
 }
 
 /* Store file handle of @upper dir in @index dir entry */
-static int ovl_set_upper_fh(struct ovl_fs *ofs, struct dentry *upper,
-			    struct dentry *index)
+static int ovl_set_upper_fh(struct ovl_fs *ofs, struct user_namespace *user_ns,
+			    struct dentry *upper, struct dentry *index)
 {
 	const struct ovl_fh *fh;
 	int err;
@@ -377,7 +381,8 @@ static int ovl_set_upper_fh(struct ovl_fs *ofs, struct dentry *upper,
  *
  * Caller must hold i_mutex on indexdir.
  */
-static int ovl_create_index(struct dentry *dentry, struct dentry *origin,
+static int ovl_create_index(struct user_namespace *user_ns,
+			    struct dentry *dentry, struct dentry *origin,
 			    struct dentry *upper)
 {
 	struct dentry *indexdir = ovl_indexdir(dentry->d_sb);
@@ -406,25 +411,25 @@ static int ovl_create_index(struct dentry *dentry, struct dentry *origin,
 	if (err)
 		return err;
 
-	temp = ovl_create_temp(indexdir, OVL_CATTR(S_IFDIR | 0));
+	temp = ovl_create_temp(user_ns, indexdir, OVL_CATTR(S_IFDIR | 0));
 	err = PTR_ERR(temp);
 	if (IS_ERR(temp))
 		goto free_name;
 
-	err = ovl_set_upper_fh(OVL_FS(dentry->d_sb), upper, temp);
+	err = ovl_set_upper_fh(OVL_FS(dentry->d_sb), user_ns, upper, temp);
 	if (err)
 		goto out;
 
-	index = lookup_one_len(name.name, indexdir, name.len);
+	index = lookup_one_len_mapped(name.name, indexdir, name.len, user_ns);
 	if (IS_ERR(index)) {
 		err = PTR_ERR(index);
 	} else {
-		err = ovl_do_rename(dir, temp, dir, index, 0);
+		err = ovl_do_rename(dir, user_ns, temp, dir, user_ns, index, 0);
 		dput(index);
 	}
 out:
 	if (err)
-		ovl_cleanup(dir, temp);
+		ovl_cleanup(user_ns, dir, temp);
 	dput(temp);
 free_name:
 	kfree(name.name);
@@ -441,6 +446,7 @@ struct ovl_copy_up_ctx {
 	struct dentry *destdir;
 	struct qstr destname;
 	struct dentry *workdir;
+	struct user_namespace *user_ns;
 	bool origin;
 	bool indexed;
 	bool metacopy;
@@ -463,16 +469,17 @@ static int ovl_link_up(struct ovl_copy_up_ctx *c)
 		return err;
 
 	inode_lock_nested(udir, I_MUTEX_PARENT);
-	upper = lookup_one_len(c->dentry->d_name.name, upperdir,
-			       c->dentry->d_name.len);
+	upper = lookup_one_len_mapped(c->dentry->d_name.name, upperdir,
+				  c->dentry->d_name.len,
+				  c->user_ns);
 	err = PTR_ERR(upper);
 	if (!IS_ERR(upper)) {
-		err = ovl_do_link(ovl_dentry_upper(c->dentry), udir, upper);
+		err = ovl_do_link(c->user_ns, ovl_dentry_upper(c->dentry), udir, upper);
 		dput(upper);
 
 		if (!err) {
 			/* Restore timestamps on parent (best effort) */
-			ovl_set_timestamps(upperdir, &c->pstat);
+			ovl_set_timestamps(c->user_ns, upperdir, &c->pstat);
 			ovl_dentry_set_upper_alias(c->dentry);
 		}
 	}
@@ -509,7 +516,8 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
 			return err;
 	}
 
-	err = ovl_copy_xattr(c->dentry->d_sb, c->lowerpath.dentry, temp);
+	err = ovl_copy_xattr(c->dentry->d_sb, mnt_user_ns(c->lowerpath.mnt),
+			     c->lowerpath.dentry, c->user_ns, temp);
 	if (err)
 		return err;
 
@@ -535,9 +543,9 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
 
 	inode_lock(temp->d_inode);
 	if (S_ISREG(c->stat.mode))
-		err = ovl_set_size(temp, &c->stat);
+		err = ovl_set_size(c->user_ns, temp, &c->stat);
 	if (!err)
-		err = ovl_set_attr(temp, &c->stat);
+		err = ovl_set_attr(c->user_ns, temp, &c->stat);
 	inode_unlock(temp->d_inode);
 
 	return err;
@@ -598,7 +606,7 @@ static int ovl_copy_up_workdir(struct ovl_copy_up_ctx *c)
 	if (err)
 		goto unlock;
 
-	temp = ovl_create_temp(c->workdir, &cattr);
+	temp = ovl_create_temp(c->user_ns, c->workdir, &cattr);
 	ovl_revert_cu_creds(&cc);
 
 	err = PTR_ERR(temp);
@@ -610,17 +618,18 @@ static int ovl_copy_up_workdir(struct ovl_copy_up_ctx *c)
 		goto cleanup;
 
 	if (S_ISDIR(c->stat.mode) && c->indexed) {
-		err = ovl_create_index(c->dentry, c->lowerpath.dentry, temp);
+		err = ovl_create_index(c->user_ns, c->dentry, c->lowerpath.dentry, temp);
 		if (err)
 			goto cleanup;
 	}
 
-	upper = lookup_one_len(c->destname.name, c->destdir, c->destname.len);
+	upper = lookup_one_len_mapped(c->destname.name, c->destdir, c->destname.len,
+				  c->user_ns);
 	err = PTR_ERR(upper);
 	if (IS_ERR(upper))
 		goto cleanup;
 
-	err = ovl_do_rename(wdir, temp, udir, upper, 0);
+	err = ovl_do_rename(wdir, c->user_ns, temp, udir, c->user_ns, upper, 0);
 	dput(upper);
 	if (err)
 		goto cleanup;
@@ -637,7 +646,7 @@ static int ovl_copy_up_workdir(struct ovl_copy_up_ctx *c)
 	return err;
 
 cleanup:
-	ovl_cleanup(wdir, temp);
+	ovl_cleanup(c->user_ns, wdir, temp);
 	dput(temp);
 	goto unlock;
 }
@@ -654,7 +663,7 @@ static int ovl_copy_up_tmpfile(struct ovl_copy_up_ctx *c)
 	if (err)
 		return err;
 
-	temp = ovl_do_tmpfile(c->workdir, c->stat.mode);
+	temp = ovl_do_tmpfile(c->user_ns, c->workdir, c->stat.mode);
 	ovl_revert_cu_creds(&cc);
 
 	if (IS_ERR(temp))
@@ -666,10 +675,11 @@ static int ovl_copy_up_tmpfile(struct ovl_copy_up_ctx *c)
 
 	inode_lock_nested(udir, I_MUTEX_PARENT);
 
-	upper = lookup_one_len(c->destname.name, c->destdir, c->destname.len);
+	upper = lookup_one_len_mapped(c->destname.name, c->destdir, c->destname.len,
+				  c->user_ns);
 	err = PTR_ERR(upper);
 	if (!IS_ERR(upper)) {
-		err = ovl_do_link(temp, udir, upper);
+		err = ovl_do_link(c->user_ns, temp, udir, upper);
 		dput(upper);
 	}
 	inode_unlock(udir);
@@ -757,7 +767,7 @@ static int ovl_do_copy_up(struct ovl_copy_up_ctx *c)
 
 		/* Restore timestamps on parent (best effort) */
 		inode_lock(udir);
-		ovl_set_timestamps(c->destdir, &c->pstat);
+		ovl_set_timestamps(c->user_ns, c->destdir, &c->pstat);
 		inode_unlock(udir);
 
 		ovl_dentry_set_upper_alias(c->dentry);
@@ -786,12 +796,13 @@ static bool ovl_need_meta_copy_up(struct dentry *dentry, umode_t mode,
 	return true;
 }
 
-static ssize_t ovl_getxattr(struct dentry *dentry, char *name, char **value)
+static ssize_t ovl_getxattr(struct user_namespace *user_ns,
+			    struct dentry *dentry, char *name, char **value)
 {
 	ssize_t res;
 	char *buf;
 
-	res = vfs_getxattr(dentry, name, NULL, 0);
+	res = vfs_mapped_getxattr(user_ns, dentry, name, NULL, 0);
 	if (res == -ENODATA || res == -EOPNOTSUPP)
 		res = 0;
 
@@ -800,7 +811,7 @@ static ssize_t ovl_getxattr(struct dentry *dentry, char *name, char **value)
 		if (!buf)
 			return -ENOMEM;
 
-		res = vfs_getxattr(dentry, name, buf, res);
+		res = vfs_mapped_getxattr(user_ns, dentry, name, buf, res);
 		if (res < 0)
 			kfree(buf);
 		else
@@ -814,6 +825,7 @@ static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
 {
 	struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb);
 	struct path upperpath, datapath;
+	struct user_namespace *user_ns;
 	int err;
 	char *capability = NULL;
 	ssize_t cap_size;
@@ -827,8 +839,8 @@ static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
 		return -EIO;
 
 	if (c->stat.size) {
-		err = cap_size = ovl_getxattr(upperpath.dentry, XATTR_NAME_CAPS,
-					      &capability);
+		err = cap_size = ovl_getxattr(c->user_ns, upperpath.dentry,
+					      XATTR_NAME_CAPS, &capability);
 		if (cap_size < 0)
 			goto out;
 	}
@@ -841,9 +853,10 @@ static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
 	 * Writing to upper file will clear security.capability xattr. We
 	 * don't want that to happen for normal copy-up operation.
 	 */
+	user_ns = mnt_user_ns(upperpath.mnt);
 	if (capability) {
-		err = vfs_setxattr(upperpath.dentry, XATTR_NAME_CAPS,
-				   capability, cap_size, 0);
+		err = vfs_mapped_setxattr(user_ns, upperpath.dentry,
+				      XATTR_NAME_CAPS, capability, cap_size, 0);
 		if (err)
 			goto out_free;
 	}
@@ -887,6 +900,7 @@ static int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry,
 		ovl_path_upper(parent, &parentpath);
 		ctx.destdir = parentpath.dentry;
 		ctx.destname = dentry->d_name;
+		ctx.user_ns = mnt_user_ns(parentpath.mnt);
 
 		err = vfs_getattr(&parentpath, &ctx.pstat,
 				  STATX_ATIME | STATX_MTIME,
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 28a075b5f5b2..23d09de00957 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -23,15 +23,16 @@ MODULE_PARM_DESC(redirect_max,
 
 static int ovl_set_redirect(struct dentry *dentry, bool samedir);
 
-int ovl_cleanup(struct inode *wdir, struct dentry *wdentry)
+int ovl_cleanup(struct user_namespace *user_ns, struct inode *wdir,
+		struct dentry *wdentry)
 {
 	int err;
 
 	dget(wdentry);
 	if (d_is_dir(wdentry))
-		err = ovl_do_rmdir(wdir, wdentry);
+		err = ovl_do_rmdir(user_ns, wdir, wdentry);
 	else
-		err = ovl_do_unlink(wdir, wdentry);
+		err = ovl_do_unlink(user_ns, wdir, wdentry);
 	dput(wdentry);
 
 	if (err) {
@@ -42,7 +43,8 @@ int ovl_cleanup(struct inode *wdir, struct dentry *wdentry)
 	return err;
 }
 
-struct dentry *ovl_lookup_temp(struct dentry *workdir)
+struct dentry *ovl_lookup_temp(struct user_namespace *user_ns,
+			       struct dentry *workdir)
 {
 	struct dentry *temp;
 	char name[20];
@@ -51,7 +53,7 @@ struct dentry *ovl_lookup_temp(struct dentry *workdir)
 	/* counter is allowed to wrap, since temp dentries are ephemeral */
 	snprintf(name, sizeof(name), "#%x", atomic_inc_return(&temp_id));
 
-	temp = lookup_one_len(name, workdir, strlen(name));
+	temp = lookup_one_len_mapped(name, workdir, strlen(name), user_ns);
 	if (!IS_ERR(temp) && temp->d_inode) {
 		pr_err("workdir/%s already exists\n", name);
 		dput(temp);
@@ -68,13 +70,14 @@ static struct dentry *ovl_whiteout(struct ovl_fs *ofs)
 	struct dentry *whiteout;
 	struct dentry *workdir = ofs->workdir;
 	struct inode *wdir = workdir->d_inode;
+	struct user_namespace *user_ns = ovl_upper_mnt_user_ns(ofs);
 
 	if (!ofs->whiteout) {
-		whiteout = ovl_lookup_temp(workdir);
+		whiteout = ovl_lookup_temp(user_ns, workdir);
 		if (IS_ERR(whiteout))
 			goto out;
 
-		err = ovl_do_whiteout(wdir, whiteout);
+		err = ovl_do_whiteout(user_ns, wdir, whiteout);
 		if (err) {
 			dput(whiteout);
 			whiteout = ERR_PTR(err);
@@ -84,11 +87,11 @@ static struct dentry *ovl_whiteout(struct ovl_fs *ofs)
 	}
 
 	if (ofs->share_whiteout) {
-		whiteout = ovl_lookup_temp(workdir);
+		whiteout = ovl_lookup_temp(user_ns, workdir);
 		if (IS_ERR(whiteout))
 			goto out;
 
-		err = ovl_do_link(ofs->whiteout, wdir, whiteout);
+		err = ovl_do_link(user_ns, ofs->whiteout, wdir, whiteout);
 		if (!err)
 			goto out;
 
@@ -110,6 +113,7 @@ int ovl_cleanup_and_whiteout(struct ovl_fs *ofs, struct inode *dir,
 			     struct dentry *dentry)
 {
 	struct inode *wdir = ofs->workdir->d_inode;
+	struct user_namespace *user_ns = ovl_upper_mnt_user_ns(ofs);
 	struct dentry *whiteout;
 	int err;
 	int flags = 0;
@@ -122,28 +126,28 @@ int ovl_cleanup_and_whiteout(struct ovl_fs *ofs, struct inode *dir,
 	if (d_is_dir(dentry))
 		flags = RENAME_EXCHANGE;
 
-	err = ovl_do_rename(wdir, whiteout, dir, dentry, flags);
+	err = ovl_do_rename(wdir, user_ns, whiteout, dir, user_ns, dentry, flags);
 	if (err)
 		goto kill_whiteout;
 	if (flags)
-		ovl_cleanup(wdir, dentry);
+		ovl_cleanup(user_ns, wdir, dentry);
 
 out:
 	dput(whiteout);
 	return err;
 
 kill_whiteout:
-	ovl_cleanup(wdir, whiteout);
+	ovl_cleanup(user_ns, wdir, whiteout);
 	goto out;
 }
 
-static int ovl_mkdir_real(struct inode *dir, struct dentry **newdentry,
-			  umode_t mode)
+static int ovl_mkdir_real(struct user_namespace *user_ns, struct inode *dir,
+			  struct dentry **newdentry, umode_t mode)
 {
 	int err;
 	struct dentry *d, *dentry = *newdentry;
 
-	err = ovl_do_mkdir(dir, dentry, mode);
+	err = ovl_do_mkdir(user_ns, dir, dentry, mode);
 	if (err)
 		return err;
 
@@ -155,8 +159,8 @@ static int ovl_mkdir_real(struct inode *dir, struct dentry **newdentry,
 	 * to it unhashed and negative. If that happens, try to
 	 * lookup a new hashed and positive dentry.
 	 */
-	d = lookup_one_len(dentry->d_name.name, dentry->d_parent,
-			   dentry->d_name.len);
+	d = lookup_one_len_mapped(dentry->d_name.name, dentry->d_parent,
+			      dentry->d_name.len, user_ns);
 	if (IS_ERR(d)) {
 		pr_warn("failed lookup after mkdir (%pd2, err=%i).\n",
 			dentry, err);
@@ -168,7 +172,8 @@ static int ovl_mkdir_real(struct inode *dir, struct dentry **newdentry,
 	return 0;
 }
 
-struct dentry *ovl_create_real(struct inode *dir, struct dentry *newdentry,
+struct dentry *ovl_create_real(struct user_namespace *user_ns,
+			       struct inode *dir, struct dentry *newdentry,
 			       struct ovl_cattr *attr)
 {
 	int err;
@@ -181,28 +186,28 @@ struct dentry *ovl_create_real(struct inode *dir, struct dentry *newdentry,
 		goto out;
 
 	if (attr->hardlink) {
-		err = ovl_do_link(attr->hardlink, dir, newdentry);
+		err = ovl_do_link(user_ns, attr->hardlink, dir, newdentry);
 	} else {
 		switch (attr->mode & S_IFMT) {
 		case S_IFREG:
-			err = ovl_do_create(dir, newdentry, attr->mode);
+			err = ovl_do_create(user_ns, dir, newdentry, attr->mode);
 			break;
 
 		case S_IFDIR:
 			/* mkdir is special... */
-			err =  ovl_mkdir_real(dir, &newdentry, attr->mode);
+			err =  ovl_mkdir_real(user_ns, dir, &newdentry, attr->mode);
 			break;
 
 		case S_IFCHR:
 		case S_IFBLK:
 		case S_IFIFO:
 		case S_IFSOCK:
-			err = ovl_do_mknod(dir, newdentry, attr->mode,
+			err = ovl_do_mknod(user_ns, dir, newdentry, attr->mode,
 					   attr->rdev);
 			break;
 
 		case S_IFLNK:
-			err = ovl_do_symlink(dir, newdentry, attr->link);
+			err = ovl_do_symlink(user_ns, dir, newdentry, attr->link);
 			break;
 
 		default:
@@ -224,10 +229,11 @@ struct dentry *ovl_create_real(struct inode *dir, struct dentry *newdentry,
 	return newdentry;
 }
 
-struct dentry *ovl_create_temp(struct dentry *workdir, struct ovl_cattr *attr)
+struct dentry *ovl_create_temp(struct user_namespace *user_ns, struct dentry *workdir,
+			       struct ovl_cattr *attr)
 {
-	return ovl_create_real(d_inode(workdir), ovl_lookup_temp(workdir),
-			       attr);
+	return ovl_create_real(user_ns, d_inode(workdir),
+			       ovl_lookup_temp(user_ns, workdir), attr);
 }
 
 static int ovl_set_opaque_xerr(struct dentry *dentry, struct dentry *upper,
@@ -323,16 +329,18 @@ static int ovl_create_upper(struct dentry *dentry, struct inode *inode,
 	struct dentry *upperdir = ovl_dentry_upper(dentry->d_parent);
 	struct inode *udir = upperdir->d_inode;
 	struct dentry *newdentry;
+	struct user_namespace *user_ns = ovl_dentry_mnt_user_ns(dentry);
 	int err;
 
 	if (!attr->hardlink && !IS_POSIXACL(udir))
 		attr->mode &= ~current_umask();
 
 	inode_lock_nested(udir, I_MUTEX_PARENT);
-	newdentry = ovl_create_real(udir,
-				    lookup_one_len(dentry->d_name.name,
-						   upperdir,
-						   dentry->d_name.len),
+	newdentry = ovl_create_real(user_ns, udir,
+				    lookup_one_len_mapped(dentry->d_name.name,
+						      upperdir,
+						      dentry->d_name.len,
+						      user_ns),
 				    attr);
 	err = PTR_ERR(newdentry);
 	if (IS_ERR(newdentry))
@@ -351,7 +359,7 @@ static int ovl_create_upper(struct dentry *dentry, struct inode *inode,
 	return err;
 
 out_cleanup:
-	ovl_cleanup(udir, newdentry);
+	ovl_cleanup(user_ns, udir, newdentry);
 	dput(newdentry);
 	goto out_unlock;
 }
@@ -363,6 +371,7 @@ static struct dentry *ovl_clear_empty(struct dentry *dentry,
 	struct inode *wdir = workdir->d_inode;
 	struct dentry *upperdir = ovl_dentry_upper(dentry->d_parent);
 	struct inode *udir = upperdir->d_inode;
+	struct user_namespace *user_ns;
 	struct path upperpath;
 	struct dentry *upper;
 	struct dentry *opaquedir;
@@ -389,12 +398,13 @@ static struct dentry *ovl_clear_empty(struct dentry *dentry,
 	if (upper->d_parent->d_inode != udir)
 		goto out_unlock;
 
-	opaquedir = ovl_create_temp(workdir, OVL_CATTR(stat.mode));
+	user_ns = mnt_user_ns(upperpath.mnt);
+	opaquedir = ovl_create_temp(user_ns, workdir, OVL_CATTR(stat.mode));
 	err = PTR_ERR(opaquedir);
 	if (IS_ERR(opaquedir))
 		goto out_unlock;
 
-	err = ovl_copy_xattr(dentry->d_sb, upper, opaquedir);
+	err = ovl_copy_xattr(dentry->d_sb, user_ns, upper, user_ns, opaquedir);
 	if (err)
 		goto out_cleanup;
 
@@ -403,17 +413,17 @@ static struct dentry *ovl_clear_empty(struct dentry *dentry,
 		goto out_cleanup;
 
 	inode_lock(opaquedir->d_inode);
-	err = ovl_set_attr(opaquedir, &stat);
+	err = ovl_set_attr(user_ns, opaquedir, &stat);
 	inode_unlock(opaquedir->d_inode);
 	if (err)
 		goto out_cleanup;
 
-	err = ovl_do_rename(wdir, opaquedir, udir, upper, RENAME_EXCHANGE);
+	err = ovl_do_rename(wdir, user_ns, opaquedir, udir, user_ns, upper, RENAME_EXCHANGE);
 	if (err)
 		goto out_cleanup;
 
-	ovl_cleanup_whiteouts(upper, list);
-	ovl_cleanup(wdir, upper);
+	ovl_cleanup_whiteouts(user_ns, upper, list);
+	ovl_cleanup(user_ns, wdir, upper);
 	unlock_rename(workdir, upperdir);
 
 	/* dentry's upper doesn't match now, get rid of it */
@@ -422,7 +432,7 @@ static struct dentry *ovl_clear_empty(struct dentry *dentry,
 	return opaquedir;
 
 out_cleanup:
-	ovl_cleanup(wdir, opaquedir);
+	ovl_cleanup(user_ns, wdir, opaquedir);
 	dput(opaquedir);
 out_unlock:
 	unlock_rename(workdir, upperdir);
@@ -430,7 +440,8 @@ static struct dentry *ovl_clear_empty(struct dentry *dentry,
 	return ERR_PTR(err);
 }
 
-static int ovl_set_upper_acl(struct dentry *upperdentry, const char *name,
+static int ovl_set_upper_acl(struct user_namespace *user_ns,
+			     struct dentry *upperdentry, const char *name,
 			     const struct posix_acl *acl)
 {
 	void *buffer;
@@ -449,7 +460,7 @@ static int ovl_set_upper_acl(struct dentry *upperdentry, const char *name,
 	if (err < 0)
 		goto out_free;
 
-	err = vfs_setxattr(upperdentry, name, buffer, size, XATTR_CREATE);
+	err = vfs_mapped_setxattr(user_ns, upperdentry, name, buffer, size, XATTR_CREATE);
 out_free:
 	kfree(buffer);
 	return err;
@@ -464,6 +475,7 @@ static int ovl_create_over_whiteout(struct dentry *dentry, struct inode *inode,
 	struct inode *udir = upperdir->d_inode;
 	struct dentry *upper;
 	struct dentry *newdentry;
+	struct user_namespace *user_ns;
 	int err;
 	struct posix_acl *acl, *default_acl;
 	bool hardlink = !!cattr->hardlink;
@@ -482,8 +494,9 @@ static int ovl_create_over_whiteout(struct dentry *dentry, struct inode *inode,
 	if (err)
 		goto out;
 
-	upper = lookup_one_len(dentry->d_name.name, upperdir,
-			       dentry->d_name.len);
+	user_ns = ovl_dentry_mnt_user_ns(dentry->d_parent);
+	upper = lookup_one_len_mapped(dentry->d_name.name, upperdir,
+				  dentry->d_name.len, user_ns);
 	err = PTR_ERR(upper);
 	if (IS_ERR(upper))
 		goto out_unlock;
@@ -492,7 +505,7 @@ static int ovl_create_over_whiteout(struct dentry *dentry, struct inode *inode,
 	if (d_is_negative(upper) || !IS_WHITEOUT(d_inode(upper)))
 		goto out_dput;
 
-	newdentry = ovl_create_temp(workdir, cattr);
+	newdentry = ovl_create_temp(user_ns, workdir, cattr);
 	err = PTR_ERR(newdentry);
 	if (IS_ERR(newdentry))
 		goto out_dput;
@@ -508,18 +521,18 @@ static int ovl_create_over_whiteout(struct dentry *dentry, struct inode *inode,
 			.ia_mode = cattr->mode,
 		};
 		inode_lock(newdentry->d_inode);
-		err = notify_change(newdentry, &attr, NULL);
+		err = notify_mapped_change(user_ns, newdentry, &attr, NULL);
 		inode_unlock(newdentry->d_inode);
 		if (err)
 			goto out_cleanup;
 	}
 	if (!hardlink) {
-		err = ovl_set_upper_acl(newdentry, XATTR_NAME_POSIX_ACL_ACCESS,
+		err = ovl_set_upper_acl(user_ns, newdentry, XATTR_NAME_POSIX_ACL_ACCESS,
 					acl);
 		if (err)
 			goto out_cleanup;
 
-		err = ovl_set_upper_acl(newdentry, XATTR_NAME_POSIX_ACL_DEFAULT,
+		err = ovl_set_upper_acl(user_ns, newdentry, XATTR_NAME_POSIX_ACL_DEFAULT,
 					default_acl);
 		if (err)
 			goto out_cleanup;
@@ -530,14 +543,14 @@ static int ovl_create_over_whiteout(struct dentry *dentry, struct inode *inode,
 		if (err)
 			goto out_cleanup;
 
-		err = ovl_do_rename(wdir, newdentry, udir, upper,
+		err = ovl_do_rename(wdir, user_ns, newdentry, udir, user_ns, upper,
 				    RENAME_EXCHANGE);
 		if (err)
 			goto out_cleanup;
 
-		ovl_cleanup(wdir, upper);
+		ovl_cleanup(user_ns, wdir, upper);
 	} else {
-		err = ovl_do_rename(wdir, newdentry, udir, upper, 0);
+		err = ovl_do_rename(wdir, user_ns, newdentry, udir, user_ns, upper, 0);
 		if (err)
 			goto out_cleanup;
 	}
@@ -556,7 +569,7 @@ static int ovl_create_over_whiteout(struct dentry *dentry, struct inode *inode,
 	return err;
 
 out_cleanup:
-	ovl_cleanup(wdir, newdentry);
+	ovl_cleanup(user_ns, wdir, newdentry);
 	dput(newdentry);
 	goto out_dput;
 }
@@ -762,8 +775,9 @@ static int ovl_remove_and_whiteout(struct dentry *dentry,
 	if (err)
 		goto out_dput;
 
-	upper = lookup_one_len(dentry->d_name.name, upperdir,
-			       dentry->d_name.len);
+	upper = lookup_one_len_mapped(dentry->d_name.name, upperdir,
+				  dentry->d_name.len,
+				  ovl_upper_mnt_user_ns(ofs));
 	err = PTR_ERR(upper);
 	if (IS_ERR(upper))
 		goto out_unlock;
@@ -797,6 +811,7 @@ static int ovl_remove_upper(struct dentry *dentry, bool is_dir,
 {
 	struct dentry *upperdir = ovl_dentry_upper(dentry->d_parent);
 	struct inode *dir = upperdir->d_inode;
+	struct user_namespace *user_ns = ovl_dentry_mnt_user_ns(dentry->d_parent);
 	struct dentry *upper;
 	struct dentry *opaquedir = NULL;
 	int err;
@@ -809,8 +824,8 @@ static int ovl_remove_upper(struct dentry *dentry, bool is_dir,
 	}
 
 	inode_lock_nested(dir, I_MUTEX_PARENT);
-	upper = lookup_one_len(dentry->d_name.name, upperdir,
-			       dentry->d_name.len);
+	upper = lookup_one_len_mapped(dentry->d_name.name, upperdir,
+				  dentry->d_name.len, user_ns);
 	err = PTR_ERR(upper);
 	if (IS_ERR(upper))
 		goto out_unlock;
@@ -821,9 +836,9 @@ static int ovl_remove_upper(struct dentry *dentry, bool is_dir,
 		goto out_dput_upper;
 
 	if (is_dir)
-		err = vfs_rmdir(dir, upper);
+		err = vfs_mapped_rmdir(user_ns, dir, upper);
 	else
-		err = vfs_unlink(dir, upper, NULL);
+		err = vfs_mapped_unlink(user_ns, dir, upper, NULL);
 	ovl_dir_modified(dentry->d_parent, ovl_type_origin(dentry));
 
 	/*
@@ -920,7 +935,8 @@ static int ovl_do_remove(struct dentry *dentry, bool is_dir)
 	 */
 	upperdentry = ovl_dentry_upper(dentry);
 	if (upperdentry)
-		ovl_copyattr(d_inode(upperdentry), d_inode(dentry));
+		ovl_copyattr(ovl_inode_real_user_ns(d_inode(dentry)),
+			     d_inode(upperdentry), d_inode(dentry));
 
 out_drop_write:
 	ovl_drop_write(dentry);
@@ -1078,6 +1094,7 @@ static int ovl_rename(struct inode *olddir, struct dentry *old,
 	struct dentry *new_upperdir;
 	struct dentry *olddentry;
 	struct dentry *newdentry;
+	struct user_namespace *old_user_ns, *new_user_ns;
 	struct dentry *trap;
 	bool old_opaque;
 	bool new_opaque;
@@ -1181,10 +1198,12 @@ static int ovl_rename(struct inode *olddir, struct dentry *old,
 		}
 	}
 
+	old_user_ns = ovl_dentry_mnt_user_ns(old->d_parent);
+	new_user_ns = ovl_dentry_mnt_user_ns(new->d_parent);
 	trap = lock_rename(new_upperdir, old_upperdir);
 
-	olddentry = lookup_one_len(old->d_name.name, old_upperdir,
-				   old->d_name.len);
+	olddentry = lookup_one_len_mapped(old->d_name.name, old_upperdir,
+				      old->d_name.len, old_user_ns);
 	err = PTR_ERR(olddentry);
 	if (IS_ERR(olddentry))
 		goto out_unlock;
@@ -1193,8 +1212,8 @@ static int ovl_rename(struct inode *olddir, struct dentry *old,
 	if (!ovl_matches_upper(old, olddentry))
 		goto out_dput_old;
 
-	newdentry = lookup_one_len(new->d_name.name, new_upperdir,
-				   new->d_name.len);
+	newdentry = lookup_one_len_mapped(new->d_name.name, new_upperdir,
+				      new->d_name.len, new_user_ns);
 	err = PTR_ERR(newdentry);
 	if (IS_ERR(newdentry))
 		goto out_dput_old;
@@ -1241,13 +1260,13 @@ static int ovl_rename(struct inode *olddir, struct dentry *old,
 	if (err)
 		goto out_dput;
 
-	err = ovl_do_rename(old_upperdir->d_inode, olddentry,
-			    new_upperdir->d_inode, newdentry, flags);
+	err = ovl_do_rename(old_upperdir->d_inode, old_user_ns, olddentry,
+			    new_upperdir->d_inode, new_user_ns, newdentry, flags);
 	if (err)
 		goto out_dput;
 
 	if (cleanup_whiteout)
-		ovl_cleanup(old_upperdir->d_inode, newdentry);
+		ovl_cleanup(old_user_ns, old_upperdir->d_inode, newdentry);
 
 	if (overwrite && d_inode(new)) {
 		if (new_is_dir)
@@ -1262,9 +1281,9 @@ static int ovl_rename(struct inode *olddir, struct dentry *old,
 			 (d_inode(new) && ovl_type_origin(new)));
 
 	/* copy ctime: */
-	ovl_copyattr(d_inode(olddentry), d_inode(old));
+	ovl_copyattr(old_user_ns, d_inode(olddentry), d_inode(old));
 	if (d_inode(new) && ovl_dentry_upper(new))
-		ovl_copyattr(d_inode(newdentry), d_inode(new));
+		ovl_copyattr(new_user_ns, d_inode(newdentry), d_inode(new));
 
 out_dput:
 	dput(newdentry);
diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c
index ed35be3fafc6..15620ddc458d 100644
--- a/fs/overlayfs/export.c
+++ b/fs/overlayfs/export.c
@@ -389,7 +389,8 @@ static struct dentry *ovl_lookup_real_one(struct dentry *connected,
 	 * pointer because we hold no lock on the real dentry.
 	 */
 	take_dentry_name_snapshot(&name, real);
-	this = lookup_one_len(name.name.name, connected, name.name.len);
+	this = lookup_one_len_mapped(name.name.name, connected, name.name.len,
+				 mnt_user_ns(layer->mnt));
 	err = PTR_ERR(this);
 	if (IS_ERR(this)) {
 		goto fail;
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index efccb7c1f9bc..a685c73b684a 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -40,6 +40,8 @@ static struct file *ovl_open_realfile(const struct file *file,
 				      struct inode *realinode)
 {
 	struct inode *inode = file_inode(file);
+	struct path realpath;
+	struct user_namespace *user_ns;
 	struct file *realfile;
 	const struct cred *old_cred;
 	int flags = file->f_flags | OVL_OPEN_FLAGS;
@@ -49,11 +51,13 @@ static struct file *ovl_open_realfile(const struct file *file,
 	if (flags & O_APPEND)
 		acc_mode |= MAY_APPEND;
 
+	ovl_path_real(file_dentry(file), &realpath);
+	user_ns = mnt_user_ns(realpath.mnt);
 	old_cred = ovl_override_creds(inode->i_sb);
-	err = inode_permission(realinode, MAY_OPEN | acc_mode);
+	err = mapped_inode_permission(user_ns, realinode, MAY_OPEN | acc_mode);
 	if (err) {
 		realfile = ERR_PTR(err);
-	} else if (!inode_owner_or_capable(realinode)) {
+	} else if (!mapped_inode_owner_or_capable(user_ns, realinode)) {
 		realfile = ERR_PTR(-EPERM);
 	} else {
 		realfile = open_with_fake_path(&file->f_path, flags, realinode,
@@ -269,7 +273,8 @@ static void ovl_aio_cleanup_handler(struct ovl_aio_req *aio_req)
 		__sb_writers_acquired(file_inode(iocb->ki_filp)->i_sb,
 				      SB_FREEZE_WRITE);
 		file_end_write(iocb->ki_filp);
-		ovl_copyattr(ovl_inode_real(inode), inode);
+		ovl_copyattr(ovl_inode_real_user_ns(inode),
+			     ovl_inode_real(inode), inode);
 	}
 
 	orig_iocb->ki_pos = iocb->ki_pos;
@@ -345,7 +350,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 
 	inode_lock(inode);
 	/* Update mode */
-	ovl_copyattr(ovl_inode_real(inode), inode);
+	ovl_copyattr(ovl_inode_real_user_ns(inode), ovl_inode_real(inode), inode);
 	ret = file_remove_privs(file);
 	if (ret)
 		goto out_unlock;
@@ -364,7 +369,8 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 				     ovl_iocb_to_rwf(ifl));
 		file_end_write(real.file);
 		/* Update size */
-		ovl_copyattr(ovl_inode_real(inode), inode);
+		ovl_copyattr(ovl_inode_real_user_ns(inode),
+			     ovl_inode_real(inode), inode);
 	} else {
 		struct ovl_aio_req *aio_req;
 
@@ -511,7 +517,7 @@ static long ovl_fallocate(struct file *file, int mode, loff_t offset, loff_t len
 	revert_creds(old_cred);
 
 	/* Update size */
-	ovl_copyattr(ovl_inode_real(inode), inode);
+	ovl_copyattr(ovl_inode_real_user_ns(inode), ovl_inode_real(inode), inode);
 
 	fdput(real);
 
@@ -582,7 +588,7 @@ static long ovl_ioctl_set_flags(struct file *file, unsigned int cmd,
 	struct inode *inode = file_inode(file);
 	unsigned int oldflags;
 
-	if (!inode_owner_or_capable(inode))
+	if (!mapped_inode_owner_or_capable(mnt_user_ns(file->f_path.mnt), inode))
 		return -EACCES;
 
 	ret = mnt_want_write_file(file);
@@ -744,7 +750,8 @@ static loff_t ovl_copyfile(struct file *file_in, loff_t pos_in,
 	revert_creds(old_cred);
 
 	/* Update size */
-	ovl_copyattr(ovl_inode_real(inode_out), inode_out);
+	ovl_copyattr(ovl_inode_real_user_ns(inode_out),
+		     ovl_inode_real(inode_out), inode_out);
 
 	fdput(real_in);
 	fdput(real_out);
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index b584dca845ba..b6c8b904f0e7 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -9,6 +9,7 @@
 #include <linux/cred.h>
 #include <linux/xattr.h>
 #include <linux/posix_acl.h>
+#include <linux/posix_acl_xattr.h>
 #include <linux/ratelimit.h>
 #include <linux/fiemap.h>
 #include "overlayfs.h"
@@ -46,6 +47,7 @@ int ovl_setattr(struct dentry *dentry, struct iattr *attr)
 		err = ovl_copy_up_with_data(dentry);
 	if (!err) {
 		struct inode *winode = NULL;
+		struct user_namespace *upper_user_ns;
 
 		upperdentry = ovl_dentry_upper(dentry);
 
@@ -77,12 +79,20 @@ int ovl_setattr(struct dentry *dentry, struct iattr *attr)
 		 */
 		attr->ia_valid &= ~ATTR_OPEN;
 
+		upper_user_ns = ovl_upper_mnt_user_ns(OVL_FS(dentry->d_sb));
+
+		if (attr->ia_valid & ATTR_UID)
+			attr->ia_uid = kuid_from_mnt(upper_user_ns, attr->ia_uid);
+		if (attr->ia_valid & ATTR_GID)
+			attr->ia_gid = kgid_from_mnt(upper_user_ns, attr->ia_gid);
+
 		inode_lock(upperdentry->d_inode);
 		old_cred = ovl_override_creds(dentry->d_sb);
-		err = notify_change(upperdentry, attr, NULL);
+		err = notify_mapped_change(upper_user_ns, upperdentry, attr, NULL);
 		revert_creds(old_cred);
 		if (!err)
-			ovl_copyattr(upperdentry->d_inode, dentry->d_inode);
+			ovl_copyattr(upper_user_ns, upperdentry->d_inode,
+				     dentry->d_inode);
 		inode_unlock(upperdentry->d_inode);
 
 		if (winode)
@@ -281,6 +291,7 @@ int ovl_permission(struct inode *inode, int mask)
 {
 	struct inode *upperinode = ovl_inode_upper(inode);
 	struct inode *realinode = upperinode ?: ovl_inode_lower(inode);
+	struct user_namespace *user_ns;
 	const struct cred *old_cred;
 	int err;
 
@@ -290,6 +301,11 @@ int ovl_permission(struct inode *inode, int mask)
 		return -ECHILD;
 	}
 
+	if (upperinode)
+		user_ns = ovl_upper_mnt_user_ns(OVL_FS(inode->i_sb));
+	else
+		user_ns = OVL_I(inode)->lower_user_ns;
+
 	/*
 	 * Check overlay inode with the creds of task and underlying inode
 	 * with creds of mounter
@@ -298,6 +314,7 @@ int ovl_permission(struct inode *inode, int mask)
 	if (err)
 		return err;
 
+	/* Handle idmapped lower mounts. */
 	old_cred = ovl_override_creds(inode->i_sb);
 	if (!upperinode &&
 	    !special_file(realinode->i_mode) && mask & MAY_WRITE) {
@@ -305,7 +322,7 @@ int ovl_permission(struct inode *inode, int mask)
 		/* Make sure mounter can read file for copy up later */
 		mask |= MAY_READ;
 	}
-	err = inode_permission(realinode, mask);
+	err = mapped_inode_permission(user_ns, realinode, mask);
 	revert_creds(old_cred);
 
 	return err;
@@ -337,16 +354,23 @@ int ovl_xattr_set(struct dentry *dentry, struct inode *inode, const char *name,
 		  const void *value, size_t size, int flags)
 {
 	int err;
+	void *val = NULL;
 	struct dentry *upperdentry = ovl_i_dentry_upper(inode);
 	struct dentry *realdentry = upperdentry ?: ovl_dentry_lower(dentry);
+	struct user_namespace *user_ns;
 	const struct cred *old_cred;
 
 	err = ovl_want_write(dentry);
 	if (err)
 		goto out;
 
+	if (upperdentry)
+		user_ns = ovl_upper_mnt_user_ns(OVL_FS(inode->i_sb));
+	else
+		user_ns = OVL_I(inode)->lower_user_ns;
+
 	if (!value && !upperdentry) {
-		err = vfs_getxattr(realdentry, name, NULL, 0);
+		err = vfs_mapped_getxattr(user_ns, realdentry, name, NULL, 0);
 		if (err < 0)
 			goto out_drop_write;
 	}
@@ -360,19 +384,34 @@ int ovl_xattr_set(struct dentry *dentry, struct inode *inode, const char *name,
 	}
 
 	old_cred = ovl_override_creds(dentry->d_sb);
-	if (value)
-		err = vfs_setxattr(realdentry, name, value, size, flags);
-	else {
+	if (value) {
+		val = kmalloc(size, GFP_KERNEL);
+		if (!val)
+			goto out_drop_write;
+		memcpy(val, value, size);
+
+		if ((strcmp(name, XATTR_NAME_POSIX_ACL_ACCESS) == 0) ||
+		    (strcmp(name, XATTR_NAME_POSIX_ACL_DEFAULT) == 0))
+			posix_acl_fix_xattr_from_user(user_ns, val, size);
+		else if (strcmp(name, XATTR_NAME_CAPS) == 0) {
+			err = cap_convert_nscap(user_ns, realdentry, &val, size);
+			if (err < 0)
+				goto out_drop_write;
+			size = err;
+		}
+		err = vfs_mapped_setxattr(user_ns, realdentry, name, val, size, flags);
+	} else {
 		WARN_ON(flags != XATTR_REPLACE);
-		err = vfs_removexattr(realdentry, name);
+		err = vfs_mapped_removexattr(user_ns, realdentry, name);
 	}
 	revert_creds(old_cred);
 
 	/* copy c/mtime */
-	ovl_copyattr(d_inode(realdentry), inode);
+	ovl_copyattr(user_ns, d_inode(realdentry), inode);
 
 out_drop_write:
 	ovl_drop_write(dentry);
+	kfree(val);
 out:
 	return err;
 }
@@ -382,11 +421,22 @@ int ovl_xattr_get(struct dentry *dentry, struct inode *inode, const char *name,
 {
 	ssize_t res;
 	const struct cred *old_cred;
-	struct dentry *realdentry =
-		ovl_i_dentry_upper(inode) ?: ovl_dentry_lower(dentry);
+	struct dentry *realdentry = ovl_i_dentry_upper(inode);
+	struct user_namespace *user_ns;
+
+
+	if (realdentry) {
+		user_ns = ovl_upper_mnt_user_ns(OVL_FS(inode->i_sb));
+	} else {
+		realdentry = ovl_dentry_lower(dentry);
+		user_ns = OVL_I(inode)->lower_user_ns;
+	}
 
 	old_cred = ovl_override_creds(dentry->d_sb);
-	res = vfs_getxattr(realdentry, name, value, size);
+	res = vfs_mapped_getxattr(user_ns, realdentry, name, value, size);
+	if ((strcmp(name, XATTR_NAME_POSIX_ACL_ACCESS) == 0) ||
+	    (strcmp(name, XATTR_NAME_POSIX_ACL_DEFAULT) == 0))
+		posix_acl_fix_xattr_to_user(user_ns, value, size);
 	revert_creds(old_cred);
 	return res;
 }
@@ -634,13 +684,15 @@ void ovl_inode_init(struct inode *inode, struct ovl_inode_params *oip,
 
 	if (oip->upperdentry)
 		OVL_I(inode)->__upperdentry = oip->upperdentry;
-	if (oip->lowerpath && oip->lowerpath->dentry)
+	if (oip->lowerpath && oip->lowerpath->dentry) {
 		OVL_I(inode)->lower = igrab(d_inode(oip->lowerpath->dentry));
+		OVL_I(inode)->lower_user_ns = get_user_ns(mnt_user_ns(oip->lowerpath->layer->mnt));
+	}
 	if (oip->lowerdata)
 		OVL_I(inode)->lowerdata = igrab(d_inode(oip->lowerdata));
 
 	realinode = ovl_inode_real(inode);
-	ovl_copyattr(realinode, inode);
+	ovl_copyattr(ovl_inode_real_user_ns(inode), realinode, inode);
 	ovl_copyflags(realinode, inode);
 	ovl_map_ino(inode, ino, fsid);
 }
@@ -751,8 +803,8 @@ unsigned int ovl_get_nlink(struct ovl_fs *ofs, struct dentry *lowerdentry,
 	if (!lowerdentry || !upperdentry || d_inode(lowerdentry)->i_nlink == 1)
 		return fallback;
 
-	err = ovl_do_getxattr(ofs, upperdentry, OVL_XATTR_NLINK,
-			      &buf, sizeof(buf) - 1);
+	err = ovl_do_getxattr(ofs, ovl_upper_mnt_user_ns(ofs), upperdentry,
+			      OVL_XATTR_NLINK, &buf, sizeof(buf) - 1);
 	if (err < 0)
 		goto fail;
 
@@ -956,6 +1008,7 @@ struct inode *ovl_get_inode(struct super_block *sb,
 	struct inode *realinode = upperdentry ? d_inode(upperdentry) : NULL;
 	struct inode *inode;
 	struct dentry *lowerdentry = lowerpath ? lowerpath->dentry : NULL;
+	struct user_namespace *user_ns = ovl_upper_mnt_user_ns(OVL_FS(sb));
 	bool bylower = ovl_hash_bylower(sb, upperdentry, lowerdentry,
 					oip->index);
 	int fsid = bylower ? lowerpath->layer->fsid : 0;
@@ -1028,8 +1081,10 @@ struct inode *ovl_get_inode(struct super_block *sb,
 
 	/* Check for non-merge dir that may have whiteouts */
 	if (is_dir) {
+		if (!upperdentry)
+			user_ns = mnt_user_ns(lowerpath->layer->mnt);
 		if (((upperdentry && lowerdentry) || oip->numlower > 1) ||
-		    ovl_check_origin_xattr(ofs, upperdentry ?: lowerdentry)) {
+		    ovl_check_origin_xattr(ofs, user_ns, upperdentry ?: lowerdentry)) {
 			ovl_set_flag(OVL_WHITEOUTS, inode);
 		}
 	}
diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
index a6162c4076db..4a5e9ef6524b 100644
--- a/fs/overlayfs/namei.c
+++ b/fs/overlayfs/namei.c
@@ -25,14 +25,15 @@ struct ovl_lookup_data {
 	bool metacopy;
 };
 
-static int ovl_check_redirect(struct dentry *dentry, struct ovl_lookup_data *d,
+static int ovl_check_redirect(struct user_namespace *mnt_user_ns,
+			      struct dentry *dentry, struct ovl_lookup_data *d,
 			      size_t prelen, const char *post)
 {
 	int res;
 	char *buf;
 	struct ovl_fs *ofs = OVL_FS(d->sb);
 
-	buf = ovl_get_redirect_xattr(ofs, dentry, prelen + strlen(post));
+	buf = ovl_get_redirect_xattr(ofs, mnt_user_ns, dentry, prelen + strlen(post));
 	if (IS_ERR_OR_NULL(buf))
 		return PTR_ERR(buf);
 
@@ -110,8 +111,9 @@ static struct ovl_fh *ovl_get_fh(struct ovl_fs *ofs, struct dentry *dentry,
 {
 	int res, err;
 	struct ovl_fh *fh = NULL;
+	struct user_namespace *user_ns = ovl_upper_mnt_user_ns(ofs);
 
-	res = ovl_do_getxattr(ofs, dentry, ox, NULL, 0);
+	res = ovl_do_getxattr(ofs, user_ns, dentry, ox, NULL, 0);
 	if (res < 0) {
 		if (res == -ENODATA || res == -EOPNOTSUPP)
 			return NULL;
@@ -125,7 +127,7 @@ static struct ovl_fh *ovl_get_fh(struct ovl_fs *ofs, struct dentry *dentry,
 	if (!fh)
 		return ERR_PTR(-ENOMEM);
 
-	res = ovl_do_getxattr(ofs, dentry, ox, fh->buf, res);
+	res = ovl_do_getxattr(ofs, user_ns, dentry, ox, fh->buf, res);
 	if (res < 0)
 		goto fail;
 
@@ -188,16 +190,19 @@ struct dentry *ovl_decode_real_fh(struct ovl_fh *fh, struct vfsmount *mnt,
 	return real;
 }
 
-static bool ovl_is_opaquedir(struct super_block *sb, struct dentry *dentry)
+static bool ovl_is_opaquedir(struct super_block *sb,
+			     struct user_namespace *user_ns,
+			     struct dentry *dentry)
 {
-	return ovl_check_dir_xattr(sb, dentry, OVL_XATTR_OPAQUE);
+	return ovl_check_dir_xattr(sb, user_ns, dentry, OVL_XATTR_OPAQUE);
 }
 
 static struct dentry *ovl_lookup_positive_unlocked(const char *name,
 						   struct dentry *base, int len,
-						   bool drop_negative)
+						   bool drop_negative,
+						   struct user_namespace *mnt_user_ns)
 {
-	struct dentry *ret = lookup_one_len_unlocked(name, base, len);
+	struct dentry *ret = lookup_one_len_mapped_unlocked(name, base, len, mnt_user_ns);
 
 	if (!IS_ERR(ret) && d_flags_negative(smp_load_acquire(&ret->d_flags))) {
 		if (drop_negative && ret->d_lockref.count == 1) {
@@ -216,13 +221,14 @@ static struct dentry *ovl_lookup_positive_unlocked(const char *name,
 static int ovl_lookup_single(struct dentry *base, struct ovl_lookup_data *d,
 			     const char *name, unsigned int namelen,
 			     size_t prelen, const char *post,
-			     struct dentry **ret, bool drop_negative)
+			     struct dentry **ret, bool drop_negative,
+			     struct user_namespace *mnt_user_ns)
 {
 	struct dentry *this;
 	int err;
 	bool last_element = !post[0];
 
-	this = ovl_lookup_positive_unlocked(name, base, namelen, drop_negative);
+	this = ovl_lookup_positive_unlocked(name, base, namelen, drop_negative, mnt_user_ns);
 	if (IS_ERR(this)) {
 		err = PTR_ERR(this);
 		this = NULL;
@@ -253,7 +259,7 @@ static int ovl_lookup_single(struct dentry *base, struct ovl_lookup_data *d,
 			d->stop = true;
 			goto put_and_out;
 		}
-		err = ovl_check_metacopy_xattr(OVL_FS(d->sb), this);
+		err = ovl_check_metacopy_xattr(OVL_FS(d->sb), mnt_user_ns, this);
 		if (err < 0)
 			goto out_err;
 
@@ -273,14 +279,14 @@ static int ovl_lookup_single(struct dentry *base, struct ovl_lookup_data *d,
 		if (d->last)
 			goto out;
 
-		if (ovl_is_opaquedir(d->sb, this)) {
+		if (ovl_is_opaquedir(d->sb, mnt_user_ns, this)) {
 			d->stop = true;
 			if (last_element)
 				d->opaque = true;
 			goto out;
 		}
 	}
-	err = ovl_check_redirect(this, d, prelen, post);
+	err = ovl_check_redirect(mnt_user_ns, this, d, prelen, post);
 	if (err)
 		goto out_err;
 out:
@@ -298,7 +304,8 @@ static int ovl_lookup_single(struct dentry *base, struct ovl_lookup_data *d,
 }
 
 static int ovl_lookup_layer(struct dentry *base, struct ovl_lookup_data *d,
-			    struct dentry **ret, bool drop_negative)
+			    struct dentry **ret, bool drop_negative,
+			    struct user_namespace *mnt_user_ns)
 {
 	/* Counting down from the end, since the prefix can change */
 	size_t rem = d->name.len - 1;
@@ -307,7 +314,7 @@ static int ovl_lookup_layer(struct dentry *base, struct ovl_lookup_data *d,
 
 	if (d->name.name[0] != '/')
 		return ovl_lookup_single(base, d, d->name.name, d->name.len,
-					 0, "", ret, drop_negative);
+					 0, "", ret, drop_negative, mnt_user_ns);
 
 	while (!IS_ERR_OR_NULL(base) && d_can_lookup(base)) {
 		const char *s = d->name.name + d->name.len - rem;
@@ -321,7 +328,7 @@ static int ovl_lookup_layer(struct dentry *base, struct ovl_lookup_data *d,
 
 		err = ovl_lookup_single(base, d, s, thislen,
 					d->name.len - rem, next, &base,
-					drop_negative);
+					drop_negative, mnt_user_ns);
 		dput(dentry);
 		if (err)
 			return err;
@@ -666,7 +673,8 @@ struct dentry *ovl_get_index_fh(struct ovl_fs *ofs, struct ovl_fh *fh)
 	if (err)
 		return ERR_PTR(err);
 
-	index = lookup_positive_unlocked(name.name, ofs->indexdir, name.len);
+	index = lookup_positive_mapped_unlocked(name.name, ofs->indexdir, name.len,
+					    ovl_upper_mnt_user_ns(ofs));
 	kfree(name.name);
 	if (IS_ERR(index)) {
 		if (PTR_ERR(index) == -ENOENT)
@@ -698,7 +706,8 @@ struct dentry *ovl_lookup_index(struct ovl_fs *ofs, struct dentry *upper,
 	if (err)
 		return ERR_PTR(err);
 
-	index = lookup_positive_unlocked(name.name, ofs->indexdir, name.len);
+	index = lookup_positive_mapped_unlocked(name.name, ofs->indexdir, name.len,
+					    ovl_upper_mnt_user_ns(ofs));
 	if (IS_ERR(index)) {
 		err = PTR_ERR(index);
 		if (err == -ENOENT) {
@@ -798,7 +807,7 @@ static int ovl_fix_origin(struct ovl_fs *ofs, struct dentry *dentry,
 {
 	int err;
 
-	if (ovl_check_origin_xattr(ofs, upper))
+	if (ovl_check_origin_xattr(ofs, ovl_upper_mnt_user_ns(ofs), upper))
 		return 0;
 
 	err = ovl_want_write(dentry);
@@ -818,6 +827,7 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
 {
 	struct ovl_entry *oe;
 	const struct cred *old_cred;
+	struct user_namespace *user_ns;
 	struct ovl_fs *ofs = dentry->d_sb->s_fs_info;
 	struct ovl_entry *poe = dentry->d_parent->d_fsdata;
 	struct ovl_entry *roe = dentry->d_sb->s_root->d_fsdata;
@@ -850,7 +860,8 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
 	old_cred = ovl_override_creds(dentry->d_sb);
 	upperdir = ovl_dentry_upper(dentry->d_parent);
 	if (upperdir) {
-		err = ovl_lookup_layer(upperdir, &d, &upperdentry, true);
+		user_ns = ovl_upper_mnt_user_ns(ofs);
+		err = ovl_lookup_layer(upperdir, &d, &upperdentry, true, user_ns);
 		if (err)
 			goto out;
 
@@ -905,7 +916,8 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
 		else
 			d.last = lower.layer->idx == roe->numlower;
 
-		err = ovl_lookup_layer(lower.dentry, &d, &this, false);
+		user_ns = mnt_user_ns(lower.layer->mnt);
+		err = ovl_lookup_layer(lower.dentry, &d, &this, false, user_ns);
 		if (err)
 			goto out_put;
 
@@ -1062,14 +1074,15 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
 	if (upperdentry)
 		ovl_dentry_set_upper_alias(dentry);
 	else if (index) {
+		struct user_namespace *upper_user_ns = ovl_upper_mnt_user_ns(ofs);
 		upperdentry = dget(index);
-		upperredirect = ovl_get_redirect_xattr(ofs, upperdentry, 0);
+		upperredirect = ovl_get_redirect_xattr(ofs, upper_user_ns, upperdentry, 0);
 		if (IS_ERR(upperredirect)) {
 			err = PTR_ERR(upperredirect);
 			upperredirect = NULL;
 			goto out_free_oe;
 		}
-		err = ovl_check_metacopy_xattr(ofs, upperdentry);
+		err = ovl_check_metacopy_xattr(ofs, upper_user_ns, upperdentry);
 		if (err < 0)
 			goto out_free_oe;
 		uppermetacopy = err;
@@ -1153,9 +1166,10 @@ bool ovl_lower_positive(struct dentry *dentry)
 	for (i = 0; !done && !positive && i < poe->numlower; i++) {
 		struct dentry *this;
 		struct dentry *lowerdir = poe->lowerstack[i].dentry;
+		struct user_namespace *user_ns = mnt_user_ns(poe->lowerstack[i].layer->mnt);
 
-		this = lookup_positive_unlocked(name->name, lowerdir,
-					       name->len);
+		this = lookup_positive_mapped_unlocked(name->name, lowerdir,
+						   name->len, user_ns);
 		if (IS_ERR(this)) {
 			switch (PTR_ERR(this)) {
 			case -ENOENT:
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 30ee48ddfaa2..63c257c3bfa8 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -7,6 +7,7 @@
 #include <linux/kernel.h>
 #include <linux/uuid.h>
 #include <linux/fs.h>
+#include <linux/mount.h>
 #include "ovl_entry.h"
 
 #undef pr_fmt
@@ -119,72 +120,90 @@ static inline const char *ovl_xattr(struct ovl_fs *ofs, enum ovl_xattr ox)
 	return ovl_xattr_table[ox];
 }
 
-static inline int ovl_do_rmdir(struct inode *dir, struct dentry *dentry)
+static inline struct user_namespace *ovl_upper_mnt_user_ns(struct ovl_fs *ofs)
 {
-	int err = vfs_rmdir(dir, dentry);
+	return mnt_user_ns(ovl_upper_mnt(ofs));
+}
+
+static inline struct user_namespace *ovl_dentry_mnt_user_ns(struct dentry *dentry)
+{
+	return mnt_user_ns(ovl_upper_mnt(OVL_FS(dentry->d_sb)));
+}
+
+static inline int ovl_do_rmdir(struct user_namespace *user_ns,
+			       struct inode *dir, struct dentry *dentry)
+{
+	int err = vfs_mapped_rmdir(user_ns, dir, dentry);
 
 	pr_debug("rmdir(%pd2) = %i\n", dentry, err);
 	return err;
 }
 
-static inline int ovl_do_unlink(struct inode *dir, struct dentry *dentry)
+static inline int ovl_do_unlink(struct user_namespace *user_ns,
+				struct inode *dir, struct dentry *dentry)
 {
-	int err = vfs_unlink(dir, dentry, NULL);
+	int err = vfs_mapped_unlink(user_ns, dir, dentry, NULL);
 
 	pr_debug("unlink(%pd2) = %i\n", dentry, err);
 	return err;
 }
 
-static inline int ovl_do_link(struct dentry *old_dentry, struct inode *dir,
+static inline int ovl_do_link(struct user_namespace *user_ns,
+			      struct dentry *old_dentry, struct inode *dir,
 			      struct dentry *new_dentry)
 {
-	int err = vfs_link(old_dentry, dir, new_dentry, NULL);
+	int err = vfs_mapped_link(old_dentry, user_ns, dir, new_dentry, NULL);
 
 	pr_debug("link(%pd2, %pd2) = %i\n", old_dentry, new_dentry, err);
 	return err;
 }
 
-static inline int ovl_do_create(struct inode *dir, struct dentry *dentry,
+static inline int ovl_do_create(struct user_namespace *user_ns,
+				struct inode *dir, struct dentry *dentry,
 				umode_t mode)
 {
-	int err = vfs_create(dir, dentry, mode, true);
+	int err = vfs_mapped_create(user_ns, dir, dentry, mode, true);
 
 	pr_debug("create(%pd2, 0%o) = %i\n", dentry, mode, err);
 	return err;
 }
 
-static inline int ovl_do_mkdir(struct inode *dir, struct dentry *dentry,
+static inline int ovl_do_mkdir(struct user_namespace *user_ns,
+			       struct inode *dir, struct dentry *dentry,
 			       umode_t mode)
 {
-	int err = vfs_mkdir(dir, dentry, mode);
+	int err = vfs_mapped_mkdir(user_ns, dir, dentry, mode);
 	pr_debug("mkdir(%pd2, 0%o) = %i\n", dentry, mode, err);
 	return err;
 }
 
-static inline int ovl_do_mknod(struct inode *dir, struct dentry *dentry,
+static inline int ovl_do_mknod(struct user_namespace *user_ns,
+			       struct inode *dir, struct dentry *dentry,
 			       umode_t mode, dev_t dev)
 {
-	int err = vfs_mknod(dir, dentry, mode, dev);
+	int err = vfs_mapped_mknod(user_ns, dir, dentry, mode, dev);
 
 	pr_debug("mknod(%pd2, 0%o, 0%o) = %i\n", dentry, mode, dev, err);
 	return err;
 }
 
-static inline int ovl_do_symlink(struct inode *dir, struct dentry *dentry,
+static inline int ovl_do_symlink(struct user_namespace *user_ns,
+				 struct inode *dir, struct dentry *dentry,
 				 const char *oldname)
 {
-	int err = vfs_symlink(dir, dentry, oldname);
+	int err = vfs_mapped_symlink(user_ns, dir, dentry, oldname);
 
 	pr_debug("symlink(\"%s\", %pd2) = %i\n", oldname, dentry, err);
 	return err;
 }
 
-static inline ssize_t ovl_do_getxattr(struct ovl_fs *ofs, struct dentry *dentry,
-				      enum ovl_xattr ox, void *value,
-				      size_t size)
+static inline ssize_t ovl_do_getxattr(struct ovl_fs *ofs,
+				      struct user_namespace *user_ns,
+				      struct dentry *dentry, enum ovl_xattr ox,
+				      void *value, size_t size)
 {
 	const char *name = ovl_xattr(ofs, ox);
-	return vfs_getxattr(dentry, name, value, size);
+	return vfs_mapped_getxattr(user_ns, dentry, name, value, size);
 }
 
 static inline int ovl_do_setxattr(struct ovl_fs *ofs, struct dentry *dentry,
@@ -192,7 +211,8 @@ static inline int ovl_do_setxattr(struct ovl_fs *ofs, struct dentry *dentry,
 				  size_t size)
 {
 	const char *name = ovl_xattr(ofs, ox);
-	int err = vfs_setxattr(dentry, name, value, size, 0);
+	struct user_namespace *user_ns = ovl_upper_mnt_user_ns(ofs);
+	int err = vfs_mapped_setxattr(user_ns, dentry, name, value, size, 0);
 	pr_debug("setxattr(%pd2, \"%s\", \"%*pE\", %zu, 0) = %i\n",
 		 dentry, name, min((int)size, 48), value, size, err);
 	return err;
@@ -202,19 +222,31 @@ static inline int ovl_do_removexattr(struct ovl_fs *ofs, struct dentry *dentry,
 				     enum ovl_xattr ox)
 {
 	const char *name = ovl_xattr(ofs, ox);
-	int err = vfs_removexattr(dentry, name);
+	struct user_namespace *user_ns = ovl_upper_mnt_user_ns(ofs);
+	int err = vfs_mapped_removexattr(user_ns, dentry, name);
 	pr_debug("removexattr(%pd2, \"%s\") = %i\n", dentry, name, err);
 	return err;
 }
 
-static inline int ovl_do_rename(struct inode *olddir, struct dentry *olddentry,
-				struct inode *newdir, struct dentry *newdentry,
-				unsigned int flags)
+static inline int ovl_do_rename(struct inode *olddir,
+				struct user_namespace *old_user_ns,
+				struct dentry *olddentry, struct inode *newdir,
+				struct user_namespace *new_user_ns,
+				struct dentry *newdentry, unsigned int flags)
 {
 	int err;
+	struct renamedata data = {
+		.old_dir	= olddir,
+		.old_dentry	= olddentry,
+		.old_user_ns	= old_user_ns,
+		.new_dir	= newdir,
+		.new_dentry	= newdentry,
+		.new_user_ns	= new_user_ns,
+		.flags		= flags,
+	};
 
 	pr_debug("rename(%pd2, %pd2, 0x%x)\n", olddentry, newdentry, flags);
-	err = vfs_rename(olddir, olddentry, newdir, newdentry, NULL, flags);
+	err = vfs_mapped_rename(&data);
 	if (err) {
 		pr_debug("...rename(%pd2, %pd2, ...) = %i\n",
 			 olddentry, newdentry, err);
@@ -222,16 +254,18 @@ static inline int ovl_do_rename(struct inode *olddir, struct dentry *olddentry,
 	return err;
 }
 
-static inline int ovl_do_whiteout(struct inode *dir, struct dentry *dentry)
+static inline int ovl_do_whiteout(struct user_namespace *user_ns,
+				  struct inode *dir, struct dentry *dentry)
 {
-	int err = vfs_whiteout(&init_user_ns, dir, dentry);
+	int err = vfs_whiteout(user_ns, dir, dentry);
 	pr_debug("whiteout(%pd2) = %i\n", dentry, err);
 	return err;
 }
 
-static inline struct dentry *ovl_do_tmpfile(struct dentry *dentry, umode_t mode)
+static inline struct dentry *ovl_do_tmpfile(struct user_namespace *user_ns,
+					    struct dentry *dentry, umode_t mode)
 {
-	struct dentry *ret = vfs_tmpfile(dentry, mode, 0);
+	struct dentry *ret = vfs_mapped_tmpfile(user_ns, dentry, mode, 0);
 	int err = PTR_ERR_OR_ZERO(ret);
 
 	pr_debug("tmpfile(%pd2, 0%o) = %i\n", dentry, mode, err);
@@ -301,9 +335,10 @@ struct file *ovl_path_open(struct path *path, int flags);
 int ovl_copy_up_start(struct dentry *dentry, int flags);
 void ovl_copy_up_end(struct dentry *dentry);
 bool ovl_already_copied_up(struct dentry *dentry, int flags);
-bool ovl_check_origin_xattr(struct ovl_fs *ofs, struct dentry *dentry);
-bool ovl_check_dir_xattr(struct super_block *sb, struct dentry *dentry,
-			 enum ovl_xattr ox);
+bool ovl_check_origin_xattr(struct ovl_fs *ofs, struct user_namespace *user_ns,
+			    struct dentry *dentry);
+bool ovl_check_dir_xattr(struct super_block *sb, struct user_namespace *user_ns,
+			 struct dentry *dentry, enum ovl_xattr ox);
 int ovl_check_setxattr(struct dentry *dentry, struct dentry *upperdentry,
 		       enum ovl_xattr ox, const void *value, size_t size,
 		       int xerr);
@@ -318,15 +353,17 @@ bool ovl_need_index(struct dentry *dentry);
 int ovl_nlink_start(struct dentry *dentry);
 void ovl_nlink_end(struct dentry *dentry);
 int ovl_lock_rename_workdir(struct dentry *workdir, struct dentry *upperdir);
-int ovl_check_metacopy_xattr(struct ovl_fs *ofs, struct dentry *dentry);
+int ovl_check_metacopy_xattr(struct ovl_fs *ofs, struct user_namespace *user_ns,
+			     struct dentry *dentry);
 bool ovl_is_metacopy_dentry(struct dentry *dentry);
-char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
-			     int padding);
+char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct user_namespace *user_ns,
+			     struct dentry *dentry, int padding);
 
 static inline bool ovl_is_impuredir(struct super_block *sb,
 				    struct dentry *dentry)
 {
-	return ovl_check_dir_xattr(sb, dentry, OVL_XATTR_IMPURE);
+	return ovl_check_dir_xattr(sb, ovl_upper_mnt_user_ns(OVL_FS(sb)),
+				   dentry, OVL_XATTR_IMPURE);
 }
 
 /*
@@ -404,8 +441,7 @@ bool ovl_lower_positive(struct dentry *dentry);
 static inline int ovl_verify_origin(struct ovl_fs *ofs, struct dentry *upper,
 				    struct dentry *origin, bool set)
 {
-	return ovl_verify_set_fh(ofs, upper, OVL_XATTR_ORIGIN, origin,
-				 false, set);
+	return ovl_verify_set_fh(ofs, upper, OVL_XATTR_ORIGIN, origin, false, set);
 }
 
 static inline int ovl_verify_upper(struct ovl_fs *ofs, struct dentry *index,
@@ -418,7 +454,8 @@ static inline int ovl_verify_upper(struct ovl_fs *ofs, struct dentry *index,
 extern const struct file_operations ovl_dir_operations;
 struct file *ovl_dir_real_file(const struct file *file, bool want_upper);
 int ovl_check_empty_dir(struct dentry *dentry, struct list_head *list);
-void ovl_cleanup_whiteouts(struct dentry *upper, struct list_head *list);
+void ovl_cleanup_whiteouts(struct user_namespace *user_ns, struct dentry *upper,
+			   struct list_head *list);
 void ovl_cache_free(struct list_head *list);
 void ovl_dir_cache_free(struct inode *inode);
 int ovl_check_d_type_supported(struct path *realpath);
@@ -463,10 +500,11 @@ bool ovl_lookup_trap_inode(struct super_block *sb, struct dentry *dir);
 struct inode *ovl_get_trap_inode(struct super_block *sb, struct dentry *dir);
 struct inode *ovl_get_inode(struct super_block *sb,
 			    struct ovl_inode_params *oip);
-static inline void ovl_copyattr(struct inode *from, struct inode *to)
+static inline void ovl_copyattr(struct user_namespace *user_ns,
+				struct inode *from, struct inode *to)
 {
-	to->i_uid = from->i_uid;
-	to->i_gid = from->i_gid;
+	to->i_uid =  i_uid_into_mnt(user_ns, from);
+	to->i_gid =  i_gid_into_mnt(user_ns, from);
 	to->i_mode = from->i_mode;
 	to->i_atime = from->i_atime;
 	to->i_mtime = from->i_mtime;
@@ -474,6 +512,17 @@ static inline void ovl_copyattr(struct inode *from, struct inode *to)
 	i_size_write(to, i_size_read(from));
 }
 
+static inline struct user_namespace *ovl_inode_real_user_ns(struct inode *inode)
+{
+	struct inode *realinode;
+
+	realinode = ovl_inode_upper(inode);
+	if (realinode)
+		return ovl_upper_mnt_user_ns(OVL_FS(inode->i_sb));
+
+	return OVL_I(inode)->lower_user_ns;
+}
+
 static inline void ovl_copyflags(struct inode *from, struct inode *to)
 {
 	unsigned int mask = S_SYNC | S_IMMUTABLE | S_APPEND | S_NOATIME;
@@ -494,11 +543,15 @@ struct ovl_cattr {
 
 #define OVL_CATTR(m) (&(struct ovl_cattr) { .mode = (m) })
 
-struct dentry *ovl_create_real(struct inode *dir, struct dentry *newdentry,
+struct dentry *ovl_create_real(struct user_namespace *user_ns,
+			       struct inode *dir, struct dentry *newdentry,
 			       struct ovl_cattr *attr);
-int ovl_cleanup(struct inode *dir, struct dentry *dentry);
-struct dentry *ovl_lookup_temp(struct dentry *workdir);
-struct dentry *ovl_create_temp(struct dentry *workdir, struct ovl_cattr *attr);
+int ovl_cleanup(struct user_namespace *user_ns, struct inode *dir,
+		struct dentry *dentry);
+struct dentry *ovl_lookup_temp(struct user_namespace *user_ns,
+			       struct dentry *workdir);
+struct dentry *ovl_create_temp(struct user_namespace *user_ns,
+			       struct dentry *workdir, struct ovl_cattr *attr);
 
 /* file.c */
 extern const struct file_operations ovl_file_operations;
@@ -511,9 +564,11 @@ long ovl_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
 int ovl_copy_up(struct dentry *dentry);
 int ovl_copy_up_with_data(struct dentry *dentry);
 int ovl_maybe_copy_up(struct dentry *dentry, int flags);
-int ovl_copy_xattr(struct super_block *sb, struct dentry *old,
+int ovl_copy_xattr(struct super_block *sb, struct user_namespace *old_user_ns,
+		   struct dentry *old, struct user_namespace *new_user_ns,
 		   struct dentry *new);
-int ovl_set_attr(struct dentry *upper, struct kstat *stat);
+int ovl_set_attr(struct user_namespace *user_ns, struct dentry *upper,
+		 struct kstat *stat);
 struct ovl_fh *ovl_encode_real_fh(struct dentry *real, bool is_upper);
 int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
 		   struct dentry *upper);
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index 1b5a2094df8e..c505605f27ee 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -126,6 +126,7 @@ struct ovl_inode {
 	struct inode vfs_inode;
 	struct dentry *__upperdentry;
 	struct inode *lower;
+	struct user_namespace *lower_user_ns;
 
 	/* synchronize copy up and more */
 	struct mutex lock;
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
index 01620ebae1bd..6927e4e93d44 100644
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -264,7 +264,8 @@ static int ovl_fill_merge(struct dir_context *ctx, const char *name,
 		return ovl_fill_lowest(rdd, name, namelen, offset, ino, d_type);
 }
 
-static int ovl_check_whiteouts(struct dentry *dir, struct ovl_readdir_data *rdd)
+static int ovl_check_whiteouts(struct user_namespace *user_ns,
+			       struct dentry *dir, struct ovl_readdir_data *rdd)
 {
 	int err;
 	struct ovl_cache_entry *p;
@@ -278,7 +279,7 @@ static int ovl_check_whiteouts(struct dentry *dir, struct ovl_readdir_data *rdd)
 		while (rdd->first_maybe_whiteout) {
 			p = rdd->first_maybe_whiteout;
 			rdd->first_maybe_whiteout = p->next_maybe_whiteout;
-			dentry = lookup_one_len(p->name, dir, p->len);
+			dentry = lookup_one_len_mapped(p->name, dir, p->len, user_ns);
 			if (!IS_ERR(dentry)) {
 				p->is_whiteout = ovl_is_whiteout(dentry);
 				dput(dentry);
@@ -312,7 +313,8 @@ static inline int ovl_dir_read(struct path *realpath,
 	} while (!err && rdd->count);
 
 	if (!err && rdd->first_maybe_whiteout && rdd->dentry)
-		err = ovl_check_whiteouts(realpath->dentry, rdd);
+		err = ovl_check_whiteouts(mnt_user_ns(realpath->mnt),
+					  realpath->dentry, rdd);
 
 	fput(realfile);
 
@@ -491,7 +493,7 @@ static int ovl_cache_update_ino(struct path *path, struct ovl_cache_entry *p)
 			goto get;
 		}
 	}
-	this = lookup_one_len(p->name, dir, p->len);
+	this = lookup_one_len_mapped(p->name, dir, p->len, mnt_user_ns(path->mnt));
 	if (IS_ERR_OR_NULL(this) || !this->d_inode) {
 		if (IS_ERR(this)) {
 			err = PTR_ERR(this);
@@ -1020,7 +1022,8 @@ int ovl_check_empty_dir(struct dentry *dentry, struct list_head *list)
 	return err;
 }
 
-void ovl_cleanup_whiteouts(struct dentry *upper, struct list_head *list)
+void ovl_cleanup_whiteouts(struct user_namespace *user_ns, struct dentry *upper,
+			   struct list_head *list)
 {
 	struct ovl_cache_entry *p;
 
@@ -1031,7 +1034,7 @@ void ovl_cleanup_whiteouts(struct dentry *upper, struct list_head *list)
 		if (WARN_ON(!p->is_whiteout || !p->is_upper))
 			continue;
 
-		dentry = lookup_one_len(p->name, upper, p->len);
+		dentry = lookup_one_len_mapped(p->name, upper, p->len, user_ns);
 		if (IS_ERR(dentry)) {
 			pr_err("lookup '%s/%.*s' failed (%i)\n",
 			       upper->d_name.name, p->len, p->name,
@@ -1039,7 +1042,7 @@ void ovl_cleanup_whiteouts(struct dentry *upper, struct list_head *list)
 			continue;
 		}
 		if (dentry->d_inode)
-			ovl_cleanup(upper->d_inode, dentry);
+			ovl_cleanup(user_ns, upper->d_inode, dentry);
 		dput(dentry);
 	}
 	inode_unlock(upper->d_inode);
@@ -1130,7 +1133,8 @@ static int ovl_workdir_cleanup_recurse(struct path *path, int level)
 			err = -EINVAL;
 			break;
 		}
-		dentry = lookup_one_len(p->name, path->dentry, p->len);
+		dentry = lookup_one_len_mapped(p->name, path->dentry, p->len,
+					   mnt_user_ns(path->mnt));
 		if (IS_ERR(dentry))
 			continue;
 		if (dentry->d_inode)
@@ -1149,12 +1153,13 @@ int ovl_workdir_cleanup(struct inode *dir, struct vfsmount *mnt,
 			 struct dentry *dentry, int level)
 {
 	int err;
+	struct user_namespace *user_ns = mnt_user_ns(mnt);
 
 	if (!d_is_dir(dentry) || level > 1) {
-		return ovl_cleanup(dir, dentry);
+		return ovl_cleanup(user_ns, dir, dentry);
 	}
 
-	err = ovl_do_rmdir(dir, dentry);
+	err = ovl_do_rmdir(user_ns, dir, dentry);
 	if (err) {
 		struct path path = { .mnt = mnt, .dentry = dentry };
 
@@ -1162,7 +1167,7 @@ int ovl_workdir_cleanup(struct inode *dir, struct vfsmount *mnt,
 		err = ovl_workdir_cleanup_recurse(&path, level + 1);
 		inode_lock_nested(dir, I_MUTEX_PARENT);
 		if (!err)
-			err = ovl_cleanup(dir, dentry);
+			err = ovl_cleanup(user_ns, dir, dentry);
 	}
 
 	return err;
@@ -1175,6 +1180,7 @@ int ovl_indexdir_cleanup(struct ovl_fs *ofs)
 	struct dentry *index = NULL;
 	struct inode *dir = indexdir->d_inode;
 	struct path path = { .mnt = ovl_upper_mnt(ofs), .dentry = indexdir };
+	struct user_namespace *user_ns = mnt_user_ns(path.mnt);
 	LIST_HEAD(list);
 	struct rb_root root = RB_ROOT;
 	struct ovl_cache_entry *p;
@@ -1198,7 +1204,7 @@ int ovl_indexdir_cleanup(struct ovl_fs *ofs)
 			if (p->len == 2 && p->name[1] == '.')
 				continue;
 		}
-		index = lookup_one_len(p->name, indexdir, p->len);
+		index = lookup_one_len_mapped(p->name, indexdir, p->len, user_ns);
 		if (IS_ERR(index)) {
 			err = PTR_ERR(index);
 			index = NULL;
@@ -1216,7 +1222,7 @@ int ovl_indexdir_cleanup(struct ovl_fs *ofs)
 			goto next;
 		} else if (err == -ESTALE) {
 			/* Cleanup stale index entries */
-			err = ovl_cleanup(dir, index);
+			err = ovl_cleanup(user_ns, dir, index);
 		} else if (err != -ENOENT) {
 			/*
 			 * Abort mount to avoid corrupting the index if
@@ -1232,7 +1238,7 @@ int ovl_indexdir_cleanup(struct ovl_fs *ofs)
 			err = ovl_cleanup_and_whiteout(ofs, dir, index);
 		} else {
 			/* Cleanup orphan index entries */
-			err = ovl_cleanup(dir, index);
+			err = ovl_cleanup(user_ns, dir, index);
 		}
 
 		if (err)
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 290983bcfbb3..755f651587a5 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -203,6 +203,7 @@ static void ovl_destroy_inode(struct inode *inode)
 
 	dput(oi->__upperdentry);
 	iput(oi->lower);
+	put_user_ns(oi->lower_user_ns);
 	if (S_ISDIR(inode->i_mode))
 		ovl_dir_cache_free(inode);
 	else
@@ -699,13 +700,14 @@ static struct dentry *ovl_workdir_create(struct ovl_fs *ofs,
 {
 	struct inode *dir =  ofs->workbasedir->d_inode;
 	struct vfsmount *mnt = ovl_upper_mnt(ofs);
+	struct user_namespace *user_ns = mnt_user_ns(mnt);
 	struct dentry *work;
 	int err;
 	bool retried = false;
 
 	inode_lock_nested(dir, I_MUTEX_PARENT);
 retry:
-	work = lookup_one_len(name, ofs->workbasedir, strlen(name));
+	work = lookup_one_len_mapped(name, ofs->workbasedir, strlen(name), user_ns);
 
 	if (!IS_ERR(work)) {
 		struct iattr attr = {
@@ -731,7 +733,7 @@ static struct dentry *ovl_workdir_create(struct ovl_fs *ofs,
 			goto retry;
 		}
 
-		work = ovl_create_real(dir, work, OVL_CATTR(attr.ia_mode));
+		work = ovl_create_real(user_ns, dir, work, OVL_CATTR(attr.ia_mode));
 		err = PTR_ERR(work);
 		if (IS_ERR(work))
 			goto out_err;
@@ -749,17 +751,17 @@ static struct dentry *ovl_workdir_create(struct ovl_fs *ofs,
 		 * allowed as upper are limited to "normal" ones, where checking
 		 * for the above two errors is sufficient.
 		 */
-		err = vfs_removexattr(work, XATTR_NAME_POSIX_ACL_DEFAULT);
+		err = vfs_mapped_removexattr(user_ns, work, XATTR_NAME_POSIX_ACL_DEFAULT);
 		if (err && err != -ENODATA && err != -EOPNOTSUPP)
 			goto out_dput;
 
-		err = vfs_removexattr(work, XATTR_NAME_POSIX_ACL_ACCESS);
+		err = vfs_mapped_removexattr(user_ns, work, XATTR_NAME_POSIX_ACL_ACCESS);
 		if (err && err != -ENODATA && err != -EOPNOTSUPP)
 			goto out_dput;
 
 		/* Clear any inherited mode bits */
 		inode_lock(work->d_inode);
-		err = notify_change(work, &attr, NULL);
+		err = notify_mapped_change(user_ns, work, &attr, NULL);
 		inode_unlock(work->d_inode);
 		if (err)
 			goto out_dput;
@@ -934,10 +936,11 @@ ovl_posix_acl_xattr_get(const struct xattr_handler *handler,
 }
 
 static int __maybe_unused
-ovl_posix_acl_xattr_set(const struct xattr_handler *handler,
-			struct dentry *dentry, struct inode *inode,
-			const char *name, const void *value,
-			size_t size, int flags)
+ovl_posix_acl_xattr_set_mapped(const struct xattr_handler *handler,
+			       struct user_namespace *user_ns,
+			       struct dentry *dentry, struct inode *inode,
+			       const char *name, const void *value,
+			       size_t size, int flags)
 {
 	struct dentry *workdir = ovl_workdir(dentry);
 	struct inode *realinode = ovl_inode_real(inode);
@@ -960,7 +963,7 @@ ovl_posix_acl_xattr_set(const struct xattr_handler *handler,
 		goto out_acl_release;
 	}
 	err = -EPERM;
-	if (!inode_owner_or_capable(inode))
+	if (!mapped_inode_owner_or_capable(user_ns, inode))
 		goto out_acl_release;
 
 	posix_acl_release(acl);
@@ -971,8 +974,8 @@ ovl_posix_acl_xattr_set(const struct xattr_handler *handler,
 	 */
 	if (unlikely(inode->i_mode & S_ISGID) &&
 	    handler->flags == ACL_TYPE_ACCESS &&
-	    !in_group_p(inode->i_gid) &&
-	    !capable_wrt_inode_uidgid(inode, CAP_FSETID)) {
+	    !in_group_p(i_gid_into_mnt(user_ns, inode)) &&
+	    !capable_wrt_mapped_inode_uidgid(user_ns, inode, CAP_FSETID)) {
 		struct iattr iattr = { .ia_valid = ATTR_KILL_SGID };
 
 		err = ovl_setattr(dentry, &iattr);
@@ -982,7 +985,7 @@ ovl_posix_acl_xattr_set(const struct xattr_handler *handler,
 
 	err = ovl_xattr_set(dentry, inode, handler->name, value, size, flags);
 	if (!err)
-		ovl_copyattr(ovl_inode_real(inode), inode);
+		ovl_copyattr(ovl_inode_real_user_ns(inode), realinode, inode);
 
 	return err;
 
@@ -991,6 +994,16 @@ ovl_posix_acl_xattr_set(const struct xattr_handler *handler,
 	return err;
 }
 
+static int __maybe_unused
+ovl_posix_acl_xattr_set(const struct xattr_handler *handler,
+			struct dentry *dentry, struct inode *inode,
+			const char *name, const void *value,
+			size_t size, int flags)
+{
+	return ovl_posix_acl_xattr_set_mapped(handler, &init_user_ns, dentry,
+					      inode, name, value, size, flags);
+}
+
 static int ovl_own_xattr_get(const struct xattr_handler *handler,
 			     struct dentry *dentry, struct inode *inode,
 			     const char *name, void *buffer, size_t size)
@@ -998,6 +1011,17 @@ static int ovl_own_xattr_get(const struct xattr_handler *handler,
 	return -EOPNOTSUPP;
 }
 
+#ifdef CONFIG_IDMAP_MOUNTS
+static int ovl_own_xattr_set_mapped(const struct xattr_handler *handler,
+				    struct user_namespace *user_ns,
+				    struct dentry *dentry, struct inode *inode,
+				    const char *name, const void *value,
+				    size_t size, int flags)
+{
+	return -EOPNOTSUPP;
+}
+#endif
+
 static int ovl_own_xattr_set(const struct xattr_handler *handler,
 			     struct dentry *dentry, struct inode *inode,
 			     const char *name, const void *value,
@@ -1013,6 +1037,17 @@ static int ovl_other_xattr_get(const struct xattr_handler *handler,
 	return ovl_xattr_get(dentry, inode, name, buffer, size);
 }
 
+#ifdef CONFIG_IDMAP_MOUNTS
+static int ovl_other_xattr_set_mapped(const struct xattr_handler *handler,
+				      struct user_namespace *user_ns,
+				      struct dentry *dentry,
+				      struct inode *inode, const char *name,
+				      const void *value, size_t size, int flags)
+{
+	return ovl_xattr_set(dentry, inode, name, value, size, flags);
+}
+#endif
+
 static int ovl_other_xattr_set(const struct xattr_handler *handler,
 			       struct dentry *dentry, struct inode *inode,
 			       const char *name, const void *value,
@@ -1027,6 +1062,9 @@ ovl_posix_acl_access_xattr_handler = {
 	.flags = ACL_TYPE_ACCESS,
 	.get = ovl_posix_acl_xattr_get,
 	.set = ovl_posix_acl_xattr_set,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.set_mapped = ovl_posix_acl_xattr_set_mapped,
+#endif
 };
 
 static const struct xattr_handler __maybe_unused
@@ -1035,18 +1073,27 @@ ovl_posix_acl_default_xattr_handler = {
 	.flags = ACL_TYPE_DEFAULT,
 	.get = ovl_posix_acl_xattr_get,
 	.set = ovl_posix_acl_xattr_set,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.set_mapped = ovl_posix_acl_xattr_set_mapped,
+#endif
 };
 
 static const struct xattr_handler ovl_own_xattr_handler = {
 	.prefix	= OVL_XATTR_PREFIX,
 	.get = ovl_own_xattr_get,
 	.set = ovl_own_xattr_set,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.set_mapped = ovl_own_xattr_set_mapped,
+#endif
 };
 
 static const struct xattr_handler ovl_other_xattr_handler = {
 	.prefix	= "", /* catch all */
 	.get = ovl_other_xattr_get,
 	.set = ovl_other_xattr_set,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.set_mapped = ovl_other_xattr_set_mapped,
+#endif
 };
 
 static const struct xattr_handler *ovl_xattr_handlers[] = {
@@ -1164,7 +1211,8 @@ static int ovl_get_upper(struct super_block *sb, struct ovl_fs *ofs,
  * Returns 1 if RENAME_WHITEOUT is supported, 0 if not supported and
  * negative values if error is encountered.
  */
-static int ovl_check_rename_whiteout(struct dentry *workdir)
+static int ovl_check_rename_whiteout(struct user_namespace *user_ns,
+				     struct dentry *workdir)
 {
 	struct inode *dir = d_inode(workdir);
 	struct dentry *temp;
@@ -1175,12 +1223,12 @@ static int ovl_check_rename_whiteout(struct dentry *workdir)
 
 	inode_lock_nested(dir, I_MUTEX_PARENT);
 
-	temp = ovl_create_temp(workdir, OVL_CATTR(S_IFREG | 0));
+	temp = ovl_create_temp(user_ns, workdir, OVL_CATTR(S_IFREG | 0));
 	err = PTR_ERR(temp);
 	if (IS_ERR(temp))
 		goto out_unlock;
 
-	dest = ovl_lookup_temp(workdir);
+	dest = ovl_lookup_temp(user_ns, workdir);
 	err = PTR_ERR(dest);
 	if (IS_ERR(dest)) {
 		dput(temp);
@@ -1189,14 +1237,14 @@ static int ovl_check_rename_whiteout(struct dentry *workdir)
 
 	/* Name is inline and stable - using snapshot as a copy helper */
 	take_dentry_name_snapshot(&name, temp);
-	err = ovl_do_rename(dir, temp, dir, dest, RENAME_WHITEOUT);
+	err = ovl_do_rename(dir, user_ns, temp, dir, user_ns, dest, RENAME_WHITEOUT);
 	if (err) {
 		if (err == -EINVAL)
 			err = 0;
 		goto cleanup_temp;
 	}
 
-	whiteout = lookup_one_len(name.name.name, workdir, name.name.len);
+	whiteout = lookup_one_len_mapped(name.name.name, workdir, name.name.len, user_ns);
 	err = PTR_ERR(whiteout);
 	if (IS_ERR(whiteout))
 		goto cleanup_temp;
@@ -1205,11 +1253,11 @@ static int ovl_check_rename_whiteout(struct dentry *workdir)
 
 	/* Best effort cleanup of whiteout and temp file */
 	if (err)
-		ovl_cleanup(dir, whiteout);
+		ovl_cleanup(user_ns, dir, whiteout);
 	dput(whiteout);
 
 cleanup_temp:
-	ovl_cleanup(dir, temp);
+	ovl_cleanup(user_ns, dir, temp);
 	release_dentry_name_snapshot(&name);
 	dput(temp);
 	dput(dest);
@@ -1220,16 +1268,17 @@ static int ovl_check_rename_whiteout(struct dentry *workdir)
 	return err;
 }
 
-static struct dentry *ovl_lookup_or_create(struct dentry *parent,
+static struct dentry *ovl_lookup_or_create(struct user_namespace *user_ns,
+					   struct dentry *parent,
 					   const char *name, umode_t mode)
 {
 	size_t len = strlen(name);
 	struct dentry *child;
 
 	inode_lock_nested(parent->d_inode, I_MUTEX_PARENT);
-	child = lookup_one_len(name, parent, len);
+	child = lookup_one_len_mapped(name, parent, len, user_ns);
 	if (!IS_ERR(child) && !child->d_inode)
-		child = ovl_create_real(parent->d_inode, child,
+		child = ovl_create_real(user_ns, parent->d_inode, child,
 					OVL_CATTR(mode));
 	inode_unlock(parent->d_inode);
 	dput(parent);
@@ -1251,7 +1300,8 @@ static int ovl_create_volatile_dirty(struct ovl_fs *ofs)
 	const char *const *name = volatile_path;
 
 	for (ctr = ARRAY_SIZE(volatile_path); ctr; ctr--, name++) {
-		d = ovl_lookup_or_create(d, *name, ctr > 1 ? S_IFDIR : S_IFREG);
+		d = ovl_lookup_or_create(ovl_upper_mnt_user_ns(ofs), d, *name,
+					 ctr > 1 ? S_IFDIR : S_IFREG);
 		if (IS_ERR(d))
 			return PTR_ERR(d);
 	}
@@ -1264,6 +1314,7 @@ static int ovl_make_workdir(struct super_block *sb, struct ovl_fs *ofs,
 {
 	struct vfsmount *mnt = ovl_upper_mnt(ofs);
 	struct dentry *temp, *workdir;
+	struct user_namespace *user_ns = mnt_user_ns(mnt);
 	bool rename_whiteout;
 	bool d_type;
 	int fh_type;
@@ -1299,7 +1350,7 @@ static int ovl_make_workdir(struct super_block *sb, struct ovl_fs *ofs,
 		pr_warn("upper fs needs to support d_type.\n");
 
 	/* Check if upper/work fs supports O_TMPFILE */
-	temp = ovl_do_tmpfile(ofs->workdir, S_IFREG | 0);
+	temp = ovl_do_tmpfile(user_ns, ofs->workdir, S_IFREG | 0);
 	ofs->tmpfile = !IS_ERR(temp);
 	if (ofs->tmpfile)
 		dput(temp);
@@ -1308,7 +1359,7 @@ static int ovl_make_workdir(struct super_block *sb, struct ovl_fs *ofs,
 
 
 	/* Check if upper/work fs supports RENAME_WHITEOUT */
-	err = ovl_check_rename_whiteout(ofs->workdir);
+	err = ovl_check_rename_whiteout(user_ns, ofs->workdir);
 	if (err < 0)
 		goto out;
 
@@ -1423,6 +1474,7 @@ static int ovl_get_indexdir(struct super_block *sb, struct ovl_fs *ofs,
 {
 	struct vfsmount *mnt = ovl_upper_mnt(ofs);
 	struct dentry *indexdir;
+	struct user_namespace *user_ns = mnt_user_ns(mnt);
 	int err;
 
 	err = mnt_want_write(mnt);
@@ -1462,7 +1514,7 @@ static int ovl_get_indexdir(struct super_block *sb, struct ovl_fs *ofs,
 		 * "trusted.overlay.upper" to indicate that index may have
 		 * directory entries.
 		 */
-		if (ovl_check_origin_xattr(ofs, ofs->indexdir)) {
+		if (ovl_check_origin_xattr(ofs, user_ns, ofs->indexdir)) {
 			err = ovl_verify_set_fh(ofs, ofs->indexdir,
 						OVL_XATTR_ORIGIN,
 						upperpath->dentry, true, false);
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index 23f475627d07..efd4dd067258 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -437,7 +437,8 @@ static void ovl_dentry_version_inc(struct dentry *dentry, bool impurity)
 void ovl_dir_modified(struct dentry *dentry, bool impurity)
 {
 	/* Copy mtime/ctime */
-	ovl_copyattr(d_inode(ovl_dentry_upper(dentry)), d_inode(dentry));
+	ovl_copyattr(ovl_upper_mnt_user_ns(OVL_FS(dentry->d_sb)),
+		     d_inode(ovl_dentry_upper(dentry)), d_inode(dentry));
 
 	ovl_dentry_version_inc(dentry, impurity);
 }
@@ -460,6 +461,7 @@ bool ovl_is_whiteout(struct dentry *dentry)
 struct file *ovl_path_open(struct path *path, int flags)
 {
 	struct inode *inode = d_inode(path->dentry);
+	struct user_namespace *user_ns = mnt_user_ns(path->mnt);
 	int err, acc_mode;
 
 	if (flags & ~(O_ACCMODE | O_LARGEFILE))
@@ -476,12 +478,12 @@ struct file *ovl_path_open(struct path *path, int flags)
 		BUG();
 	}
 
-	err = inode_permission(inode, acc_mode | MAY_OPEN);
+	err = mapped_inode_permission(user_ns, inode, acc_mode | MAY_OPEN);
 	if (err)
 		return ERR_PTR(err);
 
 	/* O_NOATIME is an optimization, don't fail if not permitted */
-	if (inode_owner_or_capable(inode))
+	if (mapped_inode_owner_or_capable(user_ns, inode))
 		flags |= O_NOATIME;
 
 	return dentry_open(path, flags, current_cred());
@@ -544,11 +546,12 @@ void ovl_copy_up_end(struct dentry *dentry)
 	ovl_inode_unlock(d_inode(dentry));
 }
 
-bool ovl_check_origin_xattr(struct ovl_fs *ofs, struct dentry *dentry)
+bool ovl_check_origin_xattr(struct ovl_fs *ofs, struct user_namespace *user_ns,
+			    struct dentry *dentry)
 {
 	int res;
 
-	res = ovl_do_getxattr(ofs, dentry, OVL_XATTR_ORIGIN, NULL, 0);
+	res = ovl_do_getxattr(ofs, user_ns, dentry, OVL_XATTR_ORIGIN, NULL, 0);
 
 	/* Zero size value means "copied up but origin unknown" */
 	if (res >= 0)
@@ -557,8 +560,8 @@ bool ovl_check_origin_xattr(struct ovl_fs *ofs, struct dentry *dentry)
 	return false;
 }
 
-bool ovl_check_dir_xattr(struct super_block *sb, struct dentry *dentry,
-			 enum ovl_xattr ox)
+bool ovl_check_dir_xattr(struct super_block *sb, struct user_namespace *user_ns,
+			 struct dentry *dentry, enum ovl_xattr ox)
 {
 	int res;
 	char val;
@@ -566,7 +569,7 @@ bool ovl_check_dir_xattr(struct super_block *sb, struct dentry *dentry,
 	if (!d_is_dir(dentry))
 		return false;
 
-	res = ovl_do_getxattr(OVL_FS(sb), dentry, ox, &val, 1);
+	res = ovl_do_getxattr(OVL_FS(sb), user_ns, dentry, ox, &val, 1);
 	if (res == 1 && val == 'y')
 		return true;
 
@@ -723,6 +726,7 @@ static void ovl_cleanup_index(struct dentry *dentry)
 	struct dentry *index = NULL;
 	struct inode *inode;
 	struct qstr name = { };
+	struct user_namespace *user_ns = ovl_dentry_mnt_user_ns(upperdentry);
 	int err;
 
 	err = ovl_get_index_name(lowerdentry, &name);
@@ -748,7 +752,7 @@ static void ovl_cleanup_index(struct dentry *dentry)
 	}
 
 	inode_lock_nested(dir, I_MUTEX_PARENT);
-	index = lookup_one_len(name.name, indexdir, name.len);
+	index = lookup_one_len_mapped(name.name, indexdir, name.len, user_ns);
 	err = PTR_ERR(index);
 	if (IS_ERR(index)) {
 		index = NULL;
@@ -758,7 +762,7 @@ static void ovl_cleanup_index(struct dentry *dentry)
 					       dir, index);
 	} else {
 		/* Cleanup orphan index entries */
-		err = ovl_cleanup(dir, index);
+		err = ovl_cleanup(user_ns, dir, index);
 	}
 
 	inode_unlock(dir);
@@ -867,7 +871,8 @@ int ovl_lock_rename_workdir(struct dentry *workdir, struct dentry *upperdir)
 }
 
 /* err < 0, 0 if no metacopy xattr, 1 if metacopy xattr found */
-int ovl_check_metacopy_xattr(struct ovl_fs *ofs, struct dentry *dentry)
+int ovl_check_metacopy_xattr(struct ovl_fs *ofs, struct user_namespace *user_ns,
+			     struct dentry *dentry)
 {
 	int res;
 
@@ -875,7 +880,7 @@ int ovl_check_metacopy_xattr(struct ovl_fs *ofs, struct dentry *dentry)
 	if (!S_ISREG(d_inode(dentry)->i_mode))
 		return 0;
 
-	res = ovl_do_getxattr(ofs, dentry, OVL_XATTR_METACOPY, NULL, 0);
+	res = ovl_do_getxattr(ofs, user_ns, dentry, OVL_XATTR_METACOPY, NULL, 0);
 	if (res < 0) {
 		if (res == -ENODATA || res == -EOPNOTSUPP)
 			return 0;
@@ -904,13 +909,14 @@ bool ovl_is_metacopy_dentry(struct dentry *dentry)
 	return (oe->numlower > 1);
 }
 
-char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
-			     int padding)
+char *ovl_get_redirect_xattr(struct ovl_fs *ofs,
+			     struct user_namespace *mnt_user_ns,
+			     struct dentry *dentry, int padding)
 {
 	int res;
 	char *s, *next, *buf = NULL;
 
-	res = ovl_do_getxattr(ofs, dentry, OVL_XATTR_REDIRECT, NULL, 0);
+	res = ovl_do_getxattr(ofs, mnt_user_ns, dentry, OVL_XATTR_REDIRECT, NULL, 0);
 	if (res == -ENODATA || res == -EOPNOTSUPP)
 		return NULL;
 	if (res < 0)
@@ -922,7 +928,7 @@ char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
 	if (!buf)
 		return ERR_PTR(-ENOMEM);
 
-	res = ovl_do_getxattr(ofs, dentry, OVL_XATTR_REDIRECT, buf, res);
+	res = ovl_do_getxattr(ofs, mnt_user_ns, dentry, OVL_XATTR_REDIRECT, buf, res);
 	if (res < 0)
 		goto fail;
 	if (res == 0)
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 33/34] overlayfs: handle idmapped merged mounts
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (31 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 32/34] overlayfs: handle idmapped lower directories Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-30  9:57   ` Amir Goldstein
  2020-10-29  0:32 ` [PATCH 34/34] fat: handle idmapped mounts Christian Brauner
                   ` (4 subsequent siblings)
  37 siblings, 1 reply; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

Now that overlayfs handles idmapped lower and upper mounts enable it to support
idmapped merged directories, i.e. support idmapping overlayfs mounts themselves.

In contrast to supporting lower and upper idmapped mounts this requires very
little changes. Permission checking will continue to work as before, i.e. we
will verify that we are privileged over the overlay inode itself and we are
privileged over the lower and upper inodes.

A very special thank you to my friend Seth Forshee who has given invaluable
advice around the permission checking!

 # This is a directory where all file ownership starts with uid and gid 10000.
 root@f2-vm:/# ls -al /var/lib/lxc/f1/rootfs
 total 108
 drwxr-xr-x  20 10000 10000  4096 Oct 28 11:13 .
 drwxrwx---   4 10000 10000  4096 Oct 28 11:17 ..
 -rw-r--r--   1 10000 10000  7197 Oct 24 09:45 asdf
 drwxr-xr-x   2 10000 10000  4096 Oct 16 19:07 ASDF
 lrwxrwxrwx   1 10000 10000     7 Sep 24 07:43 bin -> usr/bin
 drwxr-xr-x   2 10000 10000  4096 Apr 15  2020 boot
 -rw-r--r--   1 10000 10000 13059 Oct  8 12:38 ccc
 drwxr-xr-x   2 11000 11000  4096 Oct 23 17:10 ddd
 drwxr-xr-x   3 10000 10000  4096 Sep 25 08:04 dev
 drwxr-xr-x  61 10000 10000  4096 Sep 25 08:04 etc

 # Create an idmapped mount on the host such that all files owned by uid and
 # gid 10000 show up as being owned by uid 0 and gid 0.
 /mount2 --idmap both:10000:0:10000 /var/lib/lxc/f1/rootfs/ /lower1/

 # Verify that the files show up as uid and gid 0 under the idmapped mount at /lower1
 root@f2-vm:/# ls -al /lower1/
 total 108
 drwxr-xr-x  20 root   root    4096 Oct 28 11:13 .
 drwxr-xr-x  29 root   root    4096 Oct 28 11:57 ..
 -rw-r--r--   1 root   root    7197 Oct 24 09:45 asdf
 drwxr-xr-x   2 root   root    4096 Oct 16 19:07 ASDF
 lrwxrwxrwx   1 root   root       7 Sep 24 07:43 bin -> usr/bin
 drwxr-xr-x   2 root   root    4096 Apr 15  2020 boot
 -rw-r--r--   1 root   root   13059 Oct  8 12:38 ccc
 drwxr-xr-x   2 ubuntu ubuntu  4096 Oct 23 17:10 ddd
 drwxr-xr-x   3 root   root    4096 Sep 25 08:04 dev
 drwxr-xr-x  61 root   root    4096 Sep 25 08:04 etc

 # Create an idmapped upper mount at /upper. Now, files created as id 0 will
 # show up as id 10000 in /upper and files created as id 1000 will show up as
 # id 11000 under /upper.
 /mount2 --idmap both:10000:0:10000 /upper /upper
 mkdir /upper/upper
 chown 10000:10000 /upper/upper
 mkdir /upper/work
 chown 10000:10000 /upper/work

 # Create an overlayfs mount.
 mount -t overlay overlay -o lowerdir=/lower1/,upperdir=/upper/upper/,workdir=/upper/work/ /merged/

 # Create a new user namespace with an id mapping of 0:20000:10000 and then
 # create an idmapped overlayfs merged mount with the same id mapping.
 ./mount2 -cb:0:20000:10000 -mb:0:20000:10000 /merged/ /merged/

 # Verify that we are in a user namespace with the expected id mapping
 root@f2-vm:/# cat /proc/self/uid_map
          0      20000      10000
 root@f2-vm:/# cat /proc/self/gid_map
          0      20000      10000

 # The merged mount should show up starting from uid and gid 0
 root@f2-vm:/# ls -al /merged/
 total 124
 drwxr-xr-x   1 root   root    4096 Oct 25 23:04 .
 drwxr-xr-x  29 root   root    4096 Oct 28 12:07 ..
 -rw-r--r--   1 root   root    7197 Oct 24 09:45 asdf
 drwxr-xr-x   2 root   root    4096 Oct 16 19:07 ASDF
 lrwxrwxrwx   1 root   root       7 Sep 24 07:43 bin -> usr/bin
 drwxr-xr-x   2 root   root    4096 Apr 15  2020 boot
 -rw-r--r--   1 root   root   13059 Oct  8 12:38 ccc
 drwxr-xr-x   2 ubuntu ubuntu  4096 Oct 23 17:10 ddd
 drwxr-xr-x   3 root   root    4096 Sep 25 08:04 dev
 drwxr-xr-x  61 root   root    4096 Sep 25 08:04 etc

 # Create a file as as root
 root@f2-vm:/merged# touch /merged/A-FILE

 root@f2-vm:/merged# ls -al /merged/A-FILE
 -rw-r--r-- 1 root root 0 Oct 28 12:16 /merged/A-FILE

 # Chown the file to a simple user
 root@f2-vm:/merged# chown 1000:1000 /merged/A-FILE

 root@f2-vm:/merged# ls -al /merged/A-FILE
 -rw-r--r-- 1 ubuntu ubuntu 0 Oct 28 12:16 /merged/A-FILE

 # Create a directory and delegate to simple user
 root@f2-vm:/merged# mkdir /merged/A-DIR

 root@f2-vm:/merged# chown 1000:1000 /merged/A-DIR/

 # Let's look at these files in our idmapped upper directory from within the
 # user namespace. All files should appear to be owned by 65534:65534 since the
 # /upper mount is idmapped to 10000:0:10000 but we are in a user namespace
 # with an idmapping of 0:20000:20000.
 root@f2-vm:/merged# ls -al /upper/upper/
 total 12
 drwxr-xr-x 3 nobody nogroup 4096 Oct 28 13:26 .
 drwxr-xr-x 4 nobody nogroup 4096 Oct 21 13:48 ..
 drwxr-xr-x 2 nobody nogroup 4096 Oct 28 13:26 A-DIR
 -rw-r--r-- 1 nobody nogroup    0 Oct 28 13:25 A-FILE

 # Let's look at this /upper from the initial user namespace (where it is
 # mounted idmapped with 10000:0:10000)
 root@f2-vm:~# ls -al /upper/upper/
 total 12
 drwxr-xr-x 3 root   root   4096 Oct 28 13:26 .
 drwxr-xr-x 4 root   root   4096 Oct 21 13:48 ..
 drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 13:26 A-DIR
 -rw-r--r-- 1 ubuntu ubuntu    0 Oct 28 13:25 A-FILE

 # Let's remove the idmapped /upper mount (overlayfs will have it's own private mount anyway)
 umount /upper

 # Let's look at these files in our upper directory with the idmapped mount removed
 ubuntu@f2-vm:/$ ls -al /upper/upper/
 root@f2-vm:~# ls -al /upper/upper/
 total 12
 drwxr-xr-x 3 10000 10000 4096 Oct 28 13:26 .
 drwxr-xr-x 4 10000 10000 4096 Oct 21 13:48 ..
 drwxr-xr-x 2 11000 11000 4096 Oct 28 13:26 A-DIR
 -rw-r--r-- 1 11000 11000    0 Oct 28 13:25 A-FILE

 # Let's create a few acls from the /merged directory on an already existing file
 # triggering a copy-up operation
 root@f2-vm:/merged# setfacl -m u:1000:rwx /merged/asdf
 root@f2-vm:/merged# getfacl /merged/asdf
 getfacl: Removing leading '/' from absolute path names
 # file: merged/asdf
 # owner: root
 # group: root
 user::rw-
 user:ubuntu:rwx
 group::r--
 mask::rwx
 other::r--

 # Let's look at this file from our upper directory from the initial user namespace
 root@f2-vm:/merged# getfacl /upper/upper/asdf
 getfacl: Removing leading '/' from absolute path names
 # file: upper/upper/asdf
 # owner: 10000
 # group: 10000
 user::rw-
 user:11000:rwx
 group::r--
 mask::rwx
 other::r--

Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/overlayfs/copy_up.c   |  4 +--
 fs/overlayfs/dir.c       | 68 ++++++++++++++++++++++++++++++++--------
 fs/overlayfs/inode.c     | 40 +++++++++++++++++------
 fs/overlayfs/overlayfs.h |  9 +++++-
 fs/overlayfs/super.c     |  3 ++
 5 files changed, 99 insertions(+), 25 deletions(-)

diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 1b8721796fd4..91134203c511 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -268,8 +268,8 @@ int ovl_set_attr(struct user_namespace *user_ns, struct dentry *upperdentry,
 	if (!err) {
 		struct iattr attr = {
 			.ia_valid = ATTR_UID | ATTR_GID,
-			.ia_uid = stat->uid,
-			.ia_gid = stat->gid,
+			.ia_uid = kuid_from_mnt(user_ns, stat->uid),
+			.ia_gid = kgid_from_mnt(user_ns, stat->gid),
 		};
 		err = notify_mapped_change(user_ns, upperdentry, &attr, NULL);
 	}
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 23d09de00957..1c0153b1ad6d 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -625,7 +625,8 @@ static int ovl_create_or_link(struct dentry *dentry, struct inode *inode,
 	return err;
 }
 
-static int ovl_create_object(struct dentry *dentry, int mode, dev_t rdev,
+static int ovl_create_object(struct user_namespace *user_ns,
+			     struct dentry *dentry, int mode, dev_t rdev,
 			     const char *link)
 {
 	int err;
@@ -649,7 +650,7 @@ static int ovl_create_object(struct dentry *dentry, int mode, dev_t rdev,
 	inode->i_state |= I_CREATING;
 	spin_unlock(&inode->i_lock);
 
-	inode_init_owner(inode, dentry->d_parent->d_inode, mode);
+	mapped_inode_init_owner(inode, user_ns, dentry->d_parent->d_inode, mode);
 	attr.mode = inode->i_mode;
 
 	err = ovl_create_or_link(dentry, inode, &attr, false);
@@ -663,31 +664,55 @@ static int ovl_create_object(struct dentry *dentry, int mode, dev_t rdev,
 	return err;
 }
 
-static int ovl_create(struct inode *dir, struct dentry *dentry, umode_t mode,
-		      bool excl)
+static int ovl_create_mapped(struct user_namespace *user_ns, struct inode *dir,
+			     struct dentry *dentry, umode_t mode, bool excl)
 {
-	return ovl_create_object(dentry, (mode & 07777) | S_IFREG, 0, NULL);
+	return ovl_create_object(user_ns, dentry, (mode & 07777) | S_IFREG, 0, NULL);
+}
+
+static int ovl_create(struct inode *dir, struct dentry *dentry, umode_t mode, bool excl)
+{
+	return ovl_create_mapped(&init_user_ns, dir, dentry, mode, excl);
+}
+
+static int ovl_mkdir_mapped(struct user_namespace *user_ns, struct inode *dir,
+			    struct dentry *dentry, umode_t mode)
+{
+	return ovl_create_object(user_ns, dentry,
+				 (mode & 07777) | S_IFDIR, 0, NULL);
 }
 
 static int ovl_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 {
-	return ovl_create_object(dentry, (mode & 07777) | S_IFDIR, 0, NULL);
+	return ovl_mkdir_mapped(&init_user_ns, dir, dentry, mode);
 }
 
-static int ovl_mknod(struct inode *dir, struct dentry *dentry, umode_t mode,
-		     dev_t rdev)
+static int ovl_mknod_mapped(struct user_namespace *user_ns, struct inode *dir,
+			    struct dentry *dentry, umode_t mode, dev_t rdev)
 {
 	/* Don't allow creation of "whiteout" on overlay */
 	if (S_ISCHR(mode) && rdev == WHITEOUT_DEV)
 		return -EPERM;
 
-	return ovl_create_object(dentry, mode, rdev, NULL);
+	return ovl_create_object(user_ns, dentry, mode, rdev, NULL);
+}
+
+static int ovl_mknod(struct inode *dir, struct dentry *dentry, umode_t mode,
+		     dev_t rdev)
+{
+	return ovl_mknod_mapped(&init_user_ns, dir, dentry, mode, rdev);
+}
+
+static int ovl_symlink_mapped(struct user_namespace *user_ns, struct inode *dir,
+			      struct dentry *dentry, const char *link)
+{
+	return ovl_create_object(user_ns, dentry, S_IFLNK, 0, link);
 }
 
 static int ovl_symlink(struct inode *dir, struct dentry *dentry,
 		       const char *link)
 {
-	return ovl_create_object(dentry, S_IFLNK, 0, link);
+	return ovl_symlink_mapped(&init_user_ns, dir, dentry, link);
 }
 
 static int ovl_set_link_redirect(struct dentry *dentry)
@@ -1085,9 +1110,10 @@ static int ovl_set_redirect(struct dentry *dentry, bool samedir)
 	return err;
 }
 
-static int ovl_rename(struct inode *olddir, struct dentry *old,
-		      struct inode *newdir, struct dentry *new,
-		      unsigned int flags)
+static int ovl_rename_mapped(struct user_namespace *user_ns,
+			     struct inode *olddir, struct dentry *old,
+			     struct inode *newdir, struct dentry *new,
+			     unsigned int flags)
 {
 	int err;
 	struct dentry *old_upperdir;
@@ -1303,6 +1329,13 @@ static int ovl_rename(struct inode *olddir, struct dentry *old,
 	return err;
 }
 
+static int ovl_rename(struct inode *olddir, struct dentry *old,
+		      struct inode *newdir, struct dentry *new,
+		      unsigned int flags)
+{
+	return ovl_rename_mapped(&init_user_ns, olddir, old, newdir, new, flags);
+}
+
 const struct inode_operations ovl_dir_inode_operations = {
 	.lookup		= ovl_lookup,
 	.mkdir		= ovl_mkdir,
@@ -1319,4 +1352,13 @@ const struct inode_operations ovl_dir_inode_operations = {
 	.listxattr	= ovl_listxattr,
 	.get_acl	= ovl_get_acl,
 	.update_time	= ovl_update_time,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.permission_mapped	= ovl_permission_mapped,
+	.mkdir_mapped		= ovl_mkdir_mapped,
+	.create_mapped		= ovl_create_mapped,
+	.mknod_mapped		= ovl_mknod_mapped,
+	.symlink_mapped		= ovl_symlink_mapped,
+	.setattr_mapped		= ovl_setattr_mapped,
+	.rename_mapped		= ovl_rename_mapped,
+#endif
 };
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index b6c8b904f0e7..3831ddbe57f8 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -14,15 +14,15 @@
 #include <linux/fiemap.h>
 #include "overlayfs.h"
 
-
-int ovl_setattr(struct dentry *dentry, struct iattr *attr)
+int ovl_setattr_mapped(struct user_namespace *user_ns, struct dentry *dentry,
+		       struct iattr *attr)
 {
 	int err;
 	bool full_copy_up = false;
 	struct dentry *upperdentry;
 	const struct cred *old_cred;
 
-	err = setattr_prepare(dentry, attr);
+	err = setattr_mapped_prepare(user_ns, dentry, attr);
 	if (err)
 		return err;
 
@@ -104,6 +104,11 @@ int ovl_setattr(struct dentry *dentry, struct iattr *attr)
 	return err;
 }
 
+int ovl_setattr(struct dentry *dentry, struct iattr *attr)
+{
+	return ovl_setattr_mapped(&init_user_ns, dentry, attr);
+}
+
 static int ovl_map_dev_ino(struct dentry *dentry, struct kstat *stat, int fsid)
 {
 	bool samefs = ovl_same_fs(dentry->d_sb);
@@ -167,6 +172,7 @@ static int ovl_map_dev_ino(struct dentry *dentry, struct kstat *stat, int fsid)
 int ovl_getattr(const struct path *path, struct kstat *stat,
 		u32 request_mask, unsigned int flags)
 {
+	struct user_namespace *user_ns;
 	struct dentry *dentry = path->dentry;
 	enum ovl_path_type type;
 	struct path realpath;
@@ -284,14 +290,19 @@ int ovl_getattr(const struct path *path, struct kstat *stat,
 out:
 	revert_creds(old_cred);
 
+	user_ns = mnt_user_ns(path->mnt);
+	stat->uid = kuid_into_mnt(user_ns, stat->uid);
+	stat->gid = kgid_into_mnt(user_ns, stat->gid);
+
 	return err;
 }
 
-int ovl_permission(struct inode *inode, int mask)
+int ovl_permission_mapped(struct user_namespace *user_ns,
+			  struct inode *inode, int mask)
 {
 	struct inode *upperinode = ovl_inode_upper(inode);
 	struct inode *realinode = upperinode ?: ovl_inode_lower(inode);
-	struct user_namespace *user_ns;
+	struct user_namespace *real_user_ns;
 	const struct cred *old_cred;
 	int err;
 
@@ -302,15 +313,15 @@ int ovl_permission(struct inode *inode, int mask)
 	}
 
 	if (upperinode)
-		user_ns = ovl_upper_mnt_user_ns(OVL_FS(inode->i_sb));
+		real_user_ns = ovl_upper_mnt_user_ns(OVL_FS(inode->i_sb));
 	else
-		user_ns = OVL_I(inode)->lower_user_ns;
+		real_user_ns = OVL_I(inode)->lower_user_ns;
 
 	/*
 	 * Check overlay inode with the creds of task and underlying inode
 	 * with creds of mounter
 	 */
-	err = generic_permission(inode, mask);
+	err = mapped_generic_permission(user_ns, inode, mask);
 	if (err)
 		return err;
 
@@ -322,7 +333,7 @@ int ovl_permission(struct inode *inode, int mask)
 		/* Make sure mounter can read file for copy up later */
 		mask |= MAY_READ;
 	}
-	err = mapped_inode_permission(user_ns, realinode, mask);
+	err = mapped_inode_permission(real_user_ns, realinode, mask);
 	revert_creds(old_cred);
 
 	return err;
@@ -547,6 +558,10 @@ static const struct inode_operations ovl_file_inode_operations = {
 	.get_acl	= ovl_get_acl,
 	.update_time	= ovl_update_time,
 	.fiemap		= ovl_fiemap,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.permission_mapped	= ovl_permission_mapped,
+	.setattr_mapped		= ovl_setattr_mapped,
+#endif
 };
 
 static const struct inode_operations ovl_symlink_inode_operations = {
@@ -555,6 +570,9 @@ static const struct inode_operations ovl_symlink_inode_operations = {
 	.getattr	= ovl_getattr,
 	.listxattr	= ovl_listxattr,
 	.update_time	= ovl_update_time,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.setattr_mapped	= ovl_setattr_mapped,
+#endif
 };
 
 static const struct inode_operations ovl_special_inode_operations = {
@@ -564,6 +582,10 @@ static const struct inode_operations ovl_special_inode_operations = {
 	.listxattr	= ovl_listxattr,
 	.get_acl	= ovl_get_acl,
 	.update_time	= ovl_update_time,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.permission_mapped	= ovl_permission_mapped,
+	.setattr_mapped		= ovl_setattr_mapped,
+#endif
 };
 
 static const struct address_space_operations ovl_aops = {
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 63c257c3bfa8..97c8c34cd6d0 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -470,9 +470,16 @@ unsigned int ovl_get_nlink(struct ovl_fs *ofs, struct dentry *lowerdentry,
 			   struct dentry *upperdentry,
 			   unsigned int fallback);
 int ovl_setattr(struct dentry *dentry, struct iattr *attr);
+int ovl_setattr_mapped(struct user_namespace *user_ns, struct dentry *dentry,
+		       struct iattr *attr);
 int ovl_getattr(const struct path *path, struct kstat *stat,
 		u32 request_mask, unsigned int flags);
-int ovl_permission(struct inode *inode, int mask);
+int ovl_permission_mapped(struct user_namespace *user_ns,
+			  struct inode *inode, int mask);
+static inline int ovl_permission(struct inode *inode, int mask)
+{
+	return ovl_permission_mapped(&init_user_ns, inode, mask);
+}
 int ovl_xattr_set(struct dentry *dentry, struct inode *inode, const char *name,
 		  const void *value, size_t size, int flags);
 int ovl_xattr_get(struct dentry *dentry, struct inode *inode, const char *name,
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 755f651587a5..8db8ccd70b6f 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -2082,6 +2082,9 @@ static struct file_system_type ovl_fs_type = {
 	.name		= "overlay",
 	.mount		= ovl_mount,
 	.kill_sb	= kill_anon_super,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.fs_flags	= FS_ALLOW_IDMAP,
+#endif
 };
 MODULE_ALIAS_FS("overlay");
 
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 34/34] fat: handle idmapped mounts
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (32 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 33/34] overlayfs: handle idmapped merged mounts Christian Brauner
@ 2020-10-29  0:32 ` Christian Brauner
  2020-10-29  2:27 ` [PATCH 00/34] fs: " Dave Chinner
                   ` (3 subsequent siblings)
  37 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29  0:32 UTC (permalink / raw)
  To: Alexander Viro, Christoph Hellwig, linux-fsdevel
  Cc: John Johansen, James Morris, Mimi Zohar, Dmitry Kasatkin,
	Stephen Smalley, Casey Schaufler, Arnd Bergmann, Andreas Dilger,
	OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel, Josh Triplett,
	Andy Lutomirski, Amir Goldstein, Miklos Szeredi, Theodore Tso,
	Alban Crequy, Tycho Andersen, David Howells, James Bottomley,
	Jann Horn, Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux, Christian Brauner

Let fat handle idmapped mounts. This allows to have the same fat mount appear
in multiple locations with different id mappings. This allows to expose a vfat
formatted USB stick to multiple user with different ids on the host or in user
namespaces:

mount -o uid=1000,gid=1000 /dev/sdb /mnt

u1001@f2-vm:/lower1$ ls -ln /mnt/
total 4
-rwxr-xr-x 1 1000 1000 4 Oct 28 03:44 aaa
-rwxr-xr-x 1 1000 1000 0 Oct 28 01:09 bbb
-rwxr-xr-x 1 1000 1000 0 Oct 28 01:10 ccc
-rwxr-xr-x 1 1000 1000 0 Oct 28 03:46 ddd
-rwxr-xr-x 1 1000 1000 0 Oct 28 04:01 eee

mount2 --idmap both:1000:1001:1

u1001@f2-vm:/lower1$ ls -ln /lower1/
total 4
-rwxr-xr-x 1 1001 1001 4 Oct 28 03:44 aaa
-rwxr-xr-x 1 1001 1001 0 Oct 28 01:09 bbb
-rwxr-xr-x 1 1001 1001 0 Oct 28 01:10 ccc
-rwxr-xr-x 1 1001 1001 0 Oct 28 03:46 ddd
-rwxr-xr-x 1 1001 1001 0 Oct 28 04:01 eee

u1001@f2-vm:/lower1$ touch /lower1/fff

u1001@f2-vm:/lower1$ ls -ln /lower1/fff
-rwxr-xr-x 1 1001 1001 0 Oct 28 04:03 /lower1/fff

u1001@f2-vm:/lower1$ ls -ln /mnt/fff
-rwxr-xr-x 1 1000 1000 0 Oct 28 04:03 /mnt/fff

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/fat/fat.h         |  2 ++
 fs/fat/file.c        | 27 +++++++++++++++++++--------
 fs/fat/namei_msdos.c |  7 +++++++
 fs/fat/namei_vfat.c  |  7 +++++++
 4 files changed, 35 insertions(+), 8 deletions(-)

diff --git a/fs/fat/fat.h b/fs/fat/fat.h
index 922a0c6ba46c..56d661e93d2a 100644
--- a/fs/fat/fat.h
+++ b/fs/fat/fat.h
@@ -398,6 +398,8 @@ extern long fat_generic_ioctl(struct file *filp, unsigned int cmd,
 extern const struct file_operations fat_file_operations;
 extern const struct inode_operations fat_file_inode_operations;
 extern int fat_setattr(struct dentry *dentry, struct iattr *attr);
+extern int fat_setattr_mapped(struct user_namespace *user_ns,
+			      struct dentry *dentry, struct iattr *attr);
 extern void fat_truncate_blocks(struct inode *inode, loff_t offset);
 extern int fat_getattr(const struct path *path, struct kstat *stat,
 		       u32 request_mask, unsigned int flags);
diff --git a/fs/fat/file.c b/fs/fat/file.c
index f9ee27cf4d7c..f97d46711b37 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -398,7 +398,7 @@ int fat_getattr(const struct path *path, struct kstat *stat,
 		u32 request_mask, unsigned int flags)
 {
 	struct inode *inode = d_inode(path->dentry);
-	generic_fillattr(inode, stat);
+	mapped_generic_fillattr(mnt_user_ns(path->mnt), inode, stat);
 	stat->blksize = MSDOS_SB(inode->i_sb)->cluster_size;
 
 	if (MSDOS_SB(inode->i_sb)->options.nfs == FAT_NFS_NOSTALE_RO) {
@@ -447,12 +447,13 @@ static int fat_sanitize_mode(const struct msdos_sb_info *sbi,
 	return 0;
 }
 
-static int fat_allow_set_time(struct msdos_sb_info *sbi, struct inode *inode)
+static int fat_allow_set_time(struct user_namespace *user_ns,
+			      struct msdos_sb_info *sbi, struct inode *inode)
 {
 	umode_t allow_utime = sbi->options.allow_utime;
 
-	if (!uid_eq(current_fsuid(), inode->i_uid)) {
-		if (in_group_p(inode->i_gid))
+	if (!uid_eq(current_fsuid(), i_uid_into_mnt(user_ns, inode))) {
+		if (in_group_p(i_gid_into_mnt(user_ns, inode)))
 			allow_utime >>= 3;
 		if (allow_utime & MAY_WRITE)
 			return 1;
@@ -466,7 +467,8 @@ static int fat_allow_set_time(struct msdos_sb_info *sbi, struct inode *inode)
 /* valid file mode bits */
 #define FAT_VALID_MODE	(S_IFREG | S_IFDIR | S_IRWXUGO)
 
-int fat_setattr(struct dentry *dentry, struct iattr *attr)
+int fat_setattr_mapped(struct user_namespace *user_ns, struct dentry *dentry,
+		       struct iattr *attr)
 {
 	struct msdos_sb_info *sbi = MSDOS_SB(dentry->d_sb);
 	struct inode *inode = d_inode(dentry);
@@ -476,11 +478,11 @@ int fat_setattr(struct dentry *dentry, struct iattr *attr)
 	/* Check for setting the inode time. */
 	ia_valid = attr->ia_valid;
 	if (ia_valid & TIMES_SET_FLAGS) {
-		if (fat_allow_set_time(sbi, inode))
+		if (fat_allow_set_time(user_ns, sbi, inode))
 			attr->ia_valid &= ~TIMES_SET_FLAGS;
 	}
 
-	error = setattr_prepare(dentry, attr);
+	error = setattr_mapped_prepare(user_ns, dentry, attr);
 	attr->ia_valid = ia_valid;
 	if (error) {
 		if (sbi->options.quiet)
@@ -550,15 +552,24 @@ int fat_setattr(struct dentry *dentry, struct iattr *attr)
 		fat_truncate_time(inode, &attr->ia_mtime, S_MTIME);
 	attr->ia_valid &= ~(ATTR_ATIME|ATTR_CTIME|ATTR_MTIME);
 
-	setattr_copy(inode, attr);
+	setattr_mapped_copy(user_ns, inode, attr);
 	mark_inode_dirty(inode);
 out:
 	return error;
 }
+EXPORT_SYMBOL_GPL(fat_setattr_mapped);
+
+int fat_setattr(struct dentry *dentry, struct iattr *attr)
+{
+	return fat_setattr_mapped(&init_user_ns, dentry, attr);
+}
 EXPORT_SYMBOL_GPL(fat_setattr);
 
 const struct inode_operations fat_file_inode_operations = {
 	.setattr	= fat_setattr,
 	.getattr	= fat_getattr,
 	.update_time	= fat_update_time,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.setattr_mapped	= fat_setattr_mapped,
+#endif
 };
diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c
index 9d062886fbc1..4b89d27d128d 100644
--- a/fs/fat/namei_msdos.c
+++ b/fs/fat/namei_msdos.c
@@ -639,6 +639,9 @@ static const struct inode_operations msdos_dir_inode_operations = {
 	.setattr	= fat_setattr,
 	.getattr	= fat_getattr,
 	.update_time	= fat_update_time,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.setattr_mapped	= fat_setattr_mapped,
+#endif
 };
 
 static void setup(struct super_block *sb)
@@ -665,7 +668,11 @@ static struct file_system_type msdos_fs_type = {
 	.name		= "msdos",
 	.mount		= msdos_mount,
 	.kill_sb	= kill_block_super,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.fs_flags	= FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+#else
 	.fs_flags	= FS_REQUIRES_DEV,
+#endif
 };
 MODULE_ALIAS_FS("msdos");
 
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index 0cdd0fb9f742..11fea59e1b77 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -1034,6 +1034,9 @@ static const struct inode_operations vfat_dir_inode_operations = {
 	.setattr	= fat_setattr,
 	.getattr	= fat_getattr,
 	.update_time	= fat_update_time,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.setattr_mapped	= fat_setattr_mapped,
+#endif
 };
 
 static void setup(struct super_block *sb)
@@ -1062,7 +1065,11 @@ static struct file_system_type vfat_fs_type = {
 	.name		= "vfat",
 	.mount		= vfat_mount,
 	.kill_sb	= kill_block_super,
+#ifdef CONFIG_IDMAP_MOUNTS
+	.fs_flags	= FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+#else
 	.fs_flags	= FS_REQUIRES_DEV,
+#endif
 };
 MODULE_ALIAS_FS("vfat");
 
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/34] fs: idmapped mounts
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (33 preceding siblings ...)
  2020-10-29  0:32 ` [PATCH 34/34] fat: handle idmapped mounts Christian Brauner
@ 2020-10-29  2:27 ` Dave Chinner
  2020-10-29 16:19   ` Christian Brauner
  2020-10-29  7:20 ` Sargun Dhillon
                   ` (2 subsequent siblings)
  37 siblings, 1 reply; 68+ messages in thread
From: Dave Chinner @ 2020-10-29  2:27 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Christoph Hellwig, linux-fsdevel, John Johansen,
	James Morris, Mimi Zohar, Dmitry Kasatkin, Stephen Smalley,
	Casey Schaufler, Arnd Bergmann, Andreas Dilger, OGAWA Hirofumi,
	Geoffrey Thomas, Mrunal Patel, Josh Triplett, Andy Lutomirski,
	Amir Goldstein, Miklos Szeredi, Theodore Tso, Alban Crequy,
	Tycho Andersen, David Howells, James Bottomley, Jann Horn,
	Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux

On Thu, Oct 29, 2020 at 01:32:18AM +0100, Christian Brauner wrote:
> Hey everyone,
> 
> I vanished for a little while to focus on this work here so sorry for
> not being available by mail for a while.
> 
> Since quite a long time we have issues with sharing mounts between
> multiple unprivileged containers with different id mappings, sharing a
> rootfs between multiple containers with different id mappings, and also
> sharing regular directories and filesystems between users with different
> uids and gids. The latter use-cases have become even more important with
> the availability and adoption of systemd-homed (cf. [1]) to implement
> portable home directories.
> 
> The solutions we have tried and proposed so far include the introduction
> of fsid mappings, a tiny overlay based filesystem, and an approach to
> call override creds in the vfs. None of these solutions have covered all
> of the above use-cases.
> 
> The solution proposed here has it's origins in multiple discussions
> during Linux Plumbers 2017 during and after the end of the containers
> microconference.
> To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
> James, and myself. A variant of the solution proposed here has also been
> discussed, again to the best of my knowledge, after a Linux conference
> in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> after Linux Plumbers.
> I've taken the time to finally implement a working version of this
> solution over the last weeks to the best of my abilities. Tycho has
> signed up for this sligthly crazy endeavour as well and he has helped
> with the conversion of the xattr codepaths.
> 
> The core idea is to make idmappings a property of struct vfsmount
> instead of tying it to a process being inside of a user namespace which
> has been the case for all other proposed approaches.
> It means that idmappings become a property of bind-mounts, i.e. each
> bind-mount can have a separate idmapping. This has the obvious advantage
> that idmapped mounts can be created inside of the initial user
> namespace, i.e. on the host itself instead of requiring the caller to be
> located inside of a user namespace. This enables such use-cases as e.g.
> making a usb stick available in multiple locations with different
> idmappings (see the vfat port that is part of this patch series).
> 
> The vfsmount struct gains a new struct user_namespace member. The
> idmapping of the user namespace becomes the idmapping of the mount. A
> caller that is either privileged with respect to the user namespace of
> the superblock of the underlying filesystem or a caller that is
> privileged with respect to the user namespace a mount has been idmapped
> with can create a new bind-mount and mark it with a user namespace. The
> user namespace the mount will be marked with can be specified by passing
> a file descriptor refering to the user namespace as an argument to the
> new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag.
> By default vfsmounts are marked with the initial user namespace and no
> behavioral or performance changes should be observed. All mapping
> operations are nops for the initial user namespace.
> 
> When a file/inode is accessed through an idmapped mount the i_uid and
> i_gid of the inode will be remapped according to the user namespace the
> mount has been marked with. When a new object is created based on the
> fsuid and fsgid of the caller they will similarly be remapped according
> to the user namespace of the mount they care created from.
> 
> This means the user namespace of the mount needs to be passed down into
> a few relevant inode_operations. This mostly includes inode operations
> that create filesystem objects or change file attributes.

That's really quite ... messy.

Maybe I'm missing something, but if you have the user_ns to be used
for the VFS operation we are about to execute then why can't we use
the same model as current_fsuid/current_fsgid() for passing the
filesystem credentials down to the filesystem operations?  i.e.
attach it to the current->cred->fs_userns, and then the filesystem
code that actually needs to know the current userns can call
current_fs_user_ns() instead of current_user_ns().  i.e.

#define current_fs_user_ns()	\
	(current->cred->fs_userns ? current->cred->fs_userns \
				  : current->cred->userns)

At this point, the filesystem will now always have the correct
userns it is supposed to use for mapping the uid/gid, right?

Also, if we are passing work off to worker threads, duplicating
the current creds will capture this information and won't leave
random landmines where stuff doesn't work as it should because the
worker thread is unaware of the userns that it is supposed to be
doing filesytsem operations under...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/34] fs: idmapped mounts
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (34 preceding siblings ...)
  2020-10-29  2:27 ` [PATCH 00/34] fs: " Dave Chinner
@ 2020-10-29  7:20 ` Sargun Dhillon
  2020-10-29 15:47 ` Eric W. Biederman
  2020-10-29 21:58 ` Andy Lutomirski
  37 siblings, 0 replies; 68+ messages in thread
From: Sargun Dhillon @ 2020-10-29  7:20 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Christoph Hellwig, linux-fsdevel,
	Lennart Poettering, Mimi Zohar, James Bottomley, Andreas Dilger,
	containers, Tycho Andersen, Miklos Szeredi, smbarber, linux-ext4,
	Mrunal Patel, Kees Cook, Arnd Bergmann, Jann Horn, selinux,
	Josh Triplett, Seth Forshee, Andy Lutomirski, OGAWA Hirofumi,
	Geoffrey Thomas, David Howells, John Johansen, Theodore Tso,
	Dmitry Kasatkin, Stephen Smalley, Jonathan Corbet, linux-unionfs,
	linux-security-module, linux-audit, Eric W. Biederman, linux-api,
	Casey Schaufler, Alban Crequy, linux-integrity, Todd Kjos

On Thu, Oct 29, 2020 at 01:32:18AM +0100, Christian Brauner wrote:
> Hey everyone,
> 
> I vanished for a little while to focus on this work here so sorry for
> not being available by mail for a while.
> 
> Since quite a long time we have issues with sharing mounts between
> multiple unprivileged containers with different id mappings, sharing a
> rootfs between multiple containers with different id mappings, and also
> sharing regular directories and filesystems between users with different
> uids and gids. The latter use-cases have become even more important with
> the availability and adoption of systemd-homed (cf. [1]) to implement
> portable home directories.
> 
> The solutions we have tried and proposed so far include the introduction
> of fsid mappings, a tiny overlay based filesystem, and an approach to
> call override creds in the vfs. None of these solutions have covered all
> of the above use-cases.
> 
> The solution proposed here has it's origins in multiple discussions
> during Linux Plumbers 2017 during and after the end of the containers
> microconference.
> To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
> James, and myself. A variant of the solution proposed here has also been
> discussed, again to the best of my knowledge, after a Linux conference
> in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> after Linux Plumbers.
> I've taken the time to finally implement a working version of this
> solution over the last weeks to the best of my abilities. Tycho has
> signed up for this sligthly crazy endeavour as well and he has helped
> with the conversion of the xattr codepaths.
> 
> The core idea is to make idmappings a property of struct vfsmount
> instead of tying it to a process being inside of a user namespace which
> has been the case for all other proposed approaches.
> It means that idmappings become a property of bind-mounts, i.e. each
> bind-mount can have a separate idmapping. This has the obvious advantage
> that idmapped mounts can be created inside of the initial user
> namespace, i.e. on the host itself instead of requiring the caller to be
> located inside of a user namespace. This enables such use-cases as e.g.
> making a usb stick available in multiple locations with different
> idmappings (see the vfat port that is part of this patch series).
> 
> The vfsmount struct gains a new struct user_namespace member. The
> idmapping of the user namespace becomes the idmapping of the mount. A
> caller that is either privileged with respect to the user namespace of
> the superblock of the underlying filesystem or a caller that is
> privileged with respect to the user namespace a mount has been idmapped
> with can create a new bind-mount and mark it with a user namespace. The
> user namespace the mount will be marked with can be specified by passing
> a file descriptor refering to the user namespace as an argument to the
> new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag.
> By default vfsmounts are marked with the initial user namespace and no
> behavioral or performance changes should be observed. All mapping
> operations are nops for the initial user namespace.
> 
> When a file/inode is accessed through an idmapped mount the i_uid and
> i_gid of the inode will be remapped according to the user namespace the
> mount has been marked with. When a new object is created based on the
> fsuid and fsgid of the caller they will similarly be remapped according
> to the user namespace of the mount they care created from.
> 
> This means the user namespace of the mount needs to be passed down into
> a few relevant inode_operations. This mostly includes inode operations
> that create filesystem objects or change file attributes. Some of them
> such as ->getattr() don't even need to change since they pass down a
> struct path and thus the struct vfsmount is already available. Other
> inode operations need to be adapted to pass down the user namespace the
> vfsmount has been marked with. Al was nice enough to point out that he
> will not tolerate struct vfsmount being passed to filesystems and that I
> should pass down the user namespace directly; which is what I did.
> The inode struct itself is never altered whenever the i_uid and i_gid
> need to be mapped, i.e. i_uid and i_gid are only remapped at the time of
> the check. An inode once initialized (during lookup or object creation)
> is never altered when accessed through an idmapped mount.
> 
> To limit the amount of noise in this first iteration we have not changed
> the existing inode operations but rather introduced a few new struct
> inode operation methods such as ->mkdir_mapped which pass down the user
> namespace of the mount they have been called from. Should this solution
> be worth pursuing we have no problem adapting the existing inode
> operations instead.
> 
> In order to support idmapped mounts, filesystems need to be changed and
> mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. In this first
> iteration I tried to illustrate this by changing three different
> filesystem with different levels of complexity. Of course with some bias
> towards urgent use-cases and filesystems I was at least a little more
> familiar with. However, Tycho and I (and others) have no problem
> converting each filesystem one-by-one. This first iteration includes fat
> (msdos and vfat), ext4, and overlayfs (both with idmapped lower and
> upper directories and idmapped merged directories). I'm sure I haven't
> gotten everything right for all three of them in the first version of
> this patch.
> 

Thanks for this patchset. It's been a long-time coming.

I'm curious as to for the most cases, how much the new fs mount APIs help, and 
if focusing on those could solve the problem for everything other than bind 
mounts? Specifically, the idea of doing fsopen (creation of fs_context) under 
the user namespace of question, and relying on a user with CAP_SYS_ADMIN to call 
fsmount[1]. I think this is actually especially valuable for places like 
overlayfs that use the entire cred object, as opposed to just the uid / gid. I 
imagine that soon, most filesystems will support the new mount APIs, and not set 
the global flag if they don't need to.

How popular is the "vfsmount (bind mounts) needs different uid mappings" use 
case?

The other thing I worry about is the "What UID are you really?" game that's been 
a thing recently. For example, you can have a different user namespace UID 
mapping for your network namespace that netfilter checks[2], and a different one 
for your mount namespace, and a different one that the process is actually in.
This proliferation of different mappings makes auditing, and doing things like
writing perf toolings more difficult (since I think bpf_get_current_uid_gid
use the initial user namespace still [3]).

[1]: https://lore.kernel.org/linux-nfs/20201016123745.9510-4-sargun@sargun.me/T/#u
[2]: https://elixir.bootlin.com/linux/v5.9.1/source/net/netfilter/xt_owner.c#L37
[3]: https://elixir.bootlin.com/linux/v5.9.1/source/kernel/bpf/helpers.c#L196


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/34] fs: idmapped mounts
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (35 preceding siblings ...)
  2020-10-29  7:20 ` Sargun Dhillon
@ 2020-10-29 15:47 ` Eric W. Biederman
  2020-10-29 15:51   ` Aleksa Sarai
                     ` (2 more replies)
  2020-10-29 21:58 ` Andy Lutomirski
  37 siblings, 3 replies; 68+ messages in thread
From: Eric W. Biederman @ 2020-10-29 15:47 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Christoph Hellwig, linux-fsdevel, John Johansen,
	James Morris, Mimi Zohar, Dmitry Kasatkin, Stephen Smalley,
	Casey Schaufler, Arnd Bergmann, Andreas Dilger, OGAWA Hirofumi,
	Geoffrey Thomas, Mrunal Patel, Josh Triplett, Andy Lutomirski,
	Amir Goldstein, Miklos Szeredi, Theodore Tso, Alban Crequy,
	Tycho Andersen, David Howells, James Bottomley, Jann Horn,
	Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, smbarber, Phil Estes, Serge Hallyn,
	Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux

Christian Brauner <christian.brauner@ubuntu.com> writes:

> Hey everyone,
>
> I vanished for a little while to focus on this work here so sorry for
> not being available by mail for a while.
>
> Since quite a long time we have issues with sharing mounts between
> multiple unprivileged containers with different id mappings, sharing a
> rootfs between multiple containers with different id mappings, and also
> sharing regular directories and filesystems between users with different
> uids and gids. The latter use-cases have become even more important with
> the availability and adoption of systemd-homed (cf. [1]) to implement
> portable home directories.

Can you walk us through the motivating use case?

As of this year's LPC I had the distinct impression that the primary use
case for such a feature was due to the RLIMIT_NPROC problem where two
containers with the same users still wanted different uid mappings to
the disk because the users were conflicting with each other because of
the per user rlimits.

Fixing rlimits is straight forward to implement, and easier to manage
for implementations and administrators.



Reading up on systemd-homed it appears to be a way to have encrypted
home directories.  Those home directories can either be encrypted at the
fs or at the block level.  Those home directories appear to have the
goal of being luggable between systems.  If the systems in question
don't have common administration of uids and gids after lugging your
encrypted home directory to another system chowning the files is
required.

Is that the use case you are looking at removing the need for
systemd-homed to avoid chowning after lugging encrypted home directories
from one system to another?  Why would it be desirable to avoid the
chown?


If the goal is to solve fragmented administration of uid assignment I
suggest that it might be better to solve the administration problem so
that all of the uids of interest get assigned the same way on all of the
systems of interest.  To the extent it is possible to solve the uid
assignment problem that would seem to have fewer long term
administrative problems.

Eric

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/34] fs: idmapped mounts
  2020-10-29 15:47 ` Eric W. Biederman
@ 2020-10-29 15:51   ` Aleksa Sarai
  2020-10-29 16:37     ` Eric W. Biederman
  2020-10-29 16:05   ` Lennart Poettering
       [not found]   ` <20201029161231.GA108315@cisco>
  2 siblings, 1 reply; 68+ messages in thread
From: Aleksa Sarai @ 2020-10-29 15:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Alexander Viro, Christoph Hellwig,
	linux-fsdevel, John Johansen, James Morris, Mimi Zohar,
	Dmitry Kasatkin, Stephen Smalley, Casey Schaufler, Arnd Bergmann,
	Andreas Dilger, OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel,
	Josh Triplett, Andy Lutomirski, Amir Goldstein, Miklos Szeredi,
	Theodore Tso, Alban Crequy, Tycho Andersen, David Howells,
	James Bottomley, Jann Horn, Seth Forshee, Stéphane Graber,
	Lennart Poettering, smbarber, Phil Estes, Serge Hallyn,
	Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux

[-- Attachment #1: Type: text/plain, Size: 1733 bytes --]

On 2020-10-29, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Christian Brauner <christian.brauner@ubuntu.com> writes:
> 
> > Hey everyone,
> >
> > I vanished for a little while to focus on this work here so sorry for
> > not being available by mail for a while.
> >
> > Since quite a long time we have issues with sharing mounts between
> > multiple unprivileged containers with different id mappings, sharing a
> > rootfs between multiple containers with different id mappings, and also
> > sharing regular directories and filesystems between users with different
> > uids and gids. The latter use-cases have become even more important with
> > the availability and adoption of systemd-homed (cf. [1]) to implement
> > portable home directories.
> 
> Can you walk us through the motivating use case?
> 
> As of this year's LPC I had the distinct impression that the primary use
> case for such a feature was due to the RLIMIT_NPROC problem where two
> containers with the same users still wanted different uid mappings to
> the disk because the users were conflicting with each other because of
> the per user rlimits.
> 
> Fixing rlimits is straight forward to implement, and easier to manage
> for implementations and administrators.

This is separate to the question of "isolated user namespaces" and
managing different mappings between containers. This patchset is solving
the same problem that shiftfs solved -- sharing a single directory tree
between containers that have different ID mappings. rlimits (nor any of
the other proposals we discussed at LPC) will help with this problem.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/34] fs: idmapped mounts
  2020-10-29 15:47 ` Eric W. Biederman
  2020-10-29 15:51   ` Aleksa Sarai
@ 2020-10-29 16:05   ` Lennart Poettering
  2020-10-29 16:36     ` Sargun Dhillon
  2020-10-29 16:54     ` Eric W. Biederman
       [not found]   ` <20201029161231.GA108315@cisco>
  2 siblings, 2 replies; 68+ messages in thread
From: Lennart Poettering @ 2020-10-29 16:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Alexander Viro, Christoph Hellwig,
	linux-fsdevel, John Johansen, James Morris, Mimi Zohar,
	Dmitry Kasatkin, Stephen Smalley, Casey Schaufler, Arnd Bergmann,
	Andreas Dilger, OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel,
	Josh Triplett, Andy Lutomirski, Amir Goldstein, Miklos Szeredi,
	Theodore Tso, Alban Crequy, Tycho Andersen, David Howells,
	James Bottomley, Jann Horn, Seth Forshee, Stéphane Graber,
	Aleksa Sarai, smbarber, Phil Estes, Serge Hallyn, Kees Cook,
	Todd Kjos, Jonathan Corbet, containers, linux-security-module,
	linux-api, linux-ext4, linux-unionfs, linux-audit,
	linux-integrity, selinux

On Do, 29.10.20 10:47, Eric W. Biederman (ebiederm@xmission.com) wrote:

> Is that the use case you are looking at removing the need for
> systemd-homed to avoid chowning after lugging encrypted home directories
> from one system to another?  Why would it be desirable to avoid the
> chown?

Yes, I am very interested in seeing Christian's work succeed, for the
usecase in systemd-homed. In systemd-homed each user gets their own
private file system, and these fs shall be owned by the user's local
UID, regardless in which system it is used. The UID should be an
artifact of the local, individual system in this model, and thus
the UID on of the same user/home on system A might be picked as 1010
and on another as 1543, and on a third as 1323, and it shouldn't
matter. This way, home directories become migratable without having to
universially sync UID assignments: it doesn't matter anymore what the
local UID is.

Right now we do a recursive chown() at login time to ensure the home
dir is properly owned. This has two disadvantages:

1. It's slow. In particular on large home dirs, it takes a while to go
   through the whole user's homedir tree and chown/adjust ACLs for
   everything.

2. Because it is so slow we take a shortcut right now: if the
   top-level home dir inode itself is owned by the correct user, we
   skip the recursive chowning. This means in the typical case where a
   user uses the same system most of the time, and thus the UID is
   stable we can avoid the slowness. But this comes at a drawback: if
   the user for some reason ends up with files in their homedir owned
   by an unrelated user, then we'll never notice or readjust.

> If the goal is to solve fragmented administration of uid assignment I
> suggest that it might be better to solve the administration problem so
> that all of the uids of interest get assigned the same way on all of the
> systems of interest.

Well, the goal is to make things simple and be able to use the home
dir everywhere without any prior preparation, without central UID
assignment authority.

The goal is to have a scheme that requires no administration, by
making the UID management problem go away. Hence, if you suggest
solving this by having a central administrative authority: this is
exactly what the model wants to get away from.

Or to say this differently: just because I personally use three
different computers, I certainly don't want to set up LDAP or sync
UIDs manually.

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/34] fs: idmapped mounts
  2020-10-29  2:27 ` [PATCH 00/34] fs: " Dave Chinner
@ 2020-10-29 16:19   ` Christian Brauner
  0 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-29 16:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alexander Viro, Christoph Hellwig, linux-fsdevel, John Johansen,
	James Morris, Mimi Zohar, Dmitry Kasatkin, Stephen Smalley,
	Casey Schaufler, Arnd Bergmann, Andreas Dilger, OGAWA Hirofumi,
	Geoffrey Thomas, Mrunal Patel, Josh Triplett, Andy Lutomirski,
	Amir Goldstein, Miklos Szeredi, Theodore Tso, Alban Crequy,
	Tycho Andersen, David Howells, James Bottomley, Jann Horn,
	Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux

On Thu, Oct 29, 2020 at 01:27:33PM +1100, Dave Chinner wrote:
> On Thu, Oct 29, 2020 at 01:32:18AM +0100, Christian Brauner wrote:
> > Hey everyone,
> > 
> > I vanished for a little while to focus on this work here so sorry for
> > not being available by mail for a while.
> > 
> > Since quite a long time we have issues with sharing mounts between
> > multiple unprivileged containers with different id mappings, sharing a
> > rootfs between multiple containers with different id mappings, and also
> > sharing regular directories and filesystems between users with different
> > uids and gids. The latter use-cases have become even more important with
> > the availability and adoption of systemd-homed (cf. [1]) to implement
> > portable home directories.
> > 
> > The solutions we have tried and proposed so far include the introduction
> > of fsid mappings, a tiny overlay based filesystem, and an approach to
> > call override creds in the vfs. None of these solutions have covered all
> > of the above use-cases.
> > 
> > The solution proposed here has it's origins in multiple discussions
> > during Linux Plumbers 2017 during and after the end of the containers
> > microconference.
> > To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
> > James, and myself. A variant of the solution proposed here has also been
> > discussed, again to the best of my knowledge, after a Linux conference
> > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> > after Linux Plumbers.
> > I've taken the time to finally implement a working version of this
> > solution over the last weeks to the best of my abilities. Tycho has
> > signed up for this sligthly crazy endeavour as well and he has helped
> > with the conversion of the xattr codepaths.
> > 
> > The core idea is to make idmappings a property of struct vfsmount
> > instead of tying it to a process being inside of a user namespace which
> > has been the case for all other proposed approaches.
> > It means that idmappings become a property of bind-mounts, i.e. each
> > bind-mount can have a separate idmapping. This has the obvious advantage
> > that idmapped mounts can be created inside of the initial user
> > namespace, i.e. on the host itself instead of requiring the caller to be
> > located inside of a user namespace. This enables such use-cases as e.g.
> > making a usb stick available in multiple locations with different
> > idmappings (see the vfat port that is part of this patch series).
> > 
> > The vfsmount struct gains a new struct user_namespace member. The
> > idmapping of the user namespace becomes the idmapping of the mount. A
> > caller that is either privileged with respect to the user namespace of
> > the superblock of the underlying filesystem or a caller that is
> > privileged with respect to the user namespace a mount has been idmapped
> > with can create a new bind-mount and mark it with a user namespace. The
> > user namespace the mount will be marked with can be specified by passing
> > a file descriptor refering to the user namespace as an argument to the
> > new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag.
> > By default vfsmounts are marked with the initial user namespace and no
> > behavioral or performance changes should be observed. All mapping
> > operations are nops for the initial user namespace.
> > 
> > When a file/inode is accessed through an idmapped mount the i_uid and
> > i_gid of the inode will be remapped according to the user namespace the
> > mount has been marked with. When a new object is created based on the
> > fsuid and fsgid of the caller they will similarly be remapped according
> > to the user namespace of the mount they care created from.
> > 
> > This means the user namespace of the mount needs to be passed down into
> > a few relevant inode_operations. This mostly includes inode operations
> > that create filesystem objects or change file attributes.
> 
> That's really quite ... messy.

I don't agree. It's changes all across the vfs but it's not hacky in any
way since it cleanly passes down an additional argument (I'm biased of
course.). 

> 
> Maybe I'm missing something, but if you have the user_ns to be used
> for the VFS operation we are about to execute then why can't we use
> the same model as current_fsuid/current_fsgid() for passing the
> filesystem credentials down to the filesystem operations?  i.e.
> attach it to the current->cred->fs_userns, and then the filesystem
> code that actually needs to know the current userns can call
> current_fs_user_ns() instead of current_user_ns().  i.e.
> 
> #define current_fs_user_ns()	\
> 	(current->cred->fs_userns ? current->cred->fs_userns \
> 				  : current->cred->userns)
> 
> At this point, the filesystem will now always have the correct
> userns it is supposed to use for mapping the uid/gid, right?

Thanks for this interesting idea! I have some troubles with it though.

This approach (always) seemed conceptually wrong to me. Like Tycho said
somewhere else this basically would act like a global variable which
isn't great.

There's also a substantial difference between in that the current fsuid
and fsgid are an actual property of the callers creds so to have them in
there makes perfect sense. But the user namespace of the vfsmount is a
property of the mount and as such glueing it to the callers creds when
calling into the vfs is just weird and I would very much like to avoid
this. If inode's wouldn't have an i_sb member we wouldn't suddenly start
to pass down the s_user_ns via the callers creds to the filesystems.

I'm also not fond of having to call prepare_creds() and override_creds()
all across the vfs. It's messy and prepare_creds() is especially
problematic during RCU pathwalk where we can't call it. We could
in path_init() at the start of every every lookup operation call
prepare_creds() and then override them when we need to switch the
fs_userns global variable and then put_creds() at the end of every path
walk in terminate_walk(). But this means penalizing every lookup
operations with an additional prepare_creds() which needs to be called
at least once, I think. Then during lookup we would need to
override/change this new global fs_userns variable potentially at each
mountpoint crossing to switch back to the correct fs_userns for idmapped
and non-idmapped mounts. We'd also need to rearrange a bunch of
terminate_walk() calls and we would like end up with a comparable amount
of changes only that they would indeed be more messy since we're
strapping fs_userns to the caller's creds.

I an alternative might be to have a combined approach where you pass the
user namespace around in the vfs but when calling into the filesystem
use the fs_userns global variable approach but I would very much prefer
to avoid this and instead cleanly pass down the user namespace
correctly. That's more work, it'll take longer but I and others are
around to see these changes through.

> 
> Also, if we are passing work off to worker threads, duplicating
> the current creds will capture this information and won't leave
> random landmines where stuff doesn't work as it should because the
> worker thread is unaware of the userns that it is supposed to be
> doing filesytsem operations under...

That seems like a problem that can be handled by simply keeping the
userns around similar to how we need to already keep creds around.

Christian

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/34] fs: idmapped mounts
       [not found]   ` <20201029161231.GA108315@cisco>
@ 2020-10-29 16:23     ` Serge E. Hallyn
  2020-10-29 16:44     ` Eric W. Biederman
  1 sibling, 0 replies; 68+ messages in thread
From: Serge E. Hallyn @ 2020-10-29 16:23 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Eric W. Biederman, Andy Lutomirski, Mimi Zohar, James Bottomley,
	Andreas Dilger, Miklos Szeredi, smbarber, Christoph Hellwig,
	Casey Schaufler, linux-ext4, Mrunal Patel, Kees Cook,
	Arnd Bergmann, Jann Horn, selinux, Josh Triplett, Seth Forshee,
	Alexander Viro, Lennart Poettering, OGAWA Hirofumi,
	Geoffrey Thomas, David Howells, John Johansen, Theodore Tso,
	Dmitry Kasatkin, containers, Jonathan Corbet, linux-unionfs,
	linux-security-module, linux-audit, linux-api, linux-fsdevel,
	Alban Crequy, linux-integrity, Stephen Smalley, Todd Kjos

On Thu, Oct 29, 2020 at 10:12:31AM -0600, Tycho Andersen wrote:
> Hi Eric,
> 
> On Thu, Oct 29, 2020 at 10:47:49AM -0500, Eric W. Biederman wrote:
> > Christian Brauner <christian.brauner@ubuntu.com> writes:
> > 
> > > Hey everyone,
> > >
> > > I vanished for a little while to focus on this work here so sorry for
> > > not being available by mail for a while.
> > >
> > > Since quite a long time we have issues with sharing mounts between
> > > multiple unprivileged containers with different id mappings, sharing a
> > > rootfs between multiple containers with different id mappings, and also
> > > sharing regular directories and filesystems between users with different
> > > uids and gids. The latter use-cases have become even more important with
> > > the availability and adoption of systemd-homed (cf. [1]) to implement
> > > portable home directories.
> > 
> > Can you walk us through the motivating use case?
> > 
> > As of this year's LPC I had the distinct impression that the primary use
> > case for such a feature was due to the RLIMIT_NPROC problem where two
> > containers with the same users still wanted different uid mappings to
> > the disk because the users were conflicting with each other because of
> > the per user rlimits.
> > 
> > Fixing rlimits is straight forward to implement, and easier to manage
> > for implementations and administrators.
> 
> Our use case is to have the same directory exposed to several
> different containers which each have disjoint ID mappings.
> 
> > Reading up on systemd-homed it appears to be a way to have encrypted
> > home directories.  Those home directories can either be encrypted at the
> > fs or at the block level.  Those home directories appear to have the
> > goal of being luggable between systems.  If the systems in question
> > don't have common administration of uids and gids after lugging your
> > encrypted home directory to another system chowning the files is
> > required.
> > 
> > Is that the use case you are looking at removing the need for
> > systemd-homed to avoid chowning after lugging encrypted home directories
> > from one system to another?  Why would it be desirable to avoid the
> > chown?
> 
> Not just systemd-homed, but LXD has to do this, as does our
> application at Cisco, and presumably others.
> 
> Several reasons:
> 
> * the chown is slow
> * the chown requires somewhere to write the delta in metadata (e.g. an
>   overlay workdir, or an LV or something), and there are N copies of
>   this delta, one for each container.
> * it means we need to have a +w filesystem at some point during
>   execution.
> * it's ugly :). Conceptually, the kernel solves the uid shifting
>   problem for us for most other kernel subsystems (including in a
>   limited way fscaps) by configuring a user namespace. It feels like
>   we should be able to do the same with the VFS.

And chown prevents the same inode from being shared by different
containers through different id mappings.  You can overlay, but then
they can't actually share updates.

-serge

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/34] fs: idmapped mounts
  2020-10-29 16:05   ` Lennart Poettering
@ 2020-10-29 16:36     ` Sargun Dhillon
  2020-10-29 16:54     ` Eric W. Biederman
  1 sibling, 0 replies; 68+ messages in thread
From: Sargun Dhillon @ 2020-10-29 16:36 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Eric W. Biederman, Christian Brauner, Alexander Viro,
	Christoph Hellwig, linux-fsdevel, John Johansen, James Morris,
	Mimi Zohar, Dmitry Kasatkin, Stephen Smalley, Casey Schaufler,
	Arnd Bergmann, Andreas Dilger, OGAWA Hirofumi, Geoffrey Thomas,
	Mrunal Patel, Josh Triplett, Andy Lutomirski, Amir Goldstein,
	Miklos Szeredi, Theodore Tso, Alban Crequy, Tycho Andersen,
	David Howells, James Bottomley, Jann Horn, Seth Forshee,
	Stéphane Graber, Aleksa Sarai, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux

On Thu, Oct 29, 2020 at 05:05:02PM +0100, Lennart Poettering wrote:
> On Do, 29.10.20 10:47, Eric W. Biederman (ebiederm@xmission.com) wrote:
> 
> > Is that the use case you are looking at removing the need for
> > systemd-homed to avoid chowning after lugging encrypted home directories
> > from one system to another?  Why would it be desirable to avoid the
> > chown?
> 
> Yes, I am very interested in seeing Christian's work succeed, for the
> usecase in systemd-homed. In systemd-homed each user gets their own
> private file system, and these fs shall be owned by the user's local
> UID, regardless in which system it is used. The UID should be an
> artifact of the local, individual system in this model, and thus
> the UID on of the same user/home on system A might be picked as 1010
> and on another as 1543, and on a third as 1323, and it shouldn't
> matter. This way, home directories become migratable without having to
> universially sync UID assignments: it doesn't matter anymore what the
> local UID is.
> 
> Right now we do a recursive chown() at login time to ensure the home
> dir is properly owned. This has two disadvantages:
> 
> 1. It's slow. In particular on large home dirs, it takes a while to go
>    through the whole user's homedir tree and chown/adjust ACLs for
>    everything.
> 
> 2. Because it is so slow we take a shortcut right now: if the
>    top-level home dir inode itself is owned by the correct user, we
>    skip the recursive chowning. This means in the typical case where a
>    user uses the same system most of the time, and thus the UID is
>    stable we can avoid the slowness. But this comes at a drawback: if
>    the user for some reason ends up with files in their homedir owned
>    by an unrelated user, then we'll never notice or readjust.
> 
> > If the goal is to solve fragmented administration of uid assignment I
> > suggest that it might be better to solve the administration problem so
> > that all of the uids of interest get assigned the same way on all of the
> > systems of interest.
> 
> Well, the goal is to make things simple and be able to use the home
> dir everywhere without any prior preparation, without central UID
> assignment authority.
> 
> The goal is to have a scheme that requires no administration, by
> making the UID management problem go away. Hence, if you suggest
> solving this by having a central administrative authority: this is
> exactly what the model wants to get away from.
> 
> Or to say this differently: just because I personally use three
> different computers, I certainly don't want to set up LDAP or sync
> UIDs manually.
> 
> Lennart
> 
> --
> Lennart Poettering, Berlin

Can you help me understand systemd-homed a little bit?

In the man page it says:

systemd-homed is a system service that may be used to create, remove, change or 
inspect home areas (directories and network mounts and real or loopback block 
devices with a filesystem, optionally encrypted).

It seems that the "underlay?" (If you'll call it that, maybe there is a better 
term) can either be a standalone block device (this sounds close to systemd 
machined?), a btrfs subvolume (which receives its own superblock (IIRC?, I might 
be wrong. It's been a while since I've used btrfs), or just be a directory 
that's mapped?

What decides whether it's just a directory and bind-mounted (or a similar 
vfsmount), or an actual superblock?

How is the mapping of "real UIDs" to "namespace UIDs" works when it's just a 
bind mount? From the perspective of multiple user namespaces, are all 
"underlying" UIDs mapped through, or if I try to look at another user's
home directory will they not show up?

Is there a reason you can't / don't / wont use overlayfs instead of bind mounts?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/34] fs: idmapped mounts
  2020-10-29 15:51   ` Aleksa Sarai
@ 2020-10-29 16:37     ` Eric W. Biederman
  2020-10-30  2:18       ` Serge E. Hallyn
                         ` (2 more replies)
  0 siblings, 3 replies; 68+ messages in thread
From: Eric W. Biederman @ 2020-10-29 16:37 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Christian Brauner, Alexander Viro, Christoph Hellwig,
	linux-fsdevel, John Johansen, James Morris, Mimi Zohar,
	Dmitry Kasatkin, Stephen Smalley, Casey Schaufler, Arnd Bergmann,
	Andreas Dilger, OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel,
	Josh Triplett, Andy Lutomirski, Amir Goldstein, Miklos Szeredi,
	Theodore Tso, Alban Crequy, Tycho Andersen, David Howells,
	James Bottomley, Jann Horn, Seth Forshee, Stéphane Graber,
	Lennart Poettering, smbarber, Phil Estes, Serge Hallyn,
	Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux

Aleksa Sarai <cyphar@cyphar.com> writes:

> On 2020-10-29, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Christian Brauner <christian.brauner@ubuntu.com> writes:
>> 
>> > Hey everyone,
>> >
>> > I vanished for a little while to focus on this work here so sorry for
>> > not being available by mail for a while.
>> >
>> > Since quite a long time we have issues with sharing mounts between
>> > multiple unprivileged containers with different id mappings, sharing a
>> > rootfs between multiple containers with different id mappings, and also
>> > sharing regular directories and filesystems between users with different
>> > uids and gids. The latter use-cases have become even more important with
>> > the availability and adoption of systemd-homed (cf. [1]) to implement
>> > portable home directories.
>> 
>> Can you walk us through the motivating use case?
>> 
>> As of this year's LPC I had the distinct impression that the primary use
>> case for such a feature was due to the RLIMIT_NPROC problem where two
>> containers with the same users still wanted different uid mappings to
>> the disk because the users were conflicting with each other because of
>> the per user rlimits.
>> 
>> Fixing rlimits is straight forward to implement, and easier to manage
>> for implementations and administrators.
>
> This is separate to the question of "isolated user namespaces" and
> managing different mappings between containers. This patchset is solving
> the same problem that shiftfs solved -- sharing a single directory tree
> between containers that have different ID mappings. rlimits (nor any of
> the other proposals we discussed at LPC) will help with this problem.

First and foremost: A uid shift on write to a filesystem is a security
bug waiting to happen.  This is especially in the context of facilities
like iouring, that play very agressive games with how process context
makes it to  system calls.

The only reason containers were not immediately exploitable when iouring
was introduced is because the mechanisms are built so that even if
something escapes containment the security properties still apply.
Changes to the uid when writing to the filesystem does not have that
property.  The tiniest slip in containment will be a security issue.

This is not even the least bit theoretical.  I have seem reports of how
shitfs+overlayfs created a situation where anyone could read
/etc/shadow.

If you are going to write using the same uid to disk from different
containers the question becomes why can't those containers configure
those users to use the same kuid?

What fixing rlimits does is it fixes one of the reasons that different
containers could not share the same kuid for users that want to write to
disk with the same uid.


I humbly suggest that it will be more secure, and easier to maintain for
both developers and users if we fix the reasons people want different
containers to have the same user running with different kuids.

If not what are the reasons we fundamentally need the same on-disk user
using multiple kuids in the kernel?

Eric

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/34] fs: idmapped mounts
       [not found]   ` <20201029161231.GA108315@cisco>
  2020-10-29 16:23     ` Serge E. Hallyn
@ 2020-10-29 16:44     ` Eric W. Biederman
  2020-10-29 18:04       ` Stéphane Graber
  1 sibling, 1 reply; 68+ messages in thread
From: Eric W. Biederman @ 2020-10-29 16:44 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Christian Brauner, Lennart Poettering, Mimi Zohar, David Howells,
	Andreas Dilger, containers, Miklos Szeredi, smbarber,
	Christoph Hellwig, linux-ext4, Mrunal Patel, Kees Cook,
	Arnd Bergmann, Jann Horn, selinux, Josh Triplett, linux-fsdevel,
	Alexander Viro, Andy Lutomirski, OGAWA Hirofumi, Geoffrey Thomas,
	James Bottomley, John Johansen, Theodore Tso, Seth Forshee,
	Dmitry Kasatkin, Stephen Smalley, Jonathan Corbet, linux-unionfs,
	linux-security-module, linux-audit, linux-api, Casey Schaufler,
	Alban Crequy, linux-integrity, Todd Kjos

Tycho Andersen <tycho@tycho.pizza> writes:

> Hi Eric,
>
> On Thu, Oct 29, 2020 at 10:47:49AM -0500, Eric W. Biederman wrote:
>> Christian Brauner <christian.brauner@ubuntu.com> writes:
>> 
>> > Hey everyone,
>> >
>> > I vanished for a little while to focus on this work here so sorry for
>> > not being available by mail for a while.
>> >
>> > Since quite a long time we have issues with sharing mounts between
>> > multiple unprivileged containers with different id mappings, sharing a
>> > rootfs between multiple containers with different id mappings, and also
>> > sharing regular directories and filesystems between users with different
>> > uids and gids. The latter use-cases have become even more important with
>> > the availability and adoption of systemd-homed (cf. [1]) to implement
>> > portable home directories.
>> 
>> Can you walk us through the motivating use case?
>> 
>> As of this year's LPC I had the distinct impression that the primary use
>> case for such a feature was due to the RLIMIT_NPROC problem where two
>> containers with the same users still wanted different uid mappings to
>> the disk because the users were conflicting with each other because of
>> the per user rlimits.
>> 
>> Fixing rlimits is straight forward to implement, and easier to manage
>> for implementations and administrators.
>
> Our use case is to have the same directory exposed to several
> different containers which each have disjoint ID mappings.

Why do the you have disjoint ID mappings for the users that are writing
to disk with the same ID?

>> Reading up on systemd-homed it appears to be a way to have encrypted
>> home directories.  Those home directories can either be encrypted at the
>> fs or at the block level.  Those home directories appear to have the
>> goal of being luggable between systems.  If the systems in question
>> don't have common administration of uids and gids after lugging your
>> encrypted home directory to another system chowning the files is
>> required.
>> 
>> Is that the use case you are looking at removing the need for
>> systemd-homed to avoid chowning after lugging encrypted home directories
>> from one system to another?  Why would it be desirable to avoid the
>> chown?
>
> Not just systemd-homed, but LXD has to do this,

I asked why the same disk users are assigned different kuids and the
only reason I have heard that LXD does this is the RLIMIT_NPROC problem.

Perhaps there is another reason.

In part this is why I am eager to hear peoples use case, and why I was
trying very hard to make certain we get the requirements.

I want the real requirements though and some thought, not just we did
this and it hurts.  Changning the uids on write is a very hard problem,
and not just in implementating it but also in maintaining and
understanding what is going on.

Eric

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/34] fs: idmapped mounts
  2020-10-29 16:05   ` Lennart Poettering
  2020-10-29 16:36     ` Sargun Dhillon
@ 2020-10-29 16:54     ` Eric W. Biederman
  1 sibling, 0 replies; 68+ messages in thread
From: Eric W. Biederman @ 2020-10-29 16:54 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Christian Brauner, Alexander Viro, Christoph Hellwig,
	linux-fsdevel, John Johansen, James Morris, Mimi Zohar,
	Dmitry Kasatkin, Stephen Smalley, Casey Schaufler, Arnd Bergmann,
	Andreas Dilger, OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel,
	Josh Triplett, Andy Lutomirski, Amir Goldstein, Miklos Szeredi,
	Theodore Tso, Alban Crequy, Tycho Andersen, David Howells,
	James Bottomley, Jann Horn, Seth Forshee, Stéphane Graber,
	Aleksa Sarai, smbarber, Phil Estes, Serge Hallyn, Kees Cook,
	Todd Kjos, Jonathan Corbet, containers, linux-security-module,
	linux-api, linux-ext4, linux-unionfs, linux-audit,
	linux-integrity, selinux

Lennart Poettering <lennart@poettering.net> writes:

> On Do, 29.10.20 10:47, Eric W. Biederman (ebiederm@xmission.com) wrote:
>
>> Is that the use case you are looking at removing the need for
>> systemd-homed to avoid chowning after lugging encrypted home directories
>> from one system to another?  Why would it be desirable to avoid the
>> chown?
>
> Yes, I am very interested in seeing Christian's work succeed, for the
> usecase in systemd-homed. In systemd-homed each user gets their own
> private file system, and these fs shall be owned by the user's local
> UID, regardless in which system it is used. The UID should be an
> artifact of the local, individual system in this model, and thus
> the UID on of the same user/home on system A might be picked as 1010
> and on another as 1543, and on a third as 1323, and it shouldn't
> matter. This way, home directories become migratable without having to
> universially sync UID assignments: it doesn't matter anymore what the
> local UID is.
>
> Right now we do a recursive chown() at login time to ensure the home
> dir is properly owned. This has two disadvantages:
>
> 1. It's slow. In particular on large home dirs, it takes a while to go
>    through the whole user's homedir tree and chown/adjust ACLs for
>    everything.
>
> 2. Because it is so slow we take a shortcut right now: if the
>    top-level home dir inode itself is owned by the correct user, we
>    skip the recursive chowning. This means in the typical case where a
>    user uses the same system most of the time, and thus the UID is
>    stable we can avoid the slowness. But this comes at a drawback: if
>    the user for some reason ends up with files in their homedir owned
>    by an unrelated user, then we'll never notice or readjust.


The classic solution to this problem for removable media are
uid=XXX and gid=XXX mount options.

I suspect a similar solution can apply here.

I don't think you need a solution that requires different kuids
to be able to write to the same filesystem uid.

>> If the goal is to solve fragmented administration of uid assignment I
>> suggest that it might be better to solve the administration problem so
>> that all of the uids of interest get assigned the same way on all of the
>> systems of interest.
>
> Well, the goal is to make things simple and be able to use the home
> dir everywhere without any prior preparation, without central UID
> assignment authority.
>
> The goal is to have a scheme that requires no administration, by
> making the UID management problem go away. Hence, if you suggest
> solving this by having a central administrative authority: this is
> exactly what the model wants to get away from.

For a files that can be accessed by more than a single user this is
fundamentally necessary.  Otherwise group permissions and acls can not
work.  They wind up as meaningless garbage, because without some kind of
synchronization those other users and groups simply can not be
represented.

> Or to say this differently: just because I personally use three
> different computers, I certainly don't want to set up LDAP or sync
> UIDs manually.

If they are single users systems why should you need to?

But if permissions on files are going to be at all meaningful it is
a fundamentally a requirement that there be no confusion about which
party the other parties are talking about.  To the best of my knowledge
syncing uids/usernames between machines is as simple as it can get.

Eric

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/34] fs: idmapped mounts
  2020-10-29 16:44     ` Eric W. Biederman
@ 2020-10-29 18:04       ` Stéphane Graber
  0 siblings, 0 replies; 68+ messages in thread
From: Stéphane Graber @ 2020-10-29 18:04 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Tycho Andersen, Andy Lutomirski, Mimi Zohar, James Bottomley,
	Andreas Dilger, Miklos Szeredi, Stephen Barber,
	Christoph Hellwig, Casey Schaufler, linux-ext4, Mrunal Patel,
	Kees Cook, Arnd Bergmann, Jann Horn, selinux, Josh Triplett,
	Seth Forshee, Alexander Viro, Lennart Poettering, OGAWA Hirofumi,
	Geoffrey Thomas, David Howells, John Johansen, Theodore Tso,
	Dmitry Kasatkin, Linux Containers, Jonathan Corbet, overlayfs,
	linux-security-module, linux-audit, Linux API, linux-fsdevel,
	Alban Crequy, linux-integrity, Stephen Smalley, Todd Kjos

On Thu, Oct 29, 2020 at 12:45 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Tycho Andersen <tycho@tycho.pizza> writes:
>
> > Hi Eric,
> >
> > On Thu, Oct 29, 2020 at 10:47:49AM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <christian.brauner@ubuntu.com> writes:
> >>
> >> > Hey everyone,
> >> >
> >> > I vanished for a little while to focus on this work here so sorry for
> >> > not being available by mail for a while.
> >> >
> >> > Since quite a long time we have issues with sharing mounts between
> >> > multiple unprivileged containers with different id mappings, sharing a
> >> > rootfs between multiple containers with different id mappings, and also
> >> > sharing regular directories and filesystems between users with different
> >> > uids and gids. The latter use-cases have become even more important with
> >> > the availability and adoption of systemd-homed (cf. [1]) to implement
> >> > portable home directories.
> >>
> >> Can you walk us through the motivating use case?
> >>
> >> As of this year's LPC I had the distinct impression that the primary use
> >> case for such a feature was due to the RLIMIT_NPROC problem where two
> >> containers with the same users still wanted different uid mappings to
> >> the disk because the users were conflicting with each other because of
> >> the per user rlimits.
> >>
> >> Fixing rlimits is straight forward to implement, and easier to manage
> >> for implementations and administrators.
> >
> > Our use case is to have the same directory exposed to several
> > different containers which each have disjoint ID mappings.
>
> Why do the you have disjoint ID mappings for the users that are writing
> to disk with the same ID?
>
> >> Reading up on systemd-homed it appears to be a way to have encrypted
> >> home directories.  Those home directories can either be encrypted at the
> >> fs or at the block level.  Those home directories appear to have the
> >> goal of being luggable between systems.  If the systems in question
> >> don't have common administration of uids and gids after lugging your
> >> encrypted home directory to another system chowning the files is
> >> required.
> >>
> >> Is that the use case you are looking at removing the need for
> >> systemd-homed to avoid chowning after lugging encrypted home directories
> >> from one system to another?  Why would it be desirable to avoid the
> >> chown?
> >
> > Not just systemd-homed, but LXD has to do this,
>
> I asked why the same disk users are assigned different kuids and the
> only reason I have heard that LXD does this is the RLIMIT_NPROC problem.
>
> Perhaps there is another reason.
>
> In part this is why I am eager to hear peoples use case, and why I was
> trying very hard to make certain we get the requirements.
>
> I want the real requirements though and some thought, not just we did
> this and it hurts.  Changning the uids on write is a very hard problem,
> and not just in implementating it but also in maintaining and
> understanding what is going on.

The most common cases where shiftfs is used or where folks would like
to use it today are (by importance):
 - Fast container creation (by not having to uid/gid shift all files
in the downloaded image)
 - Sharing data between the host system and a container (some paths
under /home being the most common)
 - Sharing data between unprivileged containers with a disjointed map
 - Sharing data between multiple containers, some privileged, some unprivileged

Fixing the ulimit issue only takes care of one of those (3rd item), it
does not solve any of the other cases.

The first item on there alone can be quite significant. Creation and
startup of a regular Debian container on my system takes around 500ms
when shiftfs is used (btrfs/lvm/zfs copy-on-write clone of the image,
setup shiftfs, start container) compared to 2-3s when running without
it (same clone, followed by rewrite of all uid/gid present on the fs,
including acls and capabilities, then start container). And that's on
a fast system with an NVME SSD and a small rootfs. We have had reports
of a few users running on slow spinning rust with large containers
where shifting can take several minutes.

The second item can technically be worked around without shifted
bind-mounts by doing userns map hole punching, mapping the user's
uid/gid from the host straight into the container. The downside to
this is that another shifting pass becomes needed for any file outside
of the bind-mounted path (or it would become owned by -1/-1) and it's
very much not dynamic, requiring the container be stopped, config
updated by the user, /etc/subuid and subgid maps being updated and
container started back up. If you need another user/group be exposed,
start all over again...
This is far more complex, slow and disruptive than the shifted
approach where we just need to do:
   lxc config device add MY-CONTAINER home disk source=/home
path=/home shift=true
To inject a new mount of /home from the host into the container with a
shifting layer in place, no need to reconfig subuid/subgid, no need to
re-create the userns to update the mapping and no need to go through
the container's rootfs for any file which may now need remapping
because of the map change.

Stéphane

> Eric
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/34] fs: idmapped mounts
  2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
                   ` (36 preceding siblings ...)
  2020-10-29 15:47 ` Eric W. Biederman
@ 2020-10-29 21:58 ` Andy Lutomirski
  2020-10-30 12:01   ` Christian Brauner
  37 siblings, 1 reply; 68+ messages in thread
From: Andy Lutomirski @ 2020-10-29 21:58 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Christoph Hellwig, linux-fsdevel, John Johansen,
	James Morris, Mimi Zohar, Dmitry Kasatkin, Stephen Smalley,
	Casey Schaufler, Arnd Bergmann, Andreas Dilger, OGAWA Hirofumi,
	Geoffrey Thomas, Mrunal Patel, Josh Triplett, Andy Lutomirski,
	Amir Goldstein, Miklos Szeredi, Theodore Tso, Alban Crequy,
	Tycho Andersen, David Howells, James Bottomley, Jann Horn,
	Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux



> On Oct 28, 2020, at 5:35 PM, Christian Brauner <christian.brauner@ubuntu.com> wrote:
> 
> Hey everyone,
> 
> I vanished for a little while to focus on this work here so sorry for
> not being available by mail for a while.
> 
> Since quite a long time we have issues with sharing mounts between
> multiple unprivileged containers with different id mappings, sharing a
> rootfs between multiple containers with different id mappings, and also
> sharing regular directories and filesystems between users with different
> uids and gids. The latter use-cases have become even more important with
> the availability and adoption of systemd-homed (cf. [1]) to implement
> portable home directories.
> 
> The solutions we have tried and proposed so far include the introduction
> of fsid mappings, a tiny overlay based filesystem, and an approach to
> call override creds in the vfs. None of these solutions have covered all
> of the above use-cases.
> 
> The solution proposed here has it's origins in multiple discussions
> during Linux Plumbers 2017 during and after the end of the containers
> microconference.
> To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
> James, and myself. A variant of the solution proposed here has also been
> discussed, again to the best of my knowledge, after a Linux conference
> in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> after Linux Plumbers.
> I've taken the time to finally implement a working version of this
> solution over the last weeks to the best of my abilities. Tycho has
> signed up for this sligthly crazy endeavour as well and he has helped
> with the conversion of the xattr codepaths.
> 
> The core idea is to make idmappings a property of struct vfsmount
> instead of tying it to a process being inside of a user namespace which
> has been the case for all other proposed approaches.
> It means that idmappings become a property of bind-mounts, i.e. each
> bind-mount can have a separate idmapping. This has the obvious advantage
> that idmapped mounts can be created inside of the initial user
> namespace, i.e. on the host itself instead of requiring the caller to be
> located inside of a user namespace. This enables such use-cases as e.g.
> making a usb stick available in multiple locations with different
> idmappings (see the vfat port that is part of this patch series).
> 
> The vfsmount struct gains a new struct user_namespace member. The
> idmapping of the user namespace becomes the idmapping of the mount. A
> caller that is either privileged with respect to the user namespace of
> the superblock of the underlying filesystem or a caller that is
> privileged with respect to the user namespace a mount has been idmapped
> with can create a new bind-mount and mark it with a user namespace.

So one way of thinking about this is that a user namespace that has an idmapped mount can, effectively, create or chown files with *any* on-disk uid or gid by doing it directly (if that uid exists in-namespace, which is likely for interesting ids like 0) or by creating a new userns with that id inside.

For a file system that is private to a container, this seems moderately safe, although this may depend on what exactly “private” means. We probably want a mechanism such that, if you are outside the namespace, a reference to a file with the namespace’s vfsmnt does not confer suid privilege.

Imagine the following attack: user creates a namespace with a root user and arranges to get an idmapped fs, e.g. by inserting an ext4 usb stick or using whatever container management tool does this.  Inside the namespace, the user creates a suid-root file.

Now, outside the namespace, the user has privilege over the namespace.  (I’m assuming there is some tool that will idmap things in a namespace owned by an unprivileged user, which seems likely.). So the user makes a new bind mount and if maps it to the init namespace. Game over.

So I think we need to have some control to mitigate this in a comprehensible way. A big hammer would be to require nosuid. A smaller hammer might be to say that you can’t create a new idmapped mount unless you have privilege over the userns that you want to use for the idmap and to say that a vfsmnt’s paths don’t do suid outside the idmap namespace.  We already do the latter for the vfsmnt’s mntns’s userns.

Hmm.  What happens if we require that an idmap userns equal the vfsmnt’s mntns’s userns?  Is that too limiting?

I hope that whatever solution gets used is straightforward enough to wrap one’s head around.

> When a file/inode is accessed through an idmapped mount the i_uid and
> i_gid of the inode will be remapped according to the user namespace the
> mount has been marked with. When a new object is created based on the
> fsuid and fsgid of the caller they will similarly be remapped according
> to the user namespace of the mount they care created from.

By “mapped according to”, I presume you mean that the on-disk uid/gid is the gid as seen in the user namespace in question.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/34] fs: idmapped mounts
  2020-10-29 16:37     ` Eric W. Biederman
@ 2020-10-30  2:18       ` Serge E. Hallyn
  2020-10-30 15:07       ` Seth Forshee
  2020-11-03 14:10       ` Alban Crequy
  2 siblings, 0 replies; 68+ messages in thread
From: Serge E. Hallyn @ 2020-10-30  2:18 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aleksa Sarai, Christian Brauner, Alexander Viro,
	Christoph Hellwig, linux-fsdevel, John Johansen, James Morris,
	Mimi Zohar, Dmitry Kasatkin, Stephen Smalley, Casey Schaufler,
	Arnd Bergmann, Andreas Dilger, OGAWA Hirofumi, Geoffrey Thomas,
	Mrunal Patel, Josh Triplett, Andy Lutomirski, Amir Goldstein,
	Miklos Szeredi, Theodore Tso, Alban Crequy, Tycho Andersen,
	David Howells, James Bottomley, Jann Horn, Seth Forshee,
	Stéphane Graber, Lennart Poettering, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux

On Thu, Oct 29, 2020 at 11:37:23AM -0500, Eric W. Biederman wrote:
> Aleksa Sarai <cyphar@cyphar.com> writes:
> 
> > On 2020-10-29, Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> Christian Brauner <christian.brauner@ubuntu.com> writes:
> >> 
> >> > Hey everyone,
> >> >
> >> > I vanished for a little while to focus on this work here so sorry for
> >> > not being available by mail for a while.
> >> >
> >> > Since quite a long time we have issues with sharing mounts between
> >> > multiple unprivileged containers with different id mappings, sharing a
> >> > rootfs between multiple containers with different id mappings, and also
> >> > sharing regular directories and filesystems between users with different
> >> > uids and gids. The latter use-cases have become even more important with
> >> > the availability and adoption of systemd-homed (cf. [1]) to implement
> >> > portable home directories.
> >> 
> >> Can you walk us through the motivating use case?
> >> 
> >> As of this year's LPC I had the distinct impression that the primary use
> >> case for such a feature was due to the RLIMIT_NPROC problem where two
> >> containers with the same users still wanted different uid mappings to
> >> the disk because the users were conflicting with each other because of
> >> the per user rlimits.
> >> 
> >> Fixing rlimits is straight forward to implement, and easier to manage
> >> for implementations and administrators.
> >
> > This is separate to the question of "isolated user namespaces" and
> > managing different mappings between containers. This patchset is solving
> > the same problem that shiftfs solved -- sharing a single directory tree
> > between containers that have different ID mappings. rlimits (nor any of
> > the other proposals we discussed at LPC) will help with this problem.
> 
> First and foremost: A uid shift on write to a filesystem is a security
> bug waiting to happen.  This is especially in the context of facilities
> like iouring, that play very agressive games with how process context
> makes it to  system calls.
> 
> The only reason containers were not immediately exploitable when iouring
> was introduced is because the mechanisms are built so that even if
> something escapes containment the security properties still apply.
> Changes to the uid when writing to the filesystem does not have that
> property.  The tiniest slip in containment will be a security issue.
> 
> This is not even the least bit theoretical.  I have seem reports of how
> shitfs+overlayfs created a situation where anyone could read
> /etc/shadow.
> 
> If you are going to write using the same uid to disk from different
> containers the question becomes why can't those containers configure
> those users to use the same kuid?

Because if user 'myapp' in two otherwise isolated containers both have
the same kuid, so that they can write to a shared directory, then root
in container 1 has privilege over all files owned by 'myapp' in
container 2.

Whereas if they can each have distinct kuids, but when writing to the
shared fs have a shared uid not otherwise belonging to either container,
their rootfs's can remain completely off limits to each other.

> What fixing rlimits does is it fixes one of the reasons that different
> containers could not share the same kuid for users that want to write to
> disk with the same uid.
> 
> 
> I humbly suggest that it will be more secure, and easier to maintain for
> both developers and users if we fix the reasons people want different
> containers to have the same user running with different kuids.
> 
> If not what are the reasons we fundamentally need the same on-disk user
> using multiple kuids in the kernel?
> 
> Eric

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 33/34] overlayfs: handle idmapped merged mounts
  2020-10-29  0:32 ` [PATCH 33/34] overlayfs: handle idmapped merged mounts Christian Brauner
@ 2020-10-30  9:57   ` Amir Goldstein
  0 siblings, 0 replies; 68+ messages in thread
From: Amir Goldstein @ 2020-10-30  9:57 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Christoph Hellwig, linux-fsdevel, John Johansen,
	James Morris, Mimi Zohar, Dmitry Kasatkin, Stephen Smalley,
	Casey Schaufler, Arnd Bergmann, Andreas Dilger, OGAWA Hirofumi,
	Geoffrey Thomas, Mrunal Patel, Josh Triplett, Andy Lutomirski,
	Miklos Szeredi, Theodore Tso, Alban Crequy, Tycho Andersen,
	David Howells, James Bottomley, Jann Horn, Seth Forshee,
	Stéphane Graber, Aleksa Sarai, Lennart Poettering,
	Eric W. Biederman, smbarber, Phil Estes, Serge Hallyn, Kees Cook,
	Todd Kjos, Jonathan Corbet, Linux Containers, LSM List,
	Linux API, Ext4, overlayfs, Linux Audit, linux-integrity,
	selinux

> -int ovl_permission(struct inode *inode, int mask)
> +int ovl_permission_mapped(struct user_namespace *user_ns,
> +                         struct inode *inode, int mask)
>  {
>         struct inode *upperinode = ovl_inode_upper(inode);
>         struct inode *realinode = upperinode ?: ovl_inode_lower(inode);
> -       struct user_namespace *user_ns;
> +       struct user_namespace *real_user_ns;
>         const struct cred *old_cred;
>         int err;
>
> @@ -302,15 +313,15 @@ int ovl_permission(struct inode *inode, int mask)
>         }
>
>         if (upperinode)
> -               user_ns = ovl_upper_mnt_user_ns(OVL_FS(inode->i_sb));
> +               real_user_ns = ovl_upper_mnt_user_ns(OVL_FS(inode->i_sb));
>         else
> -               user_ns = OVL_I(inode)->lower_user_ns;
> +               real_user_ns = OVL_I(inode)->lower_user_ns;

These changes look strange in this patch. Better use real_user_ns in previous
patch.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 32/34] overlayfs: handle idmapped lower directories
  2020-10-29  0:32 ` [PATCH 32/34] overlayfs: handle idmapped lower directories Christian Brauner
@ 2020-10-30 11:10   ` Amir Goldstein
  2020-10-30 11:52     ` Christian Brauner
  0 siblings, 1 reply; 68+ messages in thread
From: Amir Goldstein @ 2020-10-30 11:10 UTC (permalink / raw)
  To: Christian Brauner
  Cc: overlayfs, Miklos Szeredi, Linux Containers, linux-fsdevel,
	linux-kernel, Al Viro, Christoph Hellwig

[reducing CC list for overlayfs specific comments]

On Thu, Oct 29, 2020 at 2:41 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> As an overlay filesystem overlayfs can be mounted on top of other filesystems
> and bind-mounts. This means it can also be bind-mounted on top of one or
> multiple idmapped lower directories and/or an idmapped upper directory.
> In previous patches we've enabled the vfs to handle idmapped mounts and so we
> should have all of the helpers available to let overlayfs handle idmapped
> mounts. To handle such scenarios correctly overlayfs needs to be switched from
> non-idmapped mount aware vfs helpers to idmapped mount aware vfs helpers.
> In order to have overlayfs correctly support idmapped mounts as lower and upper
> directories we need to pass down the mount's user namespace associated with the
> lower and upper directories whenver we perform idmapped mount aware operations.
>
> Luckily, when overlayfs is mounted it creates private mounts of the lower and
> upper directories via clone_private_mount() which calls clone_mnt() internally.
> If any of the lower or upper directories are on an idmapped mount then
> clone_mnt() called in clone_private_mount() will also pin the user namespace
> the vfsmount has been marked with. Overlayfs stashes the information about the
> lower and upper directories and the mounts that they are on so that this
> information can be retrieved when needed. This makes it possible to support
> idmapped mounts as lower and upper directories. Support for idmapped merged
> mounts will be added in a follow-up patch.
>
> Whenever we perform idmap mount aware operations we need to pass down the
> mount's user namespace to the vfs helpers we've introduced in earlier patches.
> Permission checks on the lower and upper directories are performed by switching
> from the inode_permission() and inode_owner_or_capable() helpers to the new
> mapped_inode_permission() and mapped_inode_owner_or_capable() helpers.
> Similarly we switch from non-idmapped mount aware lookup helpers to
> idmapped-mount aware lookup helpers. In all cases where we need to check
> permissions in the lower or upper directories we pass down the mount associated
> with the lower and upper directory at the time of creating the overlayfs mount.
> This nicely lines up with the permission model outlined in the overlayfs
> documentation (Special thanks to Amir for pointing me to this document!).
>
> Thank to Amir for pointing me to the overlayfs permission model documentation!
>
> A very special thank you to my friend Seth Forshee who has given invaluable
> advice when coming up with these patches!
>
> As an example let's create overlayfs mount in the initial user namespace with
> an idmapped lower and upper mount:
>
>  # This is a directory where all file ownership starts with uid and gid 10000.
>  root@f2-vm:/# ls -al /var/lib/lxc/f1/rootfs
>  total 108
>  drwxr-xr-x  20 10000 10000  4096 Oct 28 11:13 .
>  drwxrwx---   4 10000 10000  4096 Oct 28 11:17 ..
>  -rw-r--r--   1 10000 10000  7197 Oct 24 09:45 asdf
>  drwxr-xr-x   2 10000 10000  4096 Oct 16 19:07 ASDF
>  lrwxrwxrwx   1 10000 10000     7 Sep 24 07:43 bin -> usr/bin
>  drwxr-xr-x   2 10000 10000  4096 Apr 15  2020 boot
>  -rw-r--r--   1 10000 10000 13059 Oct  8 12:38 ccc
>  drwxr-xr-x   2 11000 11000  4096 Oct 23 17:10 ddd
>  drwxr-xr-x   3 10000 10000  4096 Sep 25 08:04 dev
>  drwxr-xr-x  61 10000 10000  4096 Sep 25 08:04 etc
>
>  # Create an idmapped mount on the host such that all files owned by uid and
>  # gid 10000 show up as being owned by uid 0 and gid 0.
>  /mount2 --idmap both:10000:0:10000 /var/lib/lxc/f1/rootfs/ /lower1/
>
>  # Verify that the files show up as uid and gid 0 under the idmapped mount at /lower1
>  root@f2-vm:/# ls -al /lower1/
>  total 108
>  drwxr-xr-x  20 root   root    4096 Oct 28 11:13 .
>  drwxr-xr-x  29 root   root    4096 Oct 28 11:57 ..
>  -rw-r--r--   1 root   root    7197 Oct 24 09:45 asdf
>  drwxr-xr-x   2 root   root    4096 Oct 16 19:07 ASDF
>  lrwxrwxrwx   1 root   root       7 Sep 24 07:43 bin -> usr/bin
>  drwxr-xr-x   2 root   root    4096 Apr 15  2020 boot
>  -rw-r--r--   1 root   root   13059 Oct  8 12:38 ccc
>  drwxr-xr-x   2 ubuntu ubuntu  4096 Oct 23 17:10 ddd
>  drwxr-xr-x   3 root   root    4096 Sep 25 08:04 dev
>  drwxr-xr-x  61 root   root    4096 Sep 25 08:04 etc
>
>  # Create an idmapped upper mount at /upper. Now, files created as id 0 will
>  # show up as id 10000 in /upper and files created as id 1000 will show up as
>  # id 11000 under /upper.
>  /mount2 --idmap both:10000:0:10000 /upper /upper
>  mkdir /upper/upper
>  mkdir /upper/work
>
>  # Create an overlayfs mount.
>  mount -t overlay overlay -o lowerdir=/lower1/,upperdir=/upper/upper/,workdir=/upper/work/ /merged/
>
>  root@f2-vm:/# ls -al /merged/
>  total 124
>  drwxr-xr-x   1 root   root    4096 Oct 25 23:04 .
>  drwxr-xr-x  29 root   root    4096 Oct 28 12:07 ..
>  -rw-r--r--   1 root   root    7197 Oct 24 09:45 asdf
>  drwxr-xr-x   2 root   root    4096 Oct 16 19:07 ASDF
>  lrwxrwxrwx   1 root   root       7 Sep 24 07:43 bin -> usr/bin
>  drwxr-xr-x   2 root   root    4096 Apr 15  2020 boot
>  -rw-r--r--   1 root   root   13059 Oct  8 12:38 ccc
>  drwxr-xr-x   2 ubuntu ubuntu  4096 Oct 23 17:10 ddd
>  drwxr-xr-x   3 root   root    4096 Sep 25 08:04 dev
>  drwxr-xr-x  61 root   root    4096 Sep 25 08:04 etc
>
>  # Create a file as as root
>  root@f2-vm:/merged# touch /merged/A-FILE
>
>  root@f2-vm:/merged# ls -al /merged/A-FILE
>  -rw-r--r-- 1 root root 0 Oct 28 12:16 /merged/A-FILE
>
>  # Chown the file to a simple user
>  root@f2-vm:/merged# chown 1000:1000 /merged/A-FILE
>
>  root@f2-vm:/merged# ls -al /merged/A-FILE
>  -rw-r--r-- 1 ubuntu ubuntu 0 Oct 28 12:16 /merged/A-FILE
>
>  # Create a directory and delegate to simple user
>  root@f2-vm:/merged# mkdir /merged/A-DIR
>
>  root@f2-vm:/merged# chown 1000:1000 /merged/A-DIR/
>
>  # Login as user
>  root@f2-vm:/merged# sudo -u ubuntu -- bash -i
>
>  # Create a file as simpel user
>  ubuntu@f2-vm:/merged$ touch /merged/A-DIR/A-USER-FILE
>
>  ubuntu@f2-vm:/merged$ ls -al /merged/A-DIR/A-USER-FILE
>  -rw-rw-r-- 1 ubuntu ubuntu 0 Oct 28 12:18 /merged/A-DIR/A-USER-FILE
>
>  # Let's look at these files in our idmapped upper directory
>  ubuntu@f2-vm:/$ ls -alR /upper/upper/
>  /upper/upper/:
>  total 12
>  drwxr-xr-x 3 root   root   4096 Oct 28 12:23 .
>  drwxr-xr-x 4 root   root   4096 Oct 21 13:48 ..
>  drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 12:18 A-DIR
>  -rw-r--r-- 1 ubuntu ubuntu    0 Oct 28 12:16 A-FILE
>
>  /upper/upper/A-DIR:
>  total 8
>  drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 12:18 .
>  drwxr-xr-x 3 root   root   4096 Oct 28 12:23 ..
>  -rw-rw-r-- 1 ubuntu ubuntu    0 Oct 28 12:18 A-USER-FILE
>
>  # Let's remove the idmapped /upper mount (overlayfs will have it's own private mount anyway)
>  umount /upper
>
>  # Let's look at these files in our upper directory with the idmapped mount removed
>  ubuntu@f2-vm:/$ ls -alR /upper/upper/
>  /upper/upper/:
>  total 12
>  drwxr-xr-x 3 10000 10000 4096 Oct 28 12:23 .
>  drwxr-xr-x 4 10000 10000 4096 Oct 21 13:48 ..
>  drwxr-xr-x 2 11000 11000 4096 Oct 28 12:18 A-DIR
>  -rw-r--r-- 1 11000 11000    0 Oct 28 12:16 A-FILE
>
>  /upper/upper/A-DIR:
>  total 8
>  drwxr-xr-x 2 11000 11000 4096 Oct 28 12:18 .
>  drwxr-xr-x 3 10000 10000 4096 Oct 28 12:23 ..
>  -rw-rw-r-- 1 11000 11000    0 Oct 28 12:18 A-USER-FILE
>
>  # Let's create a few acls from the /merged directory  on an already existing file
>  # triggering a copy-up operation
>   root@f2-vm:/merged# setfacl -m u:1000:rwx /merged/asdf
>   root@f2-vm:/merged# getfacl /merged/asdf
>   getfacl: Removing leading '/' from absolute path names
>   # file: merged/asdf
>   # owner: root
>   # group: root
>   user::rw-
>   user:ubuntu:rwx
>   group::r--
>   mask::rwx
>   other::r--
>
>   # Let's look at this file from our upper directory
>   root@f2-vm:/merged# getfacl /upper/upper/asdf
>   getfacl: Removing leading '/' from absolute path names
>   # file: upper/upper/asdf
>   # owner: 10000
>   # group: 10000
>   user::rw-
>   user:11000:rwx
>   group::r--
>   mask::rwx
>   other::r--
>
> Cc: Seth Forshee <seth.forshee@canonical.com>
> Cc: Amir Goldstein <amir73il@gmail.com>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> ---
>  fs/overlayfs/copy_up.c   | 100 +++++++++++++++-----------
>  fs/overlayfs/dir.c       | 151 ++++++++++++++++++++++-----------------
>  fs/overlayfs/export.c    |   3 +-
>  fs/overlayfs/file.c      |  23 +++---
>  fs/overlayfs/inode.c     |  89 ++++++++++++++++++-----
>  fs/overlayfs/namei.c     |  64 ++++++++++-------
>  fs/overlayfs/overlayfs.h | 149 ++++++++++++++++++++++++++------------
>  fs/overlayfs/ovl_entry.h |   1 +
>  fs/overlayfs/readdir.c   |  34 +++++----
>  fs/overlayfs/super.c     | 106 ++++++++++++++++++++-------
>  fs/overlayfs/util.c      |  38 +++++-----
>  11 files changed, 494 insertions(+), 264 deletions(-)
>

Christian,

I think the vfs patches raised some eyebrows, but I can understand why
you started
with the explicit opt-in approach.
If for nothing else, this posting is very educating about all the
places that idmapping
affects.

WRT this overlayfs patch. It takes the same approach as the vfs patched, passing
user_ns all over the place, but there is usually no need for that,
because in most
cases, there is a much less intrusive way to get to the mnt.

Some examples below.

> diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
> index 955ecd4030f0..1b8721796fd4 100644
> --- a/fs/overlayfs/copy_up.c
> +++ b/fs/overlayfs/copy_up.c
> @@ -43,7 +43,8 @@ static bool ovl_must_copy_xattr(const char *name)
>                !strncmp(name, XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN);
>  }
>
> -int ovl_copy_xattr(struct super_block *sb, struct dentry *old,
> +int ovl_copy_xattr(struct super_block *sb, struct user_namespace *old_user_ns,
> +                  struct dentry *old, struct user_namespace *new_user_ns,
>                    struct dentry *new)


In this helper both callers already have the old path, so can pass it
into the helper.
new is always upper (because we never modify lower), so if you pass
ovl_fs to the helper (we are slowly passing it down to all helpers) you
will have the upper mnt.

>  {
>         ssize_t list_size, size, value_size = 0;
> @@ -85,9 +86,9 @@ int ovl_copy_xattr(struct super_block *sb, struct dentry *old,
>                 if (ovl_is_private_xattr(sb, name))
>                         continue;
>  retry:
> -               size = vfs_getxattr(old, name, value, value_size);
> +               size = vfs_mapped_getxattr(old_user_ns, old, name, value, value_size);
>                 if (size == -ERANGE)
> -                       size = vfs_getxattr(old, name, NULL, 0);
> +                       size = vfs_mapped_getxattr(old_user_ns, old, name, NULL, 0);
>
>                 if (size < 0) {
>                         error = size;
> @@ -114,7 +115,7 @@ int ovl_copy_xattr(struct super_block *sb, struct dentry *old,
>                         error = 0;
>                         continue; /* Discard */
>                 }
> -               error = vfs_setxattr(new, name, value, size, 0);
> +               error = vfs_mapped_setxattr(new_user_ns, new, name, value, size, 0);
>                 if (error) {
>                         if (error != -EOPNOTSUPP || ovl_must_copy_xattr(name))
>                                 break;
> @@ -228,17 +229,19 @@ static int ovl_copy_up_data(struct ovl_fs *ofs, struct path *old,
>         return error;
>  }
>
> -static int ovl_set_size(struct dentry *upperdentry, struct kstat *stat)
> +static int ovl_set_size(struct user_namespace *user_ns,
> +                       struct dentry *upperdentry, struct kstat *stat)
>  {
>         struct iattr attr = {
>                 .ia_valid = ATTR_SIZE,
>                 .ia_size = stat->size,
>         };
>
> -       return notify_change(upperdentry, &attr, NULL);
> +       return notify_mapped_change(user_ns, upperdentry, &attr, NULL);
>  }
>
> -static int ovl_set_timestamps(struct dentry *upperdentry, struct kstat *stat)
> +static int ovl_set_timestamps(struct user_namespace *user_ns,
> +                             struct dentry *upperdentry, struct kstat *stat)
>  {
>         struct iattr attr = {
>                 .ia_valid =
> @@ -247,10 +250,11 @@ static int ovl_set_timestamps(struct dentry *upperdentry, struct kstat *stat)
>                 .ia_mtime = stat->mtime,
>         };
>
> -       return notify_change(upperdentry, &attr, NULL);
> +       return notify_mapped_change(user_ns, upperdentry, &attr, NULL);
>  }
>
> -int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat)
> +int ovl_set_attr(struct user_namespace *user_ns, struct dentry *upperdentry,
> +                struct kstat *stat)
>  {
>         int err = 0;
>
> @@ -259,7 +263,7 @@ int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat)
>                         .ia_valid = ATTR_MODE,
>                         .ia_mode = stat->mode,
>                 };
> -               err = notify_change(upperdentry, &attr, NULL);
> +               err = notify_mapped_change(user_ns, upperdentry, &attr, NULL);
>         }
>         if (!err) {
>                 struct iattr attr = {
> @@ -267,10 +271,10 @@ int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat)
>                         .ia_uid = stat->uid,
>                         .ia_gid = stat->gid,
>                 };
> -               err = notify_change(upperdentry, &attr, NULL);
> +               err = notify_mapped_change(user_ns, upperdentry, &attr, NULL);
>         }
>         if (!err)
> -               ovl_set_timestamps(upperdentry, stat);
> +               ovl_set_timestamps(user_ns, upperdentry, stat);
>
>         return err;
>  }

Same to all those helpers that get upperdentry, you can pass ovl_fs.

> @@ -356,8 +360,8 @@ int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
>  }
>
>  /* Store file handle of @upper dir in @index dir entry */
> -static int ovl_set_upper_fh(struct ovl_fs *ofs, struct dentry *upper,
> -                           struct dentry *index)
> +static int ovl_set_upper_fh(struct ovl_fs *ofs, struct user_namespace *user_ns,
> +                           struct dentry *upper, struct dentry *index)
>  {

And here you already have ovl_fs.

>         const struct ovl_fh *fh;
>         int err;
> @@ -377,7 +381,8 @@ static int ovl_set_upper_fh(struct ovl_fs *ofs, struct dentry *upper,
>   *
>   * Caller must hold i_mutex on indexdir.
>   */
> -static int ovl_create_index(struct dentry *dentry, struct dentry *origin,
> +static int ovl_create_index(struct user_namespace *user_ns,
> +                           struct dentry *dentry, struct dentry *origin,
>                             struct dentry *upper)
>  {
>         struct dentry *indexdir = ovl_indexdir(dentry->d_sb);
> @@ -406,25 +411,25 @@ static int ovl_create_index(struct dentry *dentry, struct dentry *origin,
>         if (err)
>                 return err;
>
> -       temp = ovl_create_temp(indexdir, OVL_CATTR(S_IFDIR | 0));
> +       temp = ovl_create_temp(user_ns, indexdir, OVL_CATTR(S_IFDIR | 0));
>         err = PTR_ERR(temp);
>         if (IS_ERR(temp))
>                 goto free_name;
>
> -       err = ovl_set_upper_fh(OVL_FS(dentry->d_sb), upper, temp);
> +       err = ovl_set_upper_fh(OVL_FS(dentry->d_sb), user_ns, upper, temp);
>         if (err)
>                 goto out;
>
> -       index = lookup_one_len(name.name, indexdir, name.len);
> +       index = lookup_one_len_mapped(name.name, indexdir, name.len, user_ns);
>         if (IS_ERR(index)) {
>                 err = PTR_ERR(index);
>         } else {
> -               err = ovl_do_rename(dir, temp, dir, index, 0);
> +               err = ovl_do_rename(dir, user_ns, temp, dir, user_ns, index, 0);
>                 dput(index);
>         }
>  out:
>         if (err)
> -               ovl_cleanup(dir, temp);
> +               ovl_cleanup(user_ns, dir, temp);
>         dput(temp);
>  free_name:
>         kfree(name.name);
> @@ -441,6 +446,7 @@ struct ovl_copy_up_ctx {
>         struct dentry *destdir;
>         struct qstr destname;
>         struct dentry *workdir;
> +       struct user_namespace *user_ns;
>         bool origin;
>         bool indexed;
>         bool metacopy;
> @@ -463,16 +469,17 @@ static int ovl_link_up(struct ovl_copy_up_ctx *c)
>                 return err;
>
>         inode_lock_nested(udir, I_MUTEX_PARENT);
> -       upper = lookup_one_len(c->dentry->d_name.name, upperdir,
> -                              c->dentry->d_name.len);
> +       upper = lookup_one_len_mapped(c->dentry->d_name.name, upperdir,
> +                                 c->dentry->d_name.len,
> +                                 c->user_ns);
>         err = PTR_ERR(upper);
>         if (!IS_ERR(upper)) {
> -               err = ovl_do_link(ovl_dentry_upper(c->dentry), udir, upper);
> +               err = ovl_do_link(c->user_ns, ovl_dentry_upper(c->dentry), udir, upper);
>                 dput(upper);
>
>                 if (!err) {
>                         /* Restore timestamps on parent (best effort) */
> -                       ovl_set_timestamps(upperdir, &c->pstat);
> +                       ovl_set_timestamps(c->user_ns, upperdir, &c->pstat);
>                         ovl_dentry_set_upper_alias(c->dentry);
>                 }
>         }
> @@ -509,7 +516,8 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
>                         return err;
>         }
>
> -       err = ovl_copy_xattr(c->dentry->d_sb, c->lowerpath.dentry, temp);
> +       err = ovl_copy_xattr(c->dentry->d_sb, mnt_user_ns(c->lowerpath.mnt),
> +                            c->lowerpath.dentry, c->user_ns, temp);
>         if (err)
>                 return err;
>
> @@ -535,9 +543,9 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
>
>         inode_lock(temp->d_inode);
>         if (S_ISREG(c->stat.mode))
> -               err = ovl_set_size(temp, &c->stat);
> +               err = ovl_set_size(c->user_ns, temp, &c->stat);
>         if (!err)
> -               err = ovl_set_attr(temp, &c->stat);
> +               err = ovl_set_attr(c->user_ns, temp, &c->stat);
>         inode_unlock(temp->d_inode);
>
>         return err;
> @@ -598,7 +606,7 @@ static int ovl_copy_up_workdir(struct ovl_copy_up_ctx *c)
>         if (err)
>                 goto unlock;
>
> -       temp = ovl_create_temp(c->workdir, &cattr);
> +       temp = ovl_create_temp(c->user_ns, c->workdir, &cattr);
>         ovl_revert_cu_creds(&cc);
>
>         err = PTR_ERR(temp);
> @@ -610,17 +618,18 @@ static int ovl_copy_up_workdir(struct ovl_copy_up_ctx *c)
>                 goto cleanup;
>
>         if (S_ISDIR(c->stat.mode) && c->indexed) {
> -               err = ovl_create_index(c->dentry, c->lowerpath.dentry, temp);
> +               err = ovl_create_index(c->user_ns, c->dentry, c->lowerpath.dentry, temp);
>                 if (err)
>                         goto cleanup;
>         }
>
> -       upper = lookup_one_len(c->destname.name, c->destdir, c->destname.len);
> +       upper = lookup_one_len_mapped(c->destname.name, c->destdir, c->destname.len,
> +                                 c->user_ns);
>         err = PTR_ERR(upper);
>         if (IS_ERR(upper))
>                 goto cleanup;
>
> -       err = ovl_do_rename(wdir, temp, udir, upper, 0);
> +       err = ovl_do_rename(wdir, c->user_ns, temp, udir, c->user_ns, upper, 0);
>         dput(upper);
>         if (err)
>                 goto cleanup;
> @@ -637,7 +646,7 @@ static int ovl_copy_up_workdir(struct ovl_copy_up_ctx *c)
>         return err;
>
>  cleanup:
> -       ovl_cleanup(wdir, temp);
> +       ovl_cleanup(c->user_ns, wdir, temp);
>         dput(temp);
>         goto unlock;
>  }
> @@ -654,7 +663,7 @@ static int ovl_copy_up_tmpfile(struct ovl_copy_up_ctx *c)
>         if (err)
>                 return err;
>
> -       temp = ovl_do_tmpfile(c->workdir, c->stat.mode);
> +       temp = ovl_do_tmpfile(c->user_ns, c->workdir, c->stat.mode);
>         ovl_revert_cu_creds(&cc);
>
>         if (IS_ERR(temp))
> @@ -666,10 +675,11 @@ static int ovl_copy_up_tmpfile(struct ovl_copy_up_ctx *c)
>
>         inode_lock_nested(udir, I_MUTEX_PARENT);
>
> -       upper = lookup_one_len(c->destname.name, c->destdir, c->destname.len);
> +       upper = lookup_one_len_mapped(c->destname.name, c->destdir, c->destname.len,
> +                                 c->user_ns);
>         err = PTR_ERR(upper);
>         if (!IS_ERR(upper)) {
> -               err = ovl_do_link(temp, udir, upper);
> +               err = ovl_do_link(c->user_ns, temp, udir, upper);
>                 dput(upper);
>         }
>         inode_unlock(udir);
> @@ -757,7 +767,7 @@ static int ovl_do_copy_up(struct ovl_copy_up_ctx *c)
>
>                 /* Restore timestamps on parent (best effort) */
>                 inode_lock(udir);
> -               ovl_set_timestamps(c->destdir, &c->pstat);
> +               ovl_set_timestamps(c->user_ns, c->destdir, &c->pstat);
>                 inode_unlock(udir);
>
>                 ovl_dentry_set_upper_alias(c->dentry);
> @@ -786,12 +796,13 @@ static bool ovl_need_meta_copy_up(struct dentry *dentry, umode_t mode,
>         return true;
>  }
>
> -static ssize_t ovl_getxattr(struct dentry *dentry, char *name, char **value)
> +static ssize_t ovl_getxattr(struct user_namespace *user_ns,
> +                           struct dentry *dentry, char *name, char **value)
>  {
>         ssize_t res;
>         char *buf;
>
> -       res = vfs_getxattr(dentry, name, NULL, 0);
> +       res = vfs_mapped_getxattr(user_ns, dentry, name, NULL, 0);
>         if (res == -ENODATA || res == -EOPNOTSUPP)
>                 res = 0;
>
> @@ -800,7 +811,7 @@ static ssize_t ovl_getxattr(struct dentry *dentry, char *name, char **value)
>                 if (!buf)
>                         return -ENOMEM;
>
> -               res = vfs_getxattr(dentry, name, buf, res);
> +               res = vfs_mapped_getxattr(user_ns, dentry, name, buf, res);
>                 if (res < 0)
>                         kfree(buf);
>                 else
> @@ -814,6 +825,7 @@ static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
>  {
>         struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb);
>         struct path upperpath, datapath;
> +       struct user_namespace *user_ns;
>         int err;
>         char *capability = NULL;
>         ssize_t cap_size;
> @@ -827,8 +839,8 @@ static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
>                 return -EIO;
>
>         if (c->stat.size) {
> -               err = cap_size = ovl_getxattr(upperpath.dentry, XATTR_NAME_CAPS,
> -                                             &capability);
> +               err = cap_size = ovl_getxattr(c->user_ns, upperpath.dentry,
> +                                             XATTR_NAME_CAPS, &capability);
>                 if (cap_size < 0)
>                         goto out;
>         }
> @@ -841,9 +853,10 @@ static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
>          * Writing to upper file will clear security.capability xattr. We
>          * don't want that to happen for normal copy-up operation.
>          */
> +       user_ns = mnt_user_ns(upperpath.mnt);
>         if (capability) {
> -               err = vfs_setxattr(upperpath.dentry, XATTR_NAME_CAPS,
> -                                  capability, cap_size, 0);
> +               err = vfs_mapped_setxattr(user_ns, upperpath.dentry,
> +                                     XATTR_NAME_CAPS, capability, cap_size, 0);
>                 if (err)
>                         goto out_free;
>         }
> @@ -887,6 +900,7 @@ static int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry,
>                 ovl_path_upper(parent, &parentpath);
>                 ctx.destdir = parentpath.dentry;
>                 ctx.destname = dentry->d_name;
> +               ctx.user_ns = mnt_user_ns(parentpath.mnt);
>
>                 err = vfs_getattr(&parentpath, &ctx.pstat,
>                                   STATX_ATIME | STATX_MTIME,
> diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
> index 28a075b5f5b2..23d09de00957 100644
> --- a/fs/overlayfs/dir.c
> +++ b/fs/overlayfs/dir.c
> @@ -23,15 +23,16 @@ MODULE_PARM_DESC(redirect_max,
>
>  static int ovl_set_redirect(struct dentry *dentry, bool samedir);
>
> -int ovl_cleanup(struct inode *wdir, struct dentry *wdentry)
> +int ovl_cleanup(struct user_namespace *user_ns, struct inode *wdir,
> +               struct dentry *wdentry)

I think this one and lookup_temp should also pass ovl_fs, but
need to be careful when they are called from ovl_fill_super()
that upper_mnt is already assigned (should be).

>  {
>         int err;
>
>         dget(wdentry);
>         if (d_is_dir(wdentry))
> -               err = ovl_do_rmdir(wdir, wdentry);
> +               err = ovl_do_rmdir(user_ns, wdir, wdentry);
>         else
> -               err = ovl_do_unlink(wdir, wdentry);
> +               err = ovl_do_unlink(user_ns, wdir, wdentry);
>         dput(wdentry);
>
>         if (err) {
> @@ -42,7 +43,8 @@ int ovl_cleanup(struct inode *wdir, struct dentry *wdentry)
>         return err;
>  }
>
> -struct dentry *ovl_lookup_temp(struct dentry *workdir)
> +struct dentry *ovl_lookup_temp(struct user_namespace *user_ns,
> +                              struct dentry *workdir)
>  {
>         struct dentry *temp;
>         char name[20];
> @@ -51,7 +53,7 @@ struct dentry *ovl_lookup_temp(struct dentry *workdir)
>         /* counter is allowed to wrap, since temp dentries are ephemeral */
>         snprintf(name, sizeof(name), "#%x", atomic_inc_return(&temp_id));
>
> -       temp = lookup_one_len(name, workdir, strlen(name));
> +       temp = lookup_one_len_mapped(name, workdir, strlen(name), user_ns);
>         if (!IS_ERR(temp) && temp->d_inode) {
>                 pr_err("workdir/%s already exists\n", name);
>                 dput(temp);
> @@ -68,13 +70,14 @@ static struct dentry *ovl_whiteout(struct ovl_fs *ofs)
>         struct dentry *whiteout;
>         struct dentry *workdir = ofs->workdir;
>         struct inode *wdir = workdir->d_inode;
> +       struct user_namespace *user_ns = ovl_upper_mnt_user_ns(ofs);
>
>         if (!ofs->whiteout) {
> -               whiteout = ovl_lookup_temp(workdir);
> +               whiteout = ovl_lookup_temp(user_ns, workdir);
>                 if (IS_ERR(whiteout))
>                         goto out;
>
> -               err = ovl_do_whiteout(wdir, whiteout);
> +               err = ovl_do_whiteout(user_ns, wdir, whiteout);
>                 if (err) {
>                         dput(whiteout);
>                         whiteout = ERR_PTR(err);
> @@ -84,11 +87,11 @@ static struct dentry *ovl_whiteout(struct ovl_fs *ofs)
>         }
>
>         if (ofs->share_whiteout) {
> -               whiteout = ovl_lookup_temp(workdir);
> +               whiteout = ovl_lookup_temp(user_ns, workdir);
>                 if (IS_ERR(whiteout))
>                         goto out;
>
> -               err = ovl_do_link(ofs->whiteout, wdir, whiteout);
> +               err = ovl_do_link(user_ns, ofs->whiteout, wdir, whiteout);
>                 if (!err)
>                         goto out;
>
> @@ -110,6 +113,7 @@ int ovl_cleanup_and_whiteout(struct ovl_fs *ofs, struct inode *dir,
>                              struct dentry *dentry)
>  {
>         struct inode *wdir = ofs->workdir->d_inode;
> +       struct user_namespace *user_ns = ovl_upper_mnt_user_ns(ofs);
>         struct dentry *whiteout;
>         int err;
>         int flags = 0;
> @@ -122,28 +126,28 @@ int ovl_cleanup_and_whiteout(struct ovl_fs *ofs, struct inode *dir,
>         if (d_is_dir(dentry))
>                 flags = RENAME_EXCHANGE;
>
> -       err = ovl_do_rename(wdir, whiteout, dir, dentry, flags);
> +       err = ovl_do_rename(wdir, user_ns, whiteout, dir, user_ns, dentry, flags);
>         if (err)
>                 goto kill_whiteout;
>         if (flags)
> -               ovl_cleanup(wdir, dentry);
> +               ovl_cleanup(user_ns, wdir, dentry);
>
>  out:
>         dput(whiteout);
>         return err;
>
>  kill_whiteout:
> -       ovl_cleanup(wdir, whiteout);
> +       ovl_cleanup(user_ns, wdir, whiteout);
>         goto out;
>  }
>
> -static int ovl_mkdir_real(struct inode *dir, struct dentry **newdentry,
> -                         umode_t mode)
> +static int ovl_mkdir_real(struct user_namespace *user_ns, struct inode *dir,
> +                         struct dentry **newdentry, umode_t mode)
>  {
>         int err;
>         struct dentry *d, *dentry = *newdentry;
>
> -       err = ovl_do_mkdir(dir, dentry, mode);
> +       err = ovl_do_mkdir(user_ns, dir, dentry, mode);
>         if (err)
>                 return err;
>
> @@ -155,8 +159,8 @@ static int ovl_mkdir_real(struct inode *dir, struct dentry **newdentry,
>          * to it unhashed and negative. If that happens, try to
>          * lookup a new hashed and positive dentry.
>          */
> -       d = lookup_one_len(dentry->d_name.name, dentry->d_parent,
> -                          dentry->d_name.len);
> +       d = lookup_one_len_mapped(dentry->d_name.name, dentry->d_parent,
> +                             dentry->d_name.len, user_ns);
>         if (IS_ERR(d)) {
>                 pr_warn("failed lookup after mkdir (%pd2, err=%i).\n",
>                         dentry, err);
> @@ -168,7 +172,8 @@ static int ovl_mkdir_real(struct inode *dir, struct dentry **newdentry,
>         return 0;
>  }
>
> -struct dentry *ovl_create_real(struct inode *dir, struct dentry *newdentry,
> +struct dentry *ovl_create_real(struct user_namespace *user_ns,
> +                              struct inode *dir, struct dentry *newdentry,
>                                struct ovl_cattr *attr)
>  {
>         int err;
> @@ -181,28 +186,28 @@ struct dentry *ovl_create_real(struct inode *dir, struct dentry *newdentry,
>                 goto out;
>
>         if (attr->hardlink) {
> -               err = ovl_do_link(attr->hardlink, dir, newdentry);
> +               err = ovl_do_link(user_ns, attr->hardlink, dir, newdentry);
>         } else {
>                 switch (attr->mode & S_IFMT) {
>                 case S_IFREG:
> -                       err = ovl_do_create(dir, newdentry, attr->mode);
> +                       err = ovl_do_create(user_ns, dir, newdentry, attr->mode);
>                         break;
>
>                 case S_IFDIR:
>                         /* mkdir is special... */
> -                       err =  ovl_mkdir_real(dir, &newdentry, attr->mode);
> +                       err =  ovl_mkdir_real(user_ns, dir, &newdentry, attr->mode);
>                         break;
>
>                 case S_IFCHR:
>                 case S_IFBLK:
>                 case S_IFIFO:
>                 case S_IFSOCK:
> -                       err = ovl_do_mknod(dir, newdentry, attr->mode,
> +                       err = ovl_do_mknod(user_ns, dir, newdentry, attr->mode,
>                                            attr->rdev);
>                         break;
>
>                 case S_IFLNK:
> -                       err = ovl_do_symlink(dir, newdentry, attr->link);
> +                       err = ovl_do_symlink(user_ns, dir, newdentry, attr->link);
>                         break;
>
>                 default:
> @@ -224,10 +229,11 @@ struct dentry *ovl_create_real(struct inode *dir, struct dentry *newdentry,
>         return newdentry;
>  }
>

ovl_fs to all those create helpers

> -struct dentry *ovl_create_temp(struct dentry *workdir, struct ovl_cattr *attr)
> +struct dentry *ovl_create_temp(struct user_namespace *user_ns, struct dentry *workdir,
> +                              struct ovl_cattr *attr)

Not only should this get ovl_fs, but workdir could be later taken from
ofs->workdir.
It's subtle because there is one caller ovl_create_index() that needs special
care so leave that cleanup to me.

Anyway, I think you get the idea, but I am jumping ahead of myself.
Let's wait and see how the vfs patches play out.

One naive question:

If we want to avoid all the vfs API churn we can store the mnt_user_ns
in current cred.
Overlayfs can prepare at mount time one cred copy per layer and in
ovl_override_creds()
we can pass the ovl_layer id.
In some functions, such as ovl_looup() that access several layers, we
will need to take
care of overriding different creds per layer.

The concerns raised about storing userns in current wrt io_uring are
not relevant
for overlayfs access to underlying layers.

I know you and others have listed some use cases that are not
involving overlayfs,
but perhaps as a first step, this approach will be a much easier sell
and can cover
some of the use cases?

Also, if this is an interesting direction, there is no problem to
allow an overlayfs
configuration with no lower layers (just upper). I have a patch somewhere.
From user experience POV it would be just like shiftfs, but all the
implementation
would be in vfs and userns will be attached to vfs mounts - overlayfs
will just play
the role of storing the private mount clone and overriding credentials
before access
to the private mount.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 32/34] overlayfs: handle idmapped lower directories
  2020-10-30 11:10   ` Amir Goldstein
@ 2020-10-30 11:52     ` Christian Brauner
  0 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-30 11:52 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: overlayfs, Miklos Szeredi, Linux Containers, linux-fsdevel,
	linux-kernel, Al Viro, Christoph Hellwig

On Fri, Oct 30, 2020 at 01:10:52PM +0200, Amir Goldstein wrote:
> [reducing CC list for overlayfs specific comments]
> 
> On Thu, Oct 29, 2020 at 2:41 AM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > As an overlay filesystem overlayfs can be mounted on top of other filesystems
> > and bind-mounts. This means it can also be bind-mounted on top of one or
> > multiple idmapped lower directories and/or an idmapped upper directory.
> > In previous patches we've enabled the vfs to handle idmapped mounts and so we
> > should have all of the helpers available to let overlayfs handle idmapped
> > mounts. To handle such scenarios correctly overlayfs needs to be switched from
> > non-idmapped mount aware vfs helpers to idmapped mount aware vfs helpers.
> > In order to have overlayfs correctly support idmapped mounts as lower and upper
> > directories we need to pass down the mount's user namespace associated with the
> > lower and upper directories whenver we perform idmapped mount aware operations.
> >
> > Luckily, when overlayfs is mounted it creates private mounts of the lower and
> > upper directories via clone_private_mount() which calls clone_mnt() internally.
> > If any of the lower or upper directories are on an idmapped mount then
> > clone_mnt() called in clone_private_mount() will also pin the user namespace
> > the vfsmount has been marked with. Overlayfs stashes the information about the
> > lower and upper directories and the mounts that they are on so that this
> > information can be retrieved when needed. This makes it possible to support
> > idmapped mounts as lower and upper directories. Support for idmapped merged
> > mounts will be added in a follow-up patch.
> >
> > Whenever we perform idmap mount aware operations we need to pass down the
> > mount's user namespace to the vfs helpers we've introduced in earlier patches.
> > Permission checks on the lower and upper directories are performed by switching
> > from the inode_permission() and inode_owner_or_capable() helpers to the new
> > mapped_inode_permission() and mapped_inode_owner_or_capable() helpers.
> > Similarly we switch from non-idmapped mount aware lookup helpers to
> > idmapped-mount aware lookup helpers. In all cases where we need to check
> > permissions in the lower or upper directories we pass down the mount associated
> > with the lower and upper directory at the time of creating the overlayfs mount.
> > This nicely lines up with the permission model outlined in the overlayfs
> > documentation (Special thanks to Amir for pointing me to this document!).
> >
> > Thank to Amir for pointing me to the overlayfs permission model documentation!
> >
> > A very special thank you to my friend Seth Forshee who has given invaluable
> > advice when coming up with these patches!
> >
> > As an example let's create overlayfs mount in the initial user namespace with
> > an idmapped lower and upper mount:
> >
> >  # This is a directory where all file ownership starts with uid and gid 10000.
> >  root@f2-vm:/# ls -al /var/lib/lxc/f1/rootfs
> >  total 108
> >  drwxr-xr-x  20 10000 10000  4096 Oct 28 11:13 .
> >  drwxrwx---   4 10000 10000  4096 Oct 28 11:17 ..
> >  -rw-r--r--   1 10000 10000  7197 Oct 24 09:45 asdf
> >  drwxr-xr-x   2 10000 10000  4096 Oct 16 19:07 ASDF
> >  lrwxrwxrwx   1 10000 10000     7 Sep 24 07:43 bin -> usr/bin
> >  drwxr-xr-x   2 10000 10000  4096 Apr 15  2020 boot
> >  -rw-r--r--   1 10000 10000 13059 Oct  8 12:38 ccc
> >  drwxr-xr-x   2 11000 11000  4096 Oct 23 17:10 ddd
> >  drwxr-xr-x   3 10000 10000  4096 Sep 25 08:04 dev
> >  drwxr-xr-x  61 10000 10000  4096 Sep 25 08:04 etc
> >
> >  # Create an idmapped mount on the host such that all files owned by uid and
> >  # gid 10000 show up as being owned by uid 0 and gid 0.
> >  /mount2 --idmap both:10000:0:10000 /var/lib/lxc/f1/rootfs/ /lower1/
> >
> >  # Verify that the files show up as uid and gid 0 under the idmapped mount at /lower1
> >  root@f2-vm:/# ls -al /lower1/
> >  total 108
> >  drwxr-xr-x  20 root   root    4096 Oct 28 11:13 .
> >  drwxr-xr-x  29 root   root    4096 Oct 28 11:57 ..
> >  -rw-r--r--   1 root   root    7197 Oct 24 09:45 asdf
> >  drwxr-xr-x   2 root   root    4096 Oct 16 19:07 ASDF
> >  lrwxrwxrwx   1 root   root       7 Sep 24 07:43 bin -> usr/bin
> >  drwxr-xr-x   2 root   root    4096 Apr 15  2020 boot
> >  -rw-r--r--   1 root   root   13059 Oct  8 12:38 ccc
> >  drwxr-xr-x   2 ubuntu ubuntu  4096 Oct 23 17:10 ddd
> >  drwxr-xr-x   3 root   root    4096 Sep 25 08:04 dev
> >  drwxr-xr-x  61 root   root    4096 Sep 25 08:04 etc
> >
> >  # Create an idmapped upper mount at /upper. Now, files created as id 0 will
> >  # show up as id 10000 in /upper and files created as id 1000 will show up as
> >  # id 11000 under /upper.
> >  /mount2 --idmap both:10000:0:10000 /upper /upper
> >  mkdir /upper/upper
> >  mkdir /upper/work
> >
> >  # Create an overlayfs mount.
> >  mount -t overlay overlay -o lowerdir=/lower1/,upperdir=/upper/upper/,workdir=/upper/work/ /merged/
> >
> >  root@f2-vm:/# ls -al /merged/
> >  total 124
> >  drwxr-xr-x   1 root   root    4096 Oct 25 23:04 .
> >  drwxr-xr-x  29 root   root    4096 Oct 28 12:07 ..
> >  -rw-r--r--   1 root   root    7197 Oct 24 09:45 asdf
> >  drwxr-xr-x   2 root   root    4096 Oct 16 19:07 ASDF
> >  lrwxrwxrwx   1 root   root       7 Sep 24 07:43 bin -> usr/bin
> >  drwxr-xr-x   2 root   root    4096 Apr 15  2020 boot
> >  -rw-r--r--   1 root   root   13059 Oct  8 12:38 ccc
> >  drwxr-xr-x   2 ubuntu ubuntu  4096 Oct 23 17:10 ddd
> >  drwxr-xr-x   3 root   root    4096 Sep 25 08:04 dev
> >  drwxr-xr-x  61 root   root    4096 Sep 25 08:04 etc
> >
> >  # Create a file as as root
> >  root@f2-vm:/merged# touch /merged/A-FILE
> >
> >  root@f2-vm:/merged# ls -al /merged/A-FILE
> >  -rw-r--r-- 1 root root 0 Oct 28 12:16 /merged/A-FILE
> >
> >  # Chown the file to a simple user
> >  root@f2-vm:/merged# chown 1000:1000 /merged/A-FILE
> >
> >  root@f2-vm:/merged# ls -al /merged/A-FILE
> >  -rw-r--r-- 1 ubuntu ubuntu 0 Oct 28 12:16 /merged/A-FILE
> >
> >  # Create a directory and delegate to simple user
> >  root@f2-vm:/merged# mkdir /merged/A-DIR
> >
> >  root@f2-vm:/merged# chown 1000:1000 /merged/A-DIR/
> >
> >  # Login as user
> >  root@f2-vm:/merged# sudo -u ubuntu -- bash -i
> >
> >  # Create a file as simpel user
> >  ubuntu@f2-vm:/merged$ touch /merged/A-DIR/A-USER-FILE
> >
> >  ubuntu@f2-vm:/merged$ ls -al /merged/A-DIR/A-USER-FILE
> >  -rw-rw-r-- 1 ubuntu ubuntu 0 Oct 28 12:18 /merged/A-DIR/A-USER-FILE
> >
> >  # Let's look at these files in our idmapped upper directory
> >  ubuntu@f2-vm:/$ ls -alR /upper/upper/
> >  /upper/upper/:
> >  total 12
> >  drwxr-xr-x 3 root   root   4096 Oct 28 12:23 .
> >  drwxr-xr-x 4 root   root   4096 Oct 21 13:48 ..
> >  drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 12:18 A-DIR
> >  -rw-r--r-- 1 ubuntu ubuntu    0 Oct 28 12:16 A-FILE
> >
> >  /upper/upper/A-DIR:
> >  total 8
> >  drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 12:18 .
> >  drwxr-xr-x 3 root   root   4096 Oct 28 12:23 ..
> >  -rw-rw-r-- 1 ubuntu ubuntu    0 Oct 28 12:18 A-USER-FILE
> >
> >  # Let's remove the idmapped /upper mount (overlayfs will have it's own private mount anyway)
> >  umount /upper
> >
> >  # Let's look at these files in our upper directory with the idmapped mount removed
> >  ubuntu@f2-vm:/$ ls -alR /upper/upper/
> >  /upper/upper/:
> >  total 12
> >  drwxr-xr-x 3 10000 10000 4096 Oct 28 12:23 .
> >  drwxr-xr-x 4 10000 10000 4096 Oct 21 13:48 ..
> >  drwxr-xr-x 2 11000 11000 4096 Oct 28 12:18 A-DIR
> >  -rw-r--r-- 1 11000 11000    0 Oct 28 12:16 A-FILE
> >
> >  /upper/upper/A-DIR:
> >  total 8
> >  drwxr-xr-x 2 11000 11000 4096 Oct 28 12:18 .
> >  drwxr-xr-x 3 10000 10000 4096 Oct 28 12:23 ..
> >  -rw-rw-r-- 1 11000 11000    0 Oct 28 12:18 A-USER-FILE
> >
> >  # Let's create a few acls from the /merged directory  on an already existing file
> >  # triggering a copy-up operation
> >   root@f2-vm:/merged# setfacl -m u:1000:rwx /merged/asdf
> >   root@f2-vm:/merged# getfacl /merged/asdf
> >   getfacl: Removing leading '/' from absolute path names
> >   # file: merged/asdf
> >   # owner: root
> >   # group: root
> >   user::rw-
> >   user:ubuntu:rwx
> >   group::r--
> >   mask::rwx
> >   other::r--
> >
> >   # Let's look at this file from our upper directory
> >   root@f2-vm:/merged# getfacl /upper/upper/asdf
> >   getfacl: Removing leading '/' from absolute path names
> >   # file: upper/upper/asdf
> >   # owner: 10000
> >   # group: 10000
> >   user::rw-
> >   user:11000:rwx
> >   group::r--
> >   mask::rwx
> >   other::r--
> >
> > Cc: Seth Forshee <seth.forshee@canonical.com>
> > Cc: Amir Goldstein <amir73il@gmail.com>
> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> > ---
> >  fs/overlayfs/copy_up.c   | 100 +++++++++++++++-----------
> >  fs/overlayfs/dir.c       | 151 ++++++++++++++++++++++-----------------
> >  fs/overlayfs/export.c    |   3 +-
> >  fs/overlayfs/file.c      |  23 +++---
> >  fs/overlayfs/inode.c     |  89 ++++++++++++++++++-----
> >  fs/overlayfs/namei.c     |  64 ++++++++++-------
> >  fs/overlayfs/overlayfs.h | 149 ++++++++++++++++++++++++++------------
> >  fs/overlayfs/ovl_entry.h |   1 +
> >  fs/overlayfs/readdir.c   |  34 +++++----
> >  fs/overlayfs/super.c     | 106 ++++++++++++++++++++-------
> >  fs/overlayfs/util.c      |  38 +++++-----
> >  11 files changed, 494 insertions(+), 264 deletions(-)
> >
> 
> Christian,
> 
> I think the vfs patches raised some eyebrows, but I can understand why
> you started
> with the explicit opt-in approach.
> If for nothing else, this posting is very educating about all the
> places that idmapping
> affects.
> 
> WRT this overlayfs patch. It takes the same approach as the vfs patched, passing
> user_ns all over the place, but there is usually no need for that,
> because in most
> cases, there is a much less intrusive way to get to the mnt.
> 
> Some examples below.
> 
> > diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
> > index 955ecd4030f0..1b8721796fd4 100644
> > --- a/fs/overlayfs/copy_up.c
> > +++ b/fs/overlayfs/copy_up.c
> > @@ -43,7 +43,8 @@ static bool ovl_must_copy_xattr(const char *name)
> >                !strncmp(name, XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN);
> >  }
> >
> > -int ovl_copy_xattr(struct super_block *sb, struct dentry *old,
> > +int ovl_copy_xattr(struct super_block *sb, struct user_namespace *old_user_ns,
> > +                  struct dentry *old, struct user_namespace *new_user_ns,
> >                    struct dentry *new)
> 
> 
> In this helper both callers already have the old path, so can pass it
> into the helper.
> new is always upper (because we never modify lower), so if you pass
> ovl_fs to the helper (we are slowly passing it down to all helpers) you
> will have the upper mnt.
> 
> >  {
> >         ssize_t list_size, size, value_size = 0;
> > @@ -85,9 +86,9 @@ int ovl_copy_xattr(struct super_block *sb, struct dentry *old,
> >                 if (ovl_is_private_xattr(sb, name))
> >                         continue;
> >  retry:
> > -               size = vfs_getxattr(old, name, value, value_size);
> > +               size = vfs_mapped_getxattr(old_user_ns, old, name, value, value_size);
> >                 if (size == -ERANGE)
> > -                       size = vfs_getxattr(old, name, NULL, 0);
> > +                       size = vfs_mapped_getxattr(old_user_ns, old, name, NULL, 0);
> >
> >                 if (size < 0) {
> >                         error = size;
> > @@ -114,7 +115,7 @@ int ovl_copy_xattr(struct super_block *sb, struct dentry *old,
> >                         error = 0;
> >                         continue; /* Discard */
> >                 }
> > -               error = vfs_setxattr(new, name, value, size, 0);
> > +               error = vfs_mapped_setxattr(new_user_ns, new, name, value, size, 0);
> >                 if (error) {
> >                         if (error != -EOPNOTSUPP || ovl_must_copy_xattr(name))
> >                                 break;
> > @@ -228,17 +229,19 @@ static int ovl_copy_up_data(struct ovl_fs *ofs, struct path *old,
> >         return error;
> >  }
> >
> > -static int ovl_set_size(struct dentry *upperdentry, struct kstat *stat)
> > +static int ovl_set_size(struct user_namespace *user_ns,
> > +                       struct dentry *upperdentry, struct kstat *stat)
> >  {
> >         struct iattr attr = {
> >                 .ia_valid = ATTR_SIZE,
> >                 .ia_size = stat->size,
> >         };
> >
> > -       return notify_change(upperdentry, &attr, NULL);
> > +       return notify_mapped_change(user_ns, upperdentry, &attr, NULL);
> >  }
> >
> > -static int ovl_set_timestamps(struct dentry *upperdentry, struct kstat *stat)
> > +static int ovl_set_timestamps(struct user_namespace *user_ns,
> > +                             struct dentry *upperdentry, struct kstat *stat)
> >  {
> >         struct iattr attr = {
> >                 .ia_valid =
> > @@ -247,10 +250,11 @@ static int ovl_set_timestamps(struct dentry *upperdentry, struct kstat *stat)
> >                 .ia_mtime = stat->mtime,
> >         };
> >
> > -       return notify_change(upperdentry, &attr, NULL);
> > +       return notify_mapped_change(user_ns, upperdentry, &attr, NULL);
> >  }
> >
> > -int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat)
> > +int ovl_set_attr(struct user_namespace *user_ns, struct dentry *upperdentry,
> > +                struct kstat *stat)
> >  {
> >         int err = 0;
> >
> > @@ -259,7 +263,7 @@ int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat)
> >                         .ia_valid = ATTR_MODE,
> >                         .ia_mode = stat->mode,
> >                 };
> > -               err = notify_change(upperdentry, &attr, NULL);
> > +               err = notify_mapped_change(user_ns, upperdentry, &attr, NULL);
> >         }
> >         if (!err) {
> >                 struct iattr attr = {
> > @@ -267,10 +271,10 @@ int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat)
> >                         .ia_uid = stat->uid,
> >                         .ia_gid = stat->gid,
> >                 };
> > -               err = notify_change(upperdentry, &attr, NULL);
> > +               err = notify_mapped_change(user_ns, upperdentry, &attr, NULL);
> >         }
> >         if (!err)
> > -               ovl_set_timestamps(upperdentry, stat);
> > +               ovl_set_timestamps(user_ns, upperdentry, stat);
> >
> >         return err;
> >  }
> 
> Same to all those helpers that get upperdentry, you can pass ovl_fs.
> 
> > @@ -356,8 +360,8 @@ int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
> >  }
> >
> >  /* Store file handle of @upper dir in @index dir entry */
> > -static int ovl_set_upper_fh(struct ovl_fs *ofs, struct dentry *upper,
> > -                           struct dentry *index)
> > +static int ovl_set_upper_fh(struct ovl_fs *ofs, struct user_namespace *user_ns,
> > +                           struct dentry *upper, struct dentry *index)
> >  {
> 
> And here you already have ovl_fs.
> 
> >         const struct ovl_fh *fh;
> >         int err;
> > @@ -377,7 +381,8 @@ static int ovl_set_upper_fh(struct ovl_fs *ofs, struct dentry *upper,
> >   *
> >   * Caller must hold i_mutex on indexdir.
> >   */
> > -static int ovl_create_index(struct dentry *dentry, struct dentry *origin,
> > +static int ovl_create_index(struct user_namespace *user_ns,
> > +                           struct dentry *dentry, struct dentry *origin,
> >                             struct dentry *upper)
> >  {
> >         struct dentry *indexdir = ovl_indexdir(dentry->d_sb);
> > @@ -406,25 +411,25 @@ static int ovl_create_index(struct dentry *dentry, struct dentry *origin,
> >         if (err)
> >                 return err;
> >
> > -       temp = ovl_create_temp(indexdir, OVL_CATTR(S_IFDIR | 0));
> > +       temp = ovl_create_temp(user_ns, indexdir, OVL_CATTR(S_IFDIR | 0));
> >         err = PTR_ERR(temp);
> >         if (IS_ERR(temp))
> >                 goto free_name;
> >
> > -       err = ovl_set_upper_fh(OVL_FS(dentry->d_sb), upper, temp);
> > +       err = ovl_set_upper_fh(OVL_FS(dentry->d_sb), user_ns, upper, temp);
> >         if (err)
> >                 goto out;
> >
> > -       index = lookup_one_len(name.name, indexdir, name.len);
> > +       index = lookup_one_len_mapped(name.name, indexdir, name.len, user_ns);
> >         if (IS_ERR(index)) {
> >                 err = PTR_ERR(index);
> >         } else {
> > -               err = ovl_do_rename(dir, temp, dir, index, 0);
> > +               err = ovl_do_rename(dir, user_ns, temp, dir, user_ns, index, 0);
> >                 dput(index);
> >         }
> >  out:
> >         if (err)
> > -               ovl_cleanup(dir, temp);
> > +               ovl_cleanup(user_ns, dir, temp);
> >         dput(temp);
> >  free_name:
> >         kfree(name.name);
> > @@ -441,6 +446,7 @@ struct ovl_copy_up_ctx {
> >         struct dentry *destdir;
> >         struct qstr destname;
> >         struct dentry *workdir;
> > +       struct user_namespace *user_ns;
> >         bool origin;
> >         bool indexed;
> >         bool metacopy;
> > @@ -463,16 +469,17 @@ static int ovl_link_up(struct ovl_copy_up_ctx *c)
> >                 return err;
> >
> >         inode_lock_nested(udir, I_MUTEX_PARENT);
> > -       upper = lookup_one_len(c->dentry->d_name.name, upperdir,
> > -                              c->dentry->d_name.len);
> > +       upper = lookup_one_len_mapped(c->dentry->d_name.name, upperdir,
> > +                                 c->dentry->d_name.len,
> > +                                 c->user_ns);
> >         err = PTR_ERR(upper);
> >         if (!IS_ERR(upper)) {
> > -               err = ovl_do_link(ovl_dentry_upper(c->dentry), udir, upper);
> > +               err = ovl_do_link(c->user_ns, ovl_dentry_upper(c->dentry), udir, upper);
> >                 dput(upper);
> >
> >                 if (!err) {
> >                         /* Restore timestamps on parent (best effort) */
> > -                       ovl_set_timestamps(upperdir, &c->pstat);
> > +                       ovl_set_timestamps(c->user_ns, upperdir, &c->pstat);
> >                         ovl_dentry_set_upper_alias(c->dentry);
> >                 }
> >         }
> > @@ -509,7 +516,8 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
> >                         return err;
> >         }
> >
> > -       err = ovl_copy_xattr(c->dentry->d_sb, c->lowerpath.dentry, temp);
> > +       err = ovl_copy_xattr(c->dentry->d_sb, mnt_user_ns(c->lowerpath.mnt),
> > +                            c->lowerpath.dentry, c->user_ns, temp);
> >         if (err)
> >                 return err;
> >
> > @@ -535,9 +543,9 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
> >
> >         inode_lock(temp->d_inode);
> >         if (S_ISREG(c->stat.mode))
> > -               err = ovl_set_size(temp, &c->stat);
> > +               err = ovl_set_size(c->user_ns, temp, &c->stat);
> >         if (!err)
> > -               err = ovl_set_attr(temp, &c->stat);
> > +               err = ovl_set_attr(c->user_ns, temp, &c->stat);
> >         inode_unlock(temp->d_inode);
> >
> >         return err;
> > @@ -598,7 +606,7 @@ static int ovl_copy_up_workdir(struct ovl_copy_up_ctx *c)
> >         if (err)
> >                 goto unlock;
> >
> > -       temp = ovl_create_temp(c->workdir, &cattr);
> > +       temp = ovl_create_temp(c->user_ns, c->workdir, &cattr);
> >         ovl_revert_cu_creds(&cc);
> >
> >         err = PTR_ERR(temp);
> > @@ -610,17 +618,18 @@ static int ovl_copy_up_workdir(struct ovl_copy_up_ctx *c)
> >                 goto cleanup;
> >
> >         if (S_ISDIR(c->stat.mode) && c->indexed) {
> > -               err = ovl_create_index(c->dentry, c->lowerpath.dentry, temp);
> > +               err = ovl_create_index(c->user_ns, c->dentry, c->lowerpath.dentry, temp);
> >                 if (err)
> >                         goto cleanup;
> >         }
> >
> > -       upper = lookup_one_len(c->destname.name, c->destdir, c->destname.len);
> > +       upper = lookup_one_len_mapped(c->destname.name, c->destdir, c->destname.len,
> > +                                 c->user_ns);
> >         err = PTR_ERR(upper);
> >         if (IS_ERR(upper))
> >                 goto cleanup;
> >
> > -       err = ovl_do_rename(wdir, temp, udir, upper, 0);
> > +       err = ovl_do_rename(wdir, c->user_ns, temp, udir, c->user_ns, upper, 0);
> >         dput(upper);
> >         if (err)
> >                 goto cleanup;
> > @@ -637,7 +646,7 @@ static int ovl_copy_up_workdir(struct ovl_copy_up_ctx *c)
> >         return err;
> >
> >  cleanup:
> > -       ovl_cleanup(wdir, temp);
> > +       ovl_cleanup(c->user_ns, wdir, temp);
> >         dput(temp);
> >         goto unlock;
> >  }
> > @@ -654,7 +663,7 @@ static int ovl_copy_up_tmpfile(struct ovl_copy_up_ctx *c)
> >         if (err)
> >                 return err;
> >
> > -       temp = ovl_do_tmpfile(c->workdir, c->stat.mode);
> > +       temp = ovl_do_tmpfile(c->user_ns, c->workdir, c->stat.mode);
> >         ovl_revert_cu_creds(&cc);
> >
> >         if (IS_ERR(temp))
> > @@ -666,10 +675,11 @@ static int ovl_copy_up_tmpfile(struct ovl_copy_up_ctx *c)
> >
> >         inode_lock_nested(udir, I_MUTEX_PARENT);
> >
> > -       upper = lookup_one_len(c->destname.name, c->destdir, c->destname.len);
> > +       upper = lookup_one_len_mapped(c->destname.name, c->destdir, c->destname.len,
> > +                                 c->user_ns);
> >         err = PTR_ERR(upper);
> >         if (!IS_ERR(upper)) {
> > -               err = ovl_do_link(temp, udir, upper);
> > +               err = ovl_do_link(c->user_ns, temp, udir, upper);
> >                 dput(upper);
> >         }
> >         inode_unlock(udir);
> > @@ -757,7 +767,7 @@ static int ovl_do_copy_up(struct ovl_copy_up_ctx *c)
> >
> >                 /* Restore timestamps on parent (best effort) */
> >                 inode_lock(udir);
> > -               ovl_set_timestamps(c->destdir, &c->pstat);
> > +               ovl_set_timestamps(c->user_ns, c->destdir, &c->pstat);
> >                 inode_unlock(udir);
> >
> >                 ovl_dentry_set_upper_alias(c->dentry);
> > @@ -786,12 +796,13 @@ static bool ovl_need_meta_copy_up(struct dentry *dentry, umode_t mode,
> >         return true;
> >  }
> >
> > -static ssize_t ovl_getxattr(struct dentry *dentry, char *name, char **value)
> > +static ssize_t ovl_getxattr(struct user_namespace *user_ns,
> > +                           struct dentry *dentry, char *name, char **value)
> >  {
> >         ssize_t res;
> >         char *buf;
> >
> > -       res = vfs_getxattr(dentry, name, NULL, 0);
> > +       res = vfs_mapped_getxattr(user_ns, dentry, name, NULL, 0);
> >         if (res == -ENODATA || res == -EOPNOTSUPP)
> >                 res = 0;
> >
> > @@ -800,7 +811,7 @@ static ssize_t ovl_getxattr(struct dentry *dentry, char *name, char **value)
> >                 if (!buf)
> >                         return -ENOMEM;
> >
> > -               res = vfs_getxattr(dentry, name, buf, res);
> > +               res = vfs_mapped_getxattr(user_ns, dentry, name, buf, res);
> >                 if (res < 0)
> >                         kfree(buf);
> >                 else
> > @@ -814,6 +825,7 @@ static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
> >  {
> >         struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb);
> >         struct path upperpath, datapath;
> > +       struct user_namespace *user_ns;
> >         int err;
> >         char *capability = NULL;
> >         ssize_t cap_size;
> > @@ -827,8 +839,8 @@ static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
> >                 return -EIO;
> >
> >         if (c->stat.size) {
> > -               err = cap_size = ovl_getxattr(upperpath.dentry, XATTR_NAME_CAPS,
> > -                                             &capability);
> > +               err = cap_size = ovl_getxattr(c->user_ns, upperpath.dentry,
> > +                                             XATTR_NAME_CAPS, &capability);
> >                 if (cap_size < 0)
> >                         goto out;
> >         }
> > @@ -841,9 +853,10 @@ static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
> >          * Writing to upper file will clear security.capability xattr. We
> >          * don't want that to happen for normal copy-up operation.
> >          */
> > +       user_ns = mnt_user_ns(upperpath.mnt);
> >         if (capability) {
> > -               err = vfs_setxattr(upperpath.dentry, XATTR_NAME_CAPS,
> > -                                  capability, cap_size, 0);
> > +               err = vfs_mapped_setxattr(user_ns, upperpath.dentry,
> > +                                     XATTR_NAME_CAPS, capability, cap_size, 0);
> >                 if (err)
> >                         goto out_free;
> >         }
> > @@ -887,6 +900,7 @@ static int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry,
> >                 ovl_path_upper(parent, &parentpath);
> >                 ctx.destdir = parentpath.dentry;
> >                 ctx.destname = dentry->d_name;
> > +               ctx.user_ns = mnt_user_ns(parentpath.mnt);
> >
> >                 err = vfs_getattr(&parentpath, &ctx.pstat,
> >                                   STATX_ATIME | STATX_MTIME,
> > diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
> > index 28a075b5f5b2..23d09de00957 100644
> > --- a/fs/overlayfs/dir.c
> > +++ b/fs/overlayfs/dir.c
> > @@ -23,15 +23,16 @@ MODULE_PARM_DESC(redirect_max,
> >
> >  static int ovl_set_redirect(struct dentry *dentry, bool samedir);
> >
> > -int ovl_cleanup(struct inode *wdir, struct dentry *wdentry)
> > +int ovl_cleanup(struct user_namespace *user_ns, struct inode *wdir,
> > +               struct dentry *wdentry)
> 
> I think this one and lookup_temp should also pass ovl_fs, but
> need to be careful when they are called from ovl_fill_super()
> that upper_mnt is already assigned (should be).
> 
> >  {
> >         int err;
> >
> >         dget(wdentry);
> >         if (d_is_dir(wdentry))
> > -               err = ovl_do_rmdir(wdir, wdentry);
> > +               err = ovl_do_rmdir(user_ns, wdir, wdentry);
> >         else
> > -               err = ovl_do_unlink(wdir, wdentry);
> > +               err = ovl_do_unlink(user_ns, wdir, wdentry);
> >         dput(wdentry);
> >
> >         if (err) {
> > @@ -42,7 +43,8 @@ int ovl_cleanup(struct inode *wdir, struct dentry *wdentry)
> >         return err;
> >  }
> >
> > -struct dentry *ovl_lookup_temp(struct dentry *workdir)
> > +struct dentry *ovl_lookup_temp(struct user_namespace *user_ns,
> > +                              struct dentry *workdir)
> >  {
> >         struct dentry *temp;
> >         char name[20];
> > @@ -51,7 +53,7 @@ struct dentry *ovl_lookup_temp(struct dentry *workdir)
> >         /* counter is allowed to wrap, since temp dentries are ephemeral */
> >         snprintf(name, sizeof(name), "#%x", atomic_inc_return(&temp_id));
> >
> > -       temp = lookup_one_len(name, workdir, strlen(name));
> > +       temp = lookup_one_len_mapped(name, workdir, strlen(name), user_ns);
> >         if (!IS_ERR(temp) && temp->d_inode) {
> >                 pr_err("workdir/%s already exists\n", name);
> >                 dput(temp);
> > @@ -68,13 +70,14 @@ static struct dentry *ovl_whiteout(struct ovl_fs *ofs)
> >         struct dentry *whiteout;
> >         struct dentry *workdir = ofs->workdir;
> >         struct inode *wdir = workdir->d_inode;
> > +       struct user_namespace *user_ns = ovl_upper_mnt_user_ns(ofs);
> >
> >         if (!ofs->whiteout) {
> > -               whiteout = ovl_lookup_temp(workdir);
> > +               whiteout = ovl_lookup_temp(user_ns, workdir);
> >                 if (IS_ERR(whiteout))
> >                         goto out;
> >
> > -               err = ovl_do_whiteout(wdir, whiteout);
> > +               err = ovl_do_whiteout(user_ns, wdir, whiteout);
> >                 if (err) {
> >                         dput(whiteout);
> >                         whiteout = ERR_PTR(err);
> > @@ -84,11 +87,11 @@ static struct dentry *ovl_whiteout(struct ovl_fs *ofs)
> >         }
> >
> >         if (ofs->share_whiteout) {
> > -               whiteout = ovl_lookup_temp(workdir);
> > +               whiteout = ovl_lookup_temp(user_ns, workdir);
> >                 if (IS_ERR(whiteout))
> >                         goto out;
> >
> > -               err = ovl_do_link(ofs->whiteout, wdir, whiteout);
> > +               err = ovl_do_link(user_ns, ofs->whiteout, wdir, whiteout);
> >                 if (!err)
> >                         goto out;
> >
> > @@ -110,6 +113,7 @@ int ovl_cleanup_and_whiteout(struct ovl_fs *ofs, struct inode *dir,
> >                              struct dentry *dentry)
> >  {
> >         struct inode *wdir = ofs->workdir->d_inode;
> > +       struct user_namespace *user_ns = ovl_upper_mnt_user_ns(ofs);
> >         struct dentry *whiteout;
> >         int err;
> >         int flags = 0;
> > @@ -122,28 +126,28 @@ int ovl_cleanup_and_whiteout(struct ovl_fs *ofs, struct inode *dir,
> >         if (d_is_dir(dentry))
> >                 flags = RENAME_EXCHANGE;
> >
> > -       err = ovl_do_rename(wdir, whiteout, dir, dentry, flags);
> > +       err = ovl_do_rename(wdir, user_ns, whiteout, dir, user_ns, dentry, flags);
> >         if (err)
> >                 goto kill_whiteout;
> >         if (flags)
> > -               ovl_cleanup(wdir, dentry);
> > +               ovl_cleanup(user_ns, wdir, dentry);
> >
> >  out:
> >         dput(whiteout);
> >         return err;
> >
> >  kill_whiteout:
> > -       ovl_cleanup(wdir, whiteout);
> > +       ovl_cleanup(user_ns, wdir, whiteout);
> >         goto out;
> >  }
> >
> > -static int ovl_mkdir_real(struct inode *dir, struct dentry **newdentry,
> > -                         umode_t mode)
> > +static int ovl_mkdir_real(struct user_namespace *user_ns, struct inode *dir,
> > +                         struct dentry **newdentry, umode_t mode)
> >  {
> >         int err;
> >         struct dentry *d, *dentry = *newdentry;
> >
> > -       err = ovl_do_mkdir(dir, dentry, mode);
> > +       err = ovl_do_mkdir(user_ns, dir, dentry, mode);
> >         if (err)
> >                 return err;
> >
> > @@ -155,8 +159,8 @@ static int ovl_mkdir_real(struct inode *dir, struct dentry **newdentry,
> >          * to it unhashed and negative. If that happens, try to
> >          * lookup a new hashed and positive dentry.
> >          */
> > -       d = lookup_one_len(dentry->d_name.name, dentry->d_parent,
> > -                          dentry->d_name.len);
> > +       d = lookup_one_len_mapped(dentry->d_name.name, dentry->d_parent,
> > +                             dentry->d_name.len, user_ns);
> >         if (IS_ERR(d)) {
> >                 pr_warn("failed lookup after mkdir (%pd2, err=%i).\n",
> >                         dentry, err);
> > @@ -168,7 +172,8 @@ static int ovl_mkdir_real(struct inode *dir, struct dentry **newdentry,
> >         return 0;
> >  }
> >
> > -struct dentry *ovl_create_real(struct inode *dir, struct dentry *newdentry,
> > +struct dentry *ovl_create_real(struct user_namespace *user_ns,
> > +                              struct inode *dir, struct dentry *newdentry,
> >                                struct ovl_cattr *attr)
> >  {
> >         int err;
> > @@ -181,28 +186,28 @@ struct dentry *ovl_create_real(struct inode *dir, struct dentry *newdentry,
> >                 goto out;
> >
> >         if (attr->hardlink) {
> > -               err = ovl_do_link(attr->hardlink, dir, newdentry);
> > +               err = ovl_do_link(user_ns, attr->hardlink, dir, newdentry);
> >         } else {
> >                 switch (attr->mode & S_IFMT) {
> >                 case S_IFREG:
> > -                       err = ovl_do_create(dir, newdentry, attr->mode);
> > +                       err = ovl_do_create(user_ns, dir, newdentry, attr->mode);
> >                         break;
> >
> >                 case S_IFDIR:
> >                         /* mkdir is special... */
> > -                       err =  ovl_mkdir_real(dir, &newdentry, attr->mode);
> > +                       err =  ovl_mkdir_real(user_ns, dir, &newdentry, attr->mode);
> >                         break;
> >
> >                 case S_IFCHR:
> >                 case S_IFBLK:
> >                 case S_IFIFO:
> >                 case S_IFSOCK:
> > -                       err = ovl_do_mknod(dir, newdentry, attr->mode,
> > +                       err = ovl_do_mknod(user_ns, dir, newdentry, attr->mode,
> >                                            attr->rdev);
> >                         break;
> >
> >                 case S_IFLNK:
> > -                       err = ovl_do_symlink(dir, newdentry, attr->link);
> > +                       err = ovl_do_symlink(user_ns, dir, newdentry, attr->link);
> >                         break;
> >
> >                 default:
> > @@ -224,10 +229,11 @@ struct dentry *ovl_create_real(struct inode *dir, struct dentry *newdentry,
> >         return newdentry;
> >  }
> >
> 
> ovl_fs to all those create helpers
> 
> > -struct dentry *ovl_create_temp(struct dentry *workdir, struct ovl_cattr *attr)
> > +struct dentry *ovl_create_temp(struct user_namespace *user_ns, struct dentry *workdir,
> > +                              struct ovl_cattr *attr)
> 
> Not only should this get ovl_fs, but workdir could be later taken from
> ofs->workdir.
> It's subtle because there is one caller ovl_create_index() that needs special
> care so leave that cleanup to me.
> 
> Anyway, I think you get the idea, but I am jumping ahead of myself.
> Let's wait and see how the vfs patches play out.
> 
> One naive question:
> 
> If we want to avoid all the vfs API churn we can store the mnt_user_ns
> in current cred.
> Overlayfs can prepare at mount time one cred copy per layer and in
> ovl_override_creds()
> we can pass the ovl_layer id.
> In some functions, such as ovl_looup() that access several layers, we
> will need to take
> care of overriding different creds per layer.
> 
> The concerns raised about storing userns in current wrt io_uring are
> not relevant
> for overlayfs access to underlying layers.
> 
> I know you and others have listed some use cases that are not
> involving overlayfs,
> but perhaps as a first step, this approach will be a much easier sell
> and can cover
> some of the use cases?

The vfs approach was always going to raise eye-brows. That doesn't mean
it's not the correct way to do it. Any larger change to the vfs would
have and did have the same effect.
I think that we should go for a proper generic solution instead of just
for one filesystem. Overlayfs is a use-case that we're keen to handle as
part of the large approach to this.

Christian

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/34] fs: idmapped mounts
  2020-10-29 21:58 ` Andy Lutomirski
@ 2020-10-30 12:01   ` Christian Brauner
  2020-10-30 16:17     ` Serge E. Hallyn
  2020-10-31 17:43     ` Andy Lutomirski
  0 siblings, 2 replies; 68+ messages in thread
From: Christian Brauner @ 2020-10-30 12:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexander Viro, Christoph Hellwig, linux-fsdevel, John Johansen,
	James Morris, Mimi Zohar, Dmitry Kasatkin, Stephen Smalley,
	Casey Schaufler, Arnd Bergmann, Andreas Dilger, OGAWA Hirofumi,
	Geoffrey Thomas, Mrunal Patel, Josh Triplett, Andy Lutomirski,
	Amir Goldstein, Miklos Szeredi, Theodore Tso, Alban Crequy,
	Tycho Andersen, David Howells, James Bottomley, Jann Horn,
	Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux

On Thu, Oct 29, 2020 at 02:58:55PM -0700, Andy Lutomirski wrote:
> 
> 
> > On Oct 28, 2020, at 5:35 PM, Christian Brauner <christian.brauner@ubuntu.com> wrote:
> > 
> > Hey everyone,
> > 
> > I vanished for a little while to focus on this work here so sorry for
> > not being available by mail for a while.
> > 
> > Since quite a long time we have issues with sharing mounts between
> > multiple unprivileged containers with different id mappings, sharing a
> > rootfs between multiple containers with different id mappings, and also
> > sharing regular directories and filesystems between users with different
> > uids and gids. The latter use-cases have become even more important with
> > the availability and adoption of systemd-homed (cf. [1]) to implement
> > portable home directories.
> > 
> > The solutions we have tried and proposed so far include the introduction
> > of fsid mappings, a tiny overlay based filesystem, and an approach to
> > call override creds in the vfs. None of these solutions have covered all
> > of the above use-cases.
> > 
> > The solution proposed here has it's origins in multiple discussions
> > during Linux Plumbers 2017 during and after the end of the containers
> > microconference.
> > To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
> > James, and myself. A variant of the solution proposed here has also been
> > discussed, again to the best of my knowledge, after a Linux conference
> > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> > after Linux Plumbers.
> > I've taken the time to finally implement a working version of this
> > solution over the last weeks to the best of my abilities. Tycho has
> > signed up for this sligthly crazy endeavour as well and he has helped
> > with the conversion of the xattr codepaths.
> > 
> > The core idea is to make idmappings a property of struct vfsmount
> > instead of tying it to a process being inside of a user namespace which
> > has been the case for all other proposed approaches.
> > It means that idmappings become a property of bind-mounts, i.e. each
> > bind-mount can have a separate idmapping. This has the obvious advantage
> > that idmapped mounts can be created inside of the initial user
> > namespace, i.e. on the host itself instead of requiring the caller to be
> > located inside of a user namespace. This enables such use-cases as e.g.
> > making a usb stick available in multiple locations with different
> > idmappings (see the vfat port that is part of this patch series).
> > 
> > The vfsmount struct gains a new struct user_namespace member. The
> > idmapping of the user namespace becomes the idmapping of the mount. A
> > caller that is either privileged with respect to the user namespace of
> > the superblock of the underlying filesystem or a caller that is
> > privileged with respect to the user namespace a mount has been idmapped
> > with can create a new bind-mount and mark it with a user namespace.
> 
> So one way of thinking about this is that a user namespace that has an idmapped mount can, effectively, create or chown files with *any* on-disk uid or gid by doing it directly (if that uid exists in-namespace, which is likely for interesting ids like 0) or by creating a new userns with that id inside.
> 
> For a file system that is private to a container, this seems moderately safe, although this may depend on what exactly “private” means. We probably want a mechanism such that, if you are outside the namespace, a reference to a file with the namespace’s vfsmnt does not confer suid privilege.
> 
> Imagine the following attack: user creates a namespace with a root user and arranges to get an idmapped fs, e.g. by inserting an ext4 usb stick or using whatever container management tool does this.  Inside the namespace, the user creates a suid-root file.
> 
> Now, outside the namespace, the user has privilege over the namespace.  (I’m assuming there is some tool that will idmap things in a namespace owned by an unprivileged user, which seems likely.). So the user makes a new bind mount and if maps it to the init namespace. Game over.
> 
> So I think we need to have some control to mitigate this in a comprehensible way. A big hammer would be to require nosuid. A smaller hammer might be to say that you can’t create a new idmapped mount unless you have privilege over the userns that you want to use for the idmap and to say that a vfsmnt’s paths don’t do suid outside the idmap namespace.  We already do the latter for the vfsmnt’s mntns’s userns.

With this series, in order to create an idmapped mount the user must
either be cap_sys_admin in the superblock of the underlying filesystem
or if the mount is already idmapped and they want to create another
idmapped mount from it they must have cap_sys_admin in the userns that
the mount is currrently marked with. It is also not possible to change
an idmapped mount once it has been idmapped, i.e. the user must create a
new detached bind-mount first.

> 
> Hmm.  What happens if we require that an idmap userns equal the vfsmnt’s mntns’s userns?  Is that too limiting?
> 
> I hope that whatever solution gets used is straightforward enough to wrap one’s head around.
> 
> > When a file/inode is accessed through an idmapped mount the i_uid and
> > i_gid of the inode will be remapped according to the user namespace the
> > mount has been marked with. When a new object is created based on the
> > fsuid and fsgid of the caller they will similarly be remapped according
> > to the user namespace of the mount they care created from.
> 
> By “mapped according to”, I presume you mean that the on-disk uid/gid is the gid as seen in the user namespace in question.

If I understand you correctly, then yes.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/34] fs: idmapped mounts
  2020-10-29 16:37     ` Eric W. Biederman
  2020-10-30  2:18       ` Serge E. Hallyn
@ 2020-10-30 15:07       ` Seth Forshee
  2020-10-30 16:03         ` Serge E. Hallyn
  2020-11-03 14:10       ` Alban Crequy
  2 siblings, 1 reply; 68+ messages in thread
From: Seth Forshee @ 2020-10-30 15:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aleksa Sarai, Christian Brauner, Alexander Viro,
	Christoph Hellwig, linux-fsdevel, John Johansen, James Morris,
	Mimi Zohar, Dmitry Kasatkin, Stephen Smalley, Casey Schaufler,
	Arnd Bergmann, Andreas Dilger, OGAWA Hirofumi, Geoffrey Thomas,
	Mrunal Patel, Josh Triplett, Andy Lutomirski, Amir Goldstein,
	Miklos Szeredi, Theodore Tso, Alban Crequy, Tycho Andersen,
	David Howells, James Bottomley, Jann Horn, Stéphane Graber,
	Lennart Poettering, smbarber, Phil Estes, Serge Hallyn,
	Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux

On Thu, Oct 29, 2020 at 11:37:23AM -0500, Eric W. Biederman wrote:
> First and foremost: A uid shift on write to a filesystem is a security
> bug waiting to happen.  This is especially in the context of facilities
> like iouring, that play very agressive games with how process context
> makes it to  system calls.
> 
> The only reason containers were not immediately exploitable when iouring
> was introduced is because the mechanisms are built so that even if
> something escapes containment the security properties still apply.
> Changes to the uid when writing to the filesystem does not have that
> property.  The tiniest slip in containment will be a security issue.
> 
> This is not even the least bit theoretical.  I have seem reports of how
> shitfs+overlayfs created a situation where anyone could read
> /etc/shadow.

This bug was the result of a complex interaction with several
contributing factors. It's fair to say that one component was overlayfs
writing through an id-shifted mount, but the primary cause was related
to how copy-up was done coupled with allowing unprivileged overlayfs
mounts in a user ns. Checks that the mounter had access to the lower fs
file were not done before copying data up, and so the file was copied up
temporarily to the id shifted upperdir. Even though it was immediately
removed, other factors made it possible for the user to get the file
contents from the upperdir.

Regardless, I do think you raise a good point. We need to be wary of any
place the kernel could open files through a shifted mount, especially
when the open could be influenced by userspace.

Perhaps kernel file opens through shifted mounts should to be opt-in.
I.e. unless a flag is passed, or a different open interface used, the
open will fail if the dentry being opened is subject to id shifting.
This way any kernel writes which would be subject to id shifting will
only happen through code which as been written to take it into account.

Seth

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/34] fs: idmapped mounts
  2020-10-30 15:07       ` Seth Forshee
@ 2020-10-30 16:03         ` Serge E. Hallyn
  0 siblings, 0 replies; 68+ messages in thread
From: Serge E. Hallyn @ 2020-10-30 16:03 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Eric W. Biederman, Aleksa Sarai, Christian Brauner,
	Alexander Viro, Christoph Hellwig, linux-fsdevel, John Johansen,
	James Morris, Mimi Zohar, Dmitry Kasatkin, Stephen Smalley,
	Casey Schaufler, Arnd Bergmann, Andreas Dilger, OGAWA Hirofumi,
	Geoffrey Thomas, Mrunal Patel, Josh Triplett, Andy Lutomirski,
	Amir Goldstein, Miklos Szeredi, Theodore Tso, Alban Crequy,
	Tycho Andersen, David Howells, James Bottomley, Jann Horn,
	Stéphane Graber, Lennart Poettering, smbarber, Phil Estes,
	Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet, containers,
	linux-security-module, linux-api, linux-ext4, linux-unionfs,
	linux-audit, linux-integrity, selinux

On Fri, Oct 30, 2020 at 10:07:48AM -0500, Seth Forshee wrote:
> On Thu, Oct 29, 2020 at 11:37:23AM -0500, Eric W. Biederman wrote:
> > First and foremost: A uid shift on write to a filesystem is a security
> > bug waiting to happen.  This is especially in the context of facilities
> > like iouring, that play very agressive games with how process context
> > makes it to  system calls.
> > 
> > The only reason containers were not immediately exploitable when iouring
> > was introduced is because the mechanisms are built so that even if
> > something escapes containment the security properties still apply.
> > Changes to the uid when writing to the filesystem does not have that
> > property.  The tiniest slip in containment will be a security issue.
> > 
> > This is not even the least bit theoretical.  I have seem reports of how
> > shitfs+overlayfs created a situation where anyone could read
> > /etc/shadow.
> 
> This bug was the result of a complex interaction with several
> contributing factors. It's fair to say that one component was overlayfs
> writing through an id-shifted mount, but the primary cause was related
> to how copy-up was done coupled with allowing unprivileged overlayfs
> mounts in a user ns. Checks that the mounter had access to the lower fs
> file were not done before copying data up, and so the file was copied up
> temporarily to the id shifted upperdir. Even though it was immediately
> removed, other factors made it possible for the user to get the file
> contents from the upperdir.
> 
> Regardless, I do think you raise a good point. We need to be wary of any
> place the kernel could open files through a shifted mount, especially
> when the open could be influenced by userspace.
> 
> Perhaps kernel file opens through shifted mounts should to be opt-in.
> I.e. unless a flag is passed, or a different open interface used, the
> open will fail if the dentry being opened is subject to id shifting.
> This way any kernel writes which would be subject to id shifting will
> only happen through code which as been written to take it into account.

For my use cases, it would be fine to require opt-in at original fs
mount time by init_user_ns admin.  I.e.
    mount -o allow_idmap /dev/mapper/whoozit /whatzit
I'm quite certain I would always be sharing a separate LV or loopback or
tmpfs.

-serge

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/34] fs: idmapped mounts
  2020-10-30 12:01   ` Christian Brauner
@ 2020-10-30 16:17     ` Serge E. Hallyn
  2020-10-31 17:43     ` Andy Lutomirski
  1 sibling, 0 replies; 68+ messages in thread
From: Serge E. Hallyn @ 2020-10-30 16:17 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andy Lutomirski, Alexander Viro, Christoph Hellwig,
	linux-fsdevel, John Johansen, James Morris, Mimi Zohar,
	Dmitry Kasatkin, Stephen Smalley, Casey Schaufler, Arnd Bergmann,
	Andreas Dilger, OGAWA Hirofumi, Geoffrey Thomas, Mrunal Patel,
	Josh Triplett, Andy Lutomirski, Amir Goldstein, Miklos Szeredi,
	Theodore Tso, Alban Crequy, Tycho Andersen, David Howells,
	James Bottomley, Jann Horn, Seth Forshee, Stéphane Graber,
	Aleksa Sarai, Lennart Poettering, Eric W. Biederman, smbarber,
	Phil Estes, Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet,
	containers, linux-security-module, linux-api, linux-ext4,
	linux-unionfs, linux-audit, linux-integrity, selinux

On Fri, Oct 30, 2020 at 01:01:57PM +0100, Christian Brauner wrote:
> On Thu, Oct 29, 2020 at 02:58:55PM -0700, Andy Lutomirski wrote:
> > 
> > 
> > > On Oct 28, 2020, at 5:35 PM, Christian Brauner <christian.brauner@ubuntu.com> wrote:
> > > 
> > > Hey everyone,
> > > 
> > > I vanished for a little while to focus on this work here so sorry for
> > > not being available by mail for a while.
> > > 
> > > Since quite a long time we have issues with sharing mounts between
> > > multiple unprivileged containers with different id mappings, sharing a
> > > rootfs between multiple containers with different id mappings, and also
> > > sharing regular directories and filesystems between users with different
> > > uids and gids. The latter use-cases have become even more important with
> > > the availability and adoption of systemd-homed (cf. [1]) to implement
> > > portable home directories.
> > > 
> > > The solutions we have tried and proposed so far include the introduction
> > > of fsid mappings, a tiny overlay based filesystem, and an approach to
> > > call override creds in the vfs. None of these solutions have covered all
> > > of the above use-cases.
> > > 
> > > The solution proposed here has it's origins in multiple discussions
> > > during Linux Plumbers 2017 during and after the end of the containers
> > > microconference.
> > > To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
> > > James, and myself. A variant of the solution proposed here has also been
> > > discussed, again to the best of my knowledge, after a Linux conference
> > > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> > > after Linux Plumbers.
> > > I've taken the time to finally implement a working version of this
> > > solution over the last weeks to the best of my abilities. Tycho has
> > > signed up for this sligthly crazy endeavour as well and he has helped
> > > with the conversion of the xattr codepaths.
> > > 
> > > The core idea is to make idmappings a property of struct vfsmount
> > > instead of tying it to a process being inside of a user namespace which
> > > has been the case for all other proposed approaches.
> > > It means that idmappings become a property of bind-mounts, i.e. each
> > > bind-mount can have a separate idmapping. This has the obvious advantage
> > > that idmapped mounts can be created inside of the initial user
> > > namespace, i.e. on the host itself instead of requiring the caller to be
> > > located inside of a user namespace. This enables such use-cases as e.g.
> > > making a usb stick available in multiple locations with different
> > > idmappings (see the vfat port that is part of this patch series).
> > > 
> > > The vfsmount struct gains a new struct user_namespace member. The
> > > idmapping of the user namespace becomes the idmapping of the mount. A
> > > caller that is either privileged with respect to the user namespace of
> > > the superblock of the underlying filesystem or a caller that is
> > > privileged with respect to the user namespace a mount has been idmapped
> > > with can create a new bind-mount and mark it with a user namespace.
> > 
> > So one way of thinking about this is that a user namespace that has an idmapped mount can, effectively, create or chown files with *any* on-disk uid or gid by doing it directly (if that uid exists in-namespace, which is likely for interesting ids like 0) or by creating a new userns with that id inside.
> > 
> > For a file system that is private to a container, this seems moderately safe, although this may depend on what exactly “private” means. We probably want a mechanism such that, if you are outside the namespace, a reference to a file with the namespace’s vfsmnt does not confer suid privilege.
> > 
> > Imagine the following attack: user creates a namespace with a root user and arranges to get an idmapped fs, e.g. by inserting an ext4 usb stick or using whatever container management tool does this.  Inside the namespace, the user creates a suid-root file.
> > 
> > Now, outside the namespace, the user has privilege over the namespace.  (I’m assuming there is some tool that will idmap things in a namespace owned by an unprivileged user, which seems likely.). So the user makes a new bind mount and if maps it to the init namespace. Game over.
> > 
> > So I think we need to have some control to mitigate this in a comprehensible way. A big hammer would be to require nosuid. A smaller hammer might be to say that you can’t create a new idmapped mount unless you have privilege over the userns that you want to use for the idmap and to say that a vfsmnt’s paths don’t do suid outside the idmap namespace.  We already do the latter for the vfsmnt’s mntns’s userns.
> 
> With this series, in order to create an idmapped mount the user must
> either be cap_sys_admin in the superblock of the underlying filesystem
> or if the mount is already idmapped and they want to create another
> idmapped mount from it they must have cap_sys_admin in the userns that
> the mount is currrently marked with. It is also not possible to change
> an idmapped mount once it has been idmapped, i.e. the user must create a
> new detached bind-mount first.

Yeah I spent quite some time last night trying to figure out the scenario
you were presenting, but I failed.  Andy, could you either rephrase or
give a more concrete end to end attack scenario?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/34] fs: idmapped mounts
  2020-10-30 12:01   ` Christian Brauner
  2020-10-30 16:17     ` Serge E. Hallyn
@ 2020-10-31 17:43     ` Andy Lutomirski
  1 sibling, 0 replies; 68+ messages in thread
From: Andy Lutomirski @ 2020-10-31 17:43 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Christoph Hellwig, Linux FS Devel, John Johansen,
	James Morris, Mimi Zohar, Dmitry Kasatkin, Stephen Smalley,
	Casey Schaufler, Arnd Bergmann, Andreas Dilger, OGAWA Hirofumi,
	Geoffrey Thomas, Mrunal Patel, Josh Triplett, Andy Lutomirski,
	Amir Goldstein, Miklos Szeredi, Theodore Tso, Alban Crequy,
	Tycho Andersen, David Howells, James Bottomley, Jann Horn,
	Seth Forshee, Stéphane Graber, Aleksa Sarai,
	Lennart Poettering, Eric W. Biederman, Stephen Barber,
	Phil Estes, Serge Hallyn, Kees Cook, Todd Kjos, Jonathan Corbet,
	Linux Containers, LSM List, Linux API, Ext4 Developers List,
	linux-unionfs, linux-audit, linux-integrity, selinux

On Fri, Oct 30, 2020 at 5:02 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Thu, Oct 29, 2020 at 02:58:55PM -0700, Andy Lutomirski wrote:
> >
> >
> > > On Oct 28, 2020, at 5:35 PM, Christian Brauner <christian.brauner@ubuntu.com> wrote:
> > >
> > > Hey everyone,
> > >
> > > I vanished for a little while to focus on this work here so sorry for
> > > not being available by mail for a while.
> > >
> > > Since quite a long time we have issues with sharing mounts between
> > > multiple unprivileged containers with different id mappings, sharing a
> > > rootfs between multiple containers with different id mappings, and also
> > > sharing regular directories and filesystems between users with different
> > > uids and gids. The latter use-cases have become even more important with
> > > the availability and adoption of systemd-homed (cf. [1]) to implement
> > > portable home directories.
> > >
> > > The solutions we have tried and proposed so far include the introduction
> > > of fsid mappings, a tiny overlay based filesystem, and an approach to
> > > call override creds in the vfs. None of these solutions have covered all
> > > of the above use-cases.
> > >
> > > The solution proposed here has it's origins in multiple discussions
> > > during Linux Plumbers 2017 during and after the end of the containers
> > > microconference.
> > > To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
> > > James, and myself. A variant of the solution proposed here has also been
> > > discussed, again to the best of my knowledge, after a Linux conference
> > > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> > > after Linux Plumbers.
> > > I've taken the time to finally implement a working version of this
> > > solution over the last weeks to the best of my abilities. Tycho has
> > > signed up for this sligthly crazy endeavour as well and he has helped
> > > with the conversion of the xattr codepaths.
> > >
> > > The core idea is to make idmappings a property of struct vfsmount
> > > instead of tying it to a process being inside of a user namespace which
> > > has been the case for all other proposed approaches.
> > > It means that idmappings become a property of bind-mounts, i.e. each
> > > bind-mount can have a separate idmapping. This has the obvious advantage
> > > that idmapped mounts can be created inside of the initial user
> > > namespace, i.e. on the host itself instead of requiring the caller to be
> > > located inside of a user namespace. This enables such use-cases as e.g.
> > > making a usb stick available in multiple locations with different
> > > idmappings (see the vfat port that is part of this patch series).
> > >
> > > The vfsmount struct gains a new struct user_namespace member. The
> > > idmapping of the user namespace becomes the idmapping of the mount. A
> > > caller that is either privileged with respect to the user namespace of
> > > the superblock of the underlying filesystem or a caller that is
> > > privileged with respect to the user namespace a mount has been idmapped
> > > with can create a new bind-mount and mark it with a user namespace.
> >
> > So one way of thinking about this is that a user namespace that has an idmapped mount can, effectively, create or chown files with *any* on-disk uid or gid by doing it directly (if that uid exists in-namespace, which is likely for interesting ids like 0) or by creating a new userns with that id inside.
> >
> > For a file system that is private to a container, this seems moderately safe, although this may depend on what exactly “private” means. We probably want a mechanism such that, if you are outside the namespace, a reference to a file with the namespace’s vfsmnt does not confer suid privilege.
> >
> > Imagine the following attack: user creates a namespace with a root user and arranges to get an idmapped fs, e.g. by inserting an ext4 usb stick or using whatever container management tool does this.  Inside the namespace, the user creates a suid-root file.
> >
> > Now, outside the namespace, the user has privilege over the namespace.  (I’m assuming there is some tool that will idmap things in a namespace owned by an unprivileged user, which seems likely.). So the user makes a new bind mount and if maps it to the init namespace. Game over.
> >
> > So I think we need to have some control to mitigate this in a comprehensible way. A big hammer would be to require nosuid. A smaller hammer might be to say that you can’t create a new idmapped mount unless you have privilege over the userns that you want to use for the idmap and to say that a vfsmnt’s paths don’t do suid outside the idmap namespace.  We already do the latter for the vfsmnt’s mntns’s userns.
>
> With this series, in order to create an idmapped mount the user must
> either be cap_sys_admin in the superblock of the underlying filesystem
> or if the mount is already idmapped and they want to create another
> idmapped mount from it they must have cap_sys_admin in the userns that
> the mount is currrently marked with. It is also not possible to change
> an idmapped mount once it has been idmapped, i.e. the user must create a
> new detached bind-mount first.

I think my attack might not work, but I also think I didn't explain it
very well.  Let me try again.  I'll also try to lay out what I
understand the rules of idmaps to be so that you can correct me when
I'm inevitable wrong :)

First, background: there are a bunch of user namespaces around.  Every
superblock has one, every idmapped mount has one, and every vfsmnt
also (indirectly) has one: mnt->mnt_ns->user_ns.  So, if you're
looking at a given vfsmnt, you have three user namespaces that are
relevant, in addition to whatever namespaces are active for the task
(or kernel thread) accessing that mount.  I'm wondering whether
mnt_user_ns() should possibly have a name that makes it clear that it
refers to the idmap namespace and not mnt->mnt_ns->user_ns.

So here's the attack.  An attacker with uid=1000 creates a userns N
(so the attacker owns the ns and 1000 outside maps to 0 inside).  N is
a child of init_user_ns.  Now the attacker creates a mount namespace M
inside the userns and, potentially with the help of a container
management tool, creates an idmapped filesystem mount F inside M.  So,
playing fast and loose with my ampersands:

F->mnt_ns == M
F->mnt_ns->user_ns == N
mnt_user_ns(F) == N

I expect that this wouldn't be a particularly uncommon setup.  Now the
user has the ability to create files with inode->uid == 0 and the SUID
bit set on their filesystem.  This isn't terribly different from FUSE,
except that the mount won't have nosuid set, whereas at least many
uses of unprivileged FUSE would have nosuid set.  So the thing that
makes me a little bit nervous.  But it actually seems likely that I
was wrong and this is okay.  Specifically, to exploit this using
kernel mechanisms, one would need to pass a mnt_may_suid() check,
which means that one would need to acquire a mount of F in one's
current mount namespace, and one would need one's current user
namespace to be init_ns (or something else sensitive).  But you
already need to own the namespace to create mounts, unless you have a
way to confuse some existing user tooling.  You would also need to be
in F's superblock's user_ns (second line of mnt_may_suid()), which
totally kills this type of attack if F's superblock is in the
container's user_ns, but I wouldn't count on that.

So maybe this is all fine.  I'll continue to try to poke holes in it,
but perhaps there aren't any holes to poke.  I'll also continue to try
to see if I can state the security properties of idmap in a way that
is clear and obviously has nice properties.

Why are you allowing the creation of a new idmapped mount if you have
cap_sys_admin over an existing idmap userns but not over the
superblock's userns?  I assume this is for a nested container use
case, but can you spell out a specific example usage?

--Andy

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/34] namespace: take lock_mount_hash() directly when changing flags
  2020-10-29  0:32 ` [PATCH 01/34] namespace: take lock_mount_hash() directly when changing flags Christian Brauner
@ 2020-11-01 14:41   ` Christoph Hellwig
  2020-11-02 13:33     ` Christian Brauner
  0 siblings, 1 reply; 68+ messages in thread
From: Christoph Hellwig @ 2020-11-01 14:41 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Christoph Hellwig, linux-fsdevel, John Johansen,
	James Morris, Mimi Zohar, Dmitry Kasatkin, Stephen Smalley,
	Casey Schaufler, Arnd Bergmann, Andreas Dilger, OGAWA Hirofumi,
	Geoffrey Thomas, Mrunal Patel, Josh Triplett, Andy Lutomirski,
	Amir Goldstein, Miklos Szeredi, Theodore Tso, Alban Crequy,
	Tycho Andersen, David Howells, James Bottomley, Jann Horn,
	Seth Forshee, St??phane Graber, Aleksa Sarai, Lennart Poettering,
	Eric W. Biederman, smbarber, Phil Estes, Serge Hallyn, Kees Cook,
	Todd Kjos, Jonathan Corbet, containers, linux-security-module,
	linux-api, linux-ext4, linux-unionfs, linux-audit,
	linux-integrity, selinux

> index cebaa3e81794..20ee291a7af4 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -463,7 +463,6 @@ static int mnt_make_readonly(struct mount *mnt)
>  {
>  	int ret = 0;
>  
> -	lock_mount_hash();

What about adding a lockdep_assert_lock_held in all the functions
that used to take the lock to document the assumptions?

>  static int __mnt_unmake_readonly(struct mount *mnt)
>  {
> -	lock_mount_hash();
>  	mnt->mnt.mnt_flags &= ~MNT_READONLY;
> -	unlock_mount_hash();
>  	return 0;

This helper is rather pointless now.

>  static void set_mount_attributes(struct mount *mnt, unsigned int mnt_flags)
>  {
> -	lock_mount_hash();
>  	mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
>  	mnt->mnt.mnt_flags = mnt_flags;
>  	touch_mnt_namespace(mnt->mnt_ns);
> -	unlock_mount_hash();

In linux-next there is an additional notify_mount after the unlock here.

Also while you touch this lock_mount_hash/unlock_mount_hash could be
moved to namespace.c and maked static now.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 03/34] fs: add mount_setattr()
  2020-10-29  0:32 ` [PATCH 03/34] fs: add mount_setattr() Christian Brauner
@ 2020-11-01 14:42   ` Christoph Hellwig
  2020-11-02 13:34     ` Christian Brauner
  0 siblings, 1 reply; 68+ messages in thread
From: Christoph Hellwig @ 2020-11-01 14:42 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Christoph Hellwig, linux-fsdevel, John Johansen,
	James Morris, Mimi Zohar, Dmitry Kasatkin, Stephen Smalley,
	Casey Schaufler, Arnd Bergmann, Andreas Dilger, OGAWA Hirofumi,
	Geoffrey Thomas, Mrunal Patel, Josh Triplett, Andy Lutomirski,
	Amir Goldstein, Miklos Szeredi, Theodore Tso, Alban Crequy,
	Tycho Andersen, David Howells, James Bottomley, Jann Horn,
	Seth Forshee, St??phane Graber, Aleksa Sarai, Lennart Poettering,
	Eric W. Biederman, smbarber, Phil Estes, Serge Hallyn, Kees Cook,
	Todd Kjos, Jonathan Corbet, containers, linux-security-module,
	linux-api, linux-ext4, linux-unionfs, linux-audit,
	linux-integrity, selinux

This has a bunch of crazy long lines.

Also some of the refatoring might be worth slpitting into prep patches.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 05/34] fs: introduce MOUNT_ATTR_IDMAP
  2020-10-29  0:32 ` [PATCH 05/34] fs: introduce MOUNT_ATTR_IDMAP Christian Brauner
@ 2020-11-01 14:45   ` Christoph Hellwig
  2020-11-02 13:29     ` Christian Brauner
  0 siblings, 1 reply; 68+ messages in thread
From: Christoph Hellwig @ 2020-11-01 14:45 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Christoph Hellwig, linux-fsdevel, John Johansen,
	James Morris, Mimi Zohar, Dmitry Kasatkin, Stephen Smalley,
	Casey Schaufler, Arnd Bergmann, Andreas Dilger, OGAWA Hirofumi,
	Geoffrey Thomas, Mrunal Patel, Josh Triplett, Andy Lutomirski,
	Amir Goldstein, Miklos Szeredi, Theodore Tso, Alban Crequy,
	Tycho Andersen, David Howells, James Bottomley, Jann Horn,
	Seth Forshee, St??phane Graber, Aleksa Sarai, Lennart Poettering,
	Eric W. Biederman, smbarber, Phil Estes, Serge Hallyn, Kees Cook,
	Todd Kjos, Jonathan Corbet, containers, linux-security-module,
	linux-api, linux-ext4, linux-unionfs, linux-audit,
	linux-integrity, selinux

On Thu, Oct 29, 2020 at 01:32:23AM +0100, Christian Brauner wrote:
> Introduce a new mount bind mount property to allow idmapping mounts. The
> MOUNT_ATTR_IDMAP flag can be set via the new mount_setattr() syscall
> together with a file descriptor referring to a user namespace.

Shouldn't this go to the end of the series once all the infrastructure
is in place?

> +config IDMAP_MOUNTS
> +	bool "Support id mappings per mount"
> +	default n

n is the default default.

But why do we need a config option here anyway?

> +#ifdef CONFIG_IDMAP_MOUNTS
> +		if (kattr->attr_set & MNT_IDMAPPED) {
> +			struct user_namespace *user_ns;
> +			struct vfsmount *vmnt;

All the code here looks like it should go into a helper.

> +				struct user_namespace *user_ns = READ_ONCE(m->mnt.mnt_user_ns);
> +				WRITE_ONCE(m->mnt.mnt_user_ns, get_user_ns(kattr->userns));

More unreadable long lines.

> +	if (attr->attr_set & MOUNT_ATTR_IDMAP) {
> +		struct ns_common *ns;
> +		struct user_namespace *user_ns;
> +		struct file *file;
> +
> +		file = fget(attr->userns);

The code here looks like another candidate for a self contained helper.

> +
> +static inline struct user_namespace *mnt_user_ns(const struct vfsmount *mnt)
> +{
> +#ifdef CONFIG_IDMAP_MOUNTS
> +	return READ_ONCE(mnt->mnt_user_ns);
> +#else
> +	return &init_user_ns;
> +#endif

How is the READ_ONCE on a pointer going to work?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 06/34] fs: add id translation helpers
  2020-10-29  0:32 ` [PATCH 06/34] fs: add id translation helpers Christian Brauner
@ 2020-11-01 14:46   ` Christoph Hellwig
  2020-11-02 13:25     ` Christian Brauner
  0 siblings, 1 reply; 68+ messages in thread
From: Christoph Hellwig @ 2020-11-01 14:46 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Christoph Hellwig, linux-fsdevel, John Johansen,
	James Morris, Mimi Zohar, Dmitry Kasatkin, Stephen Smalley,
	Casey Schaufler, Arnd Bergmann, Andreas Dilger, OGAWA Hirofumi,
	Geoffrey Thomas, Mrunal Patel, Josh Triplett, Andy Lutomirski,
	Amir Goldstein, Miklos Szeredi, Theodore Tso, Alban Crequy,
	Tycho Andersen, David Howells, James Bottomley, Jann Horn,
	Seth Forshee, St??phane Graber, Aleksa Sarai, Lennart Poettering,
	Eric W. Biederman, smbarber, Phil Estes, Serge Hallyn, Kees Cook,
	Todd Kjos, Jonathan Corbet, containers, linux-security-module,
	linux-api, linux-ext4, linux-unionfs, linux-audit,
	linux-integrity, selinux

> +static inline kuid_t kuid_into_mnt(struct user_namespace *to, kuid_t kuid)
> +{
> +#ifdef CONFIG_IDMAP_MOUNTS
> +	return make_kuid(to, __kuid_val(kuid));
> +#else
> +	return kuid;
> +#endif
> +}
> +
> +static inline kgid_t kgid_into_mnt(struct user_namespace *to, kgid_t kgid)
> +{
> +#ifdef CONFIG_IDMAP_MOUNTS
> +	return make_kgid(to, __kgid_val(kgid));
> +#else
> +	return kgid;
> +#endif

If you want to keep the config option please at least have on
#ifdef/#else/#endif instead of this mess.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 07/34] capability: handle idmapped mounts
  2020-10-29  0:32 ` [PATCH 07/34] capability: handle idmapped mounts Christian Brauner
@ 2020-11-01 14:48   ` Christoph Hellwig
  2020-11-02 13:23     ` Christian Brauner
  0 siblings, 1 reply; 68+ messages in thread
From: Christoph Hellwig @ 2020-11-01 14:48 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Christoph Hellwig, linux-fsdevel, John Johansen,
	James Morris, Mimi Zohar, Dmitry Kasatkin, Stephen Smalley,
	Casey Schaufler, Arnd Bergmann, Andreas Dilger, OGAWA Hirofumi,
	Geoffrey Thomas, Mrunal Patel, Josh Triplett, Andy Lutomirski,
	Amir Goldstein, Miklos Szeredi, Theodore Tso, Alban Crequy,
	Tycho Andersen, David Howells, James Bottomley, Jann Horn,
	Seth Forshee, St??phane Graber, Aleksa Sarai, Lennart Poettering,
	Eric W. Biederman, smbarber, Phil Estes, Serge Hallyn, Kees Cook,
	Todd Kjos, Jonathan Corbet, containers, linux-security-module,
	linux-api, linux-ext4, linux-unionfs, linux-audit,
	linux-integrity, selinux

>  /**
>   * capable_wrt_inode_uidgid - Check nsown_capable and uid and gid mapped
>   * @inode: The inode in question
> @@ -501,9 +513,7 @@ bool privileged_wrt_inode_uidgid(struct user_namespace *ns, const struct inode *
>   */
>  bool capable_wrt_inode_uidgid(const struct inode *inode, int cap)
>  {
> +	return capable_wrt_mapped_inode_uidgid(&init_user_ns, inode, cap);
>  }
>  EXPORT_SYMBOL(capable_wrt_inode_uidgid);

Please avoid these silly wrappers and just switch all callers to pass
the namespaces instead of creating boilerplate code.  Same for the other
functions where you do this even even worse the method calls.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 07/34] capability: handle idmapped mounts
  2020-11-01 14:48   ` Christoph Hellwig
@ 2020-11-02 13:23     ` Christian Brauner
  0 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-11-02 13:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alexander Viro, linux-fsdevel, John Johansen, James Morris,
	Mimi Zohar, Dmitry Kasatkin, Stephen Smalley, Casey Schaufler,
	Arnd Bergmann, Andreas Dilger, OGAWA Hirofumi, Geoffrey Thomas,
	Mrunal Patel, Josh Triplett, Andy Lutomirski, Amir Goldstein,
	Miklos Szeredi, Theodore Tso, Alban Crequy, Tycho Andersen,
	David Howells, James Bottomley, Jann Horn, Seth Forshee,
	St??phane Graber, Aleksa Sarai, Lennart Poettering,
	Eric W. Biederman, smbarber, Phil Estes, Serge Hallyn, Kees Cook,
	Todd Kjos, Jonathan Corbet, containers, linux-security-module,
	linux-api, linux-ext4, linux-unionfs, linux-audit,
	linux-integrity, selinux

On Sun, Nov 01, 2020 at 02:48:09PM +0000, Christoph Hellwig wrote:
> >  /**
> >   * capable_wrt_inode_uidgid - Check nsown_capable and uid and gid mapped
> >   * @inode: The inode in question
> > @@ -501,9 +513,7 @@ bool privileged_wrt_inode_uidgid(struct user_namespace *ns, const struct inode *
> >   */
> >  bool capable_wrt_inode_uidgid(const struct inode *inode, int cap)
> >  {
> > +	return capable_wrt_mapped_inode_uidgid(&init_user_ns, inode, cap);
> >  }
> >  EXPORT_SYMBOL(capable_wrt_inode_uidgid);
> 
> Please avoid these silly wrappers and just switch all callers to pass
> the namespaces instead of creating boilerplate code.  Same for the other
> functions where you do this even even worse the method calls.

Christoph,

Thanks for the review!  

Ok, so I'll switch:
- all helpers to take an additional argument
  (capable_wrt_inode_uidgid()/inode_permission()/vfs_*() etc.)
- all inode method calls to take an additional argument (I assume that's
  what you're referring to: ->create()/->mknod()/->mkdir() etc.)
  I've always assumed that this is what we'd be doing in the end anyway
  (I've mentioned it in the commit message for the inode_operations
  method's. This will be a bit of work but we can get that done!)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 06/34] fs: add id translation helpers
  2020-11-01 14:46   ` Christoph Hellwig
@ 2020-11-02 13:25     ` Christian Brauner
  0 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-11-02 13:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alexander Viro, linux-fsdevel, John Johansen, James Morris,
	Mimi Zohar, Dmitry Kasatkin, Stephen Smalley, Casey Schaufler,
	Arnd Bergmann, Andreas Dilger, OGAWA Hirofumi, Geoffrey Thomas,
	Mrunal Patel, Josh Triplett, Andy Lutomirski, Amir Goldstein,
	Miklos Szeredi, Theodore Tso, Alban Crequy, Tycho Andersen,
	David Howells, James Bottomley, Jann Horn, Seth Forshee,
	St??phane Graber, Aleksa Sarai, Lennart Poettering,
	Eric W. Biederman, smbarber, Phil Estes, Serge Hallyn, Kees Cook,
	Todd Kjos, Jonathan Corbet, containers, linux-security-module,
	linux-api, linux-ext4, linux-unionfs, linux-audit,
	linux-integrity, selinux

On Sun, Nov 01, 2020 at 02:46:32PM +0000, Christoph Hellwig wrote:
> > +static inline kuid_t kuid_into_mnt(struct user_namespace *to, kuid_t kuid)
> > +{
> > +#ifdef CONFIG_IDMAP_MOUNTS
> > +	return make_kuid(to, __kuid_val(kuid));
> > +#else
> > +	return kuid;
> > +#endif
> > +}
> > +
> > +static inline kgid_t kgid_into_mnt(struct user_namespace *to, kgid_t kgid)
> > +{
> > +#ifdef CONFIG_IDMAP_MOUNTS
> > +	return make_kgid(to, __kgid_val(kgid));
> > +#else
> > +	return kgid;
> > +#endif
> 
> If you want to keep the config option please at least have on
> #ifdef/#else/#endif instead of this mess.

Understood.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 05/34] fs: introduce MOUNT_ATTR_IDMAP
  2020-11-01 14:45   ` Christoph Hellwig
@ 2020-11-02 13:29     ` Christian Brauner
  0 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-11-02 13:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alexander Viro, linux-fsdevel, John Johansen, James Morris,
	Mimi Zohar, Dmitry Kasatkin, Stephen Smalley, Casey Schaufler,
	Arnd Bergmann, Andreas Dilger, OGAWA Hirofumi, Geoffrey Thomas,
	Mrunal Patel, Josh Triplett, Andy Lutomirski, Amir Goldstein,
	Miklos Szeredi, Theodore Tso, Alban Crequy, Tycho Andersen,
	David Howells, James Bottomley, Jann Horn, Seth Forshee,
	St??phane Graber, Aleksa Sarai, Lennart Poettering,
	Eric W. Biederman, smbarber, Phil Estes, Serge Hallyn, Kees Cook,
	Todd Kjos, Jonathan Corbet, containers, linux-security-module,
	linux-api, linux-ext4, linux-unionfs, linux-audit,
	linux-integrity, selinux

On Sun, Nov 01, 2020 at 02:45:44PM +0000, Christoph Hellwig wrote:
> On Thu, Oct 29, 2020 at 01:32:23AM +0100, Christian Brauner wrote:
> > Introduce a new mount bind mount property to allow idmapping mounts. The
> > MOUNT_ATTR_IDMAP flag can be set via the new mount_setattr() syscall
> > together with a file descriptor referring to a user namespace.
> 
> Shouldn't this go to the end of the series once all the infrastructure
> is in place?

Yeah, good idea. (I mostly did it to keep compile-times short when
rebasing.)

> 
> > +config IDMAP_MOUNTS
> > +	bool "Support id mappings per mount"
> > +	default n
> 
> n is the default default.

Ah, thanks.

> 
> But why do we need a config option here anyway?

My main concern was people complaining about code they want to compile
out. I've been burnt by that before but I'm happy to remove the config
option and make this unconditional.

> 
> > +#ifdef CONFIG_IDMAP_MOUNTS
> > +		if (kattr->attr_set & MNT_IDMAPPED) {
> > +			struct user_namespace *user_ns;
> > +			struct vfsmount *vmnt;
> 
> All the code here looks like it should go into a helper.

Will do.

> 
> > +				struct user_namespace *user_ns = READ_ONCE(m->mnt.mnt_user_ns);
> > +				WRITE_ONCE(m->mnt.mnt_user_ns, get_user_ns(kattr->userns));
> 
> More unreadable long lines.

Will wrap. I have somewhat adapted to the more lax 100 limit but I'm
happy to stick to 80.

> 
> > +	if (attr->attr_set & MOUNT_ATTR_IDMAP) {
> > +		struct ns_common *ns;
> > +		struct user_namespace *user_ns;
> > +		struct file *file;
> > +
> > +		file = fget(attr->userns);
> 
> The code here looks like another candidate for a self contained helper.

Noted.

> 
> > +
> > +static inline struct user_namespace *mnt_user_ns(const struct vfsmount *mnt)
> > +{
> > +#ifdef CONFIG_IDMAP_MOUNTS
> > +	return READ_ONCE(mnt->mnt_user_ns);
> > +#else
> > +	return &init_user_ns;
> > +#endif
> 
> How is the READ_ONCE on a pointer going to work?

Honestly, this is me following a pattern I've seen in other places and
it's mostly about visually indicating concurrency but I'll drop it.

Christian

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/34] namespace: take lock_mount_hash() directly when changing flags
  2020-11-01 14:41   ` Christoph Hellwig
@ 2020-11-02 13:33     ` Christian Brauner
  0 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-11-02 13:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alexander Viro, linux-fsdevel, John Johansen, James Morris,
	Mimi Zohar, Dmitry Kasatkin, Stephen Smalley, Casey Schaufler,
	Arnd Bergmann, Andreas Dilger, OGAWA Hirofumi, Geoffrey Thomas,
	Mrunal Patel, Josh Triplett, Andy Lutomirski, Amir Goldstein,
	Miklos Szeredi, Theodore Tso, Alban Crequy, Tycho Andersen,
	David Howells, James Bottomley, Jann Horn, Seth Forshee,
	St??phane Graber, Aleksa Sarai, Lennart Poettering,
	Eric W. Biederman, smbarber, Phil Estes, Serge Hallyn, Kees Cook,
	Todd Kjos, Jonathan Corbet, containers, linux-security-module,
	linux-api, linux-ext4, linux-unionfs, linux-audit,
	linux-integrity, selinux

On Sun, Nov 01, 2020 at 02:41:08PM +0000, Christoph Hellwig wrote:
> > index cebaa3e81794..20ee291a7af4 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -463,7 +463,6 @@ static int mnt_make_readonly(struct mount *mnt)
> >  {
> >  	int ret = 0;
> >  
> > -	lock_mount_hash();
> 
> What about adding a lockdep_assert_lock_held in all the functions
> that used to take the lock to document the assumptions?

Good idea and will do. I wanted to do this but then didn't because I
haven't seen widespread use of lockdep assert in fs/namespace.c.

> 
> >  static int __mnt_unmake_readonly(struct mount *mnt)
> >  {
> > -	lock_mount_hash();
> >  	mnt->mnt.mnt_flags &= ~MNT_READONLY;
> > -	unlock_mount_hash();
> >  	return 0;
> 
> This helper is rather pointless now.

Ok, will remove.

> 
> >  static void set_mount_attributes(struct mount *mnt, unsigned int mnt_flags)
> >  {
> > -	lock_mount_hash();
> >  	mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
> >  	mnt->mnt.mnt_flags = mnt_flags;
> >  	touch_mnt_namespace(mnt->mnt_ns);
> > -	unlock_mount_hash();
> 
> In linux-next there is an additional notify_mount after the unlock here.

Thanks! I can try rebasing on -next.

> 
> Also while you touch this lock_mount_hash/unlock_mount_hash could be
> moved to namespace.c and maked static now.

Ok, will try to do that.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 03/34] fs: add mount_setattr()
  2020-11-01 14:42   ` Christoph Hellwig
@ 2020-11-02 13:34     ` Christian Brauner
  0 siblings, 0 replies; 68+ messages in thread
From: Christian Brauner @ 2020-11-02 13:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alexander Viro, linux-fsdevel, John Johansen, James Morris,
	Mimi Zohar, Dmitry Kasatkin, Stephen Smalley, Casey Schaufler,
	Arnd Bergmann, Andreas Dilger, OGAWA Hirofumi, Geoffrey Thomas,
	Mrunal Patel, Josh Triplett, Andy Lutomirski, Amir Goldstein,
	Miklos Szeredi, Theodore Tso, Alban Crequy, Tycho Andersen,
	David Howells, James Bottomley, Jann Horn, Seth Forshee,
	St??phane Graber, Aleksa Sarai, Lennart Poettering,
	Eric W. Biederman, smbarber, Phil Estes, Serge Hallyn, Kees Cook,
	Todd Kjos, Jonathan Corbet, containers, linux-security-module,
	linux-api, linux-ext4, linux-unionfs, linux-audit,
	linux-integrity, selinux

On Sun, Nov 01, 2020 at 02:42:13PM +0000, Christoph Hellwig wrote:
> This has a bunch of crazy long lines.

Ok, will stick to 80 lines instead of the accepted but more lax 100.

> 
> Also some of the refatoring might be worth slpitting into prep patches.

Ok, will try to do that.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/34] fs: idmapped mounts
  2020-10-29 16:37     ` Eric W. Biederman
  2020-10-30  2:18       ` Serge E. Hallyn
  2020-10-30 15:07       ` Seth Forshee
@ 2020-11-03 14:10       ` Alban Crequy
  2 siblings, 0 replies; 68+ messages in thread
From: Alban Crequy @ 2020-11-03 14:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aleksa Sarai, Christian Brauner, Alexander Viro,
	Christoph Hellwig, linux-fsdevel, John Johansen, James Morris,
	Mimi Zohar, Dmitry Kasatkin, Stephen Smalley, Casey Schaufler,
	Arnd Bergmann, Andreas Dilger, OGAWA Hirofumi, Geoffrey Thomas,
	Mrunal Patel, Josh Triplett, Andy Lutomirski, Amir Goldstein,
	Miklos Szeredi, Theodore Tso, Tycho Andersen, David Howells,
	James Bottomley, Jann Horn, Seth Forshee, Stéphane Graber,
	Lennart Poettering, smbarber, Phil Estes, Serge Hallyn,
	Kees Cook, Todd Kjos, Jonathan Corbet, Linux Containers, LSM,
	linux-api, linux-ext4, linux-unionfs, linux-audit,
	linux-integrity, selinux

On Thu, Oct 29, 2020 at 5:37 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Aleksa Sarai <cyphar@cyphar.com> writes:
>
> > On 2020-10-29, Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> Christian Brauner <christian.brauner@ubuntu.com> writes:
> >>
> >> > Hey everyone,
> >> >
> >> > I vanished for a little while to focus on this work here so sorry for
> >> > not being available by mail for a while.
> >> >
> >> > Since quite a long time we have issues with sharing mounts between
> >> > multiple unprivileged containers with different id mappings, sharing a
> >> > rootfs between multiple containers with different id mappings, and also
> >> > sharing regular directories and filesystems between users with different
> >> > uids and gids. The latter use-cases have become even more important with
> >> > the availability and adoption of systemd-homed (cf. [1]) to implement
> >> > portable home directories.
> >>
> >> Can you walk us through the motivating use case?
> >>
> >> As of this year's LPC I had the distinct impression that the primary use
> >> case for such a feature was due to the RLIMIT_NPROC problem where two
> >> containers with the same users still wanted different uid mappings to
> >> the disk because the users were conflicting with each other because of
> >> the per user rlimits.
> >>
> >> Fixing rlimits is straight forward to implement, and easier to manage
> >> for implementations and administrators.
> >
> > This is separate to the question of "isolated user namespaces" and
> > managing different mappings between containers. This patchset is solving
> > the same problem that shiftfs solved -- sharing a single directory tree
> > between containers that have different ID mappings. rlimits (nor any of
> > the other proposals we discussed at LPC) will help with this problem.
>
> First and foremost: A uid shift on write to a filesystem is a security
> bug waiting to happen.  This is especially in the context of facilities
> like iouring, that play very agressive games with how process context
> makes it to  system calls.
>
> The only reason containers were not immediately exploitable when iouring
> was introduced is because the mechanisms are built so that even if
> something escapes containment the security properties still apply.
> Changes to the uid when writing to the filesystem does not have that
> property.  The tiniest slip in containment will be a security issue.
>
> This is not even the least bit theoretical.  I have seem reports of how
> shitfs+overlayfs created a situation where anyone could read
> /etc/shadow.
>
> If you are going to write using the same uid to disk from different
> containers the question becomes why can't those containers configure
> those users to use the same kuid?
>
> What fixing rlimits does is it fixes one of the reasons that different
> containers could not share the same kuid for users that want to write to
> disk with the same uid.
>
>
> I humbly suggest that it will be more secure, and easier to maintain for
> both developers and users if we fix the reasons people want different
> containers to have the same user running with different kuids.
>
> If not what are the reasons we fundamentally need the same on-disk user
> using multiple kuids in the kernel?

I would like to use this patch set in the context of Kubernetes. I
described my two possible setups in
https://www.spinics.net/lists/linux-containers/msg36537.html:

1. Each Kubernetes pod has its own userns but with the same user id mapping
2. Each Kubernetes pod has its own userns with non-overlapping user id
mapping (providing additional isolation between pods)

But even in the setup where all pods run with the same id mappings,
this patch set is still useful to me for 2 reasons:

1. To avoid the expensive recursive chown of the rootfs. We cannot
necessarily extract the tarball directly with the right uids because
we might use the same container image for privileged containers (with
the host userns) and unprivileged containers (with a new userns), so
we have at least 2 “mappings” (taking more time and resulting in more
storage space). Although the “metacopy” mount option in overlayfs
helps to make the recursive chown faster, it can still take time with
large container images with lots of files. I’d like to use this patch
set to set up the root fs in constant time.

2. To manage large external volumes (NFS or other filesystems). Even
if admins can decide to use the same kuid on all the nodes of the
Kubernetes cluster, this is impractical for migration. People can have
existing Kubernetes clusters (currently without using user namespaces)
and large NFS volumes. If they want to switch to a new version of
Kubernetes with the user namespace feature enabled, they would need to
recursively chown all the files on the NFS shares. This could take
time on large filesystems and realistically, we want to support
rolling updates where some nodes use the previous version without user
namespaces and new nodes are progressively migrated to the new userns
with the new id mapping. If both sets of nodes use the same NFS share,
that can’t work.

Alban

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2020-11-03 14:10 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-29  0:32 [PATCH 00/34] fs: idmapped mounts Christian Brauner
2020-10-29  0:32 ` [PATCH 01/34] namespace: take lock_mount_hash() directly when changing flags Christian Brauner
2020-11-01 14:41   ` Christoph Hellwig
2020-11-02 13:33     ` Christian Brauner
2020-10-29  0:32 ` [PATCH 02/34] namespace: only take read lock in do_reconfigure_mnt() Christian Brauner
2020-10-29  0:32 ` [PATCH 03/34] fs: add mount_setattr() Christian Brauner
2020-11-01 14:42   ` Christoph Hellwig
2020-11-02 13:34     ` Christian Brauner
2020-10-29  0:32 ` [PATCH 04/34] tests: add mount_setattr() selftests Christian Brauner
2020-10-29  0:32 ` [PATCH 05/34] fs: introduce MOUNT_ATTR_IDMAP Christian Brauner
2020-11-01 14:45   ` Christoph Hellwig
2020-11-02 13:29     ` Christian Brauner
2020-10-29  0:32 ` [PATCH 06/34] fs: add id translation helpers Christian Brauner
2020-11-01 14:46   ` Christoph Hellwig
2020-11-02 13:25     ` Christian Brauner
2020-10-29  0:32 ` [PATCH 07/34] capability: handle idmapped mounts Christian Brauner
2020-11-01 14:48   ` Christoph Hellwig
2020-11-02 13:23     ` Christian Brauner
2020-10-29  0:32 ` [PATCH 08/34] namei: add idmapped mount aware permission helpers Christian Brauner
2020-10-29  0:32 ` [PATCH 09/34] inode: add idmapped mount aware init and " Christian Brauner
2020-10-29  0:32 ` [PATCH 10/34] attr: handle idmapped mounts Christian Brauner
2020-10-29  0:32 ` [PATCH 11/34] acl: " Christian Brauner
2020-10-29  0:32 ` [PATCH 12/34] xattr: " Christian Brauner
2020-10-29  0:32 ` [PATCH 13/34] selftests: add idmapped mounts xattr selftest Christian Brauner
2020-10-29  0:32 ` [PATCH 14/34] commoncap: handle idmapped mounts Christian Brauner
2020-10-29  0:32 ` [PATCH 15/34] stat: add mapped_generic_fillattr() Christian Brauner
2020-10-29  0:32 ` [PATCH 16/34] namei: handle idmapped mounts in may_*() helpers Christian Brauner
2020-10-29  0:32 ` [PATCH 17/34] namei: introduce struct renamedata Christian Brauner
2020-10-29  0:32 ` [PATCH 18/34] namei: prepare for idmapped mounts Christian Brauner
2020-10-29  0:32 ` [PATCH 19/34] namei: add lookup helpers with idmapped mounts aware permission checking Christian Brauner
2020-10-29  0:32 ` [PATCH 20/34] open: handle idmapped mounts in do_truncate() Christian Brauner
2020-10-29  0:32 ` [PATCH 21/34] open: handle idmapped mounts Christian Brauner
2020-10-29  0:32 ` [PATCH 22/34] af_unix: " Christian Brauner
2020-10-29  0:32 ` [PATCH 23/34] utimes: " Christian Brauner
2020-10-29  0:32 ` [PATCH 24/34] would_dump: " Christian Brauner
2020-10-29  0:32 ` [PATCH 25/34] exec: " Christian Brauner
2020-10-29  0:32 ` [PATCH 26/34] fs: add helpers for idmap mounts Christian Brauner
2020-10-29  0:32 ` [PATCH 27/34] apparmor: handle idmapped mounts Christian Brauner
2020-10-29  0:32 ` [PATCH 28/34] audit: " Christian Brauner
2020-10-29  0:32 ` [PATCH 29/34] ima: " Christian Brauner
2020-10-29  0:32 ` [PATCH 30/34] ext4: support " Christian Brauner
2020-10-29  0:32 ` [PATCH 31/34] expfs: handle " Christian Brauner
2020-10-29  0:32 ` [PATCH 32/34] overlayfs: handle idmapped lower directories Christian Brauner
2020-10-30 11:10   ` Amir Goldstein
2020-10-30 11:52     ` Christian Brauner
2020-10-29  0:32 ` [PATCH 33/34] overlayfs: handle idmapped merged mounts Christian Brauner
2020-10-30  9:57   ` Amir Goldstein
2020-10-29  0:32 ` [PATCH 34/34] fat: handle idmapped mounts Christian Brauner
2020-10-29  2:27 ` [PATCH 00/34] fs: " Dave Chinner
2020-10-29 16:19   ` Christian Brauner
2020-10-29  7:20 ` Sargun Dhillon
2020-10-29 15:47 ` Eric W. Biederman
2020-10-29 15:51   ` Aleksa Sarai
2020-10-29 16:37     ` Eric W. Biederman
2020-10-30  2:18       ` Serge E. Hallyn
2020-10-30 15:07       ` Seth Forshee
2020-10-30 16:03         ` Serge E. Hallyn
2020-11-03 14:10       ` Alban Crequy
2020-10-29 16:05   ` Lennart Poettering
2020-10-29 16:36     ` Sargun Dhillon
2020-10-29 16:54     ` Eric W. Biederman
     [not found]   ` <20201029161231.GA108315@cisco>
2020-10-29 16:23     ` Serge E. Hallyn
2020-10-29 16:44     ` Eric W. Biederman
2020-10-29 18:04       ` Stéphane Graber
2020-10-29 21:58 ` Andy Lutomirski
2020-10-30 12:01   ` Christian Brauner
2020-10-30 16:17     ` Serge E. Hallyn
2020-10-31 17:43     ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).