linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
@ 2016-05-04 14:26 Djalal Harouni
  2016-05-04 14:26 ` [RFC v2 PATCH 1/8] VFS: add CLONE_MNTNS_SHIFT_UIDGID flag to allow mounts to shift their UIDs/GIDs Djalal Harouni
                   ` (11 more replies)
  0 siblings, 12 replies; 49+ messages in thread
From: Djalal Harouni @ 2016-05-04 14:26 UTC (permalink / raw)
  To: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy
  Cc: Djalal Harouni

This is version 2 of the VFS:userns support portable root filesystems
RFC. Changes since version 1:

* Update documentation and remove some ambiguity about the feature.
  Based on Josh Triplett comments.
* Use a new email address to send the RFC :-)


This RFC tries to explore how to support filesystem operations inside
user namespace using only VFS and a per mount namespace solution. This
allows to take advantage of user namespace separations without
introducing any change at the filesystems level. All this is handled
with the virtual view of mount namespaces.


1) Presentation:
================

The main aim is to support portable root filesystems and allow containers,
virtual machines and other cases to use the same root filesystem.
Due to security reasons, filesystems can't be mounted inside user
namespaces, and mounting them outside will not solve the problem since
they will show up with the wrong UIDs/GIDs. Read and write operations
will also fail and so on.

The current userspace solution is to automatically chown the whole root
filesystem before starting a container, example:
(host) init_user_ns  1000000:1065536  => (container) user_ns_X1 0:65535
(host) init_user_ns  2000000:2065536  => (container) user_ns_Y1 0:65535
(host) init_user_ns  3000000:3065536  => (container) user_ns_Z1 0:65535
...

Every time a chown is called, files are changed and so on... This
prevents to have portable filesystems where you can throw anywhere
and boot. Having an extra step to adapt the filesystem to the current
mapping and persist it will not allow to verify its integrity, it makes
snapshots and migration a bit harder, and probably other limitations...

It seems that there are multiple ways to allow user namespaces combine
nicely with filesystems, but none of them is that easy. The bind mount
and pin the user namespace during mount time will not work, bind mounts
share the same super block, hence you may endup working on the wrong
vfsmount context and there is no easy way to get out of that...

Using the user namespace in the super block seems the way to go, and
there is the "Support fuse mounts in user namespaces" [1] patches which
seem nice but perhaps too complex!? there is also the overlayfs solution,
and finaly the VFS layer solution.


We present here a simple VFS solution, everything is packed inside VFS,
filesystems don't need to know anything (except probably XFS, and special
operations inside union filesystems). Currently it supports ext4, btrfs
and overlayfs. Changes into filesystems are small, just parse the
vfs_shift_uids and vfs_shift_gids options during mount and set the
appropriate flags into the super_block structure.

1) Filesystems don't need the FS_USERNS_MOUNT flag, so no user
namespace mounting, they stay secure, nothing changes.

2) The solution is based on VFS and mount namespaces, we use the user
namespace of the containing mount namespace to check if we should shift
UIDs/GIDs from/to virtual <=> on-disk view.
If a filesystem was mounted with "vfs_shift_uids" and "vfs_shift_gids"
options, and if it shows up inside a mount namespace that supports VFS
UIDs/GIDs shifts then during each access we will remap UID/GID either
to virtual or to on-disk view using simple helper functions to allow the
access. In case the mount or current mount namespace do not support VFS
UID/GID shifts, we fallback to the old behaviour, no shift is performed.

3) The existing user namespace interface is the one used to do the
translation from virtual to on-disk mapping.

3) inodes will always keep their original values which reflect the
mapping inside init_user_ns which we consider the on-disk mapping.

 3.1) During access we map to the virtual view, and if the
      inode->{i_uid|i_gid} do not have a mapping in the mount namespace
      we construct one for them.

 3.2) For on-disk write we construct the appropriate kuid/kgid that
      should be stored on-disk. If they have a mapping in the mount
      namespace we use the corresponding uid_t/gid_t values of that
      mapping inside the mount namespace and construct the kuid from
      the pair init_user_ns and uid_t. This covers cases where the
      mapping inside should be the one stored into on-disk. Now If they
      don't have a mapping in the mount namespace, we fallback to the
      old behaviour, the global kuid inside init_user_ns is the one
      used to update the inode->i_uid.


As an example if the mapping 0:65535 inside mount namespace and outside
is 1000000:1065536, then 0:65535 will be the range that we use to
construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
data. They represent the persistent values that we want to write to the
disk. Therefore, we don't keep track of any UID/GID shift that was applied
before, it gives portability and allows to use the previous mapping
which was freed for another root filesystem...

If the mapping inside the mount namespace is 1000:65535 and outside
is 2000:65535 then the range used to construct UIDs/GIDs mapping to
update inode->{i_uid|i_gid} will be the one inside the container, we
always use that one to construct the kuid/kgid from uid_t/gid_t and
init_user_ns.

$ cat /proc/self/uid_map 
      1000      2000      65536
$ stat -c '%u:%g' mountpoint/etc/fedora-release 
65534:65534
$ stat -c '%u:%g' mountpoint/home/tixxdz/
1000:1000
$ touch mountpoint/newuser  
touch: cannot touch ‘mountpoint/newuser’: Permission denied
$ stat -c '%u:%g' mountpoint/home/tixxdz/newuser
1000:1000
[ outside of namespaces] $ stat -c '%u:%g' mountpoint/home/tixxdz/newuser
1000:1000

Please note that the range here is not hardcoded to 65535, it can be any
value set by the creator of the user namespace. These patches use the
only interface user namespaces provide. 2**16 was used here to just show
how filesystems can be made portable by making the most used UIDs/GIDs
available inside containers.

Simple demo overlayfs, and  btrfs mounted with vfs_shift_uids and
vfs_shift_gids. The overlayfs mounts will share the same upperdir. We
create two user namesapces every one with its own mapping and where
container-uid-2000000 will pull changes from container-uid-1000000
upperdir automatically.

[tixxdz@fedora-kvm btrfs_root]$ mount | grep btrfs
/dev/mapper/fedora-btrfs_root on /mnt/btrfs_root type btrfs (rw,relatime,seclabel,space_cache,vfs_shift_uids,vfs_shift_gids,subvolid=5,subvol=/)
[tixxdz@fedora-kvm btrfs_root]$ sudo mount -t overlay overlay \
-o,lowerdir=/mnt/btrfs_root/rootfs/fedora-tree,upperdir=/mnt/btrfs_root/container-uid-1000000/upperdir,workdir=/mnt/btrfs_root/container-uid-1000000/workdir \
/mnt/btrfs_root/container-uid-1000000/merged
[tixxdz@fedora-kvm btrfs_root]$ sudo mount -t overlay overlay \
-o,lowerdir=/mnt/btrfs_root/rootfs/fedora-tree,upperdir=/mnt/btrfs_root/container-uid-1000000/upperdir,workdir=/mnt/btrfs_root/container-uid-2000000/workdir \
/mnt/btrfs_root/container-uid-2000000/merged
[tixxdz@fedora-kvm btrfs_root]$ sudo chown -R 1000000.1000000 /mnt/btrfs_root/container-uid-1000000/workdir/work/
[tixxdz@fedora-kvm btrfs_root]$ sudo chown -R 2000000.2000000 /mnt/btrfs_root/container-uid-2000000/workdir/work/

[ Term 1 ]
[tixxdz@fedora-kvm container-uid-1000000]$ sudo ~/bin/mountns-uidshift -u 1000000 
bash: /root/.bashrc: Permission denied
bash-4.3# cat /proc/self/uid_map 
         0    1000000      65536
bash-4.3# touch container-uid-1000000/merged/rootfile
bash-4.3# stat -c '%u:%g' container-uid-1000000/merged/rootfile
0:0

[ Term 2 ]
[tixxdz@fedora-kvm btrfs_root]$ sudo ~/bin/mountns-uidshift -u 2000000 
[sudo] password for tixxdz: 
bash: /root/.bashrc: Permission denied
bash-4.3# cat /proc/self/uid_map 
         0    2000000      65536
bash-4.3# stat -c '%u:%g' container-uid-2000000/merged/rootfile
0:0

[ Term 3 ] (outside of all namespaces)
[tixxdz@fedora-kvm btrfs_root]$ stat -c '%u:%g' container-uid-1000000/upperdir/rootfile 
0:0


This means that root in user namespace or inside containers is able to
write inodes with uid/gid == 0 into disk. This may sound strange and
dangerous, yes of course, care must be taken, this way we have added
the following:

1) Filesystems when mounted must explicitly support "vfs_shift_uids"
and "vfs_shift_gids", we don't require mounting inside user namespaces.

2) Containers or mounts can have their parent directory as 0700, and
even before mounting clean the mount namespace, set the appropriate
propagation flags and so on...

3) To be able to set the CLONE_MNTNS_SHIFT_UIDGID flag on the new mount
namespace either caller has to be real root in init_user_ns, or the parent
of the new mount namespace has already that flag set. This allows
nesting which I discussed briefly with Serge Hallyn, and he suggested
that this should be supported. Preventing nesting is doomed to fail. This
way we have security and nesting at the same time. Of course if you clean
that flag you won't be able to set it next time only if you are capable
in init_user_ns.

4) If the mount namespaces has the flag CLONE_MNTNS_SHIFT_UIDGID set but
the filesystem was mounted without "vfs_shift_uids" and "vfs_shift_gids"
or does not support these options, then  no shifting is performed. You
have to meet the two conditions at each access, otherwise we fallback to
current behaviour.

5) Only the creator of the mount namespace or one with similar
privileges is able to change the mapping rules of the user namespace of
that mount namespace. This ensures that only a more privileged is able
to change the mapping and at the same time it gives some flexibility
since the rules can be changed, and we never persist the virtual
UIDs/GIDs into disk, only the view in init_user_ns is always stored into
disk.


To complete this solution the current blocker is: since we need a way to
control mount namespaces we need a new flag, however all flags of current
clone() syscall are consumed, yes 32bits no luck! In this RFC I didn't
include a new syscall clone4() [2] which was already requested in the past,
and the patches for a new clone4() are already there. This way this RFC
stays minimal.

The flag we use here is just for demonstration, please see patch 0001
and the program mountns-uidshift.c [3] for that. Future versions
will include the new clone4() syscall.


2) TEST:
========

Apply on top of Linux 4.6-rc6 HEAD 04974df8049fc4240d2275, and use this
program mountns-uidshift.c to test the shifted mount namespaces.
https://raw.githubusercontent.com/OpenDZ/research/master/kernel/mountns-uidshift.c

With current mapping rules init_user_ns:
[1000000:1065536] => new_user_ns: [0:65536]
# cat /proc/self/uid_map 
         0    1000000      65536
# cat /proc/self/gid_map 
         0    1000000      65536


2.1) ext4:
==========

Setup:
/   		on ext4 without vfs_shift_uids, vfs_shift_gids
/mnt/ext4_root	on ext4 with vfs_shift_uids, vfs_shift_gids
/mnt/ext4_root/rootfs/fedore-tree		(Another fedora rootfs)
/mnt/ext4_root/container-uid-1000000		(container files with uid 1000000)
/mnt/ext4_root/container-uid-1000000/mountpoint	(bind mount of fedora-tree)

$ sudo mount -t ext4 -ovfs_shift_uids,vfs_shift_gids \
/dev/fedora/ext4_root /mnt/ext4_root/
$ mount | grep ext4 -
/dev/mapper/fedora-root on / type ext4 (rw,relatime,seclabel,data=ordered)
/dev/sda1 on /boot type ext4 (rw,relatime,seclabel,data=ordered)
/dev/mapper/fedora-ext4_root on /mnt/ext4_root type ext4 (rw,relatime,seclabel,data=ordered,vfs_shift_uids,vfs_shift_gids)
$ sudo mkdir /mnt/ext4_root/rootfs/
$ sudo yum -y --releasever=23 --installroot=/mnt/ext4_root/rootfs/fedora-tree \
--disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim
$ sudo mkdir /mnt/ext4_root/container-uid-1000000/
$ sudo mkdir /mnt/ext4_root/container-uid-1000000/mountpoint
$ sudo chown -R 1000000.1000000 /mnt/ext4_root/container-uid-1000000/
$ sudo mount --bind -ovfs_shift_uids,vfs_shift_gids \
/mnt/ext4_root/rootfs/fedora-tree/ mountpoint/
$ mount | grep vfs_shift_uids -
/dev/mapper/fedora-ext4_root on /mnt/ext4_root type ext4 (rw,relatime,seclabel,data=ordered,vfs_shift_uids,vfs_shift_gids)
/dev/mapper/fedora-ext4_root on /mnt/ext4_root/container-uid-1000000/mountpoint type ext4 (rw,relatime,seclabel,data=ordered,vfs_shift_uids,vfs_shift_gids)
$ sudo ~/bin/mountns-uidshift -u 1000000
...
bash-4.3# id
uid=0(root) gid=0(root) groups=0(root) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
bash-4.3# cat /proc/self/uid_map 
         0    1000000      65536
bash-4.3# mount | grep shift -           
/dev/mapper/fedora-ext4_root on /mnt/ext4_root type ext4 (rw,relatime,seclabel,data=ordered,vfs_shift_uids,vfs_shift_gids)
/dev/mapper/fedora-ext4_root on /mnt/ext4_root/container-uid-1000000/mountpoint type ext4 (rw,relatime,seclabel,data=ordered,vfs_shift_uids,vfs_shift_gids)
bash-4.3# stat -c '%u:%g' /etc/motd 
65534:65534
bash-4.3# stat -c '%u:%g' /mnt/ext4_root/rootfs/fedora-tree/etc/motd 
0:0
bash-4.3# stat -c '%u:%g' mountpoint/etc/motd                        
0:0
bash-4.3# stat -c '%u:%g' /etc/machine-id 
65534:65534
bash-4.3# echo "blabla" > /etc/machine-id                                                                      
bash: /etc/machine-id: Permission denied
bash-4.3# stat -c '%u:%g' mountpoint/etc/machine-id 
0:0
bash-4.3# sha1sum mountpoint/etc/machine-id 
edb24591988f0f003cd397704f49e92208b3015f  mountpoint/etc/machine-id
bash-4.3# m=$(cat /dev/urandom | tr -cd 'a-f0-9' | head -c 32); echo $m | sha1sum; echo $m > mountpoint/etc/machine-id 
f256a796b1f2ed09c4107f1f5aff2568fb2d79cc  -
bash-4.3# sha1sum mountpoint/etc/machine-id                                                                            
f256a796b1f2ed09c4107f1f5aff2568fb2d79cc  mountpoint/etc/machine-id
bash-4.3# stat -c '%u:%g' mountpoint/etc/machine-id                                                                    
0:0
[outside of namespaces]$ stat -c '%u:%g' /mnt/ext4_root/container-uid-1000000/mountpoint/etc/machine-id
0:0

Test with unprivileged user inside the new mount and user namespaces:
---------------------------------------------------------------------

Test with uid tixxdz == 1000, the user exists on both:
(1) /
(2) /mnt/ext4_root/rootfs/fedore-tree which is bind mounted into /mnt/ext4_root/container-uid-1000000/mountpoint

-bash-4.3$ touch /home/tixxdz/newfile
touch: cannot touch /home/tixxdz/newfile: Permission denied
-bash-4.3$ stat -c '%u:%g' /home/tixxdz/
65534:65534
-bash-4.3$ stat -c '%u:%g' mountpoint/home/tixxdz/
1000:1000
-bash-4.3$ touch mountpoint/home/tixxdz/newfile
-bash-4.3$ stat -c '%u:%g' mountpoint/home/tixxdz/newfile 
1000:1000
[outside of namespaces]$ stat -c '%u:%g' /mnt/ext4_root/container-uid-1000000/mountpoint/home/tixxdz/newfile 
1000:1000


2.2) btrfs:
===========
Same steps as ext4.


2.3) overlayfs:
===============

2.3.1) Native support using VFS:

Overlayfs is natively supported if lowerdir, upperdir and workdir are all
on a mount that supports vfs_shift_uids and vfs_shift_gids flags and we
are in a mount namespace that also supports that.

$ mount | grep btrfs
/dev/mapper/fedora-btrfs_root on /mnt/btrfs_root type btrfs (rw,relatime,seclabel,space_cache,vfs_shift_uids,vfs_shift_gids,subvolid=5,subvol=/)
$ cd /mnt/btrfs_root/
$ sudo mkdir -p container-uid-2000000/{upperdir,workdir,merged}
$ sudo chown -R 2000000.2000000 container-uid-2000000/
$ cd container-uid-2000000/
$ sudo mount -t overlay overlay -o,lowerdir=/mnt/btrfs_root/rootfs/fedora-tree,upperdir=upperdir,workdir=workdir merged
$ sudo chown -R 2000000.2000000 workdir/work/
$ sudo ~/bin/mountns-uidshift -u 2000000
...
bash-4.3# stat -c '%u:%g' merged/etc/passwd
0:0
bash-4.3# touch merged/overlayfs-file
bash-4.3# stat -c '%u:%g' merged/overlayfs-file 
0:0
[outside of namespaces]# stat -c '%u:%g' /mnt/btrfs_root/container-uid-2000000/merged/overlayfs-file 
0:0
[outside of namespaces]# stat -c '%u:%g' /mnt/btrfs_root/container-uid-2000000/upperdir/overlayfs-file 
0:0


2.3.2) Complex support or union filesystems:

If overlayfs lowerdir and upperdir are not on a filesystem that supports
natively vfs_shift_uids and vfs_shift_gids then to support VFS UID/GID
shifts, we must adapt the helper functions that where introduced in this
series to take also a super_block struct and test if the appropriate flags
where set into overlayfs instead of the other filesystem which the inode
belongs to. The translation on-disk <=> virtual should happen then inside
overlayfs.

I think this will always be the case of union mounts which fetch an inode
from another mount. I think that solution (2.3.2) can also be implemented,
I had some ugly patches to implement this on top of overlayfs, but not
sure, better see what others think about VFS UID/GID shifts first.

IMO solution (2.3.1) if done correctly is the way to go, in the end all
this relates to the virtual view of UID/GID inside the kernel, and how
resources are translated to them, it's not related to overlayfs.


3) ROADMAP:
===========
* Confirm current design, and make sure that the mapping is done
  correctly.

* Add clone4() syscall [2]

* Investigate if current setns() checks to enter new mount namespaces
  are sufficient ?

* Add POSIX ACL support ?

* Check if all filesystem operations are correctly supported and recheck
  permissions access.

* Do filesystems provide some operations to control disk or host resources ?
  in other words are there some inodes on filesystems that allow to access
  host resources, if so then maybe these inodes either should be marked only
  safe in init_user_ns or get the appropriate capable() in init_user_ns if
  missing. Needs investigation.

* Add XFS support.



References:
===========
[1] https://www.redhat.com/archives/dm-devel/2016-April/msg00368.html
[2] https://lkml.org/lkml/2015/3/15/10
[3] https://raw.githubusercontent.com/OpenDZ/research/master/kernel/mountns-uidshift.c

Thanks!

Patches:
[RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
[RFC v2 PATCH 1/8] VFS: add CLONE_MNTNS_SHIFT_UIDGID flag to allow mounts to shift their UIDs/GIDs
[RFC v2 PATCH 2/8] VFS:uidshift: add flags and helpers to shift UIDs and GIDs to virtual view
[RFC v2 PATCH 3/8] fs: Treat foreign mounts as nosuid
[RFC v2 PATCH 4/8] VFS:userns: shift UID/GID to virtual view during permission access
[RFC v2 PATCH 5/8] VFS:userns: add helpers to shift UIDs and GIDs into on-disk view
[RFC v2 PATCH 6/8] VFS:userns: shift UID/GID to on-disk view before any write to disk
[RFC v2 PATCH 7/8] ext4: add support for vfs_shift_uids and vfs_shift_gids mount options
[RFC v2 PATCH 8/8] btrfs: add support for vfs_shift_uids and vfs_shift_gids mount options


Diffstat for this RFC
fs/attr.c                      |  44 +++++++++++++++++++++++--------
fs/btrfs/super.c               |  15 ++++++++++-
fs/exec.c                      |   2 +-
fs/ext4/super.c                |  14 ++++++++++
fs/inode.c                     |   9 ++++---
fs/mount.h                     |   1 +
fs/namei.c                     |   6 +++--
fs/namespace.c                 | 190 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/stat.c                      |   4 +--
include/linux/fs.h             |  14 ++++++++++
include/linux/mount.h          |   1 +
include/linux/user_namespace.h |   8 ++++++
include/uapi/linux/sched.h     |   1 +
kernel/capability.c            |  14 ++++++++--
kernel/fork.c                  |   4 +++
kernel/user_namespace.c        |  13 ++++++++++
security/commoncap.c           |   2 +-
security/selinux/hooks.c       |   2 +-
18 files changed, 319 insertions(+), 25 deletions(-)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC v2 PATCH 1/8] VFS: add CLONE_MNTNS_SHIFT_UIDGID flag to allow mounts to shift their UIDs/GIDs
  2016-05-04 14:26 [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems Djalal Harouni
@ 2016-05-04 14:26 ` Djalal Harouni
  2016-05-04 14:26 ` [RFC v2 PATCH 2/8] VFS:uidshift: add flags and helpers to shift UIDs and GIDs to virtual view Djalal Harouni
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Djalal Harouni @ 2016-05-04 14:26 UTC (permalink / raw)
  To: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy
  Cc: Djalal Harouni, Djalal Harouni

Add CLONE_MNTNS_SHIFT_UIDGID flag which is a mount namespace flag when
set mount points on filesystems that support UID/GID shifts will have
their UIDs and GIDs shifted by the VFS. The UID and GID mapping rules are per
mount namespace, they follow the rules of the user namespace of the containing
mount namespace. The UID/GID of inodes are supposed to always contain
the on-disk values, hence, the shift will be done inside VFS and it's a read
shift when we access the inodes.

This is a preparation patch.

Goal:

	/* (1) */
	clone4(CLONE_NEWNS|CLONE_MNTNS_SHIFT_UIDGID, ...)
	/*
		Setup container base mount namespace, rootfs and mount all
		necessary mount points and filesystems that can't be mounted
		in user namespaces. Filesystems that support uid/gid shifts
		should set the mount parameters.
		mount(..., mount_options=[vfs_shift_uids, vfs_shift_gids])
	*/

	/* (2) */
	/*
		Setup new mount and user namespaces and inherit the
		CLONE_MNTNS_SHIFT_UIDGID flag from (1) into the new mount
		namespace (2).
	*/
	clone4(CLONE_NEWUSER|CLONE_NEWNS|CLONE_MNTNS_SHIFT_UIDGID, ...)
	/*
	   inodes of mount points here that support UID/GID shifts will have
	   automatically their UID/GID shifted according to the user
	   namespace rules of the current mount namespace (2).
	*/

We create the new user and mount namespaces where:
1) The mount namespace allows mounts inside it that support UID and GID
   shifting to perform the shifts if the CLONE_MNTNS_SHIFT_UIDGID is set
   in the current mount namespace.

2) The UID and GID mapping is done according to the rules of the user
   namespace of the containing mount namespace. The CLONE_MNTNS_SHIFT_UIDGID
   follows the CLONE_NEWUSER|CLONE_NEWNS combination. This ensures that
   only the creator of the mount namespace is able to adjust the user
   namespace mapping rules.

The flag CLONE_MNTNS_SHIFT_UIDGID can be set on the mount namespace
only if:

1) The parent namespace has already CLONE_MNTNS_SHIFT_UIDGID set on
   its mount namespace.

2) The caller has CAP_SYS_ADMIN in the init_user_ns namespace, since we
   start from that namespace and we inherit some mount points we have to
   protect files from privileged userns doing:
   clone(CLONE_NEWUSER|CLONE_NEWNS|CLONE_MNTNS_SHIFT_UIDGID...)
   This is blocked.

If a filesystem was mounted with "vfs_shift_uids" and "vfs_shift_gids"
and shows up in a mount namespace that does not include the
CLONE_MNTNS_SHIFT_UIDGID, then no shift is done. UIDs and GIDs will
not be changed at all, and things will continue to work as they are now.

Signed-off-by: Dongsu Park <dongsu@endocode.com>
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
---
 fs/mount.h                 |  1 +
 fs/namespace.c             | 20 ++++++++++++++++++++
 include/uapi/linux/sched.h |  1 +
 kernel/fork.c              |  4 ++++
 4 files changed, 26 insertions(+)

diff --git a/fs/mount.h b/fs/mount.h
index 14db05d..1e317eb 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -6,6 +6,7 @@
 
 struct mnt_namespace {
 	atomic_t		count;
+	int			flags;
 	struct ns_common	ns;
 	struct mount *	root;
 	struct list_head	list;
diff --git a/fs/namespace.c b/fs/namespace.c
index 4fb1691..940ecfc 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2774,6 +2774,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns)
 	INIT_LIST_HEAD(&new_ns->list);
 	init_waitqueue_head(&new_ns->poll);
 	new_ns->event = 0;
+	new_ns->flags = 0;
 	new_ns->user_ns = get_user_ns(user_ns);
 	return new_ns;
 }
@@ -2801,6 +2802,25 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns,
 	if (IS_ERR(new_ns))
 		return new_ns;
 
+	if (flags & CLONE_MNTNS_SHIFT_UIDGID) {
+		/*
+		 * If parent has the CLONE_MNTNS_SHIFT_UIDGID flag set
+		 * or current is capable in init_user_ns, then we set the
+		 * CLONE_MNTNS_SHIFT_UIDGID flag and allow mounts inside
+		 * this namespace to shift their UID and GID.
+		 *
+		 * We check the init_user_ns here since we always start from
+		 * that user namespace and mounts are by default available to all
+		 * users. In this regard, only CAP_SYS_ADMIN in init_user_ns is
+		 * allowed to start and propagate the CLONE_MNTNS_SHIFT_UIDGID
+		 * flag to new mount namespaces.
+		 */
+		if ((ns->flags & CLONE_MNTNS_SHIFT_UIDGID) || capable(CAP_SYS_ADMIN))
+			new_ns->flags |= CLONE_MNTNS_SHIFT_UIDGID;
+		else
+			return ERR_PTR(-EPERM);
+	}
+
 	namespace_lock();
 	/* First pass: copy the tree topology */
 	copy_flags = CL_COPY_UNBINDABLE | CL_EXPIRE;
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 5f0fe01..9ba2124 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -19,6 +19,7 @@
 #define CLONE_PARENT_SETTID	0x00100000	/* set the TID in the parent */
 #define CLONE_CHILD_CLEARTID	0x00200000	/* clear the TID in the child */
 #define CLONE_DETACHED		0x00400000	/* Unused, ignored */
+#define CLONE_MNTNS_SHIFT_UIDGID     0x00400000      /* If set allows to shift UID and GID for mounts that support it */
 #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
 #define CLONE_NEWCGROUP		0x02000000	/* New cgroup namespace */
diff --git a/kernel/fork.c b/kernel/fork.c
index d277e83..41223cd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1264,6 +1264,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
 
+	if ((clone_flags & CLONE_MNTNS_SHIFT_UIDGID) &&
+	    !(clone_flags & CLONE_NEWNS))
+		return ERR_PTR(-EINVAL);
+
 	if ((clone_flags & (CLONE_NEWUSER|CLONE_FS)) == (CLONE_NEWUSER|CLONE_FS))
 		return ERR_PTR(-EINVAL);
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC v2 PATCH 2/8] VFS:uidshift: add flags and helpers to shift UIDs and GIDs to virtual view
  2016-05-04 14:26 [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems Djalal Harouni
  2016-05-04 14:26 ` [RFC v2 PATCH 1/8] VFS: add CLONE_MNTNS_SHIFT_UIDGID flag to allow mounts to shift their UIDs/GIDs Djalal Harouni
@ 2016-05-04 14:26 ` Djalal Harouni
  2016-05-04 14:26 ` [RFC v2 PATCH 3/8] fs: Treat foreign mounts as nosuid Djalal Harouni
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Djalal Harouni @ 2016-05-04 14:26 UTC (permalink / raw)
  To: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy
  Cc: Djalal Harouni, Djalal Harouni

Add helpers to allow the VFS to shift UIDs and GIDs according to the
user namespace of the containing mount namespace. All decisions are
taken by VFS.

In this regard, to allow the shift of UID and GID filesystems when
mounted must set "vfs_shift_uids" and "vfs_shift_gids" options,
therefore add the appropriate flags to distinguish between mounts that
support UID/GID shifts and others. The flags are "SB_I_VFS_SHIFT_UIDS"
and "SB_I_VFS_SHIFT_GIDS" and should be set my filesystems in the
super_block->s_iflags when they are mounted.

If a mount supports VFS UID/GID shifts shows up in a mount namespace
that allows UID/GID shifts, VFS helpers will handle the shift
and translate it according to that mount namespace. The user namespace
of that containing mount namespace will be used to perform the
translation, this gives the advantage that only a process with the right
privileges or an ancestor of the mount namespace is able to set user
namespace mappings.

As an example do the shift to the virtual view when we stat() inodes, the
VFS handles that and inodes->{i_uid|i_gid} will always contain the
on-disk view.

Returned stat()->{uid|gid} example inside user_ns_X
---------------------------------------------------

-------------------------------------------------------------------------
inode->uid on Disk | init_user_ns uid | userns_ns_X uid | stat->uid
-------------------------------------------------------------------------
0                  | 1000000          | 0               | 65534
-------------------------------------------------------------------------
999                | 1000999          | 999             | 65534
-------------------------------------------------------------------------
1000               | 1001000          | 1000            | 65534
-------------------------------------------------------------------------
1000000            | 1000000          | 0               | 0
-------------------------------------------------------------------------
1000999            | 1000000          | 0		| 999
-------------------------------------------------------------------------
1001000            | 1000999          | 999		| 1000
-------------------------------------------------------------------------

With this patch:
-------------------------------------------------------------------------
inode->uid on Disk | init_user_ns uid | userns_ns_X uid | stat->uid
-------------------------------------------------------------------------
0                  | 1000000          | 0               | 0
-------------------------------------------------------------------------
999                | 1000999          | 999             | 999
-------------------------------------------------------------------------
1000               | 1001000          | 1000            | 1000
-------------------------------------------------------------------------
1000000            | 1000000          | 0               | 0
-------------------------------------------------------------------------
1000999            | 1000000          | 0		| 999
-------------------------------------------------------------------------
1001000            | 1000999          | 999		| 1000
-------------------------------------------------------------------------

Signed-off-by: Dongsu Park <dongsu@endocode.com>
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
---
 fs/namespace.c     | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/stat.c          |  4 ++--
 include/linux/fs.h | 11 ++++++++++
 3 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 940ecfc..de02b39 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1652,6 +1652,22 @@ SYSCALL_DEFINE1(oldumount, char __user *, name)
 
 #endif
 
+/* Returns true if the VFS should shift inode's UID */
+static bool vfs_mount_shift_i_uid(struct mnt_namespace *ns,
+				  const struct inode *inode)
+{
+	return (ns->flags & CLONE_MNTNS_SHIFT_UIDGID) &&
+		(inode->i_sb->s_iflags & SB_I_VFS_SHIFT_UIDS);
+}
+
+/* Returns true if the VFS should shift inode's GID */
+static bool vfs_mount_shift_i_gid(struct mnt_namespace *ns,
+				  const struct inode *inode)
+{
+	return (ns->flags & CLONE_MNTNS_SHIFT_UIDGID) &&
+		(inode->i_sb->s_iflags & SB_I_VFS_SHIFT_GIDS);
+}
+
 static bool is_mnt_ns_file(struct dentry *dentry)
 {
 	/* Is this a proxy for a mount namespace? */
@@ -1664,6 +1680,52 @@ struct mnt_namespace *to_mnt_ns(struct ns_common *ns)
 	return container_of(ns, struct mnt_namespace, ns);
 }
 
+/*
+ * Returns the virtual UID view of the inode's uid
+ * If UID shifts are enabled on the mount namespace and the filesystem,
+ * the VFS will return the shifted UID according to the rules of the
+ * user namespace of the containing mount namespace. If no shift is
+ * performed, inode->i_uid is returned.
+ */
+kuid_t vfs_shift_i_uid_to_virtual(const struct inode *inode)
+{
+	kuid_t i_uid = inode->i_uid;
+	struct mnt_namespace *ns = current->nsproxy->mnt_ns;
+
+	/* Nothing to do */
+	if (!vfs_mount_shift_i_uid(ns, inode))
+		return i_uid;
+
+	/* If there is no mapping construct one in the current mountns */
+	if (!kuid_has_mapping(ns->user_ns, i_uid))
+		return make_kuid(ns->user_ns, i_uid.val);
+
+	return i_uid;
+}
+
+/*
+ * Returns the virtual GID view of the inode's gid
+ * If GID shifts are enabled on the mount namespace and the filesystem,
+ * the VFS will return the shifted GID according to the rules of the
+ * user namespace of the containing mount namespace. If no shift is
+ * performed, inode->i_gid is returned.
+ */
+kgid_t vfs_shift_i_gid_to_virtual(const struct inode *inode)
+{
+	kgid_t i_gid = inode->i_gid;
+	struct mnt_namespace *ns = current->nsproxy->mnt_ns;
+
+	/* Nothing to do */
+	if (!vfs_mount_shift_i_gid(ns, inode))
+		return i_gid;
+
+	/* If there is no mapping construct one in the current mountns */
+	if (!kgid_has_mapping(ns->user_ns, i_gid))
+		return make_kgid(ns->user_ns, i_gid.val);
+
+	return i_gid;
+}
+
 static bool mnt_ns_loop(struct dentry *dentry)
 {
 	/* Could bind mounting the mount namespace inode cause a
diff --git a/fs/stat.c b/fs/stat.c
index bc045c7..d00622b 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -24,8 +24,8 @@ void generic_fillattr(struct inode *inode, struct kstat *stat)
 	stat->ino = inode->i_ino;
 	stat->mode = inode->i_mode;
 	stat->nlink = inode->i_nlink;
-	stat->uid = inode->i_uid;
-	stat->gid = inode->i_gid;
+	stat->uid = vfs_shift_i_uid_to_virtual(inode);
+	stat->gid = vfs_shift_i_gid_to_virtual(inode);
 	stat->rdev = inode->i_rdev;
 	stat->size = i_size_read(inode);
 	stat->atime = inode->i_atime;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 70e61b5..a9efc5a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1298,6 +1298,8 @@ struct mm_struct;
 /* sb->s_iflags */
 #define SB_I_CGROUPWB	0x00000001	/* cgroup-aware writeback enabled */
 #define SB_I_NOEXEC	0x00000002	/* Ignore executables on this fs */
+#define SB_I_VFS_SHIFT_UIDS	0X00000004	/* FS allows VFS to do UID shifts */
+#define SB_I_VFS_SHIFT_GIDS	0X00000008	/* FS allows VFS to do GID shifts */
 
 /* Possible states of 'frozen' field */
 enum {
@@ -1563,6 +1565,15 @@ extern int vfs_whiteout(struct inode *, struct dentry *);
  */
 extern void inode_init_owner(struct inode *inode, const struct inode *dir,
 			umode_t mode);
+
+/*
+ * VFS helpers to shift inodes's uid/gid and passed values to either virtual
+ * or on-disk view. The shift is done according to rules of the user namespace
+ * of the containing mount namespace.
+ */
+extern kuid_t vfs_shift_i_uid_to_virtual(const struct inode *inode);
+extern kgid_t vfs_shift_i_gid_to_virtual(const struct inode *inode);
+
 /*
  * VFS FS_IOC_FIEMAP helper definitions.
  */
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC v2 PATCH 3/8] fs: Treat foreign mounts as nosuid
  2016-05-04 14:26 [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems Djalal Harouni
  2016-05-04 14:26 ` [RFC v2 PATCH 1/8] VFS: add CLONE_MNTNS_SHIFT_UIDGID flag to allow mounts to shift their UIDs/GIDs Djalal Harouni
  2016-05-04 14:26 ` [RFC v2 PATCH 2/8] VFS:uidshift: add flags and helpers to shift UIDs and GIDs to virtual view Djalal Harouni
@ 2016-05-04 14:26 ` Djalal Harouni
  2016-05-04 23:19   ` Serge Hallyn
  2016-05-04 14:26 ` [RFC v2 PATCH 4/8] VFS:userns: shift UID/GID to virtual view during permission access Djalal Harouni
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 49+ messages in thread
From: Djalal Harouni @ 2016-05-04 14:26 UTC (permalink / raw)
  To: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy
  Cc: Djalal Harouni, Djalal Harouni

If a process gets access to a mount from a different user
namespace, that process should not be able to take advantage of
setuid files or selinux entrypoints from that filesystem.  Prevent
this by treating mounts from other mount namespaces and those not
owned by current_user_ns() or an ancestor as nosuid.

This patch was just adapted from the original one that was written
by Andy Lutomirski <luto@amacapital.net>
https://www.redhat.com/archives/dm-devel/2016-April/msg00374.html

Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
---
 fs/exec.c                      |  2 +-
 fs/namespace.c                 | 15 +++++++++++++++
 include/linux/mount.h          |  1 +
 include/linux/user_namespace.h |  8 ++++++++
 kernel/user_namespace.c        | 13 +++++++++++++
 security/commoncap.c           |  2 +-
 security/selinux/hooks.c       |  2 +-
 7 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index c4010b8..706088d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1391,7 +1391,7 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
 	bprm->cred->euid = current_euid();
 	bprm->cred->egid = current_egid();
 
-	if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
+	if (!mnt_may_suid(bprm->file->f_path.mnt))
 		return;
 
 	if (task_no_new_privs(current))
diff --git a/fs/namespace.c b/fs/namespace.c
index de02b39..a8820fb 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3374,6 +3374,21 @@ found:
 	return visible;
 }
 
+bool mnt_may_suid(struct vfsmount *mnt)
+{
+	struct mount *m = real_mount(mnt);
+
+	/*
+	 * Foreign mounts (accessed via fchdir or through /proc
+	 * symlinks) are always treated as if they are nosuid. This
+	 * prevents namespaces from trusting potentially unsafe
+	 * suid/sgid bits, file caps, or security labels that originate
+	 * in other namespaces.
+	 */
+	return !(mnt->mnt_flags & MNT_NOSUID) && check_mnt(m) &&
+		 in_userns(current_user_ns(), m->mnt_ns->user_ns);
+}
+
 static struct ns_common *mntns_get(struct task_struct *task)
 {
 	struct ns_common *ns = NULL;
diff --git a/include/linux/mount.h b/include/linux/mount.h
index f822c3c..54a594d 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -81,6 +81,7 @@ extern void mntput(struct vfsmount *mnt);
 extern struct vfsmount *mntget(struct vfsmount *mnt);
 extern struct vfsmount *mnt_clone_internal(struct path *path);
 extern int __mnt_is_readonly(struct vfsmount *mnt);
+extern bool mnt_may_suid(struct vfsmount *mnt);
 
 struct path;
 extern struct vfsmount *clone_private_mount(struct path *path);
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 8297e5b..a43faa7 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -72,6 +72,8 @@ extern ssize_t proc_projid_map_write(struct file *, const char __user *, size_t,
 extern ssize_t proc_setgroups_write(struct file *, const char __user *, size_t, loff_t *);
 extern int proc_setgroups_show(struct seq_file *m, void *v);
 extern bool userns_may_setgroups(const struct user_namespace *ns);
+extern bool in_userns(const struct user_namespace *ns,
+		      const struct user_namespace *target_ns);
 #else
 
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
@@ -100,6 +102,12 @@ static inline bool userns_may_setgroups(const struct user_namespace *ns)
 {
 	return true;
 }
+
+static inline bool in_userns(const struct user_namespace *ns,
+			     const struct user_namespace *target_ns)
+{
+	return true;
+}
 #endif
 
 #endif /* _LINUX_USER_H */
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 9bafc21..9a496a8 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -938,6 +938,19 @@ bool userns_may_setgroups(const struct user_namespace *ns)
 	return allowed;
 }
 
+/*
+ * Returns true if @ns is the same namespace as or a descendant of
+ * @target_ns.
+ */
+bool in_userns(const struct user_namespace *ns,
+	       const struct user_namespace *target_ns)
+{
+	for (; ns; ns = ns->parent) {
+		if (ns == target_ns)
+			return true;
+	}
+}
+
 static inline struct user_namespace *to_user_ns(struct ns_common *ns)
 {
 	return container_of(ns, struct user_namespace, ns);
diff --git a/security/commoncap.c b/security/commoncap.c
index 48071ed..6c082d2 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -453,7 +453,7 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
 	if (!file_caps_enabled)
 		return 0;
 
-	if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
+	if (!mnt_may_suid(bprm->file->f_path.mnt))
 		return 0;
 
 	rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 912deee..1350167 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2234,7 +2234,7 @@ static int check_nnp_nosuid(const struct linux_binprm *bprm,
 			    const struct task_security_struct *new_tsec)
 {
 	int nnp = (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS);
-	int nosuid = (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID);
+	int nosuid = !mnt_may_suid(bprm->file->f_path.mnt);
 	int rc;
 
 	if (!nnp && !nosuid)
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC v2 PATCH 4/8] VFS:userns: shift UID/GID to virtual view during permission access
  2016-05-04 14:26 [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems Djalal Harouni
                   ` (2 preceding siblings ...)
  2016-05-04 14:26 ` [RFC v2 PATCH 3/8] fs: Treat foreign mounts as nosuid Djalal Harouni
@ 2016-05-04 14:26 ` Djalal Harouni
  2016-05-04 14:26 ` [RFC v2 PATCH 5/8] VFS:userns: add helpers to shift UIDs and GIDs into on-disk view Djalal Harouni
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Djalal Harouni @ 2016-05-04 14:26 UTC (permalink / raw)
  To: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy
  Cc: Djalal Harouni, Djalal Harouni

If both the mount namespace and the mount point support UID/GID shifts,
then before doing any permission check, translate inode->{i_uid|i_gid}
into the kernel virtual view, then use the result to do the permission
checks. If there is no support for UID/GID shifts, we fallback to
inode->{i_uid|i_gid} on-disk values.

The VFS will shift these values to the virtual view, the result
will be used to compare with current's fsuid and fsgid and to perform
classic or capable checks. Since inode->{i_uid|i_gid} will always
contain the on-disk values we do the virtual translation when an access
is needed.

This solves the problem of privileged userns or users inside containers
that want to access files, but the access fails since VFS uses their
global kuid/kgid.

Permission checks inside user_ns_X
----------------------------------

Without this Patch:
-------------------------------------------------------------------------
inode->uid on Disk | init_user_ns uid | user_ns_X uid           | Access
-------------------------------------------------------------------------
0                  | 1000000          | 0 (userns root)         | Denied
-------------------------------------------------------------------------
999                | 1000999          | 999                     | Denied
-------------------------------------------------------------------------
1000               | 1001000          | 1000                    | Denied
-------------------------------------------------------------------------
1000               | 1000000          | 0 (userns root CAPS)    | Denied
-------------------------------------------------------------------------
0                  | 1001000          | 1000                    | Denied
-------------------------------------------------------------------------

With this patch:
--------------------------------------------------------------------------
inode->uid on Disk | init_user_ns uid | user_ns_X uid           | Access
--------------------------------------------------------------------------
0                  | 1000000          | 0 (userns root)         | Granted
--------------------------------------------------------------------------
999                | 1000999          | 999                     | Granted
--------------------------------------------------------------------------
1000               | 1001000          | 1000                    | Granted
--------------------------------------------------------------------------
1000               | 1000000          | 0 (userns root CAPS)    | Granted
--------------------------------------------------------------------------
999                | 1000000          | 0 (userns root CAPS)    | Granted
--------------------------------------------------------------------------
0                  | 1001000          | 1000                    | Denied
--------------------------------------------------------------------------
0                  | 1000999          | 999                     | Denied
--------------------------------------------------------------------------
1000               | 1000999          | 999                     | Denied
--------------------------------------------------------------------------

* CAPS: means capabilities, the access was granted due to the capabilities
  of the caller inside user_ns_X and the shifted UID/GID of the inode are
  also mapped in that user_ns_X

Privileged root user namespaces with uid 0 inside the container will be
able to access inodes->i_uid == 0 on-disk if that inode is on a file
system that supports VFS UID/GID shifts and the caller is inside a mount
namespace that also supports the above.

Signed-off-by: Dongsu Park <dongsu@endocode.com>
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
---
 fs/inode.c          |  5 +++--
 fs/namei.c          |  6 ++++--
 kernel/capability.c | 14 ++++++++++++--
 3 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 69b8b52..07daf5f 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1961,12 +1961,13 @@ EXPORT_SYMBOL(inode_init_owner);
 bool inode_owner_or_capable(const struct inode *inode)
 {
 	struct user_namespace *ns;
+	kuid_t i_uid = vfs_shift_i_uid_to_virtual(inode);
 
-	if (uid_eq(current_fsuid(), inode->i_uid))
+	if (uid_eq(current_fsuid(), i_uid))
 		return true;
 
 	ns = current_user_ns();
-	if (ns_capable(ns, CAP_FOWNER) && kuid_has_mapping(ns, inode->i_uid))
+	if (ns_capable(ns, CAP_FOWNER) && kuid_has_mapping(ns, i_uid))
 		return true;
 	return false;
 }
diff --git a/fs/namei.c b/fs/namei.c
index 1d9ca2d..f7ee498 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -289,8 +289,10 @@ static int check_acl(struct inode *inode, int mask)
 static int acl_permission_check(struct inode *inode, int mask)
 {
 	unsigned int mode = inode->i_mode;
+	kuid_t i_uid = vfs_shift_i_uid_to_virtual(inode);
+	kgid_t i_gid = vfs_shift_i_gid_to_virtual(inode);
 
-	if (likely(uid_eq(current_fsuid(), inode->i_uid)))
+	if (likely(uid_eq(current_fsuid(), i_uid)))
 		mode >>= 6;
 	else {
 		if (IS_POSIXACL(inode) && (mode & S_IRWXG)) {
@@ -299,7 +301,7 @@ static int acl_permission_check(struct inode *inode, int mask)
 				return error;
 		}
 
-		if (in_group_p(inode->i_gid))
+		if (in_group_p(i_gid))
 			mode >>= 3;
 	}
 
diff --git a/kernel/capability.c b/kernel/capability.c
index 45432b5..fdc8afb 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -441,9 +441,19 @@ EXPORT_SYMBOL(file_ns_capable);
  */
 bool capable_wrt_inode_uidgid(const struct inode *inode, int cap)
 {
+	kuid_t i_uid;
+	kgid_t i_gid;
 	struct user_namespace *ns = current_user_ns();
 
-	return ns_capable(ns, cap) && kuid_has_mapping(ns, inode->i_uid) &&
-		kgid_has_mapping(ns, inode->i_gid);
+	/*
+	 * Check if inode's UID/GID are mean to be shifted into the current
+	 * mount namespace, if so we use the result to check if the shifted
+	 * UID/GID have a mapping in current's user namespace.
+	 */
+	i_uid = vfs_shift_i_uid_to_virtual(inode);
+	i_gid = vfs_shift_i_gid_to_virtual(inode);
+
+	return ns_capable(ns, cap) && kuid_has_mapping(ns, i_uid) &&
+		kgid_has_mapping(ns, i_gid);
 }
 EXPORT_SYMBOL(capable_wrt_inode_uidgid);
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC v2 PATCH 5/8] VFS:userns: add helpers to shift UIDs and GIDs into on-disk view
  2016-05-04 14:26 [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems Djalal Harouni
                   ` (3 preceding siblings ...)
  2016-05-04 14:26 ` [RFC v2 PATCH 4/8] VFS:userns: shift UID/GID to virtual view during permission access Djalal Harouni
@ 2016-05-04 14:26 ` Djalal Harouni
  2016-05-04 14:26 ` [RFC v2 PATCH 6/8] VFS:userns: shift UID/GID to on-disk view before any write to disk Djalal Harouni
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Djalal Harouni @ 2016-05-04 14:26 UTC (permalink / raw)
  To: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy
  Cc: Djalal Harouni, Djalal Harouni

Add helpers to allow the VFS to shift UIDs and GIDs into on-disk view
according to the user namespace of the containing mount namespace.
All decisions are taken by VFS. This is a preparation patch for the next
one where we convert kuid and kgid to be written into disk.

To allow the shift of UID and GID, filesystems when mounted must set
"vfs_shift_uids" and "vfs_shift_gids" options, otherwise no shift is
performed at all.

vfs_shift_kuid_to_disk() and vfs_shift_kgid_to_disk() take two
arugments, the inode that we are trying to update on-disk and the
corresponding kuid and kgid that should be used to update
inode->{i_uid|i_gid} values. To convert to on-disk value we perform:

1) First check if UID/GID shift is enabled on the inode which means
   the filesystem and current mount namespace.

2) If the passed kuid has a mapping in the user namespace of the
containing mount namespace, then get the correspondig uid_t value
otherwise we fallback to init_user_ns

3) Finally construct the kuid from the pair init_user_ns and uid_t
values, this will always reflects what should be put into
inode->{i_uid|i_gid} on the disk.

If a mount supports VFS UID/GID shifts shows up in a mount namespace
that allows UID/GID shifts, these helpers will handle the shift to
on-disk view and perform the translation accordingly.

Signed-off-by: Dongsu Park <dongsu@endocode.com>
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
---
 fs/namespace.c     | 93 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |  3 ++
 2 files changed, 96 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index a8820fb..7df896b 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1726,6 +1726,99 @@ kgid_t vfs_shift_i_gid_to_virtual(const struct inode *inode)
 	return i_gid;
 }
 
+/*
+ * Returns the on-disk UID view of the passed kuid that is supposed to be
+ * used to update the inode's uid.
+ * If UID shifts are enabled on the mount namespace and the filesystem,
+ * the VFS will return the on-disk view of the passed kuid. If no shift is
+ * performed, kuid is returned without any change.
+ */
+kuid_t vfs_shift_kuid_to_disk(const struct inode *inode, kuid_t kuid)
+{
+	uid_t uid;
+	struct mnt_namespace *ns;
+
+	if (!current->mm)
+		return kuid;
+
+	ns = current->nsproxy->mnt_ns;
+
+	/* Nothing to do */
+	if (!vfs_mount_shift_i_uid(ns, inode))
+		return kuid;
+
+	/*
+	 * If kuid has a mapping in the mountns get its uid_t otherwise get it
+	 * from init_user_ns
+	 */
+	if (kuid_has_mapping(ns->user_ns, kuid))
+		uid = from_kuid(ns->user_ns, kuid);
+	else
+		uid = from_kuid(&init_user_ns, kuid);
+
+	if (uid == (uid_t) -1)
+		return kuid;
+
+	/* Get the final kuid that will be used for on-disk writes */
+	return make_kuid(&init_user_ns, uid);
+}
+
+/*
+ * Returns the on-disk GID view of the passed kgid that is supposed to be
+ * used to update the inode's gid.
+ * If GID shifts are enabled on the mount namespace and the filesystem,
+ * the VFS will return the on-disk view of the passed kgid. If no shift is
+ * performed, kgid is returned without any change.
+ */
+kgid_t vfs_shift_kgid_to_disk(const struct inode *inode, kgid_t kgid)
+{
+	gid_t gid;
+	struct mnt_namespace *ns;
+
+	if (!current->mm)
+		return kgid;
+
+	ns = current->nsproxy->mnt_ns;
+
+	/* Nothing to do ? */
+	if (!vfs_mount_shift_i_gid(ns, inode))
+		return kgid;
+
+	/*
+	 * If kgid has a mapping in the mountns get its gid_t otherwise get it
+	 * from init_user_ns
+	 */
+	if (kgid_has_mapping(ns->user_ns, kgid))
+		gid = from_kgid(ns->user_ns, kgid);
+	else
+		gid = from_kgid(&init_user_ns, kgid);
+
+	if (gid == (gid_t) -1)
+		return kgid;
+
+	/* Get the final kgid that will be used for on-disk writes */
+	return make_kgid(&init_user_ns, gid);
+}
+
+/*
+ * Converts the passed kgid from on-disk view into the virtual one and returns it.
+ * It takes two arguments, the inode where the related kgid is supposed to be
+ * used, and the kgid.
+ */
+kgid_t vfs_kgid_disk_to_virtual(const struct inode *inode, kgid_t kgid)
+{
+	struct mnt_namespace *ns = current->nsproxy->mnt_ns;
+
+	/* shift kgid if necessary */
+	if (!vfs_mount_shift_i_gid(ns, inode))
+		return kgid;
+
+	if (!kgid_has_mapping(ns->user_ns, kgid))
+		return make_kgid(ns->user_ns, kgid.val);
+
+	return kgid;
+}
+
 static bool mnt_ns_loop(struct dentry *dentry)
 {
 	/* Could bind mounting the mount namespace inode cause a
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a9efc5a..e957474 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1573,6 +1573,9 @@ extern void inode_init_owner(struct inode *inode, const struct inode *dir,
  */
 extern kuid_t vfs_shift_i_uid_to_virtual(const struct inode *inode);
 extern kgid_t vfs_shift_i_gid_to_virtual(const struct inode *inode);
+extern kgid_t vfs_kgid_disk_to_virtual(const struct inode *inode, kgid_t kgid);
+extern kuid_t vfs_shift_kuid_to_disk(const struct inode *inode, kuid_t kuid);
+extern kgid_t vfs_shift_kgid_to_disk(const struct inode *inode, kgid_t kgid);
 
 /*
  * VFS FS_IOC_FIEMAP helper definitions.
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC v2 PATCH 6/8] VFS:userns: shift UID/GID to on-disk view before any write to disk
  2016-05-04 14:26 [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems Djalal Harouni
                   ` (4 preceding siblings ...)
  2016-05-04 14:26 ` [RFC v2 PATCH 5/8] VFS:userns: add helpers to shift UIDs and GIDs into on-disk view Djalal Harouni
@ 2016-05-04 14:26 ` Djalal Harouni
  2016-05-04 14:26 ` [RFC v2 PATCH 7/8] ext4: add support for vfs_shift_uids and vfs_shift_gids mount options Djalal Harouni
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Djalal Harouni @ 2016-05-04 14:26 UTC (permalink / raw)
  To: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy
  Cc: Djalal Harouni, Djalal Harouni

If both the mount namespace and the mount point support UID/GID shifts,
then during inode creation or during a chown call on an inode, make sure
that kuid and kgid that will be used to set inode->{i_uid|i_gid} are in
on-disk view.

Perform the shift to on-disk view during inode initialization or during
notify_change() calls. Usually in this case inode's uid/gid will contain
a kuid and kgid that are valid in the context of the caller and its view
inside the global init_user_ns user namespace. They will always end up
either with current_fsuid() value or the attr->ia_uid of the struct iattr.

inode->{i_uid|i_gid} on-disk writes inside user_ns_X
----------------------------------------------------

Without this Patch:
------------------------------------------------------------
user_ns_X uid   | init_user_ns uid    | inode->i_uid on-disk
------------------------------------------------------------
0               | 1000000             | 1000000
------------------------------------------------------------
999             | 1000999             | 1000999
------------------------------------------------------------
1000            | 1001000             | 1001000
------------------------------------------------------------

inode->{i_uid|i_gid} always end up with global kuid/kgid of the caller
in the init_user_ns.

With this patch:
------------------------------------------------------------
user_ns_X uid   | init_user_ns uid    | inode->i_uid on-disk
------------------------------------------------------------
0               | 1000000             | 0
------------------------------------------------------------
999             | 1000999             | 999
------------------------------------------------------------
1000            | 1001000             | 1000
------------------------------------------------------------

inode->{i_uid|i_gid} will have the values of the uid_t and gid_t that
are shown inside the user namespace of the caller.

Of course this works only on mounts that support VFS UID/GID shift and
are inside a mount namespace that also supports the above. The shift into
on-disk is done inside notify_change() to give a chance to
notify_change_ok() to catch permissions access. At the same time we
adapt notify_change_ok() and make the necessary translation when it's
needed from virtual ot on-disk and vice versa.

The approach is to always keep inode->{i_uid|i_gid} even in memory with
on-disk values. The virtual translation is only done when needed for
permission access or stat() calls.

Signed-off-by: Dongsu Park <dongsu@endocode.com>
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
---
 fs/attr.c  | 44 +++++++++++++++++++++++++++++++++-----------
 fs/inode.c |  4 ++--
 2 files changed, 35 insertions(+), 13 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 25b24d0..c476257 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -47,26 +47,38 @@ int inode_change_ok(const struct inode *inode, struct iattr *attr)
 		return 0;
 
 	/* Make sure a caller can chown. */
-	if ((ia_valid & ATTR_UID) &&
-	    (!uid_eq(current_fsuid(), inode->i_uid) ||
-	     !uid_eq(attr->ia_uid, inode->i_uid)) &&
-	    !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
-		return -EPERM;
+	if (ia_valid & ATTR_UID) {
+		/* Shift to virtual if necessary */
+		kuid_t i_uid = vfs_shift_i_uid_to_virtual(inode);
+
+		if ((!uid_eq(current_fsuid(), i_uid) ||
+		     !uid_eq(attr->ia_uid, inode->i_uid)) &&
+		    !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+			return -EPERM;
+	}
 
 	/* Make sure caller can chgrp. */
-	if ((ia_valid & ATTR_GID) &&
-	    (!uid_eq(current_fsuid(), inode->i_uid) ||
-	    (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) &&
-	    !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+	if (ia_valid & ATTR_GID) {
+		/* Shift to virtual if ncessary */
+		kuid_t i_uid = vfs_shift_i_uid_to_virtual(inode);
+		/* Shift it back to virtual if necessary */
+		kgid_t ia_gid = vfs_kgid_disk_to_virtual(inode, attr->ia_gid);
+
+		if ((!uid_eq(current_fsuid(), i_uid) ||
+		     (!in_group_p(ia_gid) &&
+		      !gid_eq(attr->ia_gid, inode->i_gid))) &&
+		    !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
 		return -EPERM;
+	}
 
 	/* Make sure a caller can chmod. */
 	if (ia_valid & ATTR_MODE) {
 		if (!inode_owner_or_capable(inode))
 			return -EPERM;
 		/* Also check the setgid bit! */
-		if (!in_group_p((ia_valid & ATTR_GID) ? attr->ia_gid :
-				inode->i_gid) &&
+		if (!in_group_p((ia_valid & ATTR_GID) ?
+				vfs_kgid_disk_to_virtual(inode, attr->ia_gid) :
+				vfs_shift_i_gid_to_virtual(inode)) &&
 		    !capable_wrt_inode_uidgid(inode, CAP_FSETID))
 			attr->ia_mode &= ~S_ISGID;
 	}
@@ -209,6 +221,16 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
 			inode->i_flags &= ~S_NOSEC;
 	}
 
+	/*
+	 * Shift if necessary the UID and GID that are mean to be written
+	 * into inodes's uid/gid to on-disk view. Do that as early as
+	 * possible.
+	 */
+	if ((ia_valid & ATTR_UID))
+		attr->ia_uid = vfs_shift_kuid_to_disk(inode, attr->ia_uid);
+	if ((ia_valid & ATTR_GID))
+		attr->ia_gid = vfs_shift_kgid_to_disk(inode, attr->ia_gid);
+
 	now = current_fs_time(inode->i_sb);
 
 	attr->ia_ctime = now;
diff --git a/fs/inode.c b/fs/inode.c
index 07daf5f..e6ee56a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1940,13 +1940,13 @@ EXPORT_SYMBOL(init_special_inode);
 void inode_init_owner(struct inode *inode, const struct inode *dir,
 			umode_t mode)
 {
-	inode->i_uid = current_fsuid();
+	inode->i_uid = vfs_shift_kuid_to_disk(inode, current_fsuid());
 	if (dir && dir->i_mode & S_ISGID) {
 		inode->i_gid = dir->i_gid;
 		if (S_ISDIR(mode))
 			mode |= S_ISGID;
 	} else
-		inode->i_gid = current_fsgid();
+		inode->i_gid = vfs_shift_kgid_to_disk(inode, current_fsgid());
 	inode->i_mode = mode;
 }
 EXPORT_SYMBOL(inode_init_owner);
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC v2 PATCH 7/8] ext4: add support for vfs_shift_uids and vfs_shift_gids mount options
  2016-05-04 14:26 [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems Djalal Harouni
                   ` (5 preceding siblings ...)
  2016-05-04 14:26 ` [RFC v2 PATCH 6/8] VFS:userns: shift UID/GID to on-disk view before any write to disk Djalal Harouni
@ 2016-05-04 14:26 ` Djalal Harouni
  2016-05-04 14:26 ` [RFC v2 PATCH 8/8] btrfs: " Djalal Harouni
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Djalal Harouni @ 2016-05-04 14:26 UTC (permalink / raw)
  To: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy
  Cc: Djalal Harouni, Djalal Harouni

Make ext4 able to parse vfs_shift_uids and vfs_shift_gids options and
set the appropriate flags into the super_block structure.

vfs_shift_uids and vfs_shift_gids allow mounts that show up in a mount
namespace that supports VFS UID/GID shifts on inodes to perform translation
from in kernel virtual view into on-disk view and vice versa. This
allows user namespaces users to be able to access, read and write inodes
on these mounts.

Signed-off-by: Dongsu Park <dongsu@endocode.com>
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
---
 fs/ext4/super.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 304c712..c9a20c0 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1194,6 +1194,7 @@ enum {
 	Opt_dioread_nolock, Opt_dioread_lock,
 	Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
 	Opt_max_dir_size_kb, Opt_nojournal_checksum,
+	Opt_vfs_shift_uids, Opt_vfs_shift_gids,
 };
 
 static const match_table_t tokens = {
@@ -1279,6 +1280,8 @@ static const match_table_t tokens = {
 	{Opt_removed, "reservation"},	/* mount option from ext2/3 */
 	{Opt_removed, "noreservation"}, /* mount option from ext2/3 */
 	{Opt_removed, "journal=%u"},	/* mount option from ext2/3 */
+	{Opt_vfs_shift_uids, "vfs_shift_uids"},
+	{Opt_vfs_shift_gids, "vfs_shift_gids"},
 	{Opt_err, NULL},
 };
 
@@ -1521,8 +1524,15 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 	case Opt_nolazytime:
 		sb->s_flags &= ~MS_LAZYTIME;
 		return 1;
+	case Opt_vfs_shift_uids:
+		sb->s_iflags |= SB_I_VFS_SHIFT_UIDS;
+		return 1;
+	case Opt_vfs_shift_gids:
+		sb->s_iflags |= SB_I_VFS_SHIFT_GIDS;
+		return 1;
 	}
 
+
 	for (m = ext4_mount_opts; m->token != Opt_err; m++)
 		if (token == m->token)
 			break;
@@ -1920,6 +1930,10 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
 		SEQ_OPTS_PRINT("max_dir_size_kb=%u", sbi->s_max_dir_size_kb);
 	if (test_opt(sb, DATA_ERR_ABORT))
 		SEQ_OPTS_PUTS("data_err=abort");
+	if (sb->s_iflags & SB_I_VFS_SHIFT_UIDS)
+		SEQ_OPTS_PUTS("vfs_shift_uids");
+	if (sb->s_iflags & SB_I_VFS_SHIFT_GIDS)
+		SEQ_OPTS_PUTS("vfs_shift_gids");
 
 	ext4_show_quota_options(seq, sb);
 	return 0;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC v2 PATCH 8/8] btrfs: add support for vfs_shift_uids and vfs_shift_gids mount options
  2016-05-04 14:26 [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems Djalal Harouni
                   ` (6 preceding siblings ...)
  2016-05-04 14:26 ` [RFC v2 PATCH 7/8] ext4: add support for vfs_shift_uids and vfs_shift_gids mount options Djalal Harouni
@ 2016-05-04 14:26 ` Djalal Harouni
  2016-05-04 16:34 ` [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems Josh Triplett
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Djalal Harouni @ 2016-05-04 14:26 UTC (permalink / raw)
  To: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy
  Cc: Djalal Harouni, Djalal Harouni

Make btrfs able to parse vfs_shift_uids and vfs_shift_gids options and
set the appropriate flags into the super_block structure.

vfs_shift_uids and vfs_shift_gids allow mounts that show up in a mount
namespace that supports VFS UID/GID shifts on inodes to perform translation
from in kernel virtual view into on-disk view and vice versa. This
allows user namespaces users to be able to access, read and write inodes
on these mounts.

Signed-off-by: Dongsu Park <dongsu@endocode.com>
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
---
 fs/btrfs/super.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 00b8f37..20e7223 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -304,7 +304,8 @@ enum {
 	Opt_commit_interval, Opt_barrier, Opt_nodefrag, Opt_nodiscard,
 	Opt_noenospc_debug, Opt_noflushoncommit, Opt_acl, Opt_datacow,
 	Opt_datasum, Opt_treelog, Opt_noinode_cache, Opt_usebackuproot,
-	Opt_nologreplay, Opt_norecovery,
+	Opt_nologreplay, Opt_norecovery, Opt_vfs_shift_uids,
+	Opt_vfs_shift_gids,
 #ifdef CONFIG_BTRFS_DEBUG
 	Opt_fragment_data, Opt_fragment_metadata, Opt_fragment_all,
 #endif
@@ -364,6 +365,8 @@ static const match_table_t tokens = {
 	{Opt_rescan_uuid_tree, "rescan_uuid_tree"},
 	{Opt_fatal_errors, "fatal_errors=%s"},
 	{Opt_commit_interval, "commit=%d"},
+	{Opt_vfs_shift_uids, "vfs_shift_uids"},
+	{Opt_vfs_shift_gids, "vfs_shift_gids"},
 #ifdef CONFIG_BTRFS_DEBUG
 	{Opt_fragment_data, "fragment=data"},
 	{Opt_fragment_metadata, "fragment=metadata"},
@@ -786,6 +789,12 @@ int btrfs_parse_options(struct btrfs_root *root, char *options,
 				info->commit_interval = BTRFS_DEFAULT_COMMIT_INTERVAL;
 			}
 			break;
+		case Opt_vfs_shift_uids:
+			root->fs_info->sb->s_iflags |= SB_I_VFS_SHIFT_UIDS;
+			break;
+		case Opt_vfs_shift_gids:
+			root->fs_info->sb->s_iflags |= SB_I_VFS_SHIFT_GIDS;
+			break;
 #ifdef CONFIG_BTRFS_DEBUG
 		case Opt_fragment_all:
 			btrfs_info(root->fs_info, "fragmenting all space");
@@ -1279,6 +1288,10 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 	if (btrfs_test_opt(root, FRAGMENT_METADATA))
 		seq_puts(seq, ",fragment=metadata");
 #endif
+	if (root->fs_info->sb->s_iflags & SB_I_VFS_SHIFT_UIDS)
+		seq_puts(seq, ",vfs_shift_uids");
+	if (root->fs_info->sb->s_iflags & SB_I_VFS_SHIFT_GIDS)
+		seq_puts(seq, ",vfs_shift_gids");
 	seq_printf(seq, ",subvolid=%llu",
 		  BTRFS_I(d_inode(dentry))->root->root_key.objectid);
 	seq_puts(seq, ",subvol=");
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-04 14:26 [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems Djalal Harouni
                   ` (7 preceding siblings ...)
  2016-05-04 14:26 ` [RFC v2 PATCH 8/8] btrfs: " Djalal Harouni
@ 2016-05-04 16:34 ` Josh Triplett
  2016-05-04 21:06 ` James Bottomley
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Josh Triplett @ 2016-05-04 16:34 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Alexander Viro, Chris Mason, tytso, Serge Hallyn,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy

On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> This is version 2 of the VFS:userns support portable root filesystems
> RFC. Changes since version 1:
> 
> * Update documentation and remove some ambiguity about the feature.
>   Based on Josh Triplett comments.

Thanks for the clarifications.

> 3) The existing user namespace interface is the one used to do the
> translation from virtual to on-disk mapping.

This makes sense.  Even if in the future we had a way to supply an
arbitrary VFS UID/GID mapping for a mount, independent of the userns,
what you've proposed would still make sense as a shorthand for the
common case of using the same mapping for both userns and VFS.

- Josh Triplett

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-04 14:26 [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems Djalal Harouni
                   ` (8 preceding siblings ...)
  2016-05-04 16:34 ` [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems Josh Triplett
@ 2016-05-04 21:06 ` James Bottomley
  2016-05-05  7:36   ` Djalal Harouni
  2016-05-04 23:30 ` Serge Hallyn
  2016-05-05  0:23 ` Dave Chinner
  11 siblings, 1 reply; 49+ messages in thread
From: James Bottomley @ 2016-05-04 21:06 UTC (permalink / raw)
  To: Djalal Harouni, Alexander Viro, Chris Mason, tytso, Serge Hallyn,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, linux-security-module, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy

On Wed, 2016-05-04 at 16:26 +0200, Djalal Harouni wrote:
> This is version 2 of the VFS:userns support portable root filesystems
> RFC. Changes since version 1:
> 
> * Update documentation and remove some ambiguity about the feature.
>   Based on Josh Triplett comments.
> * Use a new email address to send the RFC :-)
> 
> 
> This RFC tries to explore how to support filesystem operations inside
> user namespace using only VFS and a per mount namespace solution.
> This
> allows to take advantage of user namespace separations without
> introducing any change at the filesystems level. All this is handled
> with the virtual view of mount namespaces.
> 
> 
> 1) Presentation:
> ================
> 
> The main aim is to support portable root filesystems and allow 
> containers, virtual machines and other cases to use the same root 
> filesystem. Due to security reasons, filesystems can't be mounted 
> inside user namespaces, and mounting them outside will not solve the 
> problem since they will show up with the wrong UIDs/GIDs. Read and 
> write operations will also fail and so on.
> 
> The current userspace solution is to automatically chown the whole 
> root filesystem before starting a container, example:
> (host) init_user_ns  1000000:1065536  => (container) user_ns_X1
> 0:65535
> (host) init_user_ns  2000000:2065536  => (container) user_ns_Y1
> 0:65535
> (host) init_user_ns  3000000:3065536  => (container) user_ns_Z1
> 0:65535
> ...
> 
> Every time a chown is called, files are changed and so on... This
> prevents to have portable filesystems where you can throw anywhere
> and boot. Having an extra step to adapt the filesystem to the current
> mapping and persist it will not allow to verify its integrity, it 
> makes snapshots and migration a bit harder, and probably other
> limitations...
> 
> It seems that there are multiple ways to allow user namespaces 
> combine nicely with filesystems, but none of them is that easy. The 
> bind mount and pin the user namespace during mount time will not 
> work, bind mounts share the same super block, hence you may endup 
> working on the wrong vfsmount context and there is no easy way to get
> out of that...

So this option was discussed at the recent LSF/MM summit.  The most
supported suggestion was that you'd use a new internal fs type that had
a struct mount with a new superblock and would copy the underlying
inodes but substitute it's own with modified ->getatrr/->setattr calls
that did the uid shift.  In many ways it would be a remapping bind
which would look similar to overlayfs but be a lot simpler.

> Using the user namespace in the super block seems the way to go, and
> there is the "Support fuse mounts in user namespaces" [1] patches 
> which seem nice but perhaps too complex!?

So I don't think that does what you want.  The fuse project I've used
before to do uid/gid shifts for build containers is bindfs

https://github.com/mpartel/bindfs/

It allows a --map argument where you specify pairs of uids/gids to map
(tedious for large ranges, but the map can be fixed to use uid:range
instead of individual).

>  there is also the overlayfs solution, and finaly the VFS layer 
> solution.
> 
> We present here a simple VFS solution, everything is packed inside 
> VFS, filesystems don't need to know anything (except probably XFS, 
> and special operations inside union filesystems). Currently it 
> supports ext4, btrfs and overlayfs. Changes into filesystems are 
> small, just parse the vfs_shift_uids and vfs_shift_gids options 
> during mount and set the appropriate flags into the super_block
> structure.

So this looks a little daunting.  It sprays the VFS with knowledge
about the shifts and requires support from every underlying filesystem.
 A simple remapping bind filesystem would be a lot simpler and require
no underlying filesystem support.

James

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 3/8] fs: Treat foreign mounts as nosuid
  2016-05-04 14:26 ` [RFC v2 PATCH 3/8] fs: Treat foreign mounts as nosuid Djalal Harouni
@ 2016-05-04 23:19   ` Serge Hallyn
  2016-05-05 13:05     ` Seth Forshee
  0 siblings, 1 reply; 49+ messages in thread
From: Serge Hallyn @ 2016-05-04 23:19 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Djalal Harouni

Quoting Djalal Harouni (tixxdz@gmail.com):
> If a process gets access to a mount from a different user
> namespace, that process should not be able to take advantage of
> setuid files or selinux entrypoints from that filesystem.  Prevent
> this by treating mounts from other mount namespaces and those not
> owned by current_user_ns() or an ancestor as nosuid.
> 
> This patch was just adapted from the original one that was written
> by Andy Lutomirski <luto@amacapital.net>
> https://www.redhat.com/archives/dm-devel/2016-April/msg00374.html

I'm not sure that this makes sense given what you're doing.  In the
case of Seth's set, a filesystem is mounted specifically (and privately)
in a user namespace.  We don't want for instance the initial user ns
to find a link to a setuid-root exploit left in the container-mounted
filesystem.

But you are having a parent user namespace mount the fs so that its
children can all access the fs, uid-shifted for convenience.  Not
allowing the child namespaces to make use of setuid-root does not
seem applicable here.

> Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
> ---
>  fs/exec.c                      |  2 +-
>  fs/namespace.c                 | 15 +++++++++++++++
>  include/linux/mount.h          |  1 +
>  include/linux/user_namespace.h |  8 ++++++++
>  kernel/user_namespace.c        | 13 +++++++++++++
>  security/commoncap.c           |  2 +-
>  security/selinux/hooks.c       |  2 +-
>  7 files changed, 40 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index c4010b8..706088d 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1391,7 +1391,7 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
>  	bprm->cred->euid = current_euid();
>  	bprm->cred->egid = current_egid();
>  
> -	if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
> +	if (!mnt_may_suid(bprm->file->f_path.mnt))
>  		return;
>  
>  	if (task_no_new_privs(current))
> diff --git a/fs/namespace.c b/fs/namespace.c
> index de02b39..a8820fb 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -3374,6 +3374,21 @@ found:
>  	return visible;
>  }
>  
> +bool mnt_may_suid(struct vfsmount *mnt)
> +{
> +	struct mount *m = real_mount(mnt);
> +
> +	/*
> +	 * Foreign mounts (accessed via fchdir or through /proc
> +	 * symlinks) are always treated as if they are nosuid. This
> +	 * prevents namespaces from trusting potentially unsafe
> +	 * suid/sgid bits, file caps, or security labels that originate
> +	 * in other namespaces.
> +	 */
> +	return !(mnt->mnt_flags & MNT_NOSUID) && check_mnt(m) &&
> +		 in_userns(current_user_ns(), m->mnt_ns->user_ns);
> +}
> +
>  static struct ns_common *mntns_get(struct task_struct *task)
>  {
>  	struct ns_common *ns = NULL;
> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index f822c3c..54a594d 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -81,6 +81,7 @@ extern void mntput(struct vfsmount *mnt);
>  extern struct vfsmount *mntget(struct vfsmount *mnt);
>  extern struct vfsmount *mnt_clone_internal(struct path *path);
>  extern int __mnt_is_readonly(struct vfsmount *mnt);
> +extern bool mnt_may_suid(struct vfsmount *mnt);
>  
>  struct path;
>  extern struct vfsmount *clone_private_mount(struct path *path);
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index 8297e5b..a43faa7 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -72,6 +72,8 @@ extern ssize_t proc_projid_map_write(struct file *, const char __user *, size_t,
>  extern ssize_t proc_setgroups_write(struct file *, const char __user *, size_t, loff_t *);
>  extern int proc_setgroups_show(struct seq_file *m, void *v);
>  extern bool userns_may_setgroups(const struct user_namespace *ns);
> +extern bool in_userns(const struct user_namespace *ns,
> +		      const struct user_namespace *target_ns);
>  #else
>  
>  static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
> @@ -100,6 +102,12 @@ static inline bool userns_may_setgroups(const struct user_namespace *ns)
>  {
>  	return true;
>  }
> +
> +static inline bool in_userns(const struct user_namespace *ns,
> +			     const struct user_namespace *target_ns)
> +{
> +	return true;
> +}
>  #endif
>  
>  #endif /* _LINUX_USER_H */
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 9bafc21..9a496a8 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -938,6 +938,19 @@ bool userns_may_setgroups(const struct user_namespace *ns)
>  	return allowed;
>  }
>  
> +/*
> + * Returns true if @ns is the same namespace as or a descendant of
> + * @target_ns.
> + */
> +bool in_userns(const struct user_namespace *ns,
> +	       const struct user_namespace *target_ns)
> +{
> +	for (; ns; ns = ns->parent) {
> +		if (ns == target_ns)
> +			return true;
> +	}
> +}
> +
>  static inline struct user_namespace *to_user_ns(struct ns_common *ns)
>  {
>  	return container_of(ns, struct user_namespace, ns);
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 48071ed..6c082d2 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -453,7 +453,7 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
>  	if (!file_caps_enabled)
>  		return 0;
>  
> -	if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
> +	if (!mnt_may_suid(bprm->file->f_path.mnt))
>  		return 0;
>  
>  	rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 912deee..1350167 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -2234,7 +2234,7 @@ static int check_nnp_nosuid(const struct linux_binprm *bprm,
>  			    const struct task_security_struct *new_tsec)
>  {
>  	int nnp = (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS);
> -	int nosuid = (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID);
> +	int nosuid = !mnt_may_suid(bprm->file->f_path.mnt);
>  	int rc;
>  
>  	if (!nnp && !nosuid)
> -- 
> 2.5.5
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-04 14:26 [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems Djalal Harouni
                   ` (9 preceding siblings ...)
  2016-05-04 21:06 ` James Bottomley
@ 2016-05-04 23:30 ` Serge Hallyn
  2016-05-06 14:38   ` Djalal Harouni
  2016-05-05  0:23 ` Dave Chinner
  11 siblings, 1 reply; 49+ messages in thread
From: Serge Hallyn @ 2016-05-04 23:30 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy

Quoting Djalal Harouni (tixxdz@gmail.com):
> This is version 2 of the VFS:userns support portable root filesystems
> RFC. Changes since version 1:
> 
> * Update documentation and remove some ambiguity about the feature.
>   Based on Josh Triplett comments.
> * Use a new email address to send the RFC :-)
> 
> 
> This RFC tries to explore how to support filesystem operations inside
> user namespace using only VFS and a per mount namespace solution. This
> allows to take advantage of user namespace separations without
> introducing any change at the filesystems level. All this is handled
> with the virtual view of mount namespaces.

Given your use case, is there any way we could work in some tradeoffs
to protect the host?  What I'm thinking is that containers can all
share devices uid-mapped at will, however any device mounted with
uid shifting cannot be used by the inital user namespace.  Or maybe
just non-executable in that case, as you'll need enough access to
the fs to set up the containers you want to run.

So if /dev/sda1 is your host /, you have to use /dev/sda2 as the
container rootfs source.  Mount it under /containers with uid
shifting.  Now all containers regardless of uid mappings see
the shifted fs contents.  But the host root cannot be tricked by
files on it, as /dev/sda2 is non-executable as far as it is
concerned.

Just a thought.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-04 14:26 [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems Djalal Harouni
                   ` (10 preceding siblings ...)
  2016-05-04 23:30 ` Serge Hallyn
@ 2016-05-05  0:23 ` Dave Chinner
  2016-05-05  1:44   ` Andy Lutomirski
  2016-05-05 22:24   ` Djalal Harouni
  11 siblings, 2 replies; 49+ messages in thread
From: Dave Chinner @ 2016-05-05  0:23 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy

On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> This is version 2 of the VFS:userns support portable root filesystems
> RFC. Changes since version 1:
> 
> * Update documentation and remove some ambiguity about the feature.
>   Based on Josh Triplett comments.
> * Use a new email address to send the RFC :-)
> 
> 
> This RFC tries to explore how to support filesystem operations inside
> user namespace using only VFS and a per mount namespace solution. This
> allows to take advantage of user namespace separations without
> introducing any change at the filesystems level. All this is handled
> with the virtual view of mount namespaces.

[...]

> As an example if the mapping 0:65535 inside mount namespace and outside
> is 1000000:1065536, then 0:65535 will be the range that we use to
> construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> data. They represent the persistent values that we want to write to the
> disk. Therefore, we don't keep track of any UID/GID shift that was applied
> before, it gives portability and allows to use the previous mapping
> which was freed for another root filesystem...

So let me get this straight. Two /isolated/ containers, different
UID/GID mappings, sharing the same files and directories. Create a
new file in a writeable directory in container 1, namespace
information gets stripped from on-disk uid/gid representation.

Container 2 then reads that shared directory, finds the file written
by container 1. As there is no no namespace component to the uid:gid
stored in the inode, we apply the current namespace shift to the VFS
inode uid/gid and so it maps to root in container 2 and we are
allowed to read it?

Unless I've misunderstood something in this crazy mapping scheme,
isn't this just a vector for unintentional containment breaches?

[...]

> Simple demo overlayfs, and  btrfs mounted with vfs_shift_uids and
> vfs_shift_gids. The overlayfs mounts will share the same upperdir. We
> create two user namesapces every one with its own mapping and where
> container-uid-2000000 will pull changes from container-uid-1000000
> upperdir automatically.

Ok, forget I asked - it's clearly intentional. This is beyond
crazy, IMO.

> 3) ROADMAP:
> ===========
> * Confirm current design, and make sure that the mapping is done
>   correctly.

How are you going to ensure that all filesystems behave the same,
and it doesn't get broken by people who really don't care about this
sort of crazy?

FWIW, having the VFS convert things to "on-disk format" is an
oxymoron - the "V" in VFS means "virtual" and has nothing to do with
disks or persistent storage formats. Indeed, let's convert the UID
to "on-disk" format for a network filesystem client....

.....
> * Add XFS support.

What is the problem here?

Next question: how does this work with uid/gid based quotas?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-05  0:23 ` Dave Chinner
@ 2016-05-05  1:44   ` Andy Lutomirski
  2016-05-05  2:25     ` Dave Chinner
  2016-05-05 22:34     ` Djalal Harouni
  2016-05-05 22:24   ` Djalal Harouni
  1 sibling, 2 replies; 49+ messages in thread
From: Andy Lutomirski @ 2016-05-05  1:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Djalal Harouni, Alexander Viro, Chris Mason, Theodore Ts'o,
	Serge Hallyn, Josh Triplett, Eric W. Biederman, Andy Lutomirski,
	Seth Forshee, Linux FS Devel, linux-kernel, LSM List,
	Dongsu Park, David Herrmann, Miklos Szeredi, Alban Crequy

On Wed, May 4, 2016 at 5:23 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
>> This is version 2 of the VFS:userns support portable root filesystems
>> RFC. Changes since version 1:
>>
>> * Update documentation and remove some ambiguity about the feature.
>>   Based on Josh Triplett comments.
>> * Use a new email address to send the RFC :-)
>>
>>
>> This RFC tries to explore how to support filesystem operations inside
>> user namespace using only VFS and a per mount namespace solution. This
>> allows to take advantage of user namespace separations without
>> introducing any change at the filesystems level. All this is handled
>> with the virtual view of mount namespaces.
>
> [...]
>
>> As an example if the mapping 0:65535 inside mount namespace and outside
>> is 1000000:1065536, then 0:65535 will be the range that we use to
>> construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
>> data. They represent the persistent values that we want to write to the
>> disk. Therefore, we don't keep track of any UID/GID shift that was applied
>> before, it gives portability and allows to use the previous mapping
>> which was freed for another root filesystem...
>
> So let me get this straight. Two /isolated/ containers, different
> UID/GID mappings, sharing the same files and directories. Create a
> new file in a writeable directory in container 1, namespace
> information gets stripped from on-disk uid/gid representation.

I think the intent is a totally separate superblock for each
container.  Djalal, am I right?

The feature that seems to me to be missing is the ability to squash
uids.  I can imagine desktop distros wanting to mount removable
storage such that everything shows up (to permission checks and
stat()) as the logged-in user's uid but that the filesystem sees 0:0.
That can be done by shifting, but the distro would want everything
else on the filesystem to show up as the logged-in user as well.

That use case could also be handled by adding a way to tell a given
filesystem to completely opt out of normal access control rules and
just let a given user act as root wrt that filesystem (and be nosuid,
of course).  This would be a much greater departure from current
behavior, but would let normal users chown things on a removable
device, which is potentially nice.

--Andy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-05  1:44   ` Andy Lutomirski
@ 2016-05-05  2:25     ` Dave Chinner
  2016-05-05  3:29       ` Andy Lutomirski
  2016-05-05 22:34     ` Djalal Harouni
  1 sibling, 1 reply; 49+ messages in thread
From: Dave Chinner @ 2016-05-05  2:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Djalal Harouni, Alexander Viro, Chris Mason, Theodore Ts'o,
	Serge Hallyn, Josh Triplett, Eric W. Biederman, Andy Lutomirski,
	Seth Forshee, Linux FS Devel, linux-kernel, LSM List,
	Dongsu Park, David Herrmann, Miklos Szeredi, Alban Crequy

On Wed, May 04, 2016 at 06:44:14PM -0700, Andy Lutomirski wrote:
> On Wed, May 4, 2016 at 5:23 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> >> This is version 2 of the VFS:userns support portable root filesystems
> >> RFC. Changes since version 1:
> >>
> >> * Update documentation and remove some ambiguity about the feature.
> >>   Based on Josh Triplett comments.
> >> * Use a new email address to send the RFC :-)
> >>
> >>
> >> This RFC tries to explore how to support filesystem operations inside
> >> user namespace using only VFS and a per mount namespace solution. This
> >> allows to take advantage of user namespace separations without
> >> introducing any change at the filesystems level. All this is handled
> >> with the virtual view of mount namespaces.
> >
> > [...]
> >
> >> As an example if the mapping 0:65535 inside mount namespace and outside
> >> is 1000000:1065536, then 0:65535 will be the range that we use to
> >> construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> >> data. They represent the persistent values that we want to write to the
> >> disk. Therefore, we don't keep track of any UID/GID shift that was applied
> >> before, it gives portability and allows to use the previous mapping
> >> which was freed for another root filesystem...
> >
> > So let me get this straight. Two /isolated/ containers, different
> > UID/GID mappings, sharing the same files and directories. Create a
> > new file in a writeable directory in container 1, namespace
> > information gets stripped from on-disk uid/gid representation.
> 
> I think the intent is a totally separate superblock for each
> container.  Djalal, am I right?

I'm pretty sure you can't have multiple superblocks point to the
same backing device. Each superblock would then think it's the sole
owner of the filesystem and all we get out of that is incoherent
caching and a corrupt on-disk filesystem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-05  2:25     ` Dave Chinner
@ 2016-05-05  3:29       ` Andy Lutomirski
  0 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2016-05-05  3:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: LSM List, Serge Hallyn, Djalal Harouni, Alban Crequy,
	Chris Mason, Dongsu Park, Theodore Ts'o, Eric W. Biederman,
	Alexander Viro, Miklos Szeredi, Josh Triplett, David Herrmann,
	Linux FS Devel, linux-kernel, Seth Forshee

On May 4, 2016 7:25 PM, "Dave Chinner" <david@fromorbit.com> wrote:
>
> On Wed, May 04, 2016 at 06:44:14PM -0700, Andy Lutomirski wrote:
> > On Wed, May 4, 2016 at 5:23 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> > >> This is version 2 of the VFS:userns support portable root filesystems
> > >> RFC. Changes since version 1:
> > >>
> > >> * Update documentation and remove some ambiguity about the feature.
> > >>   Based on Josh Triplett comments.
> > >> * Use a new email address to send the RFC :-)
> > >>
> > >>
> > >> This RFC tries to explore how to support filesystem operations inside
> > >> user namespace using only VFS and a per mount namespace solution. This
> > >> allows to take advantage of user namespace separations without
> > >> introducing any change at the filesystems level. All this is handled
> > >> with the virtual view of mount namespaces.
> > >
> > > [...]
> > >
> > >> As an example if the mapping 0:65535 inside mount namespace and outside
> > >> is 1000000:1065536, then 0:65535 will be the range that we use to
> > >> construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> > >> data. They represent the persistent values that we want to write to the
> > >> disk. Therefore, we don't keep track of any UID/GID shift that was applied
> > >> before, it gives portability and allows to use the previous mapping
> > >> which was freed for another root filesystem...
> > >
> > > So let me get this straight. Two /isolated/ containers, different
> > > UID/GID mappings, sharing the same files and directories. Create a
> > > new file in a writeable directory in container 1, namespace
> > > information gets stripped from on-disk uid/gid representation.
> >
> > I think the intent is a totally separate superblock for each
> > container.  Djalal, am I right?
>
> I'm pretty sure you can't have multiple superblocks point to the
> same backing device. Each superblock would then think it's the sole
> owner of the filesystem and all we get out of that is incoherent
> caching and a corrupt on-disk filesystem.

I meant separate backing stores, too.

--Andy

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-04 21:06 ` James Bottomley
@ 2016-05-05  7:36   ` Djalal Harouni
  2016-05-05 11:56     ` James Bottomley
  0 siblings, 1 reply; 49+ messages in thread
From: Djalal Harouni @ 2016-05-05  7:36 UTC (permalink / raw)
  To: James Bottomley
  Cc: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy

On Wed, May 04, 2016 at 05:06:19PM -0400, James Bottomley wrote:
> On Wed, 2016-05-04 at 16:26 +0200, Djalal Harouni wrote:
> > This is version 2 of the VFS:userns support portable root filesystems
> > RFC. Changes since version 1:
> > 
> > * Update documentation and remove some ambiguity about the feature.
> >   Based on Josh Triplett comments.
> > * Use a new email address to send the RFC :-)
> > 
> > 
> > This RFC tries to explore how to support filesystem operations inside
> > user namespace using only VFS and a per mount namespace solution.
> > This
> > allows to take advantage of user namespace separations without
> > introducing any change at the filesystems level. All this is handled
> > with the virtual view of mount namespaces.
> > 
> > 
> > 1) Presentation:
> > ================
> > 
> > The main aim is to support portable root filesystems and allow 
> > containers, virtual machines and other cases to use the same root 
> > filesystem. Due to security reasons, filesystems can't be mounted 
> > inside user namespaces, and mounting them outside will not solve the 
> > problem since they will show up with the wrong UIDs/GIDs. Read and 
> > write operations will also fail and so on.
> > 
> > The current userspace solution is to automatically chown the whole 
> > root filesystem before starting a container, example:
> > (host) init_user_ns  1000000:1065536  => (container) user_ns_X1
> > 0:65535
> > (host) init_user_ns  2000000:2065536  => (container) user_ns_Y1
> > 0:65535
> > (host) init_user_ns  3000000:3065536  => (container) user_ns_Z1
> > 0:65535
> > ...
> > 
> > Every time a chown is called, files are changed and so on... This
> > prevents to have portable filesystems where you can throw anywhere
> > and boot. Having an extra step to adapt the filesystem to the current
> > mapping and persist it will not allow to verify its integrity, it 
> > makes snapshots and migration a bit harder, and probably other
> > limitations...
> > 
> > It seems that there are multiple ways to allow user namespaces 
> > combine nicely with filesystems, but none of them is that easy. The 
> > bind mount and pin the user namespace during mount time will not 
> > work, bind mounts share the same super block, hence you may endup 
> > working on the wrong vfsmount context and there is no easy way to get
> > out of that...
> 
> So this option was discussed at the recent LSF/MM summit.  The most
> supported suggestion was that you'd use a new internal fs type that had
> a struct mount with a new superblock and would copy the underlying
> inodes but substitute it's own with modified ->getatrr/->setattr calls
> that did the uid shift.  In many ways it would be a remapping bind
> which would look similar to overlayfs but be a lot simpler.

Hmm, it's not only about ->getattr and ->setattr, you have all the other
file system operations that need access too... which brings two points:

1) This new internal fs may end up doing what this RFC does...

2) or by quoting "new internal fs + its own super block + copy underlying
inodes..." it seems like another overlayfs where you also need some
decisions to copy what, etc. So, will this be really
that light compared to current overlayfs ? not to mention that you need
to hook up basically the same logic or something else inside overlayfs..

> > Using the user namespace in the super block seems the way to go, and
> > there is the "Support fuse mounts in user namespaces" [1] patches 
> > which seem nice but perhaps too complex!?
> 
> So I don't think that does what you want.  The fuse project I've used
> before to do uid/gid shifts for build containers is bindfs
> 
> https://github.com/mpartel/bindfs/
> 
> It allows a --map argument where you specify pairs of uids/gids to map
> (tedious for large ranges, but the map can be fixed to use uid:range
> instead of individual).

Ok, thanks for the link, will try to take a deep look but bindfs seem
really big!

> >  there is also the overlayfs solution, and finaly the VFS layer 
> > solution.
> > 
> > We present here a simple VFS solution, everything is packed inside 
> > VFS, filesystems don't need to know anything (except probably XFS, 
> > and special operations inside union filesystems). Currently it 
> > supports ext4, btrfs and overlayfs. Changes into filesystems are 
> > small, just parse the vfs_shift_uids and vfs_shift_gids options 
> > during mount and set the appropriate flags into the super_block
> > structure.
> 
> So this looks a little daunting.  It sprays the VFS with knowledge
> about the shifts and requires support from every underlying filesystem.
Well, from my angle, shifts are just user namespace mappings which
follow certain rules, and currently VFS and all filesystems are *already*
doing some kind of shifting... This RFC uses mount namespaces which are
the standard way to deal with mounts, now the mapping inside mount
namespace can just be "inside: 0:1000" => "outside: 0:1000"
and current implementation will just use it, at the same time I'm not
sure if this mapping qualifies to be named "shift". I think that some
folks here came up with the "shift" name to describe one of the use cases
from a user interface that's it... maybe I should do
s/vfs_shift_*/vfs_remap_*/ ?


>  A simple remapping bind filesystem would be a lot simpler and require
> no underlying filesystem support.
Yes probably, you still need to parse parameters but not at the
filesystem level, and sure this RFC can do the same of course, but maybe
it's not safe to shift/remap filesystems and their inodes on behalf of
filesystems... and virtual filesystems which can share inodes ?


> James
> 

Thank you!

-- 
Djalal Harouni
http://opendz.org

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-05  7:36   ` Djalal Harouni
@ 2016-05-05 11:56     ` James Bottomley
  2016-05-05 21:49       ` Djalal Harouni
  0 siblings, 1 reply; 49+ messages in thread
From: James Bottomley @ 2016-05-05 11:56 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy

On Thu, 2016-05-05 at 08:36 +0100, Djalal Harouni wrote:
> On Wed, May 04, 2016 at 05:06:19PM -0400, James Bottomley wrote:
> > On Wed, 2016-05-04 at 16:26 +0200, Djalal Harouni wrote:
> > > This is version 2 of the VFS:userns support portable root
> > > filesystems
> > > RFC. Changes since version 1:
> > > 
> > > * Update documentation and remove some ambiguity about the
> > > feature.
> > >   Based on Josh Triplett comments.
> > > * Use a new email address to send the RFC :-)
> > > 
> > > 
> > > This RFC tries to explore how to support filesystem operations 
> > > inside user namespace using only VFS and a per mount namespace 
> > > solution. This allows to take advantage of user namespace 
> > > separations without introducing any change at the filesystems 
> > > level. All this is handled with the virtual view of mount
> > > namespaces.
> > > 
> > > 
> > > 1) Presentation:
> > > ================
> > > 
> > > The main aim is to support portable root filesystems and allow 
> > > containers, virtual machines and other cases to use the same root
> > > filesystem. Due to security reasons, filesystems can't be mounted
> > > inside user namespaces, and mounting them outside will not solve 
> > > the problem since they will show up with the wrong UIDs/GIDs. 
> > > Read and write operations will also fail and so on.
> > > 
> > > The current userspace solution is to automatically chown the 
> > > whole root filesystem before starting a container, example:
> > > (host) init_user_ns  1000000:1065536  => (container) user_ns_X1
> > > 0:65535
> > > (host) init_user_ns  2000000:2065536  => (container) user_ns_Y1
> > > 0:65535
> > > (host) init_user_ns  3000000:3065536  => (container) user_ns_Z1
> > > 0:65535
> > > ...
> > > 
> > > Every time a chown is called, files are changed and so on... This
> > > prevents to have portable filesystems where you can throw 
> > > anywhere and boot. Having an extra step to adapt the filesystem 
> > > to the current mapping and persist it will not allow to verify 
> > > its integrity, it makes snapshots and migration a bit harder, and 
> > > probably other limitations...
> > > 
> > > It seems that there are multiple ways to allow user namespaces 
> > > combine nicely with filesystems, but none of them is that easy. 
> > > The bind mount and pin the user namespace during mount time will 
> > > not work, bind mounts share the same super block, hence you may 
> > > endup working on the wrong vfsmount context and there is no easy 
> > > way to get out of that...
> > 
> > So this option was discussed at the recent LSF/MM summit.  The most
> > supported suggestion was that you'd use a new internal fs type that 
> > had a struct mount with a new superblock and would copy the 
> > underlying inodes but substitute it's own with modified ->getatrr/
> > ->setattr calls that did the uid shift.  In many ways it would be a 
> > remapping bind which would look similar to overlayfs but be a lot
> > simpler.
> 
> Hmm, it's not only about ->getattr and ->setattr, you have all the 
> other file system operations that need access too...

Why?  Or perhaps we should more cogently define the actual problem.  My
problem is simply mounting image volumes that were created with real
uids at user namespace shifted uids because I'm downshifting the
privileged ids in the container.  I actually *only* need the uid/gids
on the attributes shifted because that's what I need to manipulate the
volumes.  I actually think that other operations, like the file ioctl
ones should, for security reasons, not be uid shifted.  For instance
with xfs you could set the panic mask and error tags and bring down the
whole host.  What extra things do you need access to and why?

>  which brings two points:
> 
> 1) This new internal fs may end up doing what this RFC does...

Well that was why I brought it up, yes.

> 2) or by quoting "new internal fs + its own super block + copy
> underlying inodes..." it seems like another overlayfs where you also
> need some decisions to copy what, etc. So, will this be really
> that light compared to current overlayfs ? not to mention that you 
> need to hook up basically the same logic or something else inside
> overlayfs..

OK, so forget overlayfs, perhaps that was a bad example.  It's like a
uid shifting bind.  The way it works is to use shadow inodes (unlike
bind, but because you have to intercept the operations, so it's not a
simple subtree operation) but there's no file copying.  The shadow
points to the real inode.

> > > Using the user namespace in the super block seems the way to go, 
> > > and there is the "Support fuse mounts in user namespaces" [1] 
> > > patches which seem nice but perhaps too complex!?
> > 
> > So I don't think that does what you want.  The fuse project I've 
> > used before to do uid/gid shifts for build containers is bindfs
> > 
> > https://github.com/mpartel/bindfs/
> > 
> > It allows a --map argument where you specify pairs of uids/gids to 
> > map (tedious for large ranges, but the map can be fixed to use
> > uid:range instead of individual).
> 
> Ok, thanks for the link, will try to take a deep look but bindfs seem
> really big!

Well, it does a lot more than just uid/gid shift.

> > >  there is also the overlayfs solution, and finaly the VFS layer 
> > > solution.
> > > 
> > > We present here a simple VFS solution, everything is packed 
> > > inside VFS, filesystems don't need to know anything (except 
> > > probably XFS, and special operations inside union filesystems). 
> > > Currently it supports ext4, btrfs and overlayfs. Changes into 
> > > filesystems are small, just parse the vfs_shift_uids and 
> > > vfs_shift_gids options during mount and set the appropriate flags 
> > > into the super_block structure.
> > 
> > So this looks a little daunting.  It sprays the VFS with knowledge
> > about the shifts and requires support from every underlying 
> > filesystem. 

> Well, from my angle, shifts are just user namespace mappings which
> follow certain rules, and currently VFS and all filesystems are 
> *already* doing some kind of shifting... This RFC uses mount 
> namespaces which are the standard way to deal with mounts, now the 
> mapping inside mount namespace can just be "inside: 0:1000" => 
> "outside: 0:1000" and current implementation will just use it, at the 
> same time I'm not sure if this mapping qualifies to be named "shift". 
> I think that some folks here came up with the "shift" name to 
> describe one of the use cases from a user interface that's it... 
> maybe I should do s/vfs_shift_*/vfs_remap_*/ ?

I don't think the naming is the issue ... it's the spread inside the
vfs code (and in the underlying fs code).  The vfs is very well
layered, so touching all that code makes it look like there's a
layering problem with the patch.  Touching the underlying fs code looks
even more problematic, but that may be necessary if you have a reason
for wanting the file ioctls, because they're pass through and usually
where the from_kuid() calls are in filesystems.

> >  A simple remapping bind filesystem would be a lot simpler and 
> > require no underlying filesystem support. 
>  
> Yes probably, you still need to parse parameters but not at the
> filesystem level,

They'd just be mount options.  Basically instead of mount --bind source
target, you'd do mount -t uidshift -o <shift options> source target.

>  and sure this RFC can do the same of course, but maybe it's not safe
> to shift/remap filesystems and their inodes on behalf of
> filesystems... and virtual filesystems which can share inodes ?

That depends who you allow to do the shift.  Each fstype in the kernel
decides access to mount.  For the uidshift, I was planning to allow
only a capable admin in the initial namespace, meaning that only the
admin in the host could set up the shifts.  As long as the shifted
filesystem is present, the container can then bind it wherever it wants
in its mount namespace.

James

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 3/8] fs: Treat foreign mounts as nosuid
  2016-05-04 23:19   ` Serge Hallyn
@ 2016-05-05 13:05     ` Seth Forshee
  2016-05-05 22:40       ` Djalal Harouni
  0 siblings, 1 reply; 49+ messages in thread
From: Seth Forshee @ 2016-05-05 13:05 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Djalal Harouni, Alexander Viro, Chris Mason, tytso, Serge Hallyn,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Djalal Harouni

On Wed, May 04, 2016 at 11:19:04PM +0000, Serge Hallyn wrote:
> Quoting Djalal Harouni (tixxdz@gmail.com):
> > If a process gets access to a mount from a different user
> > namespace, that process should not be able to take advantage of
> > setuid files or selinux entrypoints from that filesystem.  Prevent
> > this by treating mounts from other mount namespaces and those not
> > owned by current_user_ns() or an ancestor as nosuid.
> > 
> > This patch was just adapted from the original one that was written
> > by Andy Lutomirski <luto@amacapital.net>
> > https://www.redhat.com/archives/dm-devel/2016-April/msg00374.html
> 
> I'm not sure that this makes sense given what you're doing.  In the
> case of Seth's set, a filesystem is mounted specifically (and privately)
> in a user namespace.  We don't want for instance the initial user ns
> to find a link to a setuid-root exploit left in the container-mounted
> filesystem.
> 
> But you are having a parent user namespace mount the fs so that its
> children can all access the fs, uid-shifted for convenience.  Not
> allowing the child namespaces to make use of setuid-root does not
> seem applicable here.

Right, the problem addressed by this patch probably isn't relevant to
this sort of uid shifting.

But I think there's another problem that needs to be addressed.
bprm_fill_uid() still gets the ids for sxid files unshifted from the
inode. We already protect against sxid to any user not in
bprm->cred->user_ns, so it will just ignore the sxid instead of e.g.
suid as global root from the id shifted mount, which is good. What would
be wanted though is to use the shifted ids so that something like
suid-root ping in the container rootfs would work.

Seth

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-05 11:56     ` James Bottomley
@ 2016-05-05 21:49       ` Djalal Harouni
  2016-05-05 22:08         ` James Bottomley
  0 siblings, 1 reply; 49+ messages in thread
From: Djalal Harouni @ 2016-05-05 21:49 UTC (permalink / raw)
  To: James Bottomley
  Cc: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy

On Thu, May 05, 2016 at 07:56:28AM -0400, James Bottomley wrote:
> On Thu, 2016-05-05 at 08:36 +0100, Djalal Harouni wrote:
> > On Wed, May 04, 2016 at 05:06:19PM -0400, James Bottomley wrote:
> > > On Wed, 2016-05-04 at 16:26 +0200, Djalal Harouni wrote:
> > > > This is version 2 of the VFS:userns support portable root
> > > > filesystems
> > > > RFC. Changes since version 1:
> > > > 
> > > > * Update documentation and remove some ambiguity about the
> > > > feature.
> > > >   Based on Josh Triplett comments.
> > > > * Use a new email address to send the RFC :-)
> > > > 
> > > > 
> > > > This RFC tries to explore how to support filesystem operations 
> > > > inside user namespace using only VFS and a per mount namespace 
> > > > solution. This allows to take advantage of user namespace 
> > > > separations without introducing any change at the filesystems 
> > > > level. All this is handled with the virtual view of mount
> > > > namespaces.
> > > > 
> > > > 
> > > > 1) Presentation:
> > > > ================
> > > > 
> > > > The main aim is to support portable root filesystems and allow 
> > > > containers, virtual machines and other cases to use the same root
> > > > filesystem. Due to security reasons, filesystems can't be mounted
> > > > inside user namespaces, and mounting them outside will not solve 
> > > > the problem since they will show up with the wrong UIDs/GIDs. 
> > > > Read and write operations will also fail and so on.
> > > > 
> > > > The current userspace solution is to automatically chown the 
> > > > whole root filesystem before starting a container, example:
> > > > (host) init_user_ns  1000000:1065536  => (container) user_ns_X1
> > > > 0:65535
> > > > (host) init_user_ns  2000000:2065536  => (container) user_ns_Y1
> > > > 0:65535
> > > > (host) init_user_ns  3000000:3065536  => (container) user_ns_Z1
> > > > 0:65535
> > > > ...
> > > > 
> > > > Every time a chown is called, files are changed and so on... This
> > > > prevents to have portable filesystems where you can throw 
> > > > anywhere and boot. Having an extra step to adapt the filesystem 
> > > > to the current mapping and persist it will not allow to verify 
> > > > its integrity, it makes snapshots and migration a bit harder, and 
> > > > probably other limitations...
> > > > 
> > > > It seems that there are multiple ways to allow user namespaces 
> > > > combine nicely with filesystems, but none of them is that easy. 
> > > > The bind mount and pin the user namespace during mount time will 
> > > > not work, bind mounts share the same super block, hence you may 
> > > > endup working on the wrong vfsmount context and there is no easy 
> > > > way to get out of that...
> > > 
> > > So this option was discussed at the recent LSF/MM summit.  The most
> > > supported suggestion was that you'd use a new internal fs type that 
> > > had a struct mount with a new superblock and would copy the 
> > > underlying inodes but substitute it's own with modified ->getatrr/
> > > ->setattr calls that did the uid shift.  In many ways it would be a 
> > > remapping bind which would look similar to overlayfs but be a lot
> > > simpler.
> > 
> > Hmm, it's not only about ->getattr and ->setattr, you have all the 
> > other file system operations that need access too...
> 
> Why?  Or perhaps we should more cogently define the actual problem.  My
> problem is simply mounting image volumes that were created with real
> uids at user namespace shifted uids because I'm downshifting the
> privileged ids in the container.  I actually *only* need the uid/gids
> on the attributes shifted because that's what I need to manipulate the

We need them obviously for read/write/creation... ?! We want to handle
also stock filesystems that were never edited without depending on any
module or third party solution, mounting them outside user namespaces,
and access inside.

> volumes.  I actually think that other operations, like the file ioctl
> ones should, for security reasons, not be uid shifted.  For instance
> with xfs you could set the panic mask and error tags and bring down the
> whole host.  What extra things do you need access to and why?

That's why precisely I said that mounting options not *inside*
filesystems which means on their back, and on behalf of container
managers, etc then you are exposed to such scenarios... some virtual
file systems can also be mounted by unprivileged, how you will deal
with something like a bind mount on them ?


> >  which brings two points:
> > 
> > 1) This new internal fs may end up doing what this RFC does...
> 
> Well that was why I brought it up, yes.

yes but *with* extra code! that was my point. I'm not sure we need to
bother with any *new* internal fs type nor hack around dir, file
operations... yet that has to be shown, defined, coded ... ?


> > 2) or by quoting "new internal fs + its own super block + copy
> > underlying inodes..." it seems like another overlayfs where you also
> > need some decisions to copy what, etc. So, will this be really
> > that light compared to current overlayfs ? not to mention that you 
> > need to hook up basically the same logic or something else inside
> > overlayfs..
> 
> OK, so forget overlayfs, perhaps that was a bad example.  It's like a
> uid shifting bind.  The way it works is to use shadow inodes (unlike
> bind, but because you have to intercept the operations, so it's not a
> simple subtree operation) but there's no file copying.  The shadow
> points to the real inode.

For that you need a super block struct for every mount... now if you
also need a new internal fs + super block + shadowing inodes... it
seems like you are going into overlayfs direction... 

I'm taking overlayfs as an example here, cause it's just nice and
really dead simple!

At the same time I'm not at all sure about what you are describing!
and how you will deal with current mount and bind mounts tree and all
the internals...



> > > > Using the user namespace in the super block seems the way to go, 
> > > > and there is the "Support fuse mounts in user namespaces" [1] 
> > > > patches which seem nice but perhaps too complex!?
> > > 
> > > So I don't think that does what you want.  The fuse project I've 
> > > used before to do uid/gid shifts for build containers is bindfs
> > > 
> > > https://github.com/mpartel/bindfs/
> > > 
> > > It allows a --map argument where you specify pairs of uids/gids to 
> > > map (tedious for large ranges, but the map can be fixed to use
> > > uid:range instead of individual).
> > 
> > Ok, thanks for the link, will try to take a deep look but bindfs seem
> > really big!
> 
> Well, it does a lot more than just uid/gid shift.
> 
> > > >  there is also the overlayfs solution, and finaly the VFS layer 
> > > > solution.
> > > > 
> > > > We present here a simple VFS solution, everything is packed 
> > > > inside VFS, filesystems don't need to know anything (except 
> > > > probably XFS, and special operations inside union filesystems). 
> > > > Currently it supports ext4, btrfs and overlayfs. Changes into 
> > > > filesystems are small, just parse the vfs_shift_uids and 
> > > > vfs_shift_gids options during mount and set the appropriate flags 
> > > > into the super_block structure.
> > > 
> > > So this looks a little daunting.  It sprays the VFS with knowledge
> > > about the shifts and requires support from every underlying 
> > > filesystem. 
> 
> > Well, from my angle, shifts are just user namespace mappings which
> > follow certain rules, and currently VFS and all filesystems are 
> > *already* doing some kind of shifting... This RFC uses mount 
> > namespaces which are the standard way to deal with mounts, now the 
> > mapping inside mount namespace can just be "inside: 0:1000" => 
> > "outside: 0:1000" and current implementation will just use it, at the 
> > same time I'm not sure if this mapping qualifies to be named "shift". 
> > I think that some folks here came up with the "shift" name to 
> > describe one of the use cases from a user interface that's it... 
> > maybe I should do s/vfs_shift_*/vfs_remap_*/ ?
> 
> I don't think the naming is the issue ... it's the spread inside the
> vfs code (and in the underlying fs code).  The vfs is very well

Currently the underlying file systems just parse vfs_shift_uids and
vfs_shif_gids


> layered, so touching all that code makes it look like there's a
> layering problem with the patch.  Touching the underlying fs code looks

Hmm, not sure I follow here ? We make use of the mount namespace which
is part of the whole layer. Actually it's the *standard* way to control
mounts. What do you mean here please ?


> even more problematic, but that may be necessary if you have a reason
> for wanting the file ioctls, because they're pass through and usually
> where the from_kuid() calls are in filesystems.

Hmm sorry, I'm not sure I'm following you here ?


> > >  A simple remapping bind filesystem would be a lot simpler and 
> > > require no underlying filesystem support. 
> >  
> > Yes probably, you still need to parse parameters but not at the
> > filesystem level,
> 
> They'd just be mount options.  Basically instead of mount --bind source
> target, you'd do mount -t uidshift -o <shift options> source target.
> 
> >  and sure this RFC can do the same of course, but maybe it's not safe
> > to shift/remap filesystems and their inodes on behalf of
> > filesystems... and virtual filesystems which can share inodes ?
> 
> That depends who you allow to do the shift.  Each fstype in the kernel
> decides access to mount.  For the uidshift, I was planning to allow
> only a capable admin in the initial namespace, meaning that only the
> admin in the host could set up the shifts.  As long as the shifted
> filesystem is present, the container can then bind it wherever it wants
> in its mount namespace.

Ah I see admin in initial namespace, yes sounds reasonable for security
reasons, and how you will be able to achieve the user namespace shift ?


> James
>

Thank you!


-- 
Djalal Harouni
http://opendz.org

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-05 21:49       ` Djalal Harouni
@ 2016-05-05 22:08         ` James Bottomley
  2016-05-10 23:36           ` James Bottomley
  0 siblings, 1 reply; 49+ messages in thread
From: James Bottomley @ 2016-05-05 22:08 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy

On Thu, 2016-05-05 at 22:49 +0100, Djalal Harouni wrote:
> On Thu, May 05, 2016 at 07:56:28AM -0400, James Bottomley wrote:
> > On Thu, 2016-05-05 at 08:36 +0100, Djalal Harouni wrote:
> > > On Wed, May 04, 2016 at 05:06:19PM -0400, James Bottomley wrote:
> > > > On Wed, 2016-05-04 at 16:26 +0200, Djalal Harouni wrote:
> > > > > This is version 2 of the VFS:userns support portable root
> > > > > filesystems
> > > > > RFC. Changes since version 1:
> > > > > 
> > > > > * Update documentation and remove some ambiguity about the
> > > > > feature.   Based on Josh Triplett comments.
> > > > > * Use a new email address to send the RFC :-)
> > > > > 
> > > > > 
> > > > > This RFC tries to explore how to support filesystem 
> > > > > operations inside user namespace using only VFS and a per 
> > > > > mount namespace solution. This allows to take advantage of 
> > > > > user namespace separations without introducing any change at 
> > > > > the filesystems level. All this is handled with the virtual 
> > > > > view of mount namespaces.
> > > > > 
> > > > > 
> > > > > 1) Presentation:
> > > > > ================
> > > > > 
> > > > > The main aim is to support portable root filesystems and 
> > > > > allow containers, virtual machines and other cases to use the 
> > > > > same root filesystem. Due to security reasons, filesystems 
> > > > > can't be mounted inside user namespaces, and mounting them 
> > > > > outside will not solve the problem since they will show up 
> > > > > with the wrong UIDs/GIDs. Read and write operations will also
> > > > > fail and so on.
> > > > > 
> > > > > The current userspace solution is to automatically chown the 
> > > > > whole root filesystem before starting a container, example:
> > > > > (host) init_user_ns  1000000:1065536  => (container)
> > > > > user_ns_X1
> > > > > 0:65535
> > > > > (host) init_user_ns  2000000:2065536  => (container)
> > > > > user_ns_Y1
> > > > > 0:65535
> > > > > (host) init_user_ns  3000000:3065536  => (container)
> > > > > user_ns_Z1
> > > > > 0:65535
> > > > > ...
> > > > > 
> > > > > Every time a chown is called, files are changed and so on... 
> > > > > This prevents to have portable filesystems where you can 
> > > > > throw anywhere and boot. Having an extra step to adapt the
> > > > > filesystem to the current mapping and persist it will not 
> > > > > allow to verify its integrity, it makes snapshots and 
> > > > > migration a bit harder, and probably other limitations...
> > > > > 
> > > > > It seems that there are multiple ways to allow user 
> > > > > namespaces combine nicely with filesystems, but none of them 
> > > > > is that easy. The bind mount and pin the user namespace 
> > > > > during mount time will not work, bind mounts share the same 
> > > > > super block, hence you may endup working on the wrong 
> > > > > vfsmount context and there is no easy way to get out of
> > > > > that...
> > > > 
> > > > So this option was discussed at the recent LSF/MM summit.  The 
> > > > most supported suggestion was that you'd use a new internal fs 
> > > > type that had a struct mount with a new superblock and would 
> > > > copy the underlying inodes but substitute it's own with 
> > > >  modified  ->getatrr/->setattr calls that did the uid shift. 
> > > >  In many ways it would be a remapping bind which would look 
> > > > similar to overlayfs but be a lot simpler.
> > > 
> > > Hmm, it's not only about ->getattr and ->setattr, you have all 
> > > the other file system operations that need access too...
> > 
> > Why?  Or perhaps we should more cogently define the actual problem.
> >   My problem is simply mounting image volumes that were created 
> > with real uids at user namespace shifted uids because I'm
> >  downshifting the privileged ids in the container.  I actually 
> > *only* need the uid/gids on the attributes shifted because that's 
> > what I need to manipulate the
> >   
> We need them obviously for read/write/creation... ?!

OK, so the way attributes are populated on an inode is via getattr. 
 You intercept that, you change the inode owner and group that are
installed on the inode.  That means that when you list the directory,
you see the shift and the shifted uid/gid are used to check permissions
for vfs_open().

>  We want to handle also stock filesystems that were never edited
> without depending on any module or third party solution, mounting
> them outside user namespaces, and access inside.

OK, but that's basically my requirements ... you didn't mention any of
the esoteric filesystem ioctls, so I assume from the below you're not
interested in shifting the uids there either?

> > volumes.  I actually think that other operations, like the file 
> > ioctl ones should, for security reasons, not be uid shifted.  For
> > instance with xfs you could set the panic mask and error tags and 
> > bring down the whole host.  What extra things do you need access to
> > and why?
> 
> That's why precisely I said that mounting options not *inside*
> filesystems which means on their back, and on behalf of container
> managers, etc then you are exposed to such scenarios... some virtual
> file systems can also be mounted by unprivileged, how you will deal
> with something like a bind mount on them ?
> 
> 
> > >  which brings two points:
> > > 
> > > 1) This new internal fs may end up doing what this RFC does...
> > 
> > Well that was why I brought it up, yes.
> 
> yes but *with* extra code! that was my point. I'm not sure we need to
> bother with any *new* internal fs type nor hack around dir, file
> operations... yet that has to be shown, defined, coded ... ?

Either way requires patching the kernel.  The question I was asking is
is it better to confine the patch to a new fs type or directly change
the vfs.

> > > 2) or by quoting "new internal fs + its own super block + copy
> > > underlying inodes..." it seems like another overlayfs where you 
> > > also need some decisions to copy what, etc. So, will this be 
> > > really that light compared to current overlayfs ? not to mention 
> > > that you need to hook up basically the same logic or something 
> > > else inside overlayfs..
> > 
> > OK, so forget overlayfs, perhaps that was a bad example.  It's like 
> > a uid shifting bind.  The way it works is to use shadow inodes
> > (unlike bind, but because you have to intercept the operations, so 
> > it's not a simple subtree operation) but there's no file copying. 
> >  The shadow points to the real inode.
> 
> For that you need a super block struct for every mount... now if you
> also need a new internal fs + super block + shadowing inodes... it
> seems like you are going into overlayfs direction... 

Well, that's the way you build a shadowing fs.  I'm not sure you need
one sb per struct vfs mount, but it's certainly possible to code it
that way.

> I'm taking overlayfs as an example here, cause it's just nice and
> really dead simple!
> 
> At the same time I'm not at all sure about what you are describing!
> and how you will deal with current mount and bind mounts tree and all
> the internals...

You mean would MS_REC functionality be supported?  There's no reason
why not, but there's no reason you have to either (it could even be
optional, like it is for bind).

> > > > Using the user namespace in the super block seems the way to 
> > > > go, and there is the "Support fuse mounts in user namespaces" 
> > > > [1] patches which seem nice but perhaps too complex!?
> > > > 
> > > > So I don't think that does what you want.  The fuse project 
> > > > I've used before to do uid/gid shifts for build containers is 
> > > > bindfs https://github.com/mpartel/bindfs/
> > > > 
> > > > It allows a --map argument where you specify pairs of uids/gids 
> > > > to map (tedious for large ranges, but the map can be fixed to 
> > > > use uid:range instead of individual).
> > > 
> > > Ok, thanks for the link, will try to take a deep look but bindfs 
> > > seem really big!
> > 
> > Well, it does a lot more than just uid/gid shift.
> > 
> > > > >  there is also the overlayfs solution, and finaly the VFS 
> > > > > layer solution.
> > > > > 
> > > > > We present here a simple VFS solution, everything is packed 
> > > > > inside VFS, filesystems don't need to know anything (except 
> > > > > probably XFS, and special operations inside union 
> > > > > filesystems). Currently it supports ext4, btrfs and 
> > > > > overlayfs. Changes into filesystems are small, just parse the 
> > > > > vfs_shift_uids and vfs_shift_gids options during mount and 
> > > > > set the appropriate flags into the super_block structure.
> > > > 
> > > > So this looks a little daunting.  It sprays the VFS with 
> > > > knowledge about the shifts and requires support from every 
> > > > underlying filesystem.
> > 
> > > Well, from my angle, shifts are just user namespace mappings 
> > > which follow certain rules, and currently VFS and all filesystems 
> > > are *already* doing some kind of shifting... This RFC uses mount
> > > namespaces which are the standard way to deal with mounts, now 
> > > the mapping inside mount namespace can just be "inside: 0:1000" 
> > > => "outside: 0:1000" and current implementation will just use it, 
> > > at the same time I'm not sure if this mapping qualifies to be 
> > > named "shift". I think that some folks here came up with the 
> > > "shift" name to describe one of the use cases from a user 
> > > interface that's it... maybe I should do
> > > s/vfs_shift_*/vfs_remap_*/ ?
> > 
> > I don't think the naming is the issue ... it's the spread inside 
> > the vfs code (and in the underlying fs code).  The vfs is very well
> 
> Currently the underlying file systems just parse vfs_shift_uids and
> vfs_shif_gids 
> 
> > layered, so touching all that code makes it look like there's a
> > layering problem with the patch.  Touching the underlying fs code 
> > looks
> >  
> Hmm, not sure I follow here ? We make use of the mount namespace 
> which is part of the whole layer. Actually it's the *standard* way to
> control mounts. What do you mean here please ?

The patch touches a lot of the vfs.

> > even more problematic, but that may be necessary if you have a 
> > reason for wanting the file ioctls, because they're pass through 
> > and usually where the from_kuid() calls are in filesystems.
> 
> Hmm sorry, I'm not sure I'm following you here ?

An ideal solution, given both our requirements, shouldn't require
touching any underlying fs code.

> > > >  A simple remapping bind filesystem would be a lot simpler and 
> > > > require no underlying filesystem support. 
> > >  
> > > Yes probably, you still need to parse parameters but not at the
> > > filesystem level,
> > 
> > They'd just be mount options.  Basically instead of mount --bind 
> > source target, you'd do mount -t uidshift -o <shift options> source
> > target.
> > 
> > >  and sure this RFC can do the same of course, but maybe it's not 
> > > safe to shift/remap filesystems and their inodes on behalf of
> > > filesystems... and virtual filesystems which can share inodes ?
> > 
> > That depends who you allow to do the shift.  Each fstype in the 
> > kernel decides access to mount.  For the uidshift, I was planning 
> > to allow only a capable admin in the initial namespace, meaning 
> > that only the admin in the host could set up the shifts.  As long 
> > as the shifted filesystem is present, the container can then bind 
> > it wherever it wants in its mount namespace.
> 
> Ah I see admin in initial namespace, yes sounds reasonable for 
> security reasons, and how you will be able to achieve the user 
> namespace shift?

As I said, it would be in the mount options of the command.  The <shift
options> above. Probably parametrised the way uid_map and gid_map are
today.

James

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-05  0:23 ` Dave Chinner
  2016-05-05  1:44   ` Andy Lutomirski
@ 2016-05-05 22:24   ` Djalal Harouni
  2016-05-06  2:50     ` Dave Chinner
  1 sibling, 1 reply; 49+ messages in thread
From: Djalal Harouni @ 2016-05-05 22:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy

On Thu, May 05, 2016 at 10:23:14AM +1000, Dave Chinner wrote:
> On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> > This is version 2 of the VFS:userns support portable root filesystems
> > RFC. Changes since version 1:
> > 
> > * Update documentation and remove some ambiguity about the feature.
> >   Based on Josh Triplett comments.
> > * Use a new email address to send the RFC :-)
> > 
> > 
> > This RFC tries to explore how to support filesystem operations inside
> > user namespace using only VFS and a per mount namespace solution. This
> > allows to take advantage of user namespace separations without
> > introducing any change at the filesystems level. All this is handled
> > with the virtual view of mount namespaces.
> 
> [...]
> 
> > As an example if the mapping 0:65535 inside mount namespace and outside
> > is 1000000:1065536, then 0:65535 will be the range that we use to
> > construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> > data. They represent the persistent values that we want to write to the
> > disk. Therefore, we don't keep track of any UID/GID shift that was applied
> > before, it gives portability and allows to use the previous mapping
> > which was freed for another root filesystem...
> 
> So let me get this straight. Two /isolated/ containers, different
> UID/GID mappings, sharing the same files and directories. Create a
> new file in a writeable directory in container 1, namespace
> information gets stripped from on-disk uid/gid representation.
> 
> Container 2 then reads that shared directory, finds the file written
> by container 1. As there is no no namespace component to the uid:gid
> stored in the inode, we apply the current namespace shift to the VFS
> inode uid/gid and so it maps to root in container 2 and we are
> allowed to read it?

Only if container 2 has the flag CLONE_MNTNS_SHIFT_UIDGID set in its own
mount namespace which only root can set or if it was already set in
parent, and have access to the shared dir which the container manager
should also configure before... even with that if in container 2 the
shift flag is not set then there is no mapping and things work as they
are now, but yes this setup is flawed! they should not share rootfs,
maybe in rare cases, some user data that's it.


> Unless I've misunderstood something in this crazy mapping scheme,
> isn't this just a vector for unintentional containment breaches?
> 
> [...]
> 
> > Simple demo overlayfs, and  btrfs mounted with vfs_shift_uids and
> > vfs_shift_gids. The overlayfs mounts will share the same upperdir. We
> > create two user namesapces every one with its own mapping and where
> > container-uid-2000000 will pull changes from container-uid-1000000
> > upperdir automatically.
> 
> Ok, forget I asked - it's clearly intentional. This is beyond
> crazy, IMO.

This setup is flawed! that example was to show that files show up with
the right mapping with two different user namespaces. As Andy noted
they should have a backing device...

Anyway by the previous paragraph what I mean is that when the container
terminates it releases the UID shift range which can be re-used later
on another filesystem or on the same previous fs... whatever. Now if
the range is already in use, userspace should grab a new range give it
a new filesystem or a previous one which doesn't need to be shared and
everything should continue to work...


simple example with loop devices..., however the image should be a GPT
(GUID partition table) or an MBR one...

$ dd if=/dev/zero of=/tmp/fedora-newtree.raw bs=10M count=100
$ mkfs.ext4 /tmp/fedora-newtree.raw
...
$ sudo mount -t ext4 -oloop,rw,sync /var/lib/machines/fedora-newtree.raw /mnt/fedora-tree
$ sudo yum -y --releasever=23 --installroot=/mnt/fedora-tree --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim 
$ sudo mount -t ext4 -oloop,vfs_shift_uids,vfs_shift_gids, /var/lib/machines/fedora-newtree.raw  /mnt/fedora-tree
$ sudo ~/container --uidmap [1000000:1065536 or
			     2000000:2065536 or
			     3000000:3065536 ....}
  (That's the mapping outside of the container)



> > 3) ROADMAP:
> > ===========
> > * Confirm current design, and make sure that the mapping is done
> >   correctly.
> 
> How are you going to ensure that all filesystems behave the same,
> and it doesn't get broken by people who really don't care about this
> sort of crazy?

By trying to make this a VFS mount namespace parameter. So if the
shift is not set on on the mount namespace then we just fallback to
the current behaviour! no shift is performed.

later of course I'll try xfstests and several tests...

Does this answer your question ?


> FWIW, having the VFS convert things to "on-disk format" is an
> oxymoron - the "V" in VFS means "virtual" and has nothing to do with
> disks or persistent storage formats. Indeed, let's convert the UID
> to "on-disk" format for a network filesystem client....
hehe! sure it's not already done? it can be changed to "to-fs" !


> .....
> > * Add XFS support.
> 
> What is the problem here?

Yep, sorry! just lack of time from my part! XFS currently is a bit aware
of kuid/kgid mapping on its own, and I just didn't had the appropriate
time! Will try to fix it next time.

> Next question: how does this work with uid/gid based quotas?

If you do a shift you should know that you will share quota on disk. In
all cases to activate the behaviour you have to set the options during
mount too... but it will be documented and recommended to have
different divce nodes, loop device, MBR or GPT partitions, block
devices... lvm or anything else mounted with xfs or any other filesystem
that supports this shift and set the flags at mount time.


> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com


Thank you!

-- 
Djalal Harouni
http://opendz.org

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-05  1:44   ` Andy Lutomirski
  2016-05-05  2:25     ` Dave Chinner
@ 2016-05-05 22:34     ` Djalal Harouni
  1 sibling, 0 replies; 49+ messages in thread
From: Djalal Harouni @ 2016-05-05 22:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Chinner, Alexander Viro, Chris Mason, Theodore Ts'o,
	Serge Hallyn, Josh Triplett, Eric W. Biederman, Andy Lutomirski,
	Seth Forshee, Linux FS Devel, linux-kernel, LSM List,
	Dongsu Park, David Herrmann, Miklos Szeredi, Alban Crequy

On Wed, May 04, 2016 at 06:44:14PM -0700, Andy Lutomirski wrote:
> On Wed, May 4, 2016 at 5:23 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> >> This is version 2 of the VFS:userns support portable root filesystems
> >> RFC. Changes since version 1:
> >>
> >> * Update documentation and remove some ambiguity about the feature.
> >>   Based on Josh Triplett comments.
> >> * Use a new email address to send the RFC :-)
> >>
> >>
> >> This RFC tries to explore how to support filesystem operations inside
> >> user namespace using only VFS and a per mount namespace solution. This
> >> allows to take advantage of user namespace separations without
> >> introducing any change at the filesystems level. All this is handled
> >> with the virtual view of mount namespaces.
> >
> > [...]
> >
> >> As an example if the mapping 0:65535 inside mount namespace and outside
> >> is 1000000:1065536, then 0:65535 will be the range that we use to
> >> construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> >> data. They represent the persistent values that we want to write to the
> >> disk. Therefore, we don't keep track of any UID/GID shift that was applied
> >> before, it gives portability and allows to use the previous mapping
> >> which was freed for another root filesystem...
> >
> > So let me get this straight. Two /isolated/ containers, different
> > UID/GID mappings, sharing the same files and directories. Create a
> > new file in a writeable directory in container 1, namespace
> > information gets stripped from on-disk uid/gid representation.
> 
> I think the intent is a totally separate superblock for each
> container.  Djalal, am I right?

Absolutely that would be ideal, each container will mount its image
device into the new mount namespace, setting up the right private/slave
flags, no propagation into host... using GPT, lvm, loop or any other
backing device, the mount will show up only into the container...

Now as you know we can't prevent all flawed solutions. The thing that
I made sure is that the flag CLONE_MNTNS_SHIFT_UIDGID could only be
set by real root.


> The feature that seems to me to be missing is the ability to squash
> uids.  I can imagine desktop distros wanting to mount removable
> storage such that everything shows up (to permission checks and
> stat()) as the logged-in user's uid but that the filesystem sees 0:0.
> That can be done by shifting, but the distro would want everything
> else on the filesystem to show up as the logged-in user as well.
> 
> That use case could also be handled by adding a way to tell a given
> filesystem to completely opt out of normal access control rules and
> just let a given user act as root wrt that filesystem (and be nosuid,
> of course).  This would be a much greater departure from current
> behavior, but would let normal users chown things on a removable
> device, which is potentially nice.

Ok Andy, this one is hard... I gave it some thought and what do you
think of the above:
It will work only if you are referring to some high level software
into distros which seems perfect of course for normal users.

So the sotfware should do:

1) mount the removable storage with vfs_shift_uids and vfs_shift_gids
2) Now the software should act as a container, make a
    clone4(CLONE_NEWUSER|CLONE_NEWNS|CLONE_MNTNS_SHIFT_UIDGID, ...)

    => Setup the right mapping so we are able to access files...

   The mount will show up into the new mount namespace.

3) Now inside new namespaces we are able to access all files.

4) Use stat() returned values, and shift back to logged-in user
   values...

The software did setup the mapping so it already knows who maps to who!

This allows to show results of stat() as they are normal logged-in
users, where everything works as you have described. So maybe this
has its place in a small userspace helper library where all softwares
can use it ?! thoughts ? 

> --Andy

Thanks!

-- 
Djalal Harouni
http://opendz.org

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 3/8] fs: Treat foreign mounts as nosuid
  2016-05-05 13:05     ` Seth Forshee
@ 2016-05-05 22:40       ` Djalal Harouni
  0 siblings, 0 replies; 49+ messages in thread
From: Djalal Harouni @ 2016-05-05 22:40 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Serge Hallyn, Alexander Viro, Chris Mason, tytso, Serge Hallyn,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Djalal Harouni

Hi,

On Thu, May 05, 2016 at 08:05:08AM -0500, Seth Forshee wrote:
> On Wed, May 04, 2016 at 11:19:04PM +0000, Serge Hallyn wrote:
> > Quoting Djalal Harouni (tixxdz@gmail.com):
> > > If a process gets access to a mount from a different user
> > > namespace, that process should not be able to take advantage of
> > > setuid files or selinux entrypoints from that filesystem.  Prevent
> > > this by treating mounts from other mount namespaces and those not
> > > owned by current_user_ns() or an ancestor as nosuid.
> > > 
> > > This patch was just adapted from the original one that was written
> > > by Andy Lutomirski <luto@amacapital.net>
> > > https://www.redhat.com/archives/dm-devel/2016-April/msg00374.html
> > 
> > I'm not sure that this makes sense given what you're doing.  In the
> > case of Seth's set, a filesystem is mounted specifically (and privately)
> > in a user namespace.  We don't want for instance the initial user ns
> > to find a link to a setuid-root exploit left in the container-mounted
> > filesystem.
> > 
> > But you are having a parent user namespace mount the fs so that its
> > children can all access the fs, uid-shifted for convenience.  Not
> > allowing the child namespaces to make use of setuid-root does not
> > seem applicable here.
> 
> Right, the problem addressed by this patch probably isn't relevant to
> this sort of uid shifting.
I'll have another deep look into it, yes the aim when I ported this, is
I was not sure about setns(), or if you get a handle to a mount
namespace through /proc or anything else... then you call into it from
an external user namespace.


> But I think there's another problem that needs to be addressed.
> bprm_fill_uid() still gets the ids for sxid files unshifted from the
> inode. We already protect against sxid to any user not in
> bprm->cred->user_ns, so it will just ignore the sxid instead of e.g.
> suid as global root from the id shifted mount, which is good. What would
> be wanted though is to use the shifted ids so that something like
> suid-root ping in the container rootfs would work.
> 
> Seth
Ok thank you Seth! I'll note it and try to fix it.


-- 
Djalal Harouni
http://opendz.org

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-05 22:24   ` Djalal Harouni
@ 2016-05-06  2:50     ` Dave Chinner
  2016-05-12 19:47       ` Djalal Harouni
  0 siblings, 1 reply; 49+ messages in thread
From: Dave Chinner @ 2016-05-06  2:50 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy

On Thu, May 05, 2016 at 11:24:35PM +0100, Djalal Harouni wrote:
> On Thu, May 05, 2016 at 10:23:14AM +1000, Dave Chinner wrote:
> > On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> > > This is version 2 of the VFS:userns support portable root filesystems
> > > RFC. Changes since version 1:
> > > 
> > > * Update documentation and remove some ambiguity about the feature.
> > >   Based on Josh Triplett comments.
> > > * Use a new email address to send the RFC :-)
> > > 
> > > 
> > > This RFC tries to explore how to support filesystem operations inside
> > > user namespace using only VFS and a per mount namespace solution. This
> > > allows to take advantage of user namespace separations without
> > > introducing any change at the filesystems level. All this is handled
> > > with the virtual view of mount namespaces.
> > 
> > [...]
> > 
> > > As an example if the mapping 0:65535 inside mount namespace and outside
> > > is 1000000:1065536, then 0:65535 will be the range that we use to
> > > construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> > > data. They represent the persistent values that we want to write to the
> > > disk. Therefore, we don't keep track of any UID/GID shift that was applied
> > > before, it gives portability and allows to use the previous mapping
> > > which was freed for another root filesystem...
> > 
> > So let me get this straight. Two /isolated/ containers, different
> > UID/GID mappings, sharing the same files and directories. Create a
> > new file in a writeable directory in container 1, namespace
> > information gets stripped from on-disk uid/gid representation.
> > 
> > Container 2 then reads that shared directory, finds the file written
> > by container 1. As there is no no namespace component to the uid:gid
> > stored in the inode, we apply the current namespace shift to the VFS
> > inode uid/gid and so it maps to root in container 2 and we are
> > allowed to read it?
> 
> Only if container 2 has the flag CLONE_MNTNS_SHIFT_UIDGID set in its own
> mount namespace which only root can set or if it was already set in
> parent, and have access to the shared dir which the container manager
> should also configure before... even with that if in container 2 the
> shift flag is not set then there is no mapping and things work as they
> are now, but yes this setup is flawed! they should not share rootfs,
> maybe in rare cases, some user data that's it.

<head explods>

I can't follow any of the logic you're explaining - you just
confused me even more.  I thought this was to allow namespaces with
different uid/gid mappings all to use the same backing store? And
now you're saying that "no, they'll all have separate backing
stores"?

I suspect you need to describe the layering in a way a stupid dummy
can understand, because trying to be clever with wacky examples is
not working.

> > Unless I've misunderstood something in this crazy mapping scheme,
> > isn't this just a vector for unintentional containment breaches?
> > 
> > [...]
> > 
> > > Simple demo overlayfs, and  btrfs mounted with vfs_shift_uids and
> > > vfs_shift_gids. The overlayfs mounts will share the same upperdir. We
> > > create two user namesapces every one with its own mapping and where
> > > container-uid-2000000 will pull changes from container-uid-1000000
> > > upperdir automatically.
> > 
> > Ok, forget I asked - it's clearly intentional. This is beyond
> > crazy, IMO.
> 
> This setup is flawed! that example was to show that files show up with
> the right mapping with two different user namespaces. As Andy noted
> they should have a backing device...

Did you mean "should have different backing devices" here? If not,
I'm even more confused now...

> Anyway by the previous paragraph what I mean is that when the container
> terminates it releases the UID shift range which can be re-used later
> on another filesystem or on the same previous fs... whatever. Now if
> the range is already in use, userspace should grab a new range give it
> a new filesystem or a previous one which doesn't need to be shared and
> everything should continue to work...

This sounds like you're talking about a set of single, sequential
uses of a single filesystem image across multiple different
container lifecycles? Maybe that's where I'm getting confused,
because I'm assuming multiple concurrent uses of a single filesystem
by all the running containers that are running the same distro
image....

> simple example with loop devices..., however the image should be a GPT
> (GUID partition table) or an MBR one...
> 
> $ dd if=/dev/zero of=/tmp/fedora-newtree.raw bs=10M count=100
> $ mkfs.ext4 /tmp/fedora-newtree.raw
> ...
> $ sudo mount -t ext4 -oloop,rw,sync /var/lib/machines/fedora-newtree.raw /mnt/fedora-tree
> $ sudo yum -y --releasever=23 --installroot=/mnt/fedora-tree --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim 
> $ sudo mount -t ext4 -oloop,vfs_shift_uids,vfs_shift_gids, /var/lib/machines/fedora-newtree.raw  /mnt/fedora-tree
> $ sudo ~/container --uidmap [1000000:1065536 or
> 			     2000000:2065536 or
> 			     3000000:3065536 ....}
>   (That's the mapping outside of the container)

This doesn't match your above comments about separate backing
stores. Here we have two mounts sharing the same image file, both
mounted read/write - there's no separate backing store here. The
fact you hide the initial mount that was populated by yum by
overmounting the same mount point doesn't stop the original mount
from modifying the image file independently of the container you
started.

I'm getting the impression that there's a missing step in all your
examples here - that you create a writable snapshot or overlay of
the original fs image to create separate backing devices for each
container. In that case, the uid/gid shifting avoids needing to make
uid/gid modifications to the snapshot/overlay to match the
container's mapped uid/gids.

Similarly, if the use case given was read-only sharing of trees
between containers, there's no need for separate snapshots or
overlays, just a bunch of read-only (bind?) mounts with shifts
specified for the intended container.

These seem like a pretty sane use case for wanting to shift
uids/gids in this manner, but if that's the case then I'm struggling
to understand where the complexity in the description is coming
from.

> > > 3) ROADMAP:
> > > ===========
> > > * Confirm current design, and make sure that the mapping is done
> > >   correctly.
> > 
> > How are you going to ensure that all filesystems behave the same,
> > and it doesn't get broken by people who really don't care about this
> > sort of crazy?
> 
> By trying to make this a VFS mount namespace parameter. So if the
> shift is not set on on the mount namespace then we just fallback to
> the current behaviour! no shift is performed.

That wasn't what I was asking - I was asking a code maintenance
question. i.e. someone will come along who doesn't quite understand
WTF all this convoluted namespace ID mapping is doing and they will
accidently break it in a subtle way that nobody notices because they
didn't directly change anything to do with ID shifting. What's the
plan for preventing that from happening?

> later of course I'll try xfstests and several tests...
> 
> Does this answer your question ?

That's closer, but ambiguous. ;) Were you planning on just running
some existing tests or writing a set of regression tests that
explicitly encode expected usage and behaviour, as well as what is
expected to fail?

> > .....
> > > * Add XFS support.
> > 
> > What is the problem here?
> 
> Yep, sorry! just lack of time from my part! XFS currently is a bit aware
> of kuid/kgid mapping on its own, and I just didn't had the appropriate
> time! Will try to fix it next time.

You'd be talking about the xfs_kuid_to_uid/xfs_uid_to_kuid()
wrappers, right?

It comes to the kuid/kgid being kernel internal representations of
an ID, not an on-disk format representation.  Like all other kernel
internal types they can change size and structure at any time, while
the persistent on-disk format cannot change without lots of hassle
(and then we really need conversion functions!). For clean layering,
abstraction and self-documenting code, internal types are always
converted to/from a persistent, on-disk format representation in
this manner.

> > Next question: how does this work with uid/gid based quotas?
> 
> If you do a shift you should know that you will share quota on
> disk.

Yes, and this means you can't account for individual container space
usage on such mapped devices. Also, don't you need to shift
uids/gids for the quota syscalls like you do elsewhere?

I also wonder about the fact that the quota interfaces are likely to 
return uids/gids that may not exist in a given container...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-04 23:30 ` Serge Hallyn
@ 2016-05-06 14:38   ` Djalal Harouni
  2016-05-09 16:26     ` Serge Hallyn
  0 siblings, 1 reply; 49+ messages in thread
From: Djalal Harouni @ 2016-05-06 14:38 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Dave Chinner

Hi,

On Wed, May 04, 2016 at 11:30:09PM +0000, Serge Hallyn wrote:
> Quoting Djalal Harouni (tixxdz@gmail.com):
> > This is version 2 of the VFS:userns support portable root filesystems
> > RFC. Changes since version 1:
> > 
> > * Update documentation and remove some ambiguity about the feature.
> >   Based on Josh Triplett comments.
> > * Use a new email address to send the RFC :-)
> > 
> > 
> > This RFC tries to explore how to support filesystem operations inside
> > user namespace using only VFS and a per mount namespace solution. This
> > allows to take advantage of user namespace separations without
> > introducing any change at the filesystems level. All this is handled
> > with the virtual view of mount namespaces.
> 
> Given your use case, is there any way we could work in some tradeoffs
> to protect the host?  What I'm thinking is that containers can all
> share devices uid-mapped at will, however any device mounted with
> uid shifting cannot be used by the inital user namespace.  Or maybe
> just non-executable in that case, as you'll need enough access to
> the fs to set up the containers you want to run.
> 
> So if /dev/sda1 is your host /, you have to use /dev/sda2 as the
> container rootfs source.  Mount it under /containers with uid
> shifting.  Now all containers regardless of uid mappings see
> the shifted fs contents.  But the host root cannot be tricked by
> files on it, as /dev/sda2 is non-executable as far as it is
> concerned.
Of course the whole setup is based on the container manager to setup
the right mount namespace, clean mounts, etc then pivot root, boot or
whatever...

Now I guess we can achieve what you want with MS_SLAVE|MS_REC on / ?

You create a new mount/pid... namespaces with shift flags, but you are still
in init_user_ns, you remount your / with MS_SLAVE|MS_REC, then you
create new mount/pid namespaces with shift flag (two mount namespaces
here if you don't want to race setting MS_SLAVE flag and creating mount
namespace and you don't trust other processes... or you want the same nested
setup...)

This second new secure mount namespace will be the one that you will use
to setup the container, device nodes, loops...  fs that you want into the
container (probably with shift options) and also filesystems that you can't
mount inside user namespaces nor want them to show up or propagate into
host, you may also want to umount stuff too or remount to change mount
options too.., etc anyway here call it the cleaning of the mount namespace.

Now during this phase, when you mount and prepare these file systems,
mount them with noexec flag first, then remount later with exec, or delay
the mounting just before you do a new clone(CLONE_NEWUSER...). During this
phase the container manager should get the device that you want to be
shared from input or argument, and it will only mount it and prepare
it inside new mount namespaces or containers and make sure that it will
never be propagated back...

After clone(CLONE_NEWUSER|CLONE_NEWNS|CLONE_MNTNS_SHIFT_UIDGID), setup
the user namespace mapping, I guess you drop capabilities, do setuid()
or whatever and start the PID 1 or the app of the container.

Now and to not confuse more Dave, since he doesn't like the idea of
a shared backing device, and me neither for obvious reasons! the shared
device should not be used for a rootfs, maybe for read-only user shared
data, or shared config, that's it... but for real rootfs they should have
their own *different* backing device! unless you know what you are doing
hehe I don't want to confuse people, and I just lack time, will also
respond to Dave email.


> Just a thought.

You think it will solve the case ?


Thanks for your comments!

-- 
Djalal Harouni
http://opendz.org

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-06 14:38   ` Djalal Harouni
@ 2016-05-09 16:26     ` Serge Hallyn
  2016-05-10 10:33       ` Djalal Harouni
  0 siblings, 1 reply; 49+ messages in thread
From: Serge Hallyn @ 2016-05-09 16:26 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Dave Chinner

Quoting Djalal Harouni (tixxdz@gmail.com):
> Hi,
> 
> On Wed, May 04, 2016 at 11:30:09PM +0000, Serge Hallyn wrote:
> > Quoting Djalal Harouni (tixxdz@gmail.com):
> > > This is version 2 of the VFS:userns support portable root filesystems
> > > RFC. Changes since version 1:
> > > 
> > > * Update documentation and remove some ambiguity about the feature.
> > >   Based on Josh Triplett comments.
> > > * Use a new email address to send the RFC :-)
> > > 
> > > 
> > > This RFC tries to explore how to support filesystem operations inside
> > > user namespace using only VFS and a per mount namespace solution. This
> > > allows to take advantage of user namespace separations without
> > > introducing any change at the filesystems level. All this is handled
> > > with the virtual view of mount namespaces.
> > 
> > Given your use case, is there any way we could work in some tradeoffs
> > to protect the host?  What I'm thinking is that containers can all
> > share devices uid-mapped at will, however any device mounted with
> > uid shifting cannot be used by the inital user namespace.  Or maybe
> > just non-executable in that case, as you'll need enough access to
> > the fs to set up the containers you want to run.
> > 
> > So if /dev/sda1 is your host /, you have to use /dev/sda2 as the
> > container rootfs source.  Mount it under /containers with uid
> > shifting.  Now all containers regardless of uid mappings see
> > the shifted fs contents.  But the host root cannot be tricked by
> > files on it, as /dev/sda2 is non-executable as far as it is
> > concerned.
> Of course the whole setup is based on the container manager to setup
> the right mount namespace, clean mounts, etc then pivot root, boot or
> whatever...
> 
> Now I guess we can achieve what you want with MS_SLAVE|MS_REC on / ?
> 
> You create a new mount/pid... namespaces with shift flags, but you are still
> in init_user_ns, you remount your / with MS_SLAVE|MS_REC, then you
> create new mount/pid namespaces with shift flag (two mount namespaces
> here if you don't want to race setting MS_SLAVE flag and creating mount
> namespace and you don't trust other processes... or you want the same nested
> setup...)
> 
> This second new secure mount namespace will be the one that you will use
> to setup the container, device nodes, loops...  fs that you want into the
> container (probably with shift options) and also filesystems that you can't
> mount inside user namespaces nor want them to show up or propagate into
> host, you may also want to umount stuff too or remount to change mount
> options too.., etc anyway here call it the cleaning of the mount namespace.
> 
> Now during this phase, when you mount and prepare these file systems,
> mount them with noexec flag first, then remount later with exec, or delay
> the mounting just before you do a new clone(CLONE_NEWUSER...). During this
> phase the container manager should get the device that you want to be
> shared from input or argument, and it will only mount it and prepare
> it inside new mount namespaces or containers and make sure that it will
> never be propagated back...
> 
> After clone(CLONE_NEWUSER|CLONE_NEWNS|CLONE_MNTNS_SHIFT_UIDGID), setup
> the user namespace mapping, I guess you drop capabilities, do setuid()
> or whatever and start the PID 1 or the app of the container.
> 
> Now and to not confuse more Dave, since he doesn't like the idea of
> a shared backing device, and me neither for obvious reasons! the shared
> device should not be used for a rootfs, maybe for read-only user shared
> data, or shared config, that's it... but for real rootfs they should have
> their own *different* backing device! unless you know what you are doing
> hehe I don't want to confuse people, and I just lack time, will also
> respond to Dave email.

Yes.  We're saying slightly different things.  You're saying that the admin
should assign different backing stores for containers.  I'm saying perhaps
the kernel should enforce that, because $leaks.  Let's say the host admin
did a perfect setup of a container with shifted uids.  Now he wants to
run a quick ps in the container...  he does it in a way that leaks a
/proc/pid reference into the container so that (evil) container root can
use /proc/pid/root/ to get a toehold into the host /.  Does he now have
shifted access to that?

I think if we say "this blockdev will have shifted uids in /proc/$pid/ns/user",
then immediately that blockdev becomes not-readable (or not-executable)
in any namespace which does not have /proc/$pid/ns/user as an ancestor.
With obvious check as in write-versus-execute exclusion that you cannot
mark the blockdev shifted if ancestor user_ns already has a file open for
execute.

BTW, perhaps I should do this in a separate email, but here is how I would
expect to use this:

1. Using zfs: I create a bare (unshifted) rootfs fs1.   When I want to
create a new container, I zfs clone fs1 to fs2, and let the container
use fs2 shifted.  No danger to fs1 since fs2 is cow.  Same with btrfs.

2. Using overlay: I create a bare (unshifted) rootfs fs1.  When I want
to create a new container, I I mount fs1 read-only and shifted as base
layer, then fs2 as the rw layer.

The point here is that the zfs clone plus container start takes (for a
600-800M rootfs) about .5 seconds on my laptop, while the act of shifting
all the uids takes another 2 seconds.  So being able do this without
manually shifting would be a huge improvement for cases (i.e. docker)
where you do lots and lots of quick deploys.

> > Just a thought.
> 
> You think it will solve the case ?
> 
> 
> Thanks for your comments!
> 
> -- 
> Djalal Harouni
> http://opendz.org

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-09 16:26     ` Serge Hallyn
@ 2016-05-10 10:33       ` Djalal Harouni
  0 siblings, 0 replies; 49+ messages in thread
From: Djalal Harouni @ 2016-05-10 10:33 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Dave Chinner

On Mon, May 09, 2016 at 04:26:30PM +0000, Serge Hallyn wrote:
> Quoting Djalal Harouni (tixxdz@gmail.com):
> > Hi,
[...]
> > 
> > After clone(CLONE_NEWUSER|CLONE_NEWNS|CLONE_MNTNS_SHIFT_UIDGID), setup
> > the user namespace mapping, I guess you drop capabilities, do setuid()
> > or whatever and start the PID 1 or the app of the container.
> > 
> > Now and to not confuse more Dave, since he doesn't like the idea of
> > a shared backing device, and me neither for obvious reasons! the shared
> > device should not be used for a rootfs, maybe for read-only user shared
> > data, or shared config, that's it... but for real rootfs they should have
> > their own *different* backing device! unless you know what you are doing
> > hehe I don't want to confuse people, and I just lack time, will also
> > respond to Dave email.
> 
> Yes.  We're saying slightly different things.  You're saying that the admin
> should assign different backing stores for containers.  I'm saying perhaps
> the kernel should enforce that, because $leaks.  Let's say the host admin
> did a perfect setup of a container with shifted uids.  Now he wants to
> run a quick ps in the container...  he does it in a way that leaks a
> /proc/pid reference into the container so that (evil) container root can
> use /proc/pid/root/ to get a toehold into the host /.  Does he now have
> shifted access to that?

No. Assuming host / or its other mount points are not mounted with
vfs_shift_uids and vfs_shift_gids options. In this case no shift is
performed at all.

1) If you mount host / with vfs_shift_uids and vfs_shift_gids it's
like real root in init_user_ns does "chmod -R o+rwx /"... It does not make
sense and since no one can edit/remount mounts to change their options in
the mount namespace of init_user_ns, it's safe, and not available by
default.

2) That's why also filsystems must support this explicitly and not on
their behalf.

IMO the kernel is already enforcing this, so even if you assign different
backing stores to containers, you can't have shifted access there, unless
you explicitly tell the kernel that the mount is mean to be shifted by
adding vfs_shift_uids and vfs_shift_gids mount options.


> I think if we say "this blockdev will have shifted uids in /proc/$pid/ns/user",
> then immediately that blockdev becomes not-readable (or not-executable)
> in any namespace which does not have /proc/$pid/ns/user as an ancestor.

Hmm,

(1) This won't work since to do that you have to know in advance
/proc/$pid/ns/user and since file systems can't be mounted inside user
namespace this brings us to the same blocker ... ! and in our use case
we do want to shift UIDs/GIDs to just access inodes, no need to expose
the whole filesystem, root is responsible and filesystems stay safe.

(2)  Why complicate ? the kernel already supports this! and it's a
generic solution.

As said you can just create new mount namespaces, mount things there
private, slave... mount your blockdev that will be shifted by processes
that inherits that mount, you can even have intermediate mount namespaces
that you will forget/unref at any moment and where they are only used to
perform setup, and no other process/code can enter... You don't have
any leaks nothing! you control that piece of code.

If you want that blockdev to become not-readable or noexec in any
namespace which does not have /proc/$pid/ns/user as an ancestor,
the kernel allows a better interface, it allows that blockdev to not
even show up in any ancestor, by making use of mount namespaces and
MS_PRIVATE, MS_SLAVE... no one will even notice if the mount exists.

However if you want to access that blockdev for whatever reason, then
create a new mount namespace and use MS_PRIVATE, MS_SLAVE and all the
noexec flags and mount it.

Yes slightly different things, but I don't want to add complexity where
the interface already exists in the kernel...


> With obvious check as in write-versus-execute exclusion that you cannot
> mark the blockdev shifted if ancestor user_ns already has a file open for
> execute.

Please note here, that it's the same ancestor who will mark the blockdev
to be shifted, but  why the ancestor will keep at the same time a file
open in that filesystem that is mean to be shifted and later execute
through that fd a program that was just crafted by untrusted container ?!


For me the kernel already offers the interfaces no need to complicate
things or enforce it... As said in other responses, the design of these
patches is to just use what the kernel already provides.



> BTW, perhaps I should do this in a separate email, but here is how I would
> expect to use this:
> 
> 1. Using zfs: I create a bare (unshifted) rootfs fs1.   When I want to
> create a new container, I zfs clone fs1 to fs2, and let the container
> use fs2 shifted.  No danger to fs1 since fs2 is cow.  Same with btrfs.

Yes that would work, since fs1 is unshifted, the only requirement is
that fs2 should not reside on the same backing store of fs1 to not share
quota with fs1 (I'm not a ZFS user...) and you can always make the parent
of mount point fs2 or containers directories 0700... and root should not
go there and exec programs like it's not safe to go /hom/$user... and
exec...


> 2. Using overlay: I create a bare (unshifted) rootfs fs1.  When I want
> to create a new container, I I mount fs1 read-only and shifted as base
> layer, then fs2 as the rw layer.

Yes here you may share quota if all the fs2 rw layers of all containers
reside on the same backing store... but here the requirement is that fs1
should be mounted the first time with shifted uids/gids where fs1 resides
on ext4, btrfs, xfs or anyother filesystem that supports shifting. This
means you may have to mount fs1 on a different backing store say on
/root-fs0/lib/container-image-fs1/  with vfs_shit_uids/gids then use it
as a shared read-only lower layer.

Of course you may just use your host / as a readonly layer where you
mount it the first time with vfs_shift_uids/gids but as discussed above
that's not really safe unless that's not a shared user system, or you
know what you are doing...

These patches do not edit overlayfs. overlayfs support is transparent if
the underlaying filesystems, the upper and lower directories are on
filesystems that support vfs_shift_uids/vfs_shift_gids.

If we go with overlayfs, we make it an overlayfs problem where it needs
different approache related to union mounts which I noted in the cover
letter of this patches.


> The point here is that the zfs clone plus container start takes (for a
> 600-800M rootfs) about .5 seconds on my laptop, while the act of shifting
> all the uids takes another 2 seconds.  So being able do this without
> manually shifting would be a huge improvement for cases (i.e. docker)
> where you do lots and lots of quick deploys.
> 

That's one of the use cases of course! you can also verify the
integrity... and able to really make containers fs read-only without
the recursive chown...

Thank you for your feedback!

-- 
Djalal Harouni
http://opendz.org

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-05 22:08         ` James Bottomley
@ 2016-05-10 23:36           ` James Bottomley
  2016-05-11  0:38             ` Al Viro
                               ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: James Bottomley @ 2016-05-10 23:36 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy

On Thu, 2016-05-05 at 18:08 -0400, James Bottomley wrote:
> On Thu, 2016-05-05 at 22:49 +0100, Djalal Harouni wrote:
> > On Thu, May 05, 2016 at 07:56:28AM -0400, James Bottomley wrote:
> > > On Thu, 2016-05-05 at 08:36 +0100, Djalal Harouni wrote:
> > > > On Wed, May 04, 2016 at 05:06:19PM -0400, James Bottomley
> > > > wrote:
[...]
> > > > > So this option was discussed at the recent LSF/MM summit. 
> > > > >  The 
> > > > > most supported suggestion was that you'd use a new internal
> > > > > fs 
> > > > > type that had a struct mount with a new superblock and would 
> > > > > copy the underlying inodes but substitute it's own with 
> > > > >  modified  ->getatrr/->setattr calls that did the uid shift. 
> > > > >  In many ways it would be a remapping bind which would look 
> > > > > similar to overlayfs but be a lot simpler.
> > > > 
> > > > Hmm, it's not only about ->getattr and ->setattr, you have all 
> > > > the other file system operations that need access too...
> > > 
> > > Why?  Or perhaps we should more cogently define the actual 
> > > problem.   My problem is simply mounting image volumes that were 
> > > created with real uids at user namespace shifted uids because I'm
> > >  downshifting the privileged ids in the container.  I actually 
> > > *only* need the uid/gids on the attributes shifted because that's
> > > what I need to manipulate the
> > >   
> > We need them obviously for read/write/creation... ?!
> 
> OK, so the way attributes are populated on an inode is via getattr. 
>  You intercept that, you change the inode owner and group that are
> installed on the inode.  That means that when you list the directory,
> you see the shift and the shifted uid/gid are used to check 
> permissions for vfs_open().

Just to illustrate how this could be done, here's a functional proof of
concept for a uid/gid shifting bind mount equivalent.  It's not
actually a proper bind mount because it has to manufacture its own
inodes.  As you can see, it can only be used by root, it will shift all
the uid/gid bits as well as the permission comparisons.  It operates on
subtrees, so it can shift the uids/gids on any filesystem or part of
one and because the shifts are per superblock, it could actually shift
the same subtree for multiple users on different shifts.  Best of all,
it requires no vfs changes at all, being entirely implemented inside
its own filesystem type.

You use it just like bind mount:

mount -t shiftfs <source> <target>

except that it takes uidshift=x:y:z and gidshift=x:y:z multiple times
as options.  It's currently not recursive and it definitely needs
polishing to show things like mount options and be properly Kconfig
using.

There's a bit of an open question of whether it should have vfs
changes: the way the struct file f_inode and f_ops are hijacked is a
bit nasty and perhaps d_select_inode() could be made a bit cleverer to
help us here instead.

James

---

 fs/Makefile                |   1 +
 fs/shiftfs.c               | 790 +++++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/magic.h |   2 +
 3 files changed, 793 insertions(+)

diff --git a/fs/Makefile b/fs/Makefile
index 85b6e13..bad03b2 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -128,3 +128,4 @@ obj-y				+= exofs/ # Multiple modules
 obj-$(CONFIG_CEPH_FS)		+= ceph/
 obj-$(CONFIG_PSTORE)		+= pstore/
 obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
+obj-m				+= shiftfs.o
diff --git a/fs/shiftfs.c b/fs/shiftfs.c
new file mode 100644
index 0000000..b40cdfe
--- /dev/null
+++ b/fs/shiftfs.c
@@ -0,0 +1,790 @@
+#include <linux/cred.h>
+#include <linux/mount.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/magic.h>
+#include <linux/parser.h>
+#include <linux/statfs.h>
+#include <linux/slab.h>
+#include <linux/user_namespace.h>
+#include <linux/uidgid.h>
+
+struct shiftfs_super_info {
+	struct vfsmount *mnt;
+	struct uid_gid_map uid_map, gid_map;
+};
+
+static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
+				       struct dentry *dentry);
+
+enum {
+	OPT_UIDMAP,
+	OPT_GIDMAP,
+	OPT_LAST,
+};
+
+/* global filesystem options */
+static const match_table_t tokens = {
+	{ OPT_UIDMAP, "uidmap=%u:%u:%u" },
+	{ OPT_GIDMAP, "gidmap=%u:%u:%u" },
+	{ OPT_LAST, NULL }
+};
+
+/*
+ * code stolen from user_namespace.c ... except that these functions
+ * return the same id back if unmapped ... should probably have a
+ * library?
+ */
+static u32 map_id_down(struct uid_gid_map *map, u32 id)
+{
+	unsigned idx, extents;
+	u32 first, last;
+
+	/* Find the matching extent */
+	extents = map->nr_extents;
+	smp_rmb();
+	for (idx = 0; idx < extents; idx++) {
+		first = map->extent[idx].first;
+		last = first + map->extent[idx].count - 1;
+		if (id >= first && id <= last)
+			break;
+	}
+	/* Map the id or note failure */
+	if (idx < extents)
+		id = (id - first) + map->extent[idx].lower_first;
+
+	return id;
+}
+
+static u32 map_id_up(struct uid_gid_map *map, u32 id)
+{
+	unsigned idx, extents;
+	u32 first, last;
+
+	/* Find the matching extent */
+	extents = map->nr_extents;
+	smp_rmb();
+	for (idx = 0; idx < extents; idx++) {
+		first = map->extent[idx].lower_first;
+		last = first + map->extent[idx].count - 1;
+		if (id >= first && id <= last)
+			break;
+	}
+	/* Map the id or note failure */
+	if (idx < extents)
+		id = (id - first) + map->extent[idx].first;
+
+	return id;
+}
+
+static bool mappings_overlap(struct uid_gid_map *new_map,
+			     struct uid_gid_extent *extent)
+{
+	u32 upper_first, lower_first, upper_last, lower_last;
+	unsigned idx;
+
+	upper_first = extent->first;
+	lower_first = extent->lower_first;
+	upper_last = upper_first + extent->count - 1;
+	lower_last = lower_first + extent->count - 1;
+
+	for (idx = 0; idx < new_map->nr_extents; idx++) {
+		u32 prev_upper_first, prev_lower_first;
+		u32 prev_upper_last, prev_lower_last;
+		struct uid_gid_extent *prev;
+
+		prev = &new_map->extent[idx];
+
+		prev_upper_first = prev->first;
+		prev_lower_first = prev->lower_first;
+		prev_upper_last = prev_upper_first + prev->count - 1;
+		prev_lower_last = prev_lower_first + prev->count - 1;
+
+		/* Does the upper range intersect a previous extent? */
+		if ((prev_upper_first <= upper_last) &&
+		    (prev_upper_last >= upper_first))
+			return true;
+
+		/* Does the lower range intersect a previous extent? */
+		if ((prev_lower_first <= lower_last) &&
+		    (prev_lower_last >= lower_first))
+			return true;
+	}
+	return false;
+}
+/* end code stolen from user_namespace.c */
+
+static const struct cred *shiftfs_get_up_creds(struct super_block *sb)
+{
+	struct cred *cred = prepare_creds();
+	struct shiftfs_super_info *ssi = sb->s_fs_info;
+
+	if (!cred)
+		return NULL;
+
+	cred->fsuid = KUIDT_INIT(map_id_up(&ssi->uid_map, __kuid_val(cred->fsuid)));
+	cred->fsgid = KGIDT_INIT(map_id_up(&ssi->gid_map, __kgid_val(cred->fsgid)));
+
+	return cred;
+}
+
+static const struct cred *shiftfs_new_creds(const struct cred **newcred,
+					    struct super_block *sb)
+{
+	const struct cred *cred = shiftfs_get_up_creds(sb);
+
+	*newcred = cred;
+
+	if (cred)
+		cred = override_creds(cred);
+	else
+		printk(KERN_ERR "Credential override failed: no memory\n");
+
+	return cred;
+}
+
+static void shiftfs_old_creds(const struct cred *oldcred,
+			      const struct cred **newcred)
+{
+	if (!*newcred)
+		return;
+
+	revert_creds(oldcred);
+	put_cred(*newcred);
+}
+
+static int shiftfs_parse_options(struct shiftfs_super_info *ssi, char *options)
+{
+	char *p;
+	substring_t args[MAX_OPT_ARGS];
+	int from, to, count;
+	struct uid_gid_map *map, *maps[2] = {
+		[OPT_UIDMAP] = &ssi->uid_map,
+		[OPT_GIDMAP] = &ssi->gid_map,
+	};
+
+	while ((p = strsep(&options, ",")) != NULL) {
+		int token;
+		struct uid_gid_extent ext;
+
+		if (!*p)
+			continue;
+
+		token = match_token(p, tokens, args);
+		if (token != OPT_UIDMAP && token != OPT_GIDMAP)
+			return -EINVAL;
+		if (match_int(&args[0], &from) ||
+		    match_int(&args[1], &to) ||
+		    match_int(&args[2], &count))
+			return -EINVAL;
+		map = maps[token];
+		if (map->nr_extents >= UID_GID_MAP_MAX_EXTENTS)
+			return -EINVAL;
+		ext.first = from;
+		ext.lower_first = to;
+		ext.count = count;
+		if (mappings_overlap(map, &ext))
+			return -EINVAL;
+		map->extent[map->nr_extents++] = ext;
+	}
+	return 0;
+}
+
+static void shiftfs_d_iput(struct dentry *dentry, struct inode *inode)
+{
+	struct dentry *real = inode->i_private;
+
+	dput(real);
+	iput(inode);
+}
+
+static const struct dentry_operations shiftfs_dentry_ops = {
+	.d_iput		= shiftfs_d_iput,
+};
+
+static int shiftfs_readlink(struct dentry *dentry, char __user *data,
+			    int flags)
+{
+	struct dentry *real = dentry->d_inode->i_private;
+	const struct inode_operations *iop = real->d_inode->i_op;
+
+	if (iop->readlink)
+		return iop->readlink(real, data, flags);
+
+	return -EINVAL;
+}
+
+static const char *shiftfs_get_link(struct dentry *dentry, struct inode *inode,
+				    struct delayed_call *done)
+{
+	if (dentry) {
+		struct dentry *real = dentry->d_inode->i_private;
+		struct inode *reali = real->d_inode;
+		const struct inode_operations *iop = reali->i_op;
+		const char *res = ERR_PTR(-EPERM);
+
+		if (iop->get_link)
+			res = iop->get_link(real, reali, done);
+
+		return res;
+	} else {
+		/* RCU lookup not supported */
+		return ERR_PTR(-ECHILD);
+	}
+}
+
+static int shiftfs_setxattr(struct dentry *dentry, const char *name,
+			    const void *value, size_t size, int flags)
+{
+	struct dentry *real = dentry->d_inode->i_private;
+	const struct inode_operations *iop = real->d_inode->i_op;
+	int err = -EOPNOTSUPP;
+
+	if (iop->setxattr) {
+		const struct cred *oldcred, *newcred;
+
+		oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+		err = iop->setxattr(real, name, value, size, flags);
+		shiftfs_old_creds(oldcred, &newcred);
+	}
+
+	return err;
+}
+
+static ssize_t shiftfs_getxattr(struct dentry *dentry, const char *name,
+				void *value, size_t size)
+{
+	struct dentry *real = dentry->d_inode->i_private;
+	const struct inode_operations *iop = real->d_inode->i_op;
+	int err = -EOPNOTSUPP;
+
+	if (iop->getxattr) {
+		const struct cred *oldcred, *newcred;
+
+		oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+		err = iop->getxattr(real, name, value, size);
+		shiftfs_old_creds(oldcred, &newcred);
+	}
+
+	return err;
+}
+
+static ssize_t shiftfs_listxattr(struct dentry *dentry, char *list,
+				 size_t size)
+{
+	struct dentry *real = dentry->d_inode->i_private;
+	const struct inode_operations *iop = real->d_inode->i_op;
+
+	if (iop->listxattr)
+		return iop->listxattr(real, list, size);
+
+	return -EINVAL;
+}
+
+static int shiftfs_removexattr(struct dentry *dentry, const char *name)
+{
+	struct dentry *real = dentry->d_inode->i_private;
+	const struct inode_operations *iop = real->d_inode->i_op;
+
+	if (iop->removexattr)
+		return iop->removexattr(real, name);
+
+	return -EINVAL;
+}
+
+static void shiftfs_fill_inode(struct inode *inode, struct dentry *dentry)
+{
+	struct inode *reali;
+	struct shiftfs_super_info *ssi = inode->i_sb->s_fs_info;
+
+	if (!dentry)
+		return;
+
+	reali = dentry->d_inode;
+
+	if (!reali->i_op->get_link)
+		inode->i_opflags |= IOP_NOFOLLOW;
+
+	inode->i_mapping = reali->i_mapping;
+	inode->i_private = dentry;
+
+	inode->i_uid = KUIDT_INIT(map_id_down(&ssi->uid_map, __kuid_val(reali->i_uid)));
+	inode->i_gid = KGIDT_INIT(map_id_down(&ssi->gid_map, __kgid_val(reali->i_gid)));
+}
+
+static int shiftfs_make_object(struct inode *dir, struct dentry *dentry,
+			       umode_t mode, const char *symlink,
+			       struct dentry *hardlink, bool excl)
+{
+	struct dentry *real = dir->i_private, *new;
+	struct inode *reali = real->d_inode, *newi;
+	const struct inode_operations *iop = reali->i_op;
+	int err;
+	const struct cred *oldcred, *newcred;
+	bool op_ok = false;
+
+	if (hardlink) {
+		op_ok = iop->link;
+	} else {
+		switch (mode & S_IFMT) {
+		case S_IFDIR:
+			op_ok = iop->mkdir;
+			break;
+		case S_IFREG:
+			op_ok = iop->create;
+			break;
+		case S_IFLNK:
+			op_ok = iop->symlink;
+		}
+	}
+	if (!op_ok)
+		return -EINVAL;
+
+
+	newi = shiftfs_new_inode(dentry->d_sb, mode, NULL);
+	if (!newi)
+		return -ENOMEM;
+
+	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+
+	inode_lock_nested(reali, I_MUTEX_PARENT);
+	new = lookup_one_len(dentry->d_name.name, real, dentry->d_name.len);
+	err = PTR_ERR(new);
+	if (IS_ERR(new))
+		goto out_unlock;
+
+	if (hardlink) {
+		struct dentry *realhardlink = hardlink->d_inode->i_private;
+
+		err = vfs_link(new, reali, realhardlink, NULL);
+	} else {
+		switch (mode & S_IFMT) {
+		case S_IFDIR:
+			err = vfs_mkdir(reali, new, mode);
+			break;
+		case S_IFREG:
+			err = vfs_create(reali, new, mode, excl);
+			break;
+		case S_IFLNK:
+			err = vfs_symlink(reali, new, symlink);
+		}
+	}
+
+	shiftfs_old_creds(oldcred, &newcred);
+
+	if (err)
+		goto out_dput;
+
+	shiftfs_fill_inode(newi, new);
+
+	d_instantiate(dentry, newi);
+
+	new = NULL;
+	newi = NULL;
+
+ out_dput:
+	dput(new);
+ out_unlock:
+	iput(newi);
+	inode_unlock(reali);
+
+	return err;
+}
+
+static int shiftfs_create(struct inode *dir, struct dentry *dentry,
+			  umode_t mode,  bool excl)
+{
+	mode |= S_IFREG;
+
+	return shiftfs_make_object(dir, dentry, mode, NULL, NULL, excl);
+}
+
+static int shiftfs_mkdir(struct inode *dir, struct dentry *dentry,
+			 umode_t mode)
+{
+	mode |= S_IFDIR;
+
+	return shiftfs_make_object(dir, dentry, mode, NULL, NULL, false);
+}
+
+static int shiftfs_link(struct dentry *dentry, struct inode *dir,
+			struct dentry *hardlink)
+{
+	return shiftfs_make_object(dir, dentry, 0, NULL, hardlink, false);
+}
+
+static int shiftfs_symlink(struct inode *dir, struct dentry *dentry,
+			   const char *symlink)
+{
+	return shiftfs_make_object(dir, dentry, S_IFLNK, symlink, NULL, false);
+}
+
+static int shiftfs_rm(struct inode *dir, struct dentry *dentry, bool rmdir)
+{
+	struct dentry *real = dir->i_private, *new;
+	struct inode *reali = real->d_inode;
+	int err;
+	const struct cred *oldcred, *newcred;
+
+	inode_lock_nested(reali, I_MUTEX_PARENT);
+
+	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+
+	new = lookup_one_len(dentry->d_name.name, real, dentry->d_name.len);
+	err = PTR_ERR(new);
+	if (IS_ERR(new))
+		goto out_unlock;
+
+	if (rmdir)
+		err = vfs_rmdir(reali, new);
+	else
+		err = vfs_unlink(reali, new, NULL);
+
+	dput(new);
+
+ out_unlock:
+	shiftfs_old_creds(oldcred, &newcred);
+	inode_unlock(reali);
+
+	return err;
+}
+
+static int shiftfs_unlink(struct inode *dir, struct dentry *dentry)
+{
+	return shiftfs_rm(dir, dentry, false);
+}
+
+static int shiftfs_rmdir(struct inode *dir, struct dentry *dentry)
+{
+	return shiftfs_rm(dir, dentry, true);
+}
+
+static int shiftfs_rename2(struct inode *olddir, struct dentry *old,
+			   struct inode *newdir, struct dentry *new,
+			   unsigned int flags)
+{
+	struct dentry *rodd = olddir->i_private, *rndd = newdir->i_private,
+		*realold = old->d_inode->i_private,
+		*realnew = new->d_inode->i_private;
+	struct inode *realolddir = rodd->d_inode, *realnewdir = rndd->d_inode;
+	const struct inode_operations *iop = realolddir->i_op;
+	int err;
+	const struct cred *oldcred, *newcred;
+
+	oldcred = shiftfs_new_creds(&newcred, old->d_sb);
+	err = iop->rename2(realolddir, realold, realnewdir, realnew, flags);
+	shiftfs_old_creds(oldcred, &newcred);
+
+	return err;
+}
+
+static struct dentry *shiftfs_lookup(struct inode *dir, struct dentry *dentry,
+				     unsigned int flags)
+{
+	struct dentry *real = dir->i_private, *new;
+	struct inode *reali = real->d_inode, *newi;
+	const struct cred *oldcred, *newcred;
+
+	/* note: violation of usual fs rules here: dentries are never
+	 * added with d_add.  This is because we want no dentry cache
+	 * for shiftfs.  All lookups proceed through the dentry cache
+	 * of the underlying filesystem, meaning we always see any
+	 * changes in the underlying */
+
+	inode_lock(reali);
+	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+	new = lookup_one_len(dentry->d_name.name, real, dentry->d_name.len);
+	shiftfs_old_creds(oldcred, &newcred);
+	inode_unlock(reali);
+
+	if (IS_ERR(new) || !new)
+		return new;
+
+	if (!new->d_inode) {
+		dput(new);
+		return NULL;
+	}
+
+	newi = shiftfs_new_inode(dentry->d_sb, new->d_inode->i_mode, new);
+	if (!newi) {
+		dput(new);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	d_instantiate(dentry, newi);
+
+	return NULL;
+}
+
+static int shiftfs_permission(struct inode *inode, int mask)
+{
+	struct dentry *real = inode->i_private;
+	struct inode *reali = real->d_inode;
+	const struct inode_operations *iop = reali->i_op;
+	int err;
+	const struct cred *oldcred, *newcred;
+
+	oldcred = shiftfs_new_creds(&newcred, inode->i_sb);
+	if (iop->permission)
+		err = iop->permission(reali, mask);
+	else
+		err = generic_permission(reali, mask);
+	shiftfs_old_creds(oldcred, &newcred);
+
+	return err;
+}
+
+static int shiftfs_setattr(struct dentry *dentry, struct iattr *attr)
+{
+	struct dentry *real = dentry->d_inode->i_private;
+	struct inode *reali = real->d_inode;
+	const struct inode_operations *iop = reali->i_op;
+	struct iattr newattr = *attr;
+	const struct cred *oldcred, *newcred;
+	struct shiftfs_super_info *ssi = dentry->d_sb->s_fs_info;
+	int err;
+
+	newattr.ia_uid = KUIDT_INIT(map_id_up(&ssi->uid_map, __kuid_val(attr->ia_uid)));
+	newattr.ia_gid = KGIDT_INIT(map_id_up(&ssi->gid_map, __kgid_val(attr->ia_gid)));
+
+	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+	if (iop->setattr)
+		err = iop->setattr(real, &newattr);
+	else
+		err = simple_setattr(real, &newattr);
+	shiftfs_old_creds(oldcred, &newcred);
+
+	return err;
+}
+
+static int shiftfs_getattr(struct vfsmount *mnt, struct dentry *dentry,
+			   struct kstat *stat)
+{
+	struct inode *inode = dentry->d_inode;
+	struct dentry *real = inode->i_private;
+	struct inode *reali = real->d_inode;
+	const struct inode_operations *iop = reali->i_op;
+	int err = 0;
+
+	mnt = dentry->d_sb->s_fs_info;
+
+	if (iop->getattr)
+		err = iop->getattr(mnt, real, stat);
+	else
+		generic_fillattr(reali, stat);
+
+	if (err)
+		return err;
+
+	stat->uid = inode->i_uid;
+	stat->gid = inode->i_gid;
+	return 0;
+}
+
+struct shiftfs_fop_carrier {
+	struct inode *inode;
+	int (*release)(struct inode *, struct file *);
+	struct file_operations fop;
+};
+
+static int shiftfs_release(struct inode *inode, struct file *file)
+{
+	struct shiftfs_fop_carrier *sfc;
+	int err = 0;
+
+	sfc = container_of(file->f_op, struct shiftfs_fop_carrier, fop);
+
+	if (sfc->release)
+		err = sfc->release(inode, file);
+
+	file->f_inode = sfc->inode;
+	file->f_op = sfc->inode->i_fop;
+
+	kfree(sfc);
+
+	return err;
+}
+
+static int shiftfs_open(struct inode *inode, struct file *file)
+{
+	struct dentry *real = inode->i_private;
+	struct inode *reali = real->d_inode;
+	const struct file_operations *fop;
+	struct shiftfs_fop_carrier *sfc;
+	int err = 0;
+
+	sfc = kmalloc(sizeof(*sfc), GFP_KERNEL);
+	if (!sfc)
+		return -ENOMEM;
+
+	if (real->d_flags & DCACHE_OP_SELECT_INODE)
+		reali = real->d_op->d_select_inode(real, file->f_flags);
+
+	fop = reali->i_fop;
+	sfc->inode = inode;
+	memcpy(&sfc->fop, fop, sizeof(*fop));
+	sfc->release = sfc->fop.release;
+	sfc->fop.release = shiftfs_release;
+
+	file->f_op = &sfc->fop;
+	file->f_inode = reali;
+
+	if (fop->open)
+		err = fop->open(reali, file);
+
+	return err;
+}
+
+static const struct inode_operations shiftfs_inode_ops = {
+	/* intercepted */
+	.lookup		= shiftfs_lookup,
+	.getattr	= shiftfs_getattr,
+	.setattr	= shiftfs_setattr,
+	.permission	= shiftfs_permission,
+
+	/*pass though */
+	.mkdir		= shiftfs_mkdir,
+	.symlink	= shiftfs_symlink,
+	.get_link	= shiftfs_get_link,
+	.readlink	= shiftfs_readlink,
+	.unlink		= shiftfs_unlink,
+	.rmdir		= shiftfs_rmdir,
+	.rename2	= shiftfs_rename2,
+	.link		= shiftfs_link,
+	.create		= shiftfs_create,
+	.mknod		= NULL,	/* no special files currently */
+	.setxattr	= shiftfs_setxattr,
+	.getxattr	= shiftfs_getxattr,
+	.listxattr	= shiftfs_listxattr,
+	.removexattr	= shiftfs_removexattr,
+};
+
+static const struct file_operations shiftfs_file_ops = {
+	.open		= shiftfs_open,
+};
+
+static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
+				       struct dentry *dentry)
+{
+	struct inode *inode;
+
+	inode = new_inode(sb);
+	if (!inode)
+		return NULL;
+
+	mode &= S_IFMT;
+
+	inode->i_ino = get_next_ino();
+	inode->i_mode = mode;
+	inode->i_flags |= S_NOATIME | S_NOCMTIME;
+
+	inode->i_op = &shiftfs_inode_ops;
+	inode->i_fop = &shiftfs_file_ops;
+
+	shiftfs_fill_inode(inode, dentry);
+
+	return inode;
+}
+
+static void shiftfs_put_super(struct super_block *sb)
+{
+	struct shiftfs_super_info *ssi = sb->s_fs_info;
+
+	mntput(ssi->mnt);
+	kfree(ssi);
+}
+
+static const struct super_operations shiftfs_super_ops = {
+	.put_super	= shiftfs_put_super,
+};
+
+struct shiftfs_data {
+	void *data;
+	const char *path;
+};
+
+static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
+			      int silent)
+{
+	struct shiftfs_data *data = raw_data;
+	char *name = kstrdup(data->path, GFP_KERNEL);
+	int err = -ENOMEM;
+	struct shiftfs_super_info *ssi = NULL;
+	struct path path;
+
+	if (!name)
+		goto out;
+
+	ssi = kzalloc(sizeof(*ssi), GFP_KERNEL);
+	if (!ssi)
+		goto out;
+
+	err = -EPERM;
+	if (!capable(CAP_SYS_ADMIN))
+		goto out;
+
+	err = shiftfs_parse_options(ssi, data->data);
+	if (err)
+		goto out;
+
+	err = kern_path(name, LOOKUP_FOLLOW, &path);
+	if (err)
+		goto out;
+
+	if (!S_ISDIR(path.dentry->d_inode->i_mode)) {
+		err = -ENOTDIR;
+		goto out_put;
+	}
+	ssi->mnt = path.mnt;
+
+	sb->s_fs_info = ssi;
+	sb->s_magic = SHIFTFS_MAGIC;
+	sb->s_op = &shiftfs_super_ops;
+	sb->s_d_op = &shiftfs_dentry_ops;
+	sb->s_root = d_make_root(shiftfs_new_inode(sb, S_IFDIR, path.dentry));
+
+	return 0;
+
+ out_put:
+	path_put(&path);
+ out:
+	kfree(name);
+	if (err)
+		kfree(ssi);
+	return err;
+}
+
+static struct dentry *shiftfs_mount(struct file_system_type *fs_type,
+				    int flags, const char *dev_name, void *data)
+{
+	struct shiftfs_data d = { data, dev_name };
+
+	return mount_nodev(fs_type, flags, &d, shiftfs_fill_super);
+}
+
+static struct file_system_type shiftfs_type = {
+	.owner		= THIS_MODULE,
+	.name		= "shiftfs",
+	.mount		= shiftfs_mount,
+	.kill_sb	= kill_anon_super,
+};
+
+static int __init shiftfs_init(void)
+{
+	return register_filesystem(&shiftfs_type);
+}
+
+static void __exit shiftfs_exit(void)
+{
+	unregister_filesystem(&shiftfs_type);
+}
+
+MODULE_ALIAS_FS("shiftfs");
+MODULE_AUTHOR("James Bottomley");
+MODULE_DESCRIPTION("uid/gid shifting bind filesystem");
+MODULE_LICENSE("GPL v2");
+module_init(shiftfs_init)
+module_exit(shiftfs_exit)
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 0de181a..d7992f5 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -79,4 +79,6 @@
 #define NSFS_MAGIC		0x6e736673
 #define BPF_FS_MAGIC		0xcafe4a11
 
+#define SHIFTFS_MAGIC		0x6a656a62
+
 #endif /* __LINUX_MAGIC_H__ */

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-10 23:36           ` James Bottomley
@ 2016-05-11  0:38             ` Al Viro
  2016-05-11  0:53             ` Al Viro
  2016-05-11 16:42             ` Djalal Harouni
  2 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2016-05-11  0:38 UTC (permalink / raw)
  To: James Bottomley
  Cc: Djalal Harouni, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy

On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:

> mount -t shiftfs <source> <target>

Note to self: do not eat while reading l-k...

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-10 23:36           ` James Bottomley
  2016-05-11  0:38             ` Al Viro
@ 2016-05-11  0:53             ` Al Viro
  2016-05-11  3:47               ` James Bottomley
  2016-05-11 16:42             ` Djalal Harouni
  2 siblings, 1 reply; 49+ messages in thread
From: Al Viro @ 2016-05-11  0:53 UTC (permalink / raw)
  To: James Bottomley
  Cc: Djalal Harouni, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy

On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> +static int shiftfs_rename2(struct inode *olddir, struct dentry *old,
> +			   struct inode *newdir, struct dentry *new,
> +			   unsigned int flags)
> +{
> +	struct dentry *rodd = olddir->i_private, *rndd = newdir->i_private,
> +		*realold = old->d_inode->i_private,
> +		*realnew = new->d_inode->i_private;
> +	struct inode *realolddir = rodd->d_inode, *realnewdir = rndd->d_inode;
> +	const struct inode_operations *iop = realolddir->i_op;
> +	int err;
> +	const struct cred *oldcred, *newcred;
> +
> +	oldcred = shiftfs_new_creds(&newcred, old->d_sb);
> +	err = iop->rename2(realolddir, realold, realnewdir, realnew, flags);
> +	shiftfs_old_creds(oldcred, &newcred);

... and you've just violated all locking rules for ->rename2().

> +static struct dentry *shiftfs_lookup(struct inode *dir, struct dentry *dentry,
> +				     unsigned int flags)
> +{
> +	struct dentry *real = dir->i_private, *new;
> +	struct inode *reali = real->d_inode, *newi;
> +	const struct cred *oldcred, *newcred;
> +
> +	/* note: violation of usual fs rules here: dentries are never
> +	 * added with d_add.  This is because we want no dentry cache
> +	 * for shiftfs.  All lookups proceed through the dentry cache
> +	 * of the underlying filesystem, meaning we always see any
> +	 * changes in the underlying */

Bloody wonderful.  So
	* we lose caching the negative lookups
	* we've got buggered hardlinks (different inodes for those)
	* it has never, ever been tried on -next (would do rather nasty
things on that d_instantiate())

> +
> +	kfree(sfc);
> +
> +	return err;
> +}

> +	file->f_op = &sfc->fop;

Lovely - now try that with underlying fs something built modular.

Or try to use it on top of something with non-trivial dentry_operations
(hell, on top of itself, for starters).

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-11  0:53             ` Al Viro
@ 2016-05-11  3:47               ` James Bottomley
  0 siblings, 0 replies; 49+ messages in thread
From: James Bottomley @ 2016-05-11  3:47 UTC (permalink / raw)
  To: Al Viro
  Cc: Djalal Harouni, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy

On Wed, 2016-05-11 at 01:53 +0100, Al Viro wrote:
> On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> > +static int shiftfs_rename2(struct inode *olddir, struct dentry
> > *old,
> > +			   struct inode *newdir, struct dentry
> > *new,
> > +			   unsigned int flags)
> > +{
> > +	struct dentry *rodd = olddir->i_private, *rndd = newdir
> > ->i_private,
> > +		*realold = old->d_inode->i_private,
> > +		*realnew = new->d_inode->i_private;
> > +	struct inode *realolddir = rodd->d_inode, *realnewdir =
> > rndd->d_inode;
> > +	const struct inode_operations *iop = realolddir->i_op;
> > +	int err;
> > +	const struct cred *oldcred, *newcred;
> > +
> > +	oldcred = shiftfs_new_creds(&newcred, old->d_sb);
> > +	err = iop->rename2(realolddir, realold, realnewdir,
> > realnew, flags);
> > +	shiftfs_old_creds(oldcred, &newcred);
> 
> ... and you've just violated all locking rules for ->rename2().

Yes, sorry, somehow I missed that when I converted everything else to
the vfs_ functions.

> > +static struct dentry *shiftfs_lookup(struct inode *dir, struct
> > dentry *dentry,
> > +				     unsigned int flags)
> > +{
> > +	struct dentry *real = dir->i_private, *new;
> > +	struct inode *reali = real->d_inode, *newi;
> > +	const struct cred *oldcred, *newcred;
> > +
> > +	/* note: violation of usual fs rules here: dentries are
> > never
> > +	 * added with d_add.  This is because we want no dentry
> > cache
> > +	 * for shiftfs.  All lookups proceed through the dentry
> > cache
> > +	 * of the underlying filesystem, meaning we always see any
> > +	 * changes in the underlying */
> 
> Bloody wonderful.  So
> 	* we lose caching the negative lookups

We do?  They should be cached in the underlying layer's dcache. If
that's not enough, I can hash them, but I was trying to avoid doubling
the dcache size.

> 	* we've got buggered hardlinks (different inodes for those)

Yes, had a note to do the lookup, but forgot.

> 	* it has never, ever been tried on -next (would do rather nasty
> things on that d_instantiate())

So this is just a proof of concept; I figured it was best to do it
against current rather than have people who wanted to try it pull in
your tree.  I can respin it after the merge window closes.

> 
> > +
> > +	kfree(sfc);
> > +
> > +	return err;
> > +}
> 
> > +	file->f_op = &sfc->fop;
> 
> Lovely - now try that with underlying fs something built modular.
> 
> Or try to use it on top of something with non-trivial
> dentry_operations
> (hell, on top of itself, for starters).

So if I add the missing fops_get/put, you're happy with the way this
hijacks f_op and f_inode?

James

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-10 23:36           ` James Bottomley
  2016-05-11  0:38             ` Al Viro
  2016-05-11  0:53             ` Al Viro
@ 2016-05-11 16:42             ` Djalal Harouni
  2016-05-11 18:33               ` James Bottomley
  2 siblings, 1 reply; 49+ messages in thread
From: Djalal Harouni @ 2016-05-11 16:42 UTC (permalink / raw)
  To: James Bottomley
  Cc: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Dave Chinner

On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> On Thu, 2016-05-05 at 18:08 -0400, James Bottomley wrote:
[...]
> > 
> > OK, so the way attributes are populated on an inode is via getattr. 
> >  You intercept that, you change the inode owner and group that are
> > installed on the inode.  That means that when you list the directory,
> > you see the shift and the shifted uid/gid are used to check 
> > permissions for vfs_open().
> 
> Just to illustrate how this could be done, here's a functional proof of
> concept for a uid/gid shifting bind mount equivalent.  It's not
> actually a proper bind mount because it has to manufacture its own
> inodes.  As you can see, it can only be used by root, it will shift all
> the uid/gid bits as well as the permission comparisons.  It operates on
> subtrees, so it can shift the uids/gids on any filesystem or part of
> one and because the shifts are per superblock, it could actually shift
> the same subtree for multiple users on different shifts.  Best of all,
> it requires no vfs changes at all, being entirely implemented inside
> its own filesystem type.

First, I guess this should be in a separate thread.. this way this RFC
was just hijacked!

Obviously as you say later in your response it may require a VFS
change... 

You have just consumed all inodes... what about containers or small apps
that are spawned quickly... it can even used maybe as a DoS...  maybe you
endup reporting different inode numbers... ?


> You use it just like bind mount:
> 
> mount -t shiftfs <source> <target>
> 
> except that it takes uidshift=x:y:z and gidshift=x:y:z multiple times
> as options.  It's currently not recursive and it definitely needs
> polishing to show things like mount options and be properly Kconfig
> using.

why it's not recursive ? and what if you have circular bind mounts ? 

Hmm anyway you are mounting this on behalf of filesystems, so if you add
the recursive thing, you will just probably make everything worse, by
making any /proc, /sys dentry that's under that path shiftable, and
unprivileged users can just create user namespaces and read /proc/*
and all the other stuff that doesn't have capable() related to the
init_user_ns host...

  what if you have paths like /filesystem0/uidshiftedY/dir,
/filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir 
where some of them are also bind mounts that point to same dentry ?


Also, you create a totally new user namespace interface here! by making
your own new interface we just lose the notion of init_user_ns and its
children and mapping ?

I'm not sure of the implication of all this... your user namespace
mapping is not related at all to init_user_ns! it seems that it has
its own init_user_ns ?   does a capable() check now on a shifted
filesystem relates to that and hence to your mapping or to the real
init_user_ns ?


> There's a bit of an open question of whether it should have vfs
> changes: the way the struct file f_inode and f_ops are hijacked is a
> bit nasty and perhaps d_select_inode() could be made a bit cleverer to
> help us here instead.

I'm not sure if this PoC works... but you sure you didn't introduce
a serious vulnerability here ? you use a new mapping and you update
current_fsuid() creds up, which is global on any fs operation, so may
be: lets operate on any inode, update our current_fsuid()... and
access the rest of *unshifted filesystems*... !?

The worst thing is that current_fsuid() does not follow now the
/proc/self/uid_map interface! this is a serious vulnerability and a mix
of the current semantics... it's updated but using other rules...?

For overlayfs I did write an expriment but for me it's not an overlayfs
or another new filesystem problem... we are manipulating UID/GID
identities...

It would have been better if you did send this as a separate thread.
It was a vfs:userns RFC fix which if we continue we turn it into a
complicated thing! implement another new light filesystem with
userns... (overlayfs...)

Will follow up if the appropriate thread is created, not here, I guess
it's ok ?

> James
> 

Thank you for your feedback!


-- 
Djalal Harouni
http://opendz.org

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-11 16:42             ` Djalal Harouni
@ 2016-05-11 18:33               ` James Bottomley
  2016-05-12 19:55                 ` Djalal Harouni
  0 siblings, 1 reply; 49+ messages in thread
From: James Bottomley @ 2016-05-11 18:33 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Dave Chinner

On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
> On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> > On Thu, 2016-05-05 at 18:08 -0400, James Bottomley wrote:
> [...]
> > > 
> > > OK, so the way attributes are populated on an inode is via 
> > > getattr.  You intercept that, you change the inode owner and 
> > > group that are installed on the inode.  That means that when you 
> > > list the directory, you see the shift and the shifted uid/gid are 
> > > used to check permissions for vfs_open().
> > 
> > Just to illustrate how this could be done, here's a functional 
> > proof of concept for a uid/gid shifting bind mount equivalent. 
> >  It's not actually a proper bind mount because it has to 
> > manufacture its own inodes.  As you can see, it can only be used by 
> > root, it will shift all the uid/gid bits as well as the permission 
> > comparisons.  It operates on subtrees, so it can shift the 
> > uids/gids on any filesystem or part of one and because the shifts 
> > are per superblock, it could actually shift the same subtree for 
> > multiple users on different shifts.  Best of all, it requires no 
> > vfs changes at all, being entirely implemented inside its own
> > filesystem type.
> 
> First, I guess this should be in a separate thread.. this way this 
> RFC was just hijacked!
> 
> Obviously as you say later in your response it may require a VFS
> change... 

I thought it may but viro didn't rip my head off for shifting the file
operations and inode, so perhaps it's OK as is.

> You have just consumed all inodes... what about containers or small 
> apps that are spawned quickly... it can even used maybe as a DoS... 
>  maybe you endup reporting different inode numbers... ?

Please explain?  Shiftfs deliberately doesn't populate its dentry
cache, so it basically has the same number inodes and dentries in use
as the lower filesystem would ordinarily have.

> 
> > You use it just like bind mount:
> > 
> > mount -t shiftfs <source> <target>
> > 
> > except that it takes uidshift=x:y:z and gidshift=x:y:z multiple
> > times
> > as options.  It's currently not recursive and it definitely needs
> > polishing to show things like mount options and be properly Kconfig
> > using.
> 
> why it's not recursive ? and what if you have circular bind mounts ? 

Because, as I said, it's a proof of concept.  It can easily have MS_REC
semantics added.

> Hmm anyway you are mounting this on behalf of filesystems, so if you 
> add the recursive thing, you will just probably make everything 
> worse, by making any /proc, /sys dentry that's under that path 
> shiftable, and unprivileged users can just create user namespaces and 
> read /proc/* and all the other stuff that doesn't have capable() 
> related to the init_user_ns host...

That's up to the admin who does the shifting.  Recursive would be an
option if added.

>   what if you have paths like /filesystem0/uidshiftedY/dir,
> /filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir 
> where some of them are also bind mounts that point to same dentry ?

Without recursive semantics, you see the underlying inode.  With them,
you see the upper vfsmnts.  Shiftfs isn't idempotent, so you would need
to be careful about nesting.  However, that's an admin problem.

> Also, you create a totally new user namespace interface here! by 
> making your own new interface we just lose the notion of init_user_ns 
> and its children and mapping ?

I don't quite understand this; the only use of the init_user_ns is the
capable(CAP_SYS_ADMIN) in fill_super which is how only the real admin
can mount at a shifted uid/gid.  Otherwise, there's no need to see into
the userns because filesystems see the kuid_t/kgid_t which is what I'm
shifting.

> I'm not sure of the implication of all this... your user namespace
> mapping is not related at all to init_user_ns! it seems that it has
> its own init_user_ns ?   does a capable() check now on a shifted
> filesystem relates to that and hence to your mapping or to the real
> init_user_ns ?

capable(CAP_SYS_ADMIN) == ns_capable(&init_user_ns, CAP_SYS_ADMIN)

Or is there a misunderstanding here about how user namespaces work
inside the kernel?  The design is that the ID shift is done as you
cross the kernel boundary, so a filesystem, being usually all in-kernel
operating via the VFS interfaces, ideally never needs to make any
from_kuid/make_kuid calls.  However, there are ways filesystems can
send data across the kernel/user bounary outside of the usual vfs
interfaces (ioctls being the most usual one) so in that specific code,
they have to do the kuid_t to uid_t changes themselves.  Shiftfs never
sends data to the user outside of the VFS so it never needs to do this
and can operate entirely on kuid_ts.

> > There's a bit of an open question of whether it should have vfs
> > changes: the way the struct file f_inode and f_ops are hijacked is 
> > a bit nasty and perhaps d_select_inode() could be made a bit 
> > cleverer to help us here instead.
> 
> I'm not sure if this PoC works... but you sure you didn't introduce
> a serious vulnerability here ? you use a new mapping and you update
> current_fsuid() creds up, which is global on any fs operation, so may
> be: lets operate on any inode, update our current_fsuid()... and
> access the rest of *unshifted filesystems*... !?

The credentials are per thread, so it's a standard way of doing
credential shifting and no other threads of execution in the same task
get access. As long as you bound the override_creds()/revert_creds()
pairs within the kernel, you're safe.

> The worst thing is that current_fsuid() does not follow now the
> /proc/self/uid_map interface! this is a serious vulnerability and a 
> mix of the current semantics... it's updated but using other
> rules...?

current_fsuid() is aready mapped via the userns; it's already a kuid_t
at its final value.  Shifting that is what you want to remap underlying
volume uid/gid's.  The uidmap/gidmap inputs to this are shifts on the
final underlying uid/gids.

So, if I've got a uid_map in a userns of 0:100000:1000 which remaps all
the privileged ids down to 100000, but I have a volume which still has
realids, I can mount that volume using shiftfs with
uidmap=0:100000:1000 and it will allow this userns to read and write
the volume through its remapped ids.

> For overlayfs I did write an expriment but for me it's not an 
> overlayfs or another new filesystem problem... we are manipulating 
> UID/GID identities...
> 
> It would have been better if you did send this as a separate thread.
> It was a vfs:userns RFC fix which if we continue we turn it into a
> complicated thing! implement another new light filesystem with
> userns... (overlayfs...)
> 
> Will follow up if the appropriate thread is created, not here, I 
> guess it's ok ?

Well, I can resend the patch as a separate thread when I've fixed some
of the problems viro pointed out.

James

> > James
> > 
> 
> Thank you for your feedback!
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-06  2:50     ` Dave Chinner
@ 2016-05-12 19:47       ` Djalal Harouni
  0 siblings, 0 replies; 49+ messages in thread
From: Djalal Harouni @ 2016-05-12 19:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy

Hi Dave,

Tried to do my xfs homework first!

On Fri, May 06, 2016 at 12:50:36PM +1000, Dave Chinner wrote:
> On Thu, May 05, 2016 at 11:24:35PM +0100, Djalal Harouni wrote:
> > On Thu, May 05, 2016 at 10:23:14AM +1000, Dave Chinner wrote:
> > > On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> > > > This is version 2 of the VFS:userns support portable root filesystems
> > > > RFC. Changes since version 1:
> > > > 
> > > > * Update documentation and remove some ambiguity about the feature.
> > > >   Based on Josh Triplett comments.
> > > > * Use a new email address to send the RFC :-)
> > > > 
> > > > 
> > > > This RFC tries to explore how to support filesystem operations inside
> > > > user namespace using only VFS and a per mount namespace solution. This
> > > > allows to take advantage of user namespace separations without
> > > > introducing any change at the filesystems level. All this is handled
> > > > with the virtual view of mount namespaces.
> > > 
> > > [...]
> > > 
> > > > As an example if the mapping 0:65535 inside mount namespace and outside
> > > > is 1000000:1065536, then 0:65535 will be the range that we use to
> > > > construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> > > > data. They represent the persistent values that we want to write to the
> > > > disk. Therefore, we don't keep track of any UID/GID shift that was applied
> > > > before, it gives portability and allows to use the previous mapping
> > > > which was freed for another root filesystem...
> > > 
> > > So let me get this straight. Two /isolated/ containers, different
> > > UID/GID mappings, sharing the same files and directories. Create a
> > > new file in a writeable directory in container 1, namespace
> > > information gets stripped from on-disk uid/gid representation.
> > > 
> > > Container 2 then reads that shared directory, finds the file written
> > > by container 1. As there is no no namespace component to the uid:gid
> > > stored in the inode, we apply the current namespace shift to the VFS
> > > inode uid/gid and so it maps to root in container 2 and we are
> > > allowed to read it?
> > 
> > Only if container 2 has the flag CLONE_MNTNS_SHIFT_UIDGID set in its own
> > mount namespace which only root can set or if it was already set in
> > parent, and have access to the shared dir which the container manager
> > should also configure before... even with that if in container 2 the
> > shift flag is not set then there is no mapping and things work as they
> > are now, but yes this setup is flawed! they should not share rootfs,
> > maybe in rare cases, some user data that's it.
> 
> <head explods>
> 
> I can't follow any of the logic you're explaining - you just
> confused me even more.  I thought this was to allow namespaces with
> different uid/gid mappings all to use the same backing store? And
> now you're saying that "no, they'll all have separate backing
> stores"?

Dave, absolutely for root file systems or probably most if not all use
cases, they should have *separate* backing devices. For (1) obvious
security reasons, (2) If they are writing to the filesystem, for quota,
otherwise the whole thing is useless.


> I suspect you need to describe the layering in a way a stupid dummy
> can understand, because trying to be clever with wacky examples is
> not working.

OK, see below please.


> > > Unless I've misunderstood something in this crazy mapping scheme,
> > > isn't this just a vector for unintentional containment breaches?
> > > 
> > > [...]
> > > 
> > > > Simple demo overlayfs, and  btrfs mounted with vfs_shift_uids and
> > > > vfs_shift_gids. The overlayfs mounts will share the same upperdir. We
> > > > create two user namesapces every one with its own mapping and where
> > > > container-uid-2000000 will pull changes from container-uid-1000000
> > > > upperdir automatically.
> > > 
> > > Ok, forget I asked - it's clearly intentional. This is beyond
> > > crazy, IMO.
> > 
> > This setup is flawed! that example was to show that files show up with
> > the right mapping with two different user namespaces. As Andy noted
> > they should have a backing device...
> 
> Did you mean "should have different backing devices" here? If not,
> I'm even more confused now...

Yes, I mean a separate different bakcing device.

Now some use cases may share *some* backing devices, but then it should
not be the same backing store of the host /

The container manager should mount a new backing device, maybe make a
snapshot of host / on it and use it for containers.


> > Anyway by the previous paragraph what I mean is that when the container
> > terminates it releases the UID shift range which can be re-used later
> > on another filesystem or on the same previous fs... whatever. Now if
> > the range is already in use, userspace should grab a new range give it
> > a new filesystem or a previous one which doesn't need to be shared and
> > everything should continue to work...
> 
> This sounds like you're talking about a set of single, sequential
> uses of a single filesystem image across multiple different
> container lifecycles? Maybe that's where I'm getting confused,
> because I'm assuming multiple concurrent uses of a single filesystem
> by all the running containers that are running the same distro
> image....

Ok I see, yes a sequantial usage of the single filesystem according to
available mappings and resources, and as noted above if the multiple
concurent containers are using the same single filesystem, then that one
should be mounted as read-only! where the writable data is on another
backing store for every container, or a backing store for all
containers... depends on the use cases, but at least they should not
write to the host / backing store...

Serge Hallyn already noted this.... and by default this is not possible
and solutions in the kernel exists to make sure that they can't do it
only if root allows that...

but maybe I should just add a patch so you can't set shifts on the host
/ ,disable it there and let it for only new mounts..


> > simple example with loop devices..., however the image should be a GPT
> > (GUID partition table) or an MBR one...
> > 
> > $ dd if=/dev/zero of=/tmp/fedora-newtree.raw bs=10M count=100
> > $ mkfs.ext4 /tmp/fedora-newtree.raw
> > ...
> > $ sudo mount -t ext4 -oloop,rw,sync /var/lib/machines/fedora-newtree.raw /mnt/fedora-tree
> > $ sudo yum -y --releasever=23 --installroot=/mnt/fedora-tree --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim 
> > $ sudo mount -t ext4 -oloop,vfs_shift_uids,vfs_shift_gids, /var/lib/machines/fedora-newtree.raw  /mnt/fedora-tree
> > $ sudo ~/container --uidmap [1000000:1065536 or
> > 			     2000000:2065536 or
> > 			     3000000:3065536 ....}
> >   (That's the mapping outside of the container)
> 
> This doesn't match your above comments about separate backing
> stores. Here we have two mounts sharing the same image file, both
> mounted read/write - there's no separate backing store here. The
> fact you hide the initial mount that was populated by yum by
> overmounting the same mount point doesn't stop the original mount
> from modifying the image file independently of the container you
> started.
> 
> I'm getting the impression that there's a missing step in all your
> examples here - that you create a writable snapshot or overlay of
> the original fs image to create separate backing devices for each
> container. In that case, the uid/gid shifting avoids needing to make
> uid/gid modifications to the snapshot/overlay to match the
> container's mapped uid/gids.

Yes exactly! we avoid writing to the snapshot or the image... the
mapping is done automatically by the kernel, and the snapshot can be
verified and used later by other containers with different mapping...

And yes there are some missing steps in my explanation. Container
manager should create new mount namespace, remount / MS_SLAVE|MS_RED,
mount things there that can be shared as a tree, then create an other
mount namespace,  clean it, *create* private backing devices, mount
loops, mount other private filesystems on the new backing devices,
or mount images...  then create a new mount/user namespaces for the
container and spawn it.

Now containers automatically have access to their private mounts and
also to the shared read-only tree, the mapping is done virtually, no
need to adapt the private mount or the shared tree, no need to chown,
nothing.

After the container finishes, the mapping 0:1000000:1065536 is released
and can be used for another container, and another container can easily
use the previous filesystem and shared tree with a new mapping
0:2000000:2065536 every thing works out of the box...

Now if you want the host of the container manager to not even access
container mounts, then when creating new mount namespace we remount /
MS_SLAVE and even the host or container managed won't see again
container's mounts...


Another example that we want to support:

  (1) The vendor OS files are all under /usr directory, you have one
      snapshot that you use it for all containers.
  (2) Container manager prepares /usr snapshot verify if it's trusted...
      create mount namespace backing devices, etc. Mount it a shared
      read-only tree on a different backing device with shifted
      options for containers namespaces...
  (3) Container manager mounts /etc and other directories in private tmpfs
      => this allows to spawn containers, serve and forget about them,
      it may allow factory reset and all the other stuff..


> Similarly, if the use case given was read-only sharing of trees
> between containers, there's no need for separate snapshots or
> overlays, just a bunch of read-only (bind?) mounts with shifts
> specified for the intended container.
> 
> These seem like a pretty sane use case for wanting to shift
> uids/gids in this manner, but if that's the case then I'm struggling

Indeed!



> > > > 3) ROADMAP:
> > > > ===========
> > > > * Confirm current design, and make sure that the mapping is done
> > > >   correctly.
> > > 
> > > How are you going to ensure that all filesystems behave the same,
> > > and it doesn't get broken by people who really don't care about this
> > > sort of crazy?
> > 
> > By trying to make this a VFS mount namespace parameter. So if the
> > shift is not set on on the mount namespace then we just fallback to
> > the current behaviour! no shift is performed.
> 
> That wasn't what I was asking - I was asking a code maintenance
> question. i.e. someone will come along who doesn't quite understand
> WTF all this convoluted namespace ID mapping is doing and they will
> accidently break it in a subtle way that nobody notices because they
> didn't directly change anything to do with ID shifting. What's the
> plan for preventing that from happening?

Of course I'm not sure if it will be accepted, I guess this is the
*safest* solution so far, which can be used to support all filesystems.
So if it's applied I won't throuw the code at you and move, everyone
is responsible for his own mess... but before that I'll add regression
tests, security tests make sure that we don't break the current
behaviour then the new one... IOW as it should be done with every
merged code...

Now I'm not sure if it will be accepted, I see that this thing is being
discussed every year during summit... maybe it's a favorite topic ?!
other patches about shifts break setfsuid, make current_fsuid() cross
user namespaces.. add another userns interface.. so not sure, anyway
I'll try to improve the RFC later or after the merge window, include
XFS support, take all feedback, let the mount namepsace and the VFS do
the work, and see!


> > later of course I'll try xfstests and several tests...
> > 
> > Does this answer your question ?
> 
> That's closer, but ambiguous. ;) Were you planning on just running
> some existing tests or writing a set of regression tests that
> explicitly encode expected usage and behaviour, as well as what is
> expected to fail?

Of course yes!


> > > .....
> > > > * Add XFS support.
> > > 
> > > What is the problem here?
> > 
> > Yep, sorry! just lack of time from my part! XFS currently is a bit aware
> > of kuid/kgid mapping on its own, and I just didn't had the appropriate
> > time! Will try to fix it next time.
> 
> You'd be talking about the xfs_kuid_to_uid/xfs_uid_to_kuid()
> wrappers, right?

Yes!


> It comes to the kuid/kgid being kernel internal representations of
> an ID, not an on-disk format representation.  Like all other kernel
> internal types they can change size and structure at any time, while
> the persistent on-disk format cannot change without lots of hassle
> (and then we really need conversion functions!). For clean layering,
> abstraction and self-documenting code, internal types are always
> converted to/from a persistent, on-disk format representation in
> this manner.

I see, thank you for the explanation!

So yes we have to update all these callers with:
1) xfs_ialloc()
     when creating a new inode:
     ip->i_d.di_uid = xfs_kuid_to_uid(current_fsuid());
        should be:
     ip->i_d.di_uid = xfs_kuid_to_uid(vfs_shift_kuid_to_disk(inode, current_fsuid()));


or a new VFS function like inode_init_owner() hmm inode_init_iuid()
to give you the appropriate inode->i_uid on disk

ip->i_d.di_uid = xfs_kuid_to_uid(inode_init_iuid());


This way inside container, you create an inode on disk with iuid == 0
and not using the global mapping of current_fsuid()

2) xfs_kuid_to_uid() for quota should also be updated,
   yes and to make it suitable for XFS it should be:
   vfs_shift_kuid_to_disk(super_block, kuid) or
   vfs_shift_kuid_to_fs(super_block, kuid);
    =>
   xfs_kuid_to_uid(vfs_shift_kuid_to_fs(xfs_super_block, kuid)));

These are the only changes for xfs to make it really work plus
parsing vfs_shift_uids/gids mount options, this should work out of
the box then... maybe later see with ACL... but from a VFS side,
not fs.



> > > Next question: how does this work with uid/gid based quotas?
> > 
> > If you do a shift you should know that you will share quota on
> > disk.
> 
> Yes, and this means you can't account for individual container space
> usage on such mapped devices. Also, don't you need to shift
> uids/gids for the quota syscalls like you do elsewhere?

Yes, ok!


> I also wonder about the fact that the quota interfaces are likely to 
> return uids/gids that may not exist in a given container...

Well if they do not exist or do not have a mapping we don't perform any
shift, so I guess they will return the global kuid which happens to
*always* have a mapping in init_user_ns and that's the one we use to
operate from the filesystem or disk side. 

Now inside the container, for normal use cases we try at least to map
65536 the abvious range to let a complete system work... if some
uids/gids inside the container are not mapped, they share 65534 the
nobody user..


> Cheers,
> 
> Dave.

Thanks for the feedback!


> -- 
> Dave Chinner
> david@fromorbit.com

-- 
Djalal Harouni
http://opendz.org

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-11 18:33               ` James Bottomley
@ 2016-05-12 19:55                 ` Djalal Harouni
  2016-05-12 22:24                   ` James Bottomley
  0 siblings, 1 reply; 49+ messages in thread
From: Djalal Harouni @ 2016-05-12 19:55 UTC (permalink / raw)
  To: James Bottomley
  Cc: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Dave Chinner

On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:
> On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
> > On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
[...]
> > Hmm anyway you are mounting this on behalf of filesystems, so if you 
> > add the recursive thing, you will just probably make everything 
> > worse, by making any /proc, /sys dentry that's under that path 
> > shiftable, and unprivileged users can just create user namespaces and 
> > read /proc/* and all the other stuff that doesn't have capable() 
> > related to the init_user_ns host...
> 
> That's up to the admin who does the shifting.  Recursive would be an
> option if added.

Hmm, not sure if you get my point... you just made it an admin problem
where admins want to mount an image downloaded verify it and use it for
their container with /proc...! that's another problem!


> >   what if you have paths like /filesystem0/uidshiftedY/dir,
> > /filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir 
> > where some of them are also bind mounts that point to same dentry ?
> 
> Without recursive semantics, you see the underlying inode.  With them,
> you see the upper vfsmnts.  Shiftfs isn't idempotent, so you would need
> to be careful about nesting.  However, that's an admin problem.
> 
> > Also, you create a totally new user namespace interface here! by 
> > making your own new interface we just lose the notion of init_user_ns 
> > and its children and mapping ?
> 
> I don't quite understand this; the only use of the init_user_ns is the
> capable(CAP_SYS_ADMIN) in fill_super which is how only the real admin
> can mount at a shifted uid/gid.  Otherwise, there's no need to see into
> the userns because filesystems see the kuid_t/kgid_t which is what I'm
> shifting.
> 
> > I'm not sure of the implication of all this... your user namespace
> > mapping is not related at all to init_user_ns! it seems that it has
> > its own init_user_ns ?   does a capable() check now on a shifted
> > filesystem relates to that and hence to your mapping or to the real
> > init_user_ns ?
> 
> capable(CAP_SYS_ADMIN) == ns_capable(&init_user_ns, CAP_SYS_ADMIN)
> 
> Or is there a misunderstanding here about how user namespaces work
> inside the kernel?  The design is that the ID shift is done as you
> cross the kernel boundary, so a filesystem, being usually all in-kernel
> operating via the VFS interfaces, ideally never needs to make any
> from_kuid/make_kuid calls.  However, there are ways filesystems can
> send data across the kernel/user bounary outside of the usual vfs
> interfaces (ioctls being the most usual one) so in that specific code,
> they have to do the kuid_t to uid_t changes themselves.  Shiftfs never
> sends data to the user outside of the VFS so it never needs to do this
> and can operate entirely on kuid_ts.
> 
> > > There's a bit of an open question of whether it should have vfs
> > > changes: the way the struct file f_inode and f_ops are hijacked is 
> > > a bit nasty and perhaps d_select_inode() could be made a bit 
> > > cleverer to help us here instead.
> > 
> > I'm not sure if this PoC works... but you sure you didn't introduce
> > a serious vulnerability here ? you use a new mapping and you update
> > current_fsuid() creds up, which is global on any fs operation, so may
> > be: lets operate on any inode, update our current_fsuid()... and
> > access the rest of *unshifted filesystems*... !?
> 
> The credentials are per thread, so it's a standard way of doing
> credential shifting and no other threads of execution in the same task
> get access. As long as you bound the override_creds()/revert_creds()
> pairs within the kernel, you're safe.

No, and here sorry I mean shifted.

current_fsuid() is global through all fs operations which means it
crosses user namespaces... it was safe the days of only init_user_ns,
not anymore... You give a mapping inside containers to fsuid where they
don't want to have it... this allows to operate on inodes inside other
containers... update current_fsuid() even if we want that user to be
nobody inside the container... and later it can access the inodes of
the shifted fs... and by same current of course...



> > The worst thing is that current_fsuid() does not follow now the
> > /proc/self/uid_map interface! this is a serious vulnerability and a 
> > mix of the current semantics... it's updated but using other
> > rules...?
> 
> current_fsuid() is aready mapped via the userns; it's already a kuid_t
> at its final value.  Shifting that is what you want to remap underlying
> volume uid/gid's.  The uidmap/gidmap inputs to this are shifts on the
> final underlying uid/gids.

=> some points:
Changing setfsuid() its interfaces and rules... or an indrect way to
break another syscall...

The userns used for *mapping* is totatly different and not standard...
losing "init_user_ns and its decendents userns *semantics*...", a
yet a totatly unlinked mapping...


Breaking current_uid(),current_euid(),current_fsuid() which are mapped
but in *different* user namespaces... hence different values inside
namespaces... you can change your userns mapping but that current_fsuid
specific one will always be remapped to some other value inside
even if you don't want it...
It crosses user namespaces...  uid and euid are remapped according to
/proc/self/uid_map, fsuid is remapped according to this new interface...

Hard coding the mapping, nested containers/apps may *share* fsuid and
can't get rid of it even if they change the inside userns mapping to
disable, split, reduce mapped users or offer better isolation they
can't... no way to make private inodes inside containers if they share
the final fsuid, inside container mapping is ignored...

...

> the privileged ids down to 100000, but I have a volume which still has
> realids, I can mount that volume using shiftfs with
> uidmap=0:100000:1000 and it will allow this userns to read and write
> the volume through its remapped ids.
> 
> > For overlayfs I did write an expriment but for me it's not an 
> > overlayfs or another new filesystem problem... we are manipulating 
> > UID/GID identities...
> > 
> > It would have been better if you did send this as a separate thread.
> > It was a vfs:userns RFC fix which if we continue we turn it into a
> > complicated thing! implement another new light filesystem with
> > userns... (overlayfs...)
> > 
> > Will follow up if the appropriate thread is created, not here, I 
> > guess it's ok ?
> 
> Well, I can resend the patch as a separate thread when I've fixed some
> of the problems viro pointed out.
> 
> James
> 
> > > James
> > > 
> > 
> > Thank you for your feedback!
> > 
> > 
> 

Thanks!

-- 
Djalal Harouni
http://opendz.org

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-12 19:55                 ` Djalal Harouni
@ 2016-05-12 22:24                   ` James Bottomley
  2016-05-14  9:53                     ` Djalal Harouni
  0 siblings, 1 reply; 49+ messages in thread
From: James Bottomley @ 2016-05-12 22:24 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Dave Chinner

On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote:
> On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:
> > On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
> > > On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> [...]
> > > Hmm anyway you are mounting this on behalf of filesystems, so if
> > > you 
> > > add the recursive thing, you will just probably make everything 
> > > worse, by making any /proc, /sys dentry that's under that path 
> > > shiftable, and unprivileged users can just create user namespaces
> > > and 
> > > read /proc/* and all the other stuff that doesn't have capable() 
> > > related to the init_user_ns host...
> > 
> > That's up to the admin who does the shifting.  Recursive would be
> > an
> > option if added.
> 
> Hmm, not sure if you get my point... you just made it an admin 
> problem where admins want to mount an image downloaded verify it and 
> use it for their container with /proc...! that's another problem!

You can't allow unprivileged containers to shift uids on arbitrary
filesystems, so the admin always has to do something for the initial
setup.

> > >   what if you have paths like /filesystem0/uidshiftedY/dir,
> > > /filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir 
> > > where some of them are also bind mounts that point to same dentry
> > > ?
> > 
> > Without recursive semantics, you see the underlying inode.  With 
> > them, you see the upper vfsmnts.  Shiftfs isn't idempotent, so you 
> > would need to be careful about nesting.  However, that's an admin
> > problem.
> > 
> > > Also, you create a totally new user namespace interface here! by 
> > > making your own new interface we just lose the notion of 
> > > init_user_ns and its children and mapping ?
> > 
> > I don't quite understand this; the only use of the init_user_ns is 
> > the capable(CAP_SYS_ADMIN) in fill_super which is how only the real
> > admin can mount at a shifted uid/gid.  Otherwise, there's no need 
> > to see into the userns because filesystems see the kuid_t/kgid_t 
> > which is what I'm shifting.
> > 
> > > I'm not sure of the implication of all this... your user 
> > > namespace mapping is not related at all to init_user_ns! it seems 
> > > that it has its own init_user_ns ?   does a capable() check now 
> > > on a shifted filesystem relates to that and hence to your mapping 
> > > or to the real init_user_ns ?
> > 
> > capable(CAP_SYS_ADMIN) == ns_capable(&init_user_ns, CAP_SYS_ADMIN)
> > 
> > Or is there a misunderstanding here about how user namespaces work
> > inside the kernel?  The design is that the ID shift is done as you
> > cross the kernel boundary, so a filesystem, being usually all in
> > -kernel operating via the VFS interfaces, ideally never needs to 
> > make any from_kuid/make_kuid calls.  However, there are ways 
> > filesystems can send data across the kernel/user bounary outside of 
> > the usual vfs interfaces (ioctls being the most usual one) so in 
> > that specific code, they have to do the kuid_t to uid_t changes 
> > themselves.  Shiftfs never sends data to the user outside of the 
> > VFS so it never needs to do this and can operate entirely on
> > kuid_ts.
> > 
> > > > There's a bit of an open question of whether it should have vfs
> > > > changes: the way the struct file f_inode and f_ops are hijacked 
> > > > is a bit nasty and perhaps d_select_inode() could be made a bit
> > > > cleverer to help us here instead.
> > > 
> > > I'm not sure if this PoC works... but you sure you didn't 
> > > introduce a serious vulnerability here ? you use a new mapping 
> > > and you update current_fsuid() creds up, which is global on any 
> > > fs operation, so may be: lets operate on any inode, update our 
> > > current_fsuid()... and access the rest of *unshifted filesystems*
> > > ... !?
> > 
> > The credentials are per thread, so it's a standard way of doing
> > credential shifting and no other threads of execution in the same 
> > task get access. As long as you bound the override_creds()/revert_c
> > reds() pairs within the kernel, you're safe.
> 
> No, and here sorry I mean shifted.
> 
> current_fsuid() is global through all fs operations which means it
> crosses user namespaces... it was safe the days of only init_user_ns,
> not anymore... You give a mapping inside containers to fsuid where 
> they don't want to have it... this allows to operate on inodes inside
> other containers... update current_fsuid() even if we want that user 
> to be nobody inside the container... and later it can access the 
> inodes of the shifted fs... and by same current of course...

OK, I still don't understand what you're getting at.  There are three
per-thread uids: uid, euid and fsuid (real, effective and filesystem). 
 They're all either settable via syscall or inherited on fork.  They're
all kernel side, meaning they're kuid_t.  Their values stay invariant
as you move through namespaces.  They change (and get mapped according
to the current user namespace setting) when you call set[fe]uid() So
when I enter a user namespace with mapping

0 100000 1000

and call setuid(0) (which sets all three). they all pick up the kuid_t
of 100000.  This means that writing a file inside the user namespace
after calling setuid(0) appears as real uid 100000 on the medium even
though if I call getuid() from the namespace, I get back 0.  What
shiftfs does is hijack temporarily the kernel fsuid/fsgid for
permission checks, so you can remap to any old uid on the medium
(although usually you'd pass in uidmap=0:100000:1000") it maps back
from kuid_t 100000 to kuid_t 0, which is why the container can now read
and write the underlying medium at on-media id 0 even through root
inside the container has kuid_t 100000.  There's no permanent change of
fsuid and it stays at its invariant value for the thread except as a
temporary measure to do the permission checks on the underlying of the
shifted filesystem.

> > > The worst thing is that current_fsuid() does not follow now the
> > > /proc/self/uid_map interface! this is a serious vulnerability and 
> > > a mix of the current semantics... it's updated but using other
> > > rules...?
> > 
> > current_fsuid() is aready mapped via the userns; it's already a 
> > kuid_t at its final value.  Shifting that is what you want to remap
> > underlying volume uid/gid's.  The uidmap/gidmap inputs to this are 
> > shifts on the final underlying uid/gids.
> 
> => some points:
> Changing setfsuid() its interfaces and rules... or an indrect way to
> break another syscall...

There is no change to setfsuid().

> The userns used for *mapping* is totatly different and not standard..
> . losing "init_user_ns and its decendents userns *semantics*...", a
> yet a totatly unlinked mapping...

There is no user namespace mapping at all.  This is a simple shift,
kernel side, of uids and gids at their kuid_t values.

> Breaking current_uid(),current_euid(),current_fsuid() which are
> mapped but in *different* user namespaces... hence different values
> inside namespaces... you can change your userns mapping but that
> current_fsuid specific one will always be remapped to some other 
> value inside even if you don't want it... It crosses user 
> namespaces...  uid and euid are remapped according to /proc/self/uid_
> map, fsuid is remapped according to this new interface...
> 
> Hard coding the mapping, nested containers/apps may *share* fsuid and
> can't get rid of it even if they change the inside userns mapping to
> disable, split, reduce mapped users or offer better isolation they
> can't... no way to make private inodes inside containers if they 
> share the final fsuid, inside container mapping is ignored...
> ...

OK, I think there's a misunderstanding about how credential overrides
work.  They're not permanent changes to the credentials, they're
temporary ones to get stuff done within the kernel at a temporary
privilege.  You can make credentials permanent if you go through
prepare_creds()/commit_creds(), but for making them temporary you do
prepare_creds()/override_creds() and then revert_creds() once you're
done using them.

If you want to see a current use of this, try fs/open.c:faccessat. 
 What it's doing is temporarily overriding fsuid with the real uid to
check the permissions before reverting the credentials and returning to
the user.

James

> > the privileged ids down to 100000, but I have a volume which still 
> > has realids, I can mount that volume using shiftfs with
> > uidmap=0:100000:1000 and it will allow this userns to read and 
> > write the volume through its remapped ids.
> > 
> > > For overlayfs I did write an expriment but for me it's not an 
> > > overlayfs or another new filesystem problem... we are 
> > > manipulating UID/GID identities...
> > > 
> > > It would have been better if you did send this as a separate 
> > > thread. It was a vfs:userns RFC fix which if we continue we turn 
> > > it into a complicated thing! implement another new light 
> > > filesystem with userns... (overlayfs...)
> > > 
> > > Will follow up if the appropriate thread is created, not here, I 
> > > guess it's ok ?
> > 
> > Well, I can resend the patch as a separate thread when I've fixed 
> > some of the problems viro pointed out.
> > 
> > James
> > 
> > > > James
> > > > 
> > > 
> > > Thank you for your feedback!
> > > 
> > > 
> > 
> 
> Thanks!
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-12 22:24                   ` James Bottomley
@ 2016-05-14  9:53                     ` Djalal Harouni
  2016-05-14 13:46                       ` James Bottomley
  0 siblings, 1 reply; 49+ messages in thread
From: Djalal Harouni @ 2016-05-14  9:53 UTC (permalink / raw)
  To: James Bottomley
  Cc: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Dave Chinner

On Thu, May 12, 2016 at 03:24:12PM -0700, James Bottomley wrote:
> On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote:
> > On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:
> > > On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
[...]

> > > 
> > > The credentials are per thread, so it's a standard way of doing
> > > credential shifting and no other threads of execution in the same 
> > > task get access. As long as you bound the override_creds()/revert_c
> > > reds() pairs within the kernel, you're safe.
> > 
> > No, and here sorry I mean shifted.
> > 
> > current_fsuid() is global through all fs operations which means it
> > crosses user namespaces... it was safe the days of only init_user_ns,
> > not anymore... You give a mapping inside containers to fsuid where 
> > they don't want to have it... this allows to operate on inodes inside
> > other containers... update current_fsuid() even if we want that user 
> > to be nobody inside the container... and later it can access the 
> > inodes of the shifted fs... and by same current of course...
> 
> OK, I still don't understand what you're getting at.  There are three
> per-thread uids: uid, euid and fsuid (real, effective and filesystem). 
>  They're all either settable via syscall or inherited on fork.  They're
> all kernel side, meaning they're kuid_t.  Their values stay invariant
> as you move through namespaces.  They change (and get mapped according
> to the current user namespace setting) when you call set[fe]uid() So
> when I enter a user namespace with mapping
> 
> 0 100000 1000
> 
> and call setuid(0) (which sets all three). they all pick up the kuid_t
> of 100000.  This means that writing a file inside the user namespace
> after calling setuid(0) appears as real uid 100000 on the medium even
> though if I call getuid() from the namespace, I get back 0.  What
> shiftfs does is hijack temporarily the kernel fsuid/fsgid for
> permission checks, so you can remap to any old uid on the medium
> (although usually you'd pass in uidmap=0:100000:1000") it maps back
> from kuid_t 100000 to kuid_t 0, which is why the container can now read
> and write the underlying medium at on-media id 0 even through root
> inside the container has kuid_t 100000.  There's no permanent change of
> fsuid and it stays at its invariant value for the thread except as a
> temporary measure to do the permission checks on the underlying of the
> shifted filesystem.
> 
> > > > The worst thing is that current_fsuid() does not follow now the
> > > > /proc/self/uid_map interface! this is a serious vulnerability and 
> > > > a mix of the current semantics... it's updated but using other
> > > > rules...?
> > > 
> > > current_fsuid() is aready mapped via the userns; it's already a 
> > > kuid_t at its final value.  Shifting that is what you want to remap
> > > underlying volume uid/gid's.  The uidmap/gidmap inputs to this are 
> > > shifts on the final underlying uid/gids.
> > 
> > => some points:
> > Changing setfsuid() its interfaces and rules... or an indrect way to
> > break another syscall...
> 
> There is no change to setfsuid().
> 
> > The userns used for *mapping* is totatly different and not standard..
> > . losing "init_user_ns and its decendents userns *semantics*...", a
> > yet a totatly unlinked mapping...
> 
> There is no user namespace mapping at all.  This is a simple shift,
> kernel side, of uids and gids at their kuid_t values.
> 
> > Breaking current_uid(),current_euid(),current_fsuid() which are
> > mapped but in *different* user namespaces... hence different values
> > inside namespaces... you can change your userns mapping but that
> > current_fsuid specific one will always be remapped to some other 
> > value inside even if you don't want it... It crosses user 
> > namespaces...  uid and euid are remapped according to /proc/self/uid_
> > map, fsuid is remapped according to this new interface...
> > 
> > Hard coding the mapping, nested containers/apps may *share* fsuid and
> > can't get rid of it even if they change the inside userns mapping to
> > disable, split, reduce mapped users or offer better isolation they
> > can't... no way to make private inodes inside containers if they 
> > share the final fsuid, inside container mapping is ignored...
> > ...
> 
> OK, I think there's a misunderstanding about how credential overrides
> work.  They're not permanent changes to the credentials, they're
> temporary ones to get stuff done within the kernel at a temporary
> privilege.  You can make credentials permanent if you go through
> prepare_creds()/commit_creds(), but for making them temporary you do
> prepare_creds()/override_creds() and then revert_creds() once you're
> done using them.
> 
> If you want to see a current use of this, try fs/open.c:faccessat. 
>  What it's doing is temporarily overriding fsuid with the real uid to
> check the permissions before reverting the credentials and returning to
> the user.

Thank you for explaining things, but I think you should take the time to
read this RFC and understand some problems. This is a quick dump of some
problems that it avoids...:

In this series we don't hijack setfsuid() in an indirect way, setfsuid
maps UIDs into current userns according to rules set by parent.
Changing current_fsuid() to some other mapping is a way to allow
processes to bypass that and use it to access other inodes...
This should not change and fsuid should continue to follow these
rules...

A cred->fsuid solution is safe or used to be safe only inside
init_user_ns where there is always a mapping or in context of current
user namespace. In an other user namespace with 0:1000:1 mapping,  you
can't set it to arbitrary mapping like 0:4000:1... It will give confined
processes access to inodes that satisfy the kuid_t 4000 mapping and
which the app/container wants to deny, they only want 0:1000:1. ..

We don't cross user namespaces, we don't use different mappings for
cred->uid, cred->fsuid...  A clean solution is to shift inodes UID/GID
and not change fsuid to cross namespaces. Not to mention how it may
interact with capabilities...

We follow user namespace rules and we keep "the parent defines a range
that the children can't escape" semantics.  There is a clear relation
between user namespaces that should not be broken.

We explicitly don't define a new user namespace mapping nor add a new
interface for the simple reason it's: *too complicated*. We can do that,
but no thanks! May be in future if there is a real need or things are
clear...
The current user namespace interface is getting standard and stable, so
we just keep it that way and make it consistant inside VFS.

We give VFS control of that, and we make mount namespaces the central
part of this whole logic.

We make admins life easier where they can pull container images, root
filesystems from containers/apps hubs... verify the signature and start
them with different mappings according to host resources... We don't
want them to do anything.
The design was planned to make it easier for users, it should work out
of the box, and it can be used to handle complex stuff too, since it's
flexible.

Able to support most filesystems including on-disk filesystems natively.

Able to support disk quota according to the shifted UID/GID on-disk
values. Especially during inode creation...

Able to support ACL if requested.

The user namespace mapping is kept a runtime configure option, we don't
pin a special mapping at any time, and of course parent creator of user
namespace is the one that can manipulate it, at the same time the
mapping is restricted according to grandpa rules and so on...

It allows unprivileged to use the VFS UID/GID shift without the
intervention of a privileged process each time.
The real privileged process sets the filesystem and the mount namespace
the first time, then it should work for all nested namespaces and
containers. It does not need the intervation of  init_user_ns root to
set the mapping and make it work, you don't have to go in and go out to
setup the thing, etc.

We don't do this on behalf of filesystems, they should explicitly
support it. procfs and other host resource virtual filesystems are safe
and currently they don't need shifting.

We try to fix the problem where it should be fixed, and not hide it...

-- 
Djalal Harouni
http://opendz.org

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-14  9:53                     ` Djalal Harouni
@ 2016-05-14 13:46                       ` James Bottomley
  2016-05-15  2:21                         ` Eric W. Biederman
  2016-05-17 15:42                         ` Djalal Harouni
  0 siblings, 2 replies; 49+ messages in thread
From: James Bottomley @ 2016-05-14 13:46 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Dave Chinner

On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
> On Thu, May 12, 2016 at 03:24:12PM -0700, James Bottomley wrote:
> > On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote:
> > > On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:
> > > > On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
> [...]
> 
> > > > 
> > > > The credentials are per thread, so it's a standard way of doing
> > > > credential shifting and no other threads of execution in the
> > > > same 
> > > > task get access. As long as you bound the
> > > > override_creds()/revert_c
> > > > reds() pairs within the kernel, you're safe.
> > > 
> > > No, and here sorry I mean shifted.
> > > 
> > > current_fsuid() is global through all fs operations which means
> > > it
> > > crosses user namespaces... it was safe the days of only
> > > init_user_ns,
> > > not anymore... You give a mapping inside containers to fsuid
> > > where 
> > > they don't want to have it... this allows to operate on inodes
> > > inside
> > > other containers... update current_fsuid() even if we want that
> > > user 
> > > to be nobody inside the container... and later it can access the 
> > > inodes of the shifted fs... and by same current of course...
> > 
> > OK, I still don't understand what you're getting at.  There are
> > three
> > per-thread uids: uid, euid and fsuid (real, effective and
> > filesystem). 
> >  They're all either settable via syscall or inherited on fork. 
> >  They're
> > all kernel side, meaning they're kuid_t.  Their values stay
> > invariant
> > as you move through namespaces.  They change (and get mapped
> > according
> > to the current user namespace setting) when you call set[fe]uid()
> > So
> > when I enter a user namespace with mapping
> > 
> > 0 100000 1000
> > 
> > and call setuid(0) (which sets all three). they all pick up the
> > kuid_t
> > of 100000.  This means that writing a file inside the user
> > namespace
> > after calling setuid(0) appears as real uid 100000 on the medium
> > even
> > though if I call getuid() from the namespace, I get back 0.  What
> > shiftfs does is hijack temporarily the kernel fsuid/fsgid for
> > permission checks, so you can remap to any old uid on the medium
> > (although usually you'd pass in uidmap=0:100000:1000") it maps back
> > from kuid_t 100000 to kuid_t 0, which is why the container can now
> > read
> > and write the underlying medium at on-media id 0 even through root
> > inside the container has kuid_t 100000.  There's no permanent
> > change of
> > fsuid and it stays at its invariant value for the thread except as
> > a
> > temporary measure to do the permission checks on the underlying of
> > the
> > shifted filesystem.
> > 
> > > > > The worst thing is that current_fsuid() does not follow now
> > > > > the
> > > > > /proc/self/uid_map interface! this is a serious vulnerability
> > > > > and 
> > > > > a mix of the current semantics... it's updated but using
> > > > > other
> > > > > rules...?
> > > > 
> > > > current_fsuid() is aready mapped via the userns; it's already a
> > > > kuid_t at its final value.  Shifting that is what you want to
> > > > remap
> > > > underlying volume uid/gid's.  The uidmap/gidmap inputs to this
> > > > are 
> > > > shifts on the final underlying uid/gids.
> > > 
> > > => some points:
> > > Changing setfsuid() its interfaces and rules... or an indrect way
> > > to
> > > break another syscall...
> > 
> > There is no change to setfsuid().
> > 
> > > The userns used for *mapping* is totatly different and not
> > > standard..
> > > . losing "init_user_ns and its decendents userns *semantics*...",
> > > a
> > > yet a totatly unlinked mapping...
> > 
> > There is no user namespace mapping at all.  This is a simple shift,
> > kernel side, of uids and gids at their kuid_t values.
> > 
> > > Breaking current_uid(),current_euid(),current_fsuid() which are
> > > mapped but in *different* user namespaces... hence different
> > > values
> > > inside namespaces... you can change your userns mapping but that
> > > current_fsuid specific one will always be remapped to some other 
> > > value inside even if you don't want it... It crosses user 
> > > namespaces...  uid and euid are remapped according to
> > > /proc/self/uid_
> > > map, fsuid is remapped according to this new interface...
> > > 
> > > Hard coding the mapping, nested containers/apps may *share* fsuid
> > > and
> > > can't get rid of it even if they change the inside userns mapping
> > > to
> > > disable, split, reduce mapped users or offer better isolation
> > > they
> > > can't... no way to make private inodes inside containers if they 
> > > share the final fsuid, inside container mapping is ignored...
> > > ...
> > 
> > OK, I think there's a misunderstanding about how credential
> > overrides
> > work.  They're not permanent changes to the credentials, they're
> > temporary ones to get stuff done within the kernel at a temporary
> > privilege.  You can make credentials permanent if you go through
> > prepare_creds()/commit_creds(), but for making them temporary you
> > do
> > prepare_creds()/override_creds() and then revert_creds() once
> > you're
> > done using them.
> > 
> > If you want to see a current use of this, try fs/open.c:faccessat. 
> >  What it's doing is temporarily overriding fsuid with the real uid
> > to
> > check the permissions before reverting the credentials and
> > returning to
> > the user.
> 
> Thank you for explaining things, but I think you should take the time 
> to read this RFC and understand some problems. This is a quick dump 
> of some problems that it avoids...:

I did.  The problem is how to get the userns to read and write files at
the interior not the exterior id.  Your solution is to thread the
mapping through the VFS and even on to the filesystems themselves to
get the mount option.  I already commented that this is a bit ugly and
couldn't it be encapsulated in a filesystem.  The way I approached the
problem is from the base that I do have build container roots with
shifted uid/gids because I installed them that way.  So, if it already
works, one possible solution is to have a filesystem which does the
shift and mounts the shifted root somewhere in the mount tree for the
namespace to access.  The point about doing it this way is that the
filesystem that does it needs no user namespace knowledge.  All it does
is remap from one on disk id to another using a map function.  How it
gets the map was left up to the admin in the implementation.

> In this series we don't hijack setfsuid() in an indirect way, 
> setfsuid maps UIDs into current userns according to rules set by 
> parent. Changing current_fsuid() to some other mapping is a way to 
> allow processes to bypass that and use it to access other inodes...
> This should not change and fsuid should continue to follow these
> rules...

Both solutions do this

> A cred->fsuid solution is safe or used to be safe only inside
> init_user_ns where there is always a mapping or in context of current
> user namespace. In an other user namespace with 0:1000:1 mapping, 
>  you can't set it to arbitrary mapping like 0:4000:1... It will give
> confined processes access to inodes that satisfy the kuid_t 4000 
> mapping and which the app/container wants to deny, they only want
> 0:1000:1. ..

OK, so both solutions are safe here too.  Your safety comes from only
remapping in the userns; mine comes from the normal filesystem acl
rules: either the userns for different users all have disjoint ids
regulated by /etc/subuidmap or they're all using the same one (like
docker 1.10) in either case, you could regulate by having the mount
under a directory which is accessible only to the userns owner.

> We don't cross user namespaces, we don't use different mappings for
> cred->uid, cred->fsuid...  A clean solution is to shift inodes 
> UID/GID and not change fsuid to cross namespaces. Not to mention how 
> it may interact with capabilities...

This is a subjective question on what constitutes "clean".  I think we
both think the other solution isn't clean, so that's for others to
adjudicate.

> We follow user namespace rules and we keep "the parent defines a 
> range that the children can't escape" semantics.  There is a clear 
> relation between user namespaces that should not be broken.

OK, so I separated the problem into a userns one, which remaps for the
processes in user space, and a vfs one which remaps the on-disk id. 
 However, they could be combined by allowing the userns to mount
shiftfs but only on designated filesystems and setting the uidmappings
to the same ones as the userns.

> We explicitly don't define a new user namespace mapping nor add a new
> interface for the simple reason it's: *too complicated*. We can do 
> that, but no thanks! May be in future if there is a real need or 
> things are clear... The current user namespace interface is getting 
> standard and stable, so we just keep it that way and make it
> consistant inside VFS.

I don't accept the too complicated point.  For fully unprivileged
containers, the host admin already has to set up the subuid/subgid map
files which is most of the complexity.  Once that's done, the same maps
can be used to shift mount.  Once it's all set up, no further
intervention is required.

> We give VFS control of that, and we make mount namespaces the central
> part of this whole logic.

Right, that's what causes the logic to thread throughout the entire vfs
and into the fs layer.  The fundamental point of difference is that I'd
like a solution which encapsulates the problem rather than exposing it
to the vfs.

> We make admins life easier where they can pull container images, root
> filesystems from containers/apps hubs... verify the signature and 
> start them with different mappings according to host resources... We 
> don't want them to do anything. The design was planned to make it 
> easier for users, it should work out of the box, and it can be used 
> to handle complex stuff too, since it's flexible.

Either works easily for users.  Setting stuff up is always the job of
the admin in both solutions.

> Able to support most filesystems including on-disk filesystems
> natively.

Shiftfs does this.  More importantly it supports subtrees, so I can
unpack an image root on to an existing filesystem and remap it into a
container.

> Able to support disk quota according to the shifted UID/GID on-disk
> values. Especially during inode creation...

Quota can be shifted, I just wasn't sure it was necessary.  If the
usual use case is for unpacked roots, chances are you want the
remapping to use the group quota of the userns owner, which they'd get
naturally so, while it's possible to remap projid, I didn't think it
needed to be done.

> Able to support ACL if requested.

Both do this.

> The user namespace mapping is kept a runtime configure option, we 
> don't pin a special mapping at any time, and of course parent creator 
> of user namespace is the one that can manipulate it, at the same time 
> the mapping is restricted according to grandpa rules and so on...
> 
> It allows unprivileged to use the VFS UID/GID shift without the
> intervention of a privileged process each time. The real privileged 
> process sets the filesystem and the mount namespace the first time, 
> then it should work for all nested namespaces and containers. It does 
> not need the intervation of  init_user_ns root to set the mapping and 
> make it work, you don't have to go in and go out to setup the thing,
> etc.

Both solutions work like this.  When I use this for shifted roots of
emulation containers, it's set up once at start of day.  I then build
the containers unprivileged using newsubuid/newsubgid as I'm using
them.  Once the shifts are done at start of day, no other admin support
is required.

James


> We don't do this on behalf of filesystems, they should explicitly
> support it. procfs and other host resource virtual filesystems are 
> safe and currently they don't need shifting.
> 
> We try to fix the problem where it should be fixed, and not hide 
> it...

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-14 13:46                       ` James Bottomley
@ 2016-05-15  2:21                         ` Eric W. Biederman
  2016-05-15 15:04                           ` James Bottomley
                                             ` (3 more replies)
  2016-05-17 15:42                         ` Djalal Harouni
  1 sibling, 4 replies; 49+ messages in thread
From: Eric W. Biederman @ 2016-05-15  2:21 UTC (permalink / raw)
  To: James Bottomley
  Cc: Djalal Harouni, Alexander Viro, Chris Mason, tytso, Serge Hallyn,
	Josh Triplett, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Dave Chinner

James Bottomley <James.Bottomley@HansenPartnership.com> writes:

> On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:

Just a couple of quick comments from a very high level design point.

- I think a shiftfs is valuable in the same way that overlayfs is
  valuable.

  Esepcially in the Docker case where a lot of containers want a shared
  base image (for efficiency), but it is desirable to run those
  containers in different user namespaces for safety.

- It is also the plan to make it possible to mount a filesystem where
  the uids and gids of that filesystem on disk do not have a one to one
  mapping to kernel uids and gids.  99% of the work has already be done,
  for all filesystem except XFS.

  That said there are some significant issues to work through, before
  something like that can be enabled.

  * Handling of uids/gids on disk that don't map into a kuid/kgid.
  * Safety from poisoned filesystem images.

  I have slowly been working with Seth Forshee on these issues as
  the last thing I want is to introduce more security bugs right now.
  Seth being a braver man than I am has already merged his changes into
  the Ubuntu kernel.

  Right now we are targeting fuse, because fuse is already designed to
  handle poisoned filesystem images.  So to safely enable this kind of
  mapping for fuse is not a giant step.

  The big thing from my point of view is to get the VFS interfaces
  correct so that the VFS handles all of the weird cases that come up
  with uids and gids that don't map, and any other weird cases.  Keeping
  the weird bits out of the filesystems.

James, Djalal  I regert I have not been able to read through either of
your patches cloesely yet.  From a high level view I believe there are
use cases for both approaches, and the use cases do not necessarily
overlap.

Djalal I think you are seeing the upsides and not the practical dangers
of poisoned filesystem images.

James I think you are missing the fact that all filesystems already have
the make_kuid and make_kgid calls right where the data comes off disk,
and the from_kuid and from_kgid calls right where the on-disk data is
being created just before it goes on disk.  Which means that the actual
impact on filesystems of the translation is trivial.

Where the actual impact of filesystems is much higher is the
infrastructure needed to ensure poisoned filesystem images do not cause
a kernel compromise.  That extends to the filesystem testing and code
review process beyond and is more than just a kernel problem.  Hardening
that attack surface of the disk side of filesystems is difficult
especially when not impacting filesystem performance.


So I don't think it makes sense to frame this as an either/or situation.
I think there is a need for both solutions.

Djalal if you could work with Seth I think that would be very useful.  I
know I am dragging my heels there but I really hope I can dig in and get
everything reviewed and merged soonish.

James if you could see shiftfs with a different set of merits than what
to Djalal is doing I think that would be useful.  As it would allow
everyone to concentrate on getting the bugs out of their solutions.



That said I am not certain shiftfs makes sense without Seth's patches to
handle the weird cases at the VFS level.    What do you do with uids and
gids that don't map?  You can reinvent how to handle the strange cases
in shfitfs or we can work on solving this problem at the VFS level so
people don't have to go through the error prone work of reinventing
solutions.


The big ugly nasty in all of this is that we are fundamentally dealing
with uids and gids which are security identifiers.  Practically any bug
is exploitable and CVE worthy.  So it make sense to tread very
carefully.  Even with care it can takes months if not years to get
the number of bugs down to a level where you are not the favorite target
of people looking for exploitable kernel bugs.
 
Eric

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-15  2:21                         ` Eric W. Biederman
@ 2016-05-15 15:04                           ` James Bottomley
  2016-05-16 14:12                           ` Seth Forshee
                                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 49+ messages in thread
From: James Bottomley @ 2016-05-15 15:04 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Djalal Harouni, Alexander Viro, Chris Mason, tytso, Serge Hallyn,
	Josh Triplett, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Dave Chinner

On Sat, 2016-05-14 at 21:21 -0500, Eric W. Biederman wrote:
> James if you could see shiftfs with a different set of merits than 
> what to Djalal is doing I think that would be useful.  As it would 
> allow everyone to concentrate on getting the bugs out of their
> solutions.

Just to reply to this specific point.  Djalal's patches can't actually
work for me because I use subtree based roots rather than whole fs
roots ... it's mostly because I work with image directories, not the
full mounted images themselves.  For stuff I unpack into /home, I could
see having /home on a separate directory and adding the vfs_shift_
flags.  however, I'm not doing (and it would be really unsafe to do)
that for / to get my images that unpack in /var/tmp (like the obs build
roots).

However, half the ugliness of the patch set is that it needs lower
layer FS support because vfs_shift_ are mount flags in the superblock. 
 If they were made subtree flags instead (so MNT_ flags), I think you
could eliminate the need to modify any underlying filesystems and they
would allow us to mark subtrees for shifting.  the mount command would
need modifying to add them (like it was for --shared and --private) so
we'd need an additional --vfs-shift --ufs-shift to mark the subtree but
then the series would work for bind mounting subtrees, which is what I
need.  And they would work for *any* filesystem without modification.

This would probably be the better of both worlds because it will work
for the docker case as well.

James

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-15  2:21                         ` Eric W. Biederman
  2016-05-15 15:04                           ` James Bottomley
@ 2016-05-16 14:12                           ` Seth Forshee
  2016-05-16 16:42                             ` Eric W. Biederman
  2016-05-16 19:13                           ` James Bottomley
  2016-05-17 11:42                           ` Djalal Harouni
  3 siblings, 1 reply; 49+ messages in thread
From: Seth Forshee @ 2016-05-16 14:12 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: James Bottomley, Djalal Harouni, Alexander Viro, Chris Mason,
	tytso, Serge Hallyn, Josh Triplett, Andy Lutomirski,
	linux-fsdevel, linux-kernel, linux-security-module, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Dave Chinner

On Sat, May 14, 2016 at 09:21:55PM -0500, Eric W. Biederman wrote:
>   I have slowly been working with Seth Forshee on these issues as
>   the last thing I want is to introduce more security bugs right now.
>   Seth being a braver man than I am has already merged his changes into
>   the Ubuntu kernel.

Maybe not quite so brave as you think. I also threw on a patch to
disable the feature unless explicitly enabled by a sys admin.

> James I think you are missing the fact that all filesystems already have
> the make_kuid and make_kgid calls right where the data comes off disk,
> and the from_kuid and from_kgid calls right where the on-disk data is
> being created just before it goes on disk.  Which means that the actual
> impact on filesystems of the translation is trivial.

It is fairly simple but a there's bit more that just id conversions to
change. With ext4 I found that there were mount options which needed to
be restricted, some capability checks to update, and access to external
journal devices must be checked. In all it wasn't a whole lot of changes
to the filesystem though. Fuse was a bit more involved, but the
complexities there won't apply to other filesystems.

> Djalal if you could work with Seth I think that would be very useful.  I
> know I am dragging my heels there but I really hope I can dig in and get
> everything reviewed and merged soonish.

That would make me very happy :-)

I'm happy to look with Djalal for commonalities. I did skim his patches
before, and based on that all I really expect to find are things related
to permission checks when ids don't map. The rest seems fundamentally
different.

Seth

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-16 14:12                           ` Seth Forshee
@ 2016-05-16 16:42                             ` Eric W. Biederman
  2016-05-16 18:25                               ` Seth Forshee
  0 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2016-05-16 16:42 UTC (permalink / raw)
  To: Seth Forshee
  Cc: James Bottomley, Djalal Harouni, Alexander Viro, Chris Mason,
	tytso, Serge Hallyn, Josh Triplett, Andy Lutomirski,
	linux-fsdevel, linux-kernel, linux-security-module, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Dave Chinner

Seth Forshee <seth.forshee@canonical.com> writes:

> On Sat, May 14, 2016 at 09:21:55PM -0500, Eric W. Biederman wrote:
>>   I have slowly been working with Seth Forshee on these issues as
>>   the last thing I want is to introduce more security bugs right now.
>>   Seth being a braver man than I am has already merged his changes into
>>   the Ubuntu kernel.
>
> Maybe not quite so brave as you think. I also threw on a patch to
> disable the feature unless explicitly enabled by a sys admin.
>
>> James I think you are missing the fact that all filesystems already have
>> the make_kuid and make_kgid calls right where the data comes off disk,
>> and the from_kuid and from_kgid calls right where the on-disk data is
>> being created just before it goes on disk.  Which means that the actual
>> impact on filesystems of the translation is trivial.
>
> It is fairly simple but a there's bit more that just id conversions to
> change. With ext4 I found that there were mount options which needed to
> be restricted, some capability checks to update, and access to external
> journal devices must be checked. In all it wasn't a whole lot of changes
> to the filesystem though. Fuse was a bit more involved, but the
> complexities there won't apply to other filesystems.
>
>> Djalal if you could work with Seth I think that would be very useful.  I
>> know I am dragging my heels there but I really hope I can dig in and get
>> everything reviewed and merged soonish.
>
> That would make me very happy :-)

It has missed this merge window :( But I am hoping with am aiming to
review them and get your patches (or modified versions of your patches)
into my tree as soon after rc1 as humanly possible.

Part of that will have to be the fix for mqueuefs, that Docker just hit.

> I'm happy to look with Djalal for commonalities. I did skim his patches
> before, and based on that all I really expect to find are things related
> to permission checks when ids don't map. The rest seems fundamentally
> different.

Hmm.  Then I may have to look closer at what Djalal is doing then.  It
sounded like what you were doing and if not, I will scratch my head.

That said yes.  The biggy is getting the VFS changes to handle all of
the weird translation corner cases etc (that are part of your patches).

Eric

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-16 16:42                             ` Eric W. Biederman
@ 2016-05-16 18:25                               ` Seth Forshee
  0 siblings, 0 replies; 49+ messages in thread
From: Seth Forshee @ 2016-05-16 18:25 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: James Bottomley, Djalal Harouni, Alexander Viro, Chris Mason,
	tytso, Serge Hallyn, Josh Triplett, Andy Lutomirski,
	linux-fsdevel, linux-kernel, linux-security-module, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Dave Chinner

On Mon, May 16, 2016 at 11:42:46AM -0500, Eric W. Biederman wrote:
> Seth Forshee <seth.forshee@canonical.com> writes:
> 
> > On Sat, May 14, 2016 at 09:21:55PM -0500, Eric W. Biederman wrote:
> >>   I have slowly been working with Seth Forshee on these issues as
> >>   the last thing I want is to introduce more security bugs right now.
> >>   Seth being a braver man than I am has already merged his changes into
> >>   the Ubuntu kernel.
> >
> > Maybe not quite so brave as you think. I also threw on a patch to
> > disable the feature unless explicitly enabled by a sys admin.
> >
> >> James I think you are missing the fact that all filesystems already have
> >> the make_kuid and make_kgid calls right where the data comes off disk,
> >> and the from_kuid and from_kgid calls right where the on-disk data is
> >> being created just before it goes on disk.  Which means that the actual
> >> impact on filesystems of the translation is trivial.
> >
> > It is fairly simple but a there's bit more that just id conversions to
> > change. With ext4 I found that there were mount options which needed to
> > be restricted, some capability checks to update, and access to external
> > journal devices must be checked. In all it wasn't a whole lot of changes
> > to the filesystem though. Fuse was a bit more involved, but the
> > complexities there won't apply to other filesystems.
> >
> >> Djalal if you could work with Seth I think that would be very useful.  I
> >> know I am dragging my heels there but I really hope I can dig in and get
> >> everything reviewed and merged soonish.
> >
> > That would make me very happy :-)
> 
> It has missed this merge window :( But I am hoping with am aiming to
> review them and get your patches (or modified versions of your patches)
> into my tree as soon after rc1 as humanly possible.
> 
> Part of that will have to be the fix for mqueuefs, that Docker just hit.

Yeah, I've got a patch that's been tested to fix the bug, so I'll send
new patches which include that before long.

Seth

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-15  2:21                         ` Eric W. Biederman
  2016-05-15 15:04                           ` James Bottomley
  2016-05-16 14:12                           ` Seth Forshee
@ 2016-05-16 19:13                           ` James Bottomley
  2016-05-17 22:40                             ` Eric W. Biederman
  2016-05-17 11:42                           ` Djalal Harouni
  3 siblings, 1 reply; 49+ messages in thread
From: James Bottomley @ 2016-05-16 19:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Djalal Harouni, Alexander Viro, Chris Mason, tytso, Serge Hallyn,
	Josh Triplett, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Dave Chinner

On Sat, 2016-05-14 at 21:21 -0500, Eric W. Biederman wrote:
> James Bottomley <James.Bottomley@HansenPartnership.com> writes:
> 
> > On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
> 
> Just a couple of quick comments from a very high level design point.
> 
> - I think a shiftfs is valuable in the same way that overlayfs is
>   valuable.
> 
>   Esepcially in the Docker case where a lot of containers want a shared
>   base image (for efficiency), but it is desirable to run those
>   containers in different user namespaces for safety.
> 
> - It is also the plan to make it possible to mount a filesystem where
>   the uids and gids of that filesystem on disk do not have a one to one
>   mapping to kernel uids and gids.  99% of the work has already be done,
>   for all filesystem except XFS.

Can you elaborate a bit more on why we want to do this?  I think only
having a single shift of uid_t to kuid_t across the kernel to user
boundary is a nice feature of user namespaces.  Architecturally, it's
not such a big thing to do it as the data goes on to the disk as well,
but what's the use case for it?

>   That said there are some significant issues to work through, before
>   something like that can be enabled.
> 
>   * Handling of uids/gids on disk that don't map into a kuid/kgid.

So I think this is nicely handled in the capability checks in
generic_permission() (capable_wrt_inode_uidgid()) is there a need to
make it more complex (and thus more error prone)?

>   * Safety from poisoned filesystem images.

By poisoned FS image, you mean an image over whose internal data the
user has control?  The basic problem of how do we give users write
access to data devices they can then cause to be mounted as
filesystems?

>   I have slowly been working with Seth Forshee on these issues as
>   the last thing I want is to introduce more security bugs right now.
>   Seth being a braver man than I am has already merged his changes into
>   the Ubuntu kernel.
> 
>   Right now we are targeting fuse, because fuse is already designed to
>   handle poisoned filesystem images.  So to safely enable this kind of
>   mapping for fuse is not a giant step.
> 
>   The big thing from my point of view is to get the VFS interfaces
>   correct so that the VFS handles all of the weird cases that come up
>   with uids and gids that don't map, and any other weird cases.  Keeping
>   the weird bits out of the filesystems.

If by VFS interfaces, you mean where we've already got the mapping 
confined, absolutely.

> James I think you are missing the fact that all filesystems already 
> have the make_kuid and make_kgid calls right where the data comes off
> disk,

I beg to differ: they certainly don't.  The underlying filesystem
populates the inode in ->lookup with the data off the disk which goes
into the inode as a kuid_t/kgid_t  It remains forever in the inode as
that.  We convert it as it goes out of the kernel in the stat calls
(actually stat.c:cp_old/new_stat())

>  and the from_kuid and from_kgid calls right where the on-disk data
> is being created just before it goes on disk.  Which means that the
> actual impact on filesystems of the translation is trivial.

Are you looking at a different tree from me?  I'm actually just looking
at Linus git head.

James

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-15  2:21                         ` Eric W. Biederman
                                             ` (2 preceding siblings ...)
  2016-05-16 19:13                           ` James Bottomley
@ 2016-05-17 11:42                           ` Djalal Harouni
  3 siblings, 0 replies; 49+ messages in thread
From: Djalal Harouni @ 2016-05-17 11:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: James Bottomley, Alexander Viro, Chris Mason, tytso,
	Serge Hallyn, Josh Triplett, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, linux-security-module, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Dave Chinner

Hi Eric,

On Sat, May 14, 2016 at 09:21:55PM -0500, Eric W. Biederman wrote:
> James Bottomley <James.Bottomley@HansenPartnership.com> writes:
> 
> > On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
> 
> Just a couple of quick comments from a very high level design point.
> 
> - I think a shiftfs is valuable in the same way that overlayfs is
>   valuable.
> 
>   Esepcially in the Docker case where a lot of containers want a shared
>   base image (for efficiency), but it is desirable to run those
>   containers in different user namespaces for safety.
> 
> - It is also the plan to make it possible to mount a filesystem where
>   the uids and gids of that filesystem on disk do not have a one to one
>   mapping to kernel uids and gids.  99% of the work has already be done,
>   for all filesystem except XFS.
> 
>   That said there are some significant issues to work through, before
>   something like that can be enabled.
> 
>   * Handling of uids/gids on disk that don't map into a kuid/kgid.
>   * Safety from poisoned filesystem images.
> 
>   I have slowly been working with Seth Forshee on these issues as
>   the last thing I want is to introduce more security bugs right now.
>   Seth being a braver man than I am has already merged his changes into
>   the Ubuntu kernel.
> 
>   Right now we are targeting fuse, because fuse is already designed to
>   handle poisoned filesystem images.  So to safely enable this kind of
>   mapping for fuse is not a giant step.

Alright!

>   The big thing from my point of view is to get the VFS interfaces
>   correct so that the VFS handles all of the weird cases that come up
>   with uids and gids that don't map, and any other weird cases.  Keeping
>   the weird bits out of the filesystems.

Indeed, I totally agree here.


> James, Djalal  I regert I have not been able to read through either of
> your patches cloesely yet.  From a high level view I believe there are
> use cases for both approaches, and the use cases do not necessarily
> overlap.
> 
> Djalal I think you are seeing the upsides and not the practical dangers
> of poisoned filesystem images.

Thanks for your reply Eric, I will let you sleep on the approach. Yes
it's totatly different thing, I think you can consider it as a first
step to use filesystems inside user namespace safely. Real root is still
the only one who mounts and sets the mount namespace shift flag that can
be inherited by unprivlieged userns users.. So real root is *still* in
control of things. The solution is flexible. At the same time you have
the fuse patches for ones that want to use it for unprivileged mounts, and
later and it depends on the future and the state of art or how things
are and improve...

The real problem seems poisoned filesystem images, ok I agree. However
this series considers at the moment only real root is the one who has to
mount filesystems that will be used for user namespaces.

So nothing real changes, just consider it like this:
1) root of init_user_ns mounts filesystems with mount shift flags and
create shift mount namespace.
2) then give access for inodes that have inode->{uid/gid} that match
the inside mapping of the calling process. This is like real root doing
recursive chown of files to give rwx permission but without hitting the
real disk. Every thing is virtual.

So nothing really changes for poisoned filesystems since unprivileged
users can't mount them, only real is able to do so, and he can verify
the image before doing so...

Now, the problem that I can see is if there is some special inodes
related to these filesystems and host resources that are marked 0400
only for real root, in this case we have to add the needed capability
check, capable in init_user_ns. For ioctl I guess they are already safe
since they should have the appropriate capable check, but I will check
it of course.

Now, as Seth has been working with fuse mounts, and I guess they will be
merged, I will of course check with him so everything is synced and that
this approach will continue to work after his patches are merged.


> James I think you are missing the fact that all filesystems already have
> the make_kuid and make_kgid calls right where the data comes off disk,
> and the from_kuid and from_kgid calls right where the on-disk data is
> being created just before it goes on disk.  Which means that the actual
> impact on filesystems of the translation is trivial.
> 
> Where the actual impact of filesystems is much higher is the
> infrastructure needed to ensure poisoned filesystem images do not cause
> a kernel compromise.  That extends to the filesystem testing and code
> review process beyond and is more than just a kernel problem.  Hardening
> that attack surface of the disk side of filesystems is difficult
> especially when not impacting filesystem performance.
> 
> 
> So I don't think it makes sense to frame this as an either/or situation.
> I think there is a need for both solutions.
> 
> Djalal if you could work with Seth I think that would be very useful.  I
> know I am dragging my heels there but I really hope I can dig in and get
> everything reviewed and merged soonish.

Alright!


> James if you could see shiftfs with a different set of merits than what
> to Djalal is doing I think that would be useful.  As it would allow
> everyone to concentrate on getting the bugs out of their solutions.
> 
> 
> 
> That said I am not certain shiftfs makes sense without Seth's patches to
> handle the weird cases at the VFS level.    What do you do with uids and
> gids that don't map?  You can reinvent how to handle the strange cases
> in shfitfs or we can work on solving this problem at the VFS level so
> people don't have to go through the error prone work of reinventing
> solutions.
> 
> 
> The big ugly nasty in all of this is that we are fundamentally dealing
> with uids and gids which are security identifiers.  Practically any bug
> is exploitable and CVE worthy.  So it make sense to tread very
> carefully.  Even with care it can takes months if not years to get
> the number of bugs down to a level where you are not the favorite target
> of people looking for exploitable kernel bugs.

I totally share this concern, that's why this RFC was designed like this,
when you have time please check it, thanks!

Here just for the record, I had a series that works with overlayfs that
updated current_fsuid() to match inodes to give access, and later drop
it for another better solution, but in the end I'm pretty sure that
this should be handled inside VFS, and do not mess with creds or
current_fsuid since they are global values, they cross user namespaces.

> Eric

-- 
Djalal Harouni
http://opendz.org

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-14 13:46                       ` James Bottomley
  2016-05-15  2:21                         ` Eric W. Biederman
@ 2016-05-17 15:42                         ` Djalal Harouni
  1 sibling, 0 replies; 49+ messages in thread
From: Djalal Harouni @ 2016-05-17 15:42 UTC (permalink / raw)
  To: James Bottomley
  Cc: Alexander Viro, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Dave Chinner

On Sat, May 14, 2016 at 06:46:54AM -0700, James Bottomley wrote:
> On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
> > On Thu, May 12, 2016 at 03:24:12PM -0700, James Bottomley wrote:
> > > On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote:
> > > > On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:

[...]
> > In this series we don't hijack setfsuid() in an indirect way, 
> > setfsuid maps UIDs into current userns according to rules set by 
> > parent. Changing current_fsuid() to some other mapping is a way to 
> > allow processes to bypass that and use it to access other inodes...
> > This should not change and fsuid should continue to follow these
> > rules...
> 
> Both solutions do this

James, I don't update current_fsuid() nor any other creds field in this
RFC. For the reason that if I've a pinned mapping of 0:100000:65536 that
containers or apps want to use for their own purpose, an app X started
by privileged process and sets global uid to 100000 and its current user
namespace 0:100000:65536, and that app X forks another app Y with global
uid 100000 sandbox it, hide other processes, sets its user namespace
mapping to 1000:100000:1 for app Y, same thing for app Z  2000:100000:1
restrict the set of syscalls for both Y and Z... even with all this they
will be able keep their access to inode->i_uid == 0 where we don't want
that since we don't give a mapping to 0... we just want them to access
inode->i_uid == 1000 for app Y and 2000 for app Z... they cross user
namespaces... they use another mapping... and if Y forks to another app
and even if it sets a new userns mapping with a new restricted range, it
will continue to use the old range 65536 and inodes will show up with
real uids instead of nobody..


> > A cred->fsuid solution is safe or used to be safe only inside
> > init_user_ns where there is always a mapping or in context of current
> > user namespace. In an other user namespace with 0:1000:1 mapping, 
> >  you can't set it to arbitrary mapping like 0:4000:1... It will give
> > confined processes access to inodes that satisfy the kuid_t 4000 
> > mapping and which the app/container wants to deny, they only want
> > 0:1000:1. ..
> 
> OK, so both solutions are safe here too.  Your safety comes from only
> remapping in the userns; mine comes from the normal filesystem acl
> rules: either the userns for different users all have disjoint ids
> regulated by /etc/subuidmap or they're all using the same one (like
> docker 1.10) in either case, you could regulate by having the mount
> under a directory which is accessible only to the userns owner.

Please see above comment. Nested unprivileged apps may want to restrict
syscall operations and access to inodes, maybe we don't want the forked
sandboxed app to have access to inodes, and it will be hard if not
impossible if you update global creds each time...


> > We don't cross user namespaces, we don't use different mappings for
> > cred->uid, cred->fsuid...  A clean solution is to shift inodes 
> > UID/GID and not change fsuid to cross namespaces. Not to mention how 
> > it may interact with capabilities...
> 
> This is a subjective question on what constitutes "clean".  I think we
> both think the other solution isn't clean, so that's for others to
> adjudicate.

If you see it that way :-) , I just want to access from user namespace
in the safest way as possible, if there is a better solution or if my
patches are buggy, I'll drop them... no problem!


> > We follow user namespace rules and we keep "the parent defines a 
> > range that the children can't escape" semantics.  There is a clear 
> > relation between user namespaces that should not be broken.
> 
> OK, so I separated the problem into a userns one, which remaps for the
> processes in user space, and a vfs one which remaps the on-disk id. 
>  However, they could be combined by allowing the userns to mount
> shiftfs but only on designated filesystems and setting the uidmappings
> to the same ones as the userns.
> 
> > We explicitly don't define a new user namespace mapping nor add a new
> > interface for the simple reason it's: *too complicated*. We can do 
> > that, but no thanks! May be in future if there is a real need or 
> > things are clear... The current user namespace interface is getting 
> > standard and stable, so we just keep it that way and make it
> > consistant inside VFS.
> 
> I don't accept the too complicated point.  For fully unprivileged
> containers, the host admin already has to set up the subuid/subgid map
> files which is most of the complexity.  Once that's done, the same maps
> can be used to shift mount.  Once it's all set up, no further
> intervention is required.

Well, please check my first comment. In this RFC you don't have to be
always the real root or a privileged parent to do so... it allows nesting
since it seems that the maintainers want nesting support.


> > We give VFS control of that, and we make mount namespaces the central
> > part of this whole logic.
> 
> Right, that's what causes the logic to thread throughout the entire vfs
> and into the fs layer.  The fundamental point of difference is that I'd
> like a solution which encapsulates the problem rather than exposing it
> to the vfs.
> 
> > We make admins life easier where they can pull container images, root
> > filesystems from containers/apps hubs... verify the signature and 
> > start them with different mappings according to host resources... We 
> > don't want them to do anything. The design was planned to make it 
> > easier for users, it should work out of the box, and it can be used 
> > to handle complex stuff too, since it's flexible.
> 
> Either works easily for users.  Setting stuff up is always the job of
> the admin in both solutions.


Hmm, I don't agree here, things should be safe by default and work out
of the box without the intervention of the admin.


> > Able to support most filesystems including on-disk filesystems
> > natively.
> 
> Shiftfs does this.  More importantly it supports subtrees, so I can
> unpack an image root on to an existing filesystem and remap it into a
> container.

This RFC should support subtrees of course, the mapping is done in the
context of the mount namespace of the caller.


> > Able to support disk quota according to the shifted UID/GID on-disk
> > values. Especially during inode creation...
> 
> Quota can be shifted, I just wasn't sure it was necessary.  If the
> usual use case is for unpacked roots, chances are you want the
> remapping to use the group quota of the userns owner, which they'd get
> naturally so, while it's possible to remap projid, I didn't think it
> needed to be done.
> 
> > Able to support ACL if requested.
> 
> Both do this.
> 
> > The user namespace mapping is kept a runtime configure option, we 
> > don't pin a special mapping at any time, and of course parent creator 
> > of user namespace is the one that can manipulate it, at the same time 
> > the mapping is restricted according to grandpa rules and so on...
> > 
> > It allows unprivileged to use the VFS UID/GID shift without the
> > intervention of a privileged process each time. The real privileged 
> > process sets the filesystem and the mount namespace the first time, 
> > then it should work for all nested namespaces and containers. It does 
> > not need the intervation of  init_user_ns root to set the mapping and 
> > make it work, you don't have to go in and go out to setup the thing,
> > etc.
> 
> Both solutions work like this.  When I use this for shifted roots of
> emulation containers, it's set up once at start of day.  I then build
> the containers unprivileged using newsubuid/newsubgid as I'm using
> them.  Once the shifts are done at start of day, no other admin support
> is required.

This RFC does not require the intervention of the admin or real root
process to adapt the mapping, when the filesystem is mounted with
shifted options and the mount namespace is created, everything is
inherited and you can use real separation for new nested containers/apps
if you want... you don't need the intervention of a privileged entity to
adapt the mapping at the start or after...

Just take a stock rootfs or an image and use it.

Please note that the approach this RFC takes was never discussed... I'll
let everyone sleep on it and see later after the merge window. Thanks!


> James
> 
> 
> > We don't do this on behalf of filesystems, they should explicitly
> > support it. procfs and other host resource virtual filesystems are 
> > safe and currently they don't need shifting.
> > 
> > We try to fix the problem where it should be fixed, and not hide 
> > it...
> 
> 

-- 
Djalal Harouni
http://opendz.org

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
  2016-05-16 19:13                           ` James Bottomley
@ 2016-05-17 22:40                             ` Eric W. Biederman
  0 siblings, 0 replies; 49+ messages in thread
From: Eric W. Biederman @ 2016-05-17 22:40 UTC (permalink / raw)
  To: James Bottomley
  Cc: Djalal Harouni, Alexander Viro, Chris Mason, tytso, Serge Hallyn,
	Josh Triplett, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Dave Chinner

James Bottomley <James.Bottomley@HansenPartnership.com> writes:

> On Sat, 2016-05-14 at 21:21 -0500, Eric W. Biederman wrote:
>> James Bottomley <James.Bottomley@HansenPartnership.com> writes:
>> 
>> > On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
>> 
>> Just a couple of quick comments from a very high level design point.
>> 
>> - I think a shiftfs is valuable in the same way that overlayfs is
>>   valuable.
>> 
>>   Esepcially in the Docker case where a lot of containers want a shared
>>   base image (for efficiency), but it is desirable to run those
>>   containers in different user namespaces for safety.
>> 
>> - It is also the plan to make it possible to mount a filesystem where
>>   the uids and gids of that filesystem on disk do not have a one to one
>>   mapping to kernel uids and gids.  99% of the work has already be done,
>>   for all filesystem except XFS.
>
> Can you elaborate a bit more on why we want to do this?  I think only
> having a single shift of uid_t to kuid_t across the kernel to user
> boundary is a nice feature of user namespaces.  Architecturally, it's
> not such a big thing to do it as the data goes on to the disk as well,
> but what's the use case for it?

fuse/nfs or just plain sanity.  As the data comes off disk we convert it
into the kernel internal form kuid_t and kgid_t.   For shiftfs this
would be converting the uids when they come from your underlying
filesystem to the upper level vfs abstractions.

Converting to the kernel form for a filesystem such as fuse that is does
all that is necessary to keep evil users from breaking the kernel means
that we call allow users in a user namespace to mount fuse themselves.
Supply whatever uids and gids they want in the fuse messages.  If the
uids/gids don't map from the mounting users user namespace into the
kernel then we set inode->i_uid to INVALID_UID.

That is all we ask of a filesystem, and we are sorting out the rest in
the VFS as nothing sets INVALID_UID in inode->i_uid today.


>>   That said there are some significant issues to work through, before
>>   something like that can be enabled.
>> 
>>   * Handling of uids/gids on disk that don't map into a kuid/kgid.
>
> So I think this is nicely handled in the capability checks in
> generic_permission() (capable_wrt_inode_uidgid()) is there a need to
> make it more complex (and thus more error prone)?

No just a need to handle INVALID_UID, and INVALID_GID which we don't
handle today.

>>   * Safety from poisoned filesystem images.
>
> By poisoned FS image, you mean an image over whose internal data the
> user has control?  The basic problem of how do we give users write
> access to data devices they can then cause to be mounted as
> filesystems?

Yes.  For fuse except for uids and gids this is already solved for most
other filesystems it is a whole new world of horror.

The general case of evil usb devices (think android) that look like
block devices but can return whatever they want already exists in the
wild.

>>   I have slowly been working with Seth Forshee on these issues as
>>   the last thing I want is to introduce more security bugs right now.
>>   Seth being a braver man than I am has already merged his changes into
>>   the Ubuntu kernel.
>> 
>>   Right now we are targeting fuse, because fuse is already designed to
>>   handle poisoned filesystem images.  So to safely enable this kind of
>>   mapping for fuse is not a giant step.
>> 
>>   The big thing from my point of view is to get the VFS interfaces
>>   correct so that the VFS handles all of the weird cases that come up
>>   with uids and gids that don't map, and any other weird cases.  Keeping
>>   the weird bits out of the filesystems.
>
> If by VFS interfaces, you mean where we've already got the mapping 
> confined, absolutely.

Yes.  It is just making certain we handle INVALID_UID and INVALID_GID
that results from a mapping failure.  As we don't handle that in 4.6.0.

>> James I think you are missing the fact that all filesystems already 
>> have the make_kuid and make_kgid calls right where the data comes off
>> disk,
>
> I beg to differ: they certainly don't.  The underlying filesystem
> populates the inode in ->lookup with the data off the disk which goes
> into the inode as a kuid_t/kgid_t  It remains forever in the inode as
> that.  We convert it as it goes out of the kernel in the stat calls
> (actually stat.c:cp_old/new_stat())

They do.  i_uid_write calls make_kuid to map the in comming uid from
disk into a kuid_t.  That is all I was referring to.

The only thing I am looking at infrastructure wise it to make it so that
we cleanly handle when the first parameter to make_kuid is not
&init_user_ns.  That is the core point of Seths work.

>>  and the from_kuid and from_kgid calls right where the on-disk data
>> is being created just before it goes on disk.  Which means that the
>> actual impact on filesystems of the translation is trivial.
>
> Are you looking at a different tree from me?  I'm actually just looking
> at Linus git head.

Take a look at i_uid_read and i_gid_read.  They are inline functions in
fs.h

Eric

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2016-05-17 22:51 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-04 14:26 [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 1/8] VFS: add CLONE_MNTNS_SHIFT_UIDGID flag to allow mounts to shift their UIDs/GIDs Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 2/8] VFS:uidshift: add flags and helpers to shift UIDs and GIDs to virtual view Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 3/8] fs: Treat foreign mounts as nosuid Djalal Harouni
2016-05-04 23:19   ` Serge Hallyn
2016-05-05 13:05     ` Seth Forshee
2016-05-05 22:40       ` Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 4/8] VFS:userns: shift UID/GID to virtual view during permission access Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 5/8] VFS:userns: add helpers to shift UIDs and GIDs into on-disk view Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 6/8] VFS:userns: shift UID/GID to on-disk view before any write to disk Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 7/8] ext4: add support for vfs_shift_uids and vfs_shift_gids mount options Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 8/8] btrfs: " Djalal Harouni
2016-05-04 16:34 ` [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems Josh Triplett
2016-05-04 21:06 ` James Bottomley
2016-05-05  7:36   ` Djalal Harouni
2016-05-05 11:56     ` James Bottomley
2016-05-05 21:49       ` Djalal Harouni
2016-05-05 22:08         ` James Bottomley
2016-05-10 23:36           ` James Bottomley
2016-05-11  0:38             ` Al Viro
2016-05-11  0:53             ` Al Viro
2016-05-11  3:47               ` James Bottomley
2016-05-11 16:42             ` Djalal Harouni
2016-05-11 18:33               ` James Bottomley
2016-05-12 19:55                 ` Djalal Harouni
2016-05-12 22:24                   ` James Bottomley
2016-05-14  9:53                     ` Djalal Harouni
2016-05-14 13:46                       ` James Bottomley
2016-05-15  2:21                         ` Eric W. Biederman
2016-05-15 15:04                           ` James Bottomley
2016-05-16 14:12                           ` Seth Forshee
2016-05-16 16:42                             ` Eric W. Biederman
2016-05-16 18:25                               ` Seth Forshee
2016-05-16 19:13                           ` James Bottomley
2016-05-17 22:40                             ` Eric W. Biederman
2016-05-17 11:42                           ` Djalal Harouni
2016-05-17 15:42                         ` Djalal Harouni
2016-05-04 23:30 ` Serge Hallyn
2016-05-06 14:38   ` Djalal Harouni
2016-05-09 16:26     ` Serge Hallyn
2016-05-10 10:33       ` Djalal Harouni
2016-05-05  0:23 ` Dave Chinner
2016-05-05  1:44   ` Andy Lutomirski
2016-05-05  2:25     ` Dave Chinner
2016-05-05  3:29       ` Andy Lutomirski
2016-05-05 22:34     ` Djalal Harouni
2016-05-05 22:24   ` Djalal Harouni
2016-05-06  2:50     ` Dave Chinner
2016-05-12 19:47       ` Djalal Harouni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).