linux-unionfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] introduce a uid/gid shifting bind mount
@ 2020-02-17 20:53 James Bottomley
  2020-02-17 20:53 ` [PATCH v3 1/3] fs: rethread notify_change to take a path instead of a dentry James Bottomley
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: James Bottomley @ 2020-02-17 20:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: David Howells, Christian Brauner, Al Viro, Miklos Szeredi,
	Seth Forshee, linux-unionfs, Amir Goldstein,
	Stéphane Graber, Eric Biederman, Aleksa Sarai,
	Serge E . Hallyn, Tycho Andersen, containers

The object of this series is to replace shiftfs with a proper uid/gid
shifting bind mount instead of the shiftfs hack of introducing
something that looks similar to an overlay filesystem to do it.

The VFS still has the problem that in order to tell what vfsmount a
dentry belongs to, struct path would have to be threaded everywhere
struct dentry currently is.  However, this patch is structured only to
require a rethreading of notify_change.  The rest of the knowledge
that a shift is in operation is carried in the task structure by
caching the unshifted credentials.

Note that although it is currently dependent on the new configfd
interface for bind mounts, only patch 3/3 relies on this, and the
whole thing could be redone as a syscall or any other mechanism
(depending on how people eventually want to fix the problem with the
new fsconfig mechanism being unable to reconfigure bind mounts).

The changes from v2 are I've added Amir's reviewed-by for the
notify_change rethreading and I've implemented Serge's request for a
base offset shift for the image.  It turned out to be much harder to
implement a simple linear shift than simply to do it through a
different userns, so that's how I've done it.  The userns you need to
set up for the offset shifted image is one where the interior uid
would see the shifted image as fake root.  I've introduced an
additional "ns" config parameter, which must be specified when
building the allow shift mount point (so it's done by the admin, not
by the unprivileged user).  I've also taken care that the image
shifted to zero (real root) is never visible in the filesystem.  Patch
3/3 explains how to use the additional "ns" parameter.

James

---

James Bottomley (3):
  fs: rethread notify_change to take a path instead of a dentry
  fs: introduce uid/gid shifting bind mount
  fs: expose shifting bind mount to userspace

 drivers/base/devtmpfs.c   |   8 ++-
 fs/attr.c                 | 131 ++++++++++++++++++++++++++++++++++++++--------
 fs/bind.c                 | 105 +++++++++++++++++++++++++++++++++----
 fs/cachefiles/interface.c |   6 ++-
 fs/coredump.c             |   4 +-
 fs/ecryptfs/inode.c       |   9 ++--
 fs/exec.c                 |   3 +-
 fs/inode.c                |  17 +++---
 fs/internal.h             |   2 +
 fs/mount.h                |   3 ++
 fs/namei.c                | 114 +++++++++++++++++++++++++++++++++-------
 fs/namespace.c            |   6 +++
 fs/nfsd/vfs.c             |  13 +++--
 fs/open.c                 |  44 ++++++++++++----
 fs/overlayfs/copy_up.c    |  40 ++++++++------
 fs/overlayfs/dir.c        |  10 +++-
 fs/overlayfs/inode.c      |   6 ++-
 fs/overlayfs/overlayfs.h  |   2 +-
 fs/overlayfs/super.c      |   3 +-
 fs/posix_acl.c            |   4 +-
 fs/proc_namespace.c       |   4 ++
 fs/stat.c                 |  32 +++++++++--
 fs/utimes.c               |   2 +-
 include/linux/cred.h      |  12 +++++
 include/linux/fs.h        |   7 ++-
 include/linux/mount.h     |   4 +-
 include/linux/sched.h     |   5 ++
 kernel/capability.c       |   9 +++-
 kernel/cred.c             |  20 +++++++
 29 files changed, 507 insertions(+), 118 deletions(-)

-- 
2.16.4


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 1/3] fs: rethread notify_change to take a path instead of a dentry
  2020-02-17 20:53 [PATCH v3 0/3] introduce a uid/gid shifting bind mount James Bottomley
@ 2020-02-17 20:53 ` James Bottomley
  2020-02-17 20:53 ` [PATCH v3 2/3] fs: introduce uid/gid shifting bind mount James Bottomley
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: James Bottomley @ 2020-02-17 20:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: David Howells, Christian Brauner, Al Viro, Miklos Szeredi,
	Seth Forshee, linux-unionfs, Amir Goldstein,
	Stéphane Graber, Eric Biederman, Aleksa Sarai,
	Serge E . Hallyn, Tycho Andersen, containers

In order to prepare for implementing shiftfs as a property changing
bind mount, the path (which contains the vfsmount) must be threaded
through everywhere we are going to do either a permission check or an
attribute get/set so that we can arrange for the credentials for the
operation to be based on the bind mount properties rather than those
of current.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>

---

v2: fix issues found by Amir Goldstein
v3: add reviews
---
 drivers/base/devtmpfs.c   |  8 ++++++--
 fs/attr.c                 |  4 +++-
 fs/cachefiles/interface.c |  6 ++++--
 fs/coredump.c             |  4 ++--
 fs/ecryptfs/inode.c       |  9 ++++++---
 fs/inode.c                |  7 ++++---
 fs/namei.c                |  2 +-
 fs/nfsd/vfs.c             | 13 ++++++++-----
 fs/open.c                 | 19 ++++++++++---------
 fs/overlayfs/copy_up.c    | 40 ++++++++++++++++++++++++----------------
 fs/overlayfs/dir.c        | 10 ++++++++--
 fs/overlayfs/inode.c      |  6 ++++--
 fs/overlayfs/overlayfs.h  |  2 +-
 fs/overlayfs/super.c      |  3 ++-
 fs/utimes.c               |  2 +-
 include/linux/fs.h        |  7 +++----
 16 files changed, 87 insertions(+), 55 deletions(-)

diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index c9017e0584c0..e323b55721a1 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -214,13 +214,17 @@ static int handle_create(const char *nodename, umode_t mode, kuid_t uid,
 	err = vfs_mknod(d_inode(path.dentry), dentry, mode, dev->devt);
 	if (!err) {
 		struct iattr newattrs;
+		struct path newpath = {
+			.mnt = path.mnt,
+			.dentry = dentry,
+		};
 
 		newattrs.ia_mode = mode;
 		newattrs.ia_uid = uid;
 		newattrs.ia_gid = gid;
 		newattrs.ia_valid = ATTR_MODE|ATTR_UID|ATTR_GID;
 		inode_lock(d_inode(dentry));
-		notify_change(dentry, &newattrs, NULL);
+		notify_change(&newpath, &newattrs, NULL);
 		inode_unlock(d_inode(dentry));
 
 		/* mark as kernel-created inode */
@@ -327,7 +331,7 @@ static int handle_remove(const char *nodename, struct device *dev)
 			newattrs.ia_valid =
 				ATTR_UID|ATTR_GID|ATTR_MODE;
 			inode_lock(d_inode(dentry));
-			notify_change(dentry, &newattrs, NULL);
+			notify_change(&p, &newattrs, NULL);
 			inode_unlock(d_inode(dentry));
 			err = vfs_unlink(d_inode(parent.dentry), dentry, NULL);
 			if (!err || err == -ENOENT)
diff --git a/fs/attr.c b/fs/attr.c
index b4bbdbd4c8ca..11201ab7e3b1 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -220,8 +220,10 @@ EXPORT_SYMBOL(setattr_copy);
  * the file open for write, as there can be no conflicting delegation in
  * that case.
  */
-int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **delegated_inode)
+int notify_change(const struct path *path, struct iattr * attr,
+		  struct inode **delegated_inode)
 {
+	struct dentry *dentry = path->dentry;
 	struct inode *inode = dentry->d_inode;
 	umode_t mode = inode->i_mode;
 	int error;
diff --git a/fs/cachefiles/interface.c b/fs/cachefiles/interface.c
index 4cea5fbf695e..f11216f59a56 100644
--- a/fs/cachefiles/interface.c
+++ b/fs/cachefiles/interface.c
@@ -436,6 +436,7 @@ static int cachefiles_attr_changed(struct fscache_object *_object)
 	uint64_t ni_size;
 	loff_t oi_size;
 	int ret;
+	struct path path;
 
 	ni_size = _object->store_limit_l;
 
@@ -466,18 +467,19 @@ static int cachefiles_attr_changed(struct fscache_object *_object)
 	/* if there's an extension to a partial page at the end of the backing
 	 * file, we need to discard the partial page so that we pick up new
 	 * data after it */
+	path = (struct path){ .mnt = cache->mnt, .dentry = object->backer };
 	if (oi_size & ~PAGE_MASK && ni_size > oi_size) {
 		_debug("discard tail %llx", oi_size);
 		newattrs.ia_valid = ATTR_SIZE;
 		newattrs.ia_size = oi_size & PAGE_MASK;
-		ret = notify_change(object->backer, &newattrs, NULL);
+		ret = notify_change(&path, &newattrs, NULL);
 		if (ret < 0)
 			goto truncate_failed;
 	}
 
 	newattrs.ia_valid = ATTR_SIZE;
 	newattrs.ia_size = ni_size;
-	ret = notify_change(object->backer, &newattrs, NULL);
+	ret = notify_change(&path, &newattrs, NULL);
 
 truncate_failed:
 	inode_unlock(d_inode(object->backer));
diff --git a/fs/coredump.c b/fs/coredump.c
index f8296a82d01d..0334050e27e4 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -775,7 +775,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 			goto close_fail;
 		if (!(cprm.file->f_mode & FMODE_CAN_WRITE))
 			goto close_fail;
-		if (do_truncate(cprm.file->f_path.dentry, 0, 0, cprm.file))
+		if (do_truncate(&cprm.file->f_path, 0, 0, cprm.file))
 			goto close_fail;
 	}
 
@@ -879,7 +879,7 @@ void dump_truncate(struct coredump_params *cprm)
 	if (file->f_op->llseek && file->f_op->llseek != no_llseek) {
 		offset = file->f_op->llseek(file, 0, SEEK_CUR);
 		if (i_size_read(file->f_mapping->host) < offset)
-			do_truncate(file->f_path.dentry, offset, 0, file);
+			do_truncate(&file->f_path, offset, 0, file);
 	}
 }
 EXPORT_SYMBOL(dump_truncate);
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index e23752d9a79f..3bc67c478163 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -852,10 +852,11 @@ int ecryptfs_truncate(struct dentry *dentry, loff_t new_length)
 
 	rc = truncate_upper(dentry, &ia, &lower_ia);
 	if (!rc && lower_ia.ia_valid & ATTR_SIZE) {
-		struct dentry *lower_dentry = ecryptfs_dentry_to_lower(dentry);
+		struct path *lower_path = ecryptfs_dentry_to_lower_path(dentry);
+		struct dentry *lower_dentry = lower_path->dentry;
 
 		inode_lock(d_inode(lower_dentry));
-		rc = notify_change(lower_dentry, &lower_ia, NULL);
+		rc = notify_change(lower_path, &lower_ia, NULL);
 		inode_unlock(d_inode(lower_dentry));
 	}
 	return rc;
@@ -883,6 +884,7 @@ static int ecryptfs_setattr(struct dentry *dentry, struct iattr *ia)
 {
 	int rc = 0;
 	struct dentry *lower_dentry;
+	struct path *lower_path;
 	struct iattr lower_ia;
 	struct inode *inode;
 	struct inode *lower_inode;
@@ -897,6 +899,7 @@ static int ecryptfs_setattr(struct dentry *dentry, struct iattr *ia)
 	inode = d_inode(dentry);
 	lower_inode = ecryptfs_inode_to_lower(inode);
 	lower_dentry = ecryptfs_dentry_to_lower(dentry);
+	lower_path = ecryptfs_dentry_to_lower_path(dentry);
 	mutex_lock(&crypt_stat->cs_mutex);
 	if (d_is_dir(dentry))
 		crypt_stat->flags &= ~(ECRYPTFS_ENCRYPTED);
@@ -959,7 +962,7 @@ static int ecryptfs_setattr(struct dentry *dentry, struct iattr *ia)
 		lower_ia.ia_valid &= ~ATTR_MODE;
 
 	inode_lock(d_inode(lower_dentry));
-	rc = notify_change(lower_dentry, &lower_ia, NULL);
+	rc = notify_change(lower_path, &lower_ia, NULL);
 	inode_unlock(d_inode(lower_dentry));
 out:
 	fsstack_copy_attr_all(inode, lower_inode);
diff --git a/fs/inode.c b/fs/inode.c
index 7d57068b6b7a..be14d3fcbee1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1821,7 +1821,7 @@ int dentry_needs_remove_privs(struct dentry *dentry)
 	return mask;
 }
 
-static int __remove_privs(struct dentry *dentry, int kill)
+static int __remove_privs(struct path *path, int kill)
 {
 	struct iattr newattrs;
 
@@ -1830,7 +1830,7 @@ static int __remove_privs(struct dentry *dentry, int kill)
 	 * Note we call this on write, so notify_change will not
 	 * encounter any conflicting delegations:
 	 */
-	return notify_change(dentry, &newattrs, NULL);
+	return notify_change(path, &newattrs, NULL);
 }
 
 /*
@@ -1839,6 +1839,7 @@ static int __remove_privs(struct dentry *dentry, int kill)
  */
 int file_remove_privs(struct file *file)
 {
+	struct path *path = &file->f_path;
 	struct dentry *dentry = file_dentry(file);
 	struct inode *inode = file_inode(file);
 	int kill;
@@ -1857,7 +1858,7 @@ int file_remove_privs(struct file *file)
 	if (kill < 0)
 		return kill;
 	if (kill)
-		error = __remove_privs(dentry, kill);
+		error = __remove_privs(path, kill);
 	if (!error)
 		inode_has_no_xattr(inode);
 
diff --git a/fs/namei.c b/fs/namei.c
index db6565c99825..531ac55c7e67 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3080,7 +3080,7 @@ static int handle_truncate(struct file *filp)
 	if (!error)
 		error = security_path_truncate(path);
 	if (!error) {
-		error = do_truncate(path->dentry, 0,
+		error = do_truncate(path, 0,
 				    ATTR_MTIME|ATTR_CTIME|ATTR_OPEN,
 				    filp);
 	}
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 0aa02eb18bd3..a51a69e0cc87 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -366,8 +366,8 @@ __be32
 nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
 	     int check_guard, time64_t guardtime)
 {
-	struct dentry	*dentry;
 	struct inode	*inode;
+	struct path	path;
 	int		accmode = NFSD_MAY_SATTR;
 	umode_t		ftype = 0;
 	__be32		err;
@@ -406,8 +406,11 @@ nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
 			goto out;
 	}
 
-	dentry = fhp->fh_dentry;
-	inode = d_inode(dentry);
+	path = (struct path) {
+		.mnt = fhp->fh_export->ex_path.mnt,
+		.dentry = fhp->fh_dentry,
+	};
+	inode = d_inode(path.dentry);
 
 	/* Ignore any mode updates on symlinks */
 	if (S_ISLNK(inode->i_mode))
@@ -448,7 +451,7 @@ nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
 			.ia_size	= iap->ia_size,
 		};
 
-		host_err = notify_change(dentry, &size_attr, NULL);
+		host_err = notify_change(&path, &size_attr, NULL);
 		if (host_err)
 			goto out_unlock;
 		iap->ia_valid &= ~ATTR_SIZE;
@@ -463,7 +466,7 @@ nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
 	}
 
 	iap->ia_valid |= ATTR_CTIME;
-	host_err = notify_change(dentry, iap, NULL);
+	host_err = notify_change(&path, iap, NULL);
 
 out_unlock:
 	fh_unlock(fhp);
diff --git a/fs/open.c b/fs/open.c
index 0788b3715731..db6758b9636a 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -35,11 +35,12 @@
 
 #include "internal.h"
 
-int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
+int do_truncate(const struct path *path, loff_t length, unsigned int time_attrs,
 	struct file *filp)
 {
 	int ret;
 	struct iattr newattrs;
+	struct dentry *dentry = path->dentry;
 
 	/* Not pretty: "inode->i_size" shouldn't really be signed. But it is. */
 	if (length < 0)
@@ -61,7 +62,7 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
 
 	inode_lock(dentry->d_inode);
 	/* Note any delegations or leases have already been broken: */
-	ret = notify_change(dentry, &newattrs, NULL);
+	ret = notify_change(path, &newattrs, NULL);
 	inode_unlock(dentry->d_inode);
 	return ret;
 }
@@ -107,7 +108,7 @@ long vfs_truncate(const struct path *path, loff_t length)
 	if (!error)
 		error = security_path_truncate(path);
 	if (!error)
-		error = do_truncate(path->dentry, length, 0, NULL);
+		error = do_truncate(path, length, 0, NULL);
 
 put_write_and_out:
 	put_write_access(inode);
@@ -155,7 +156,7 @@ COMPAT_SYSCALL_DEFINE2(truncate, const char __user *, path, compat_off_t, length
 long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
 {
 	struct inode *inode;
-	struct dentry *dentry;
+	struct path *path;
 	struct fd f;
 	int error;
 
@@ -171,8 +172,8 @@ long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
 	if (f.file->f_flags & O_LARGEFILE)
 		small = 0;
 
-	dentry = f.file->f_path.dentry;
-	inode = dentry->d_inode;
+	path = &f.file->f_path;
+	inode = path->dentry->d_inode;
 	error = -EINVAL;
 	if (!S_ISREG(inode->i_mode) || !(f.file->f_mode & FMODE_WRITE))
 		goto out_putf;
@@ -192,7 +193,7 @@ long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
 	if (!error)
 		error = security_path_truncate(&f.file->f_path);
 	if (!error)
-		error = do_truncate(dentry, length, ATTR_MTIME|ATTR_CTIME, f.file);
+		error = do_truncate(path, length, ATTR_MTIME|ATTR_CTIME, f.file);
 	sb_end_write(inode->i_sb);
 out_putf:
 	fdput(f);
@@ -558,7 +559,7 @@ static int chmod_common(const struct path *path, umode_t mode)
 		goto out_unlock;
 	newattrs.ia_mode = (mode & S_IALLUGO) | (inode->i_mode & ~S_IALLUGO);
 	newattrs.ia_valid = ATTR_MODE | ATTR_CTIME;
-	error = notify_change(path->dentry, &newattrs, &delegated_inode);
+	error = notify_change(path, &newattrs, &delegated_inode);
 out_unlock:
 	inode_unlock(inode);
 	if (delegated_inode) {
@@ -649,7 +650,7 @@ static int chown_common(const struct path *path, uid_t user, gid_t group)
 	inode_lock(inode);
 	error = security_path_chown(path, uid, gid);
 	if (!error)
-		error = notify_change(path->dentry, &newattrs, &delegated_inode);
+		error = notify_change(path, &newattrs, &delegated_inode);
 	inode_unlock(inode);
 	if (delegated_inode) {
 		error = break_deleg_wait(&delegated_inode);
diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 9fc47c2e078d..05cfbd579690 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -214,17 +214,17 @@ static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
 	return error;
 }
 
-static int ovl_set_size(struct dentry *upperdentry, struct kstat *stat)
+static int ovl_set_size(struct path *upperpath, struct kstat *stat)
 {
 	struct iattr attr = {
 		.ia_valid = ATTR_SIZE,
 		.ia_size = stat->size,
 	};
 
-	return notify_change(upperdentry, &attr, NULL);
+	return notify_change(upperpath, &attr, NULL);
 }
 
-static int ovl_set_timestamps(struct dentry *upperdentry, struct kstat *stat)
+static int ovl_set_timestamps(struct path *upperpath, struct kstat *stat)
 {
 	struct iattr attr = {
 		.ia_valid =
@@ -233,10 +233,10 @@ static int ovl_set_timestamps(struct dentry *upperdentry, struct kstat *stat)
 		.ia_mtime = stat->mtime,
 	};
 
-	return notify_change(upperdentry, &attr, NULL);
+	return notify_change(upperpath, &attr, NULL);
 }
 
-int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat)
+int ovl_set_attr(struct path *upperpath, struct kstat *stat)
 {
 	int err = 0;
 
@@ -245,7 +245,7 @@ int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat)
 			.ia_valid = ATTR_MODE,
 			.ia_mode = stat->mode,
 		};
-		err = notify_change(upperdentry, &attr, NULL);
+		err = notify_change(upperpath, &attr, NULL);
 	}
 	if (!err) {
 		struct iattr attr = {
@@ -253,10 +253,10 @@ int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat)
 			.ia_uid = stat->uid,
 			.ia_gid = stat->gid,
 		};
-		err = notify_change(upperdentry, &attr, NULL);
+		err = notify_change(upperpath, &attr, NULL);
 	}
 	if (!err)
-		ovl_set_timestamps(upperdentry, stat);
+		ovl_set_timestamps(upperpath, stat);
 
 	return err;
 }
@@ -435,8 +435,13 @@ static int ovl_link_up(struct ovl_copy_up_ctx *c)
 {
 	int err;
 	struct dentry *upper;
-	struct dentry *upperdir = ovl_dentry_upper(c->parent);
-	struct inode *udir = d_inode(upperdir);
+	struct dentry *upperdir;
+	struct path upperdirpath;
+	struct inode *udir;
+
+	ovl_path_upper(c->parent, &upperdirpath);
+	upperdir = upperdirpath.dentry;
+	udir = d_inode(upperdir);
 
 	/* Mark parent "impure" because it may now contain non-pure upper */
 	err = ovl_set_impure(c->parent, upperdir);
@@ -457,7 +462,7 @@ static int ovl_link_up(struct ovl_copy_up_ctx *c)
 
 		if (!err) {
 			/* Restore timestamps on parent (best effort) */
-			ovl_set_timestamps(upperdir, &c->pstat);
+			ovl_set_timestamps(&upperdirpath, &c->pstat);
 			ovl_dentry_set_upper_alias(c->dentry);
 		}
 	}
@@ -473,15 +478,16 @@ static int ovl_link_up(struct ovl_copy_up_ctx *c)
 static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
 {
 	int err;
+	struct path upperpath, temppath;
 
+	ovl_path_upper(c->dentry, &upperpath);
 	/*
 	 * Copy up data first and then xattrs. Writing data after
 	 * xattrs will remove security.capability xattr automatically.
 	 */
 	if (S_ISREG(c->stat.mode) && !c->metacopy) {
-		struct path upperpath, datapath;
+		struct path datapath;
 
-		ovl_path_upper(c->dentry, &upperpath);
 		if (WARN_ON(upperpath.dentry != NULL))
 			return -EIO;
 		upperpath.dentry = temp;
@@ -515,12 +521,13 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
 		if (err)
 			return err;
 	}
+	temppath = (struct path){ .mnt = upperpath.mnt, .dentry = temp };
 
 	inode_lock(temp->d_inode);
 	if (S_ISREG(c->stat.mode))
-		err = ovl_set_size(temp, &c->stat);
+		err = ovl_set_size(&temppath, &c->stat);
 	if (!err)
-		err = ovl_set_attr(temp, &c->stat);
+		err = ovl_set_attr(&temppath, &c->stat);
 	inode_unlock(temp->d_inode);
 
 	return err;
@@ -736,10 +743,11 @@ static int ovl_do_copy_up(struct ovl_copy_up_ctx *c)
 		err = ovl_set_nlink_upper(c->dentry);
 	} else {
 		struct inode *udir = d_inode(c->destdir);
+		struct path destpath = { .mnt = ofs->upper_mnt, .dentry = c->destdir };
 
 		/* Restore timestamps on parent (best effort) */
 		inode_lock(udir);
-		ovl_set_timestamps(c->destdir, &c->pstat);
+		ovl_set_timestamps(&destpath, &c->pstat);
 		inode_unlock(udir);
 
 		ovl_dentry_set_upper_alias(c->dentry);
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 8e57d5372b8f..cecd7e244829 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -374,7 +374,8 @@ static struct dentry *ovl_clear_empty(struct dentry *dentry,
 		goto out_cleanup;
 
 	inode_lock(opaquedir->d_inode);
-	err = ovl_set_attr(opaquedir, &stat);
+	err = ovl_set_attr(&(struct path) { .mnt = upperpath.mnt,
+					    .dentry = opaquedir }, &stat);
 	inode_unlock(opaquedir->d_inode);
 	if (err)
 		goto out_cleanup;
@@ -435,10 +436,13 @@ static int ovl_create_over_whiteout(struct dentry *dentry, struct inode *inode,
 	struct inode *udir = upperdir->d_inode;
 	struct dentry *upper;
 	struct dentry *newdentry;
+	struct path newpath;
 	int err;
 	struct posix_acl *acl, *default_acl;
 	bool hardlink = !!cattr->hardlink;
 
+	ovl_path_upper(dentry, &newpath);
+
 	if (WARN_ON(!workdir))
 		return -EROFS;
 
@@ -478,8 +482,10 @@ static int ovl_create_over_whiteout(struct dentry *dentry, struct inode *inode,
 			.ia_valid = ATTR_MODE,
 			.ia_mode = cattr->mode,
 		};
+
+		newpath.dentry = newdentry;
 		inode_lock(newdentry->d_inode);
-		err = notify_change(newdentry, &attr, NULL);
+		err = notify_change(&newpath, &attr, NULL);
 		inode_unlock(newdentry->d_inode);
 		if (err)
 			goto out_cleanup;
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 79e8994e3bc1..f3d86b357a23 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -45,8 +45,10 @@ int ovl_setattr(struct dentry *dentry, struct iattr *attr)
 		err = ovl_copy_up_with_data(dentry);
 	if (!err) {
 		struct inode *winode = NULL;
+		struct path upperpath;
 
-		upperdentry = ovl_dentry_upper(dentry);
+		ovl_path_upper(dentry, &upperpath);
+		upperdentry = upperpath.dentry;
 
 		if (attr->ia_valid & ATTR_SIZE) {
 			winode = d_inode(upperdentry);
@@ -60,7 +62,7 @@ int ovl_setattr(struct dentry *dentry, struct iattr *attr)
 
 		inode_lock(upperdentry->d_inode);
 		old_cred = ovl_override_creds(dentry->d_sb);
-		err = notify_change(upperdentry, attr, NULL);
+		err = notify_change(&upperpath, attr, NULL);
 		revert_creds(old_cred);
 		if (!err)
 			ovl_copyattr(upperdentry->d_inode, dentry->d_inode);
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 3623d28aa4fa..ecd20f3d979b 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -459,7 +459,7 @@ int ovl_copy_up_with_data(struct dentry *dentry);
 int ovl_copy_up_flags(struct dentry *dentry, int flags);
 int ovl_maybe_copy_up(struct dentry *dentry, int flags);
 int ovl_copy_xattr(struct dentry *old, struct dentry *new);
-int ovl_set_attr(struct dentry *upper, struct kstat *stat);
+int ovl_set_attr(struct path *upperpath, struct kstat *stat);
 struct ovl_fh *ovl_encode_real_fh(struct dentry *real, bool is_upper);
 int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
 		   struct dentry *upper);
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 319fe0d355b0..1cd87964b34f 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -633,6 +633,7 @@ static struct dentry *ovl_workdir_create(struct ovl_fs *ofs,
 			.ia_valid = ATTR_MODE,
 			.ia_mode = S_IFDIR | 0,
 		};
+		const struct path workpath = { .mnt = mnt, .dentry = work };
 
 		if (work->d_inode) {
 			err = -EEXIST;
@@ -676,7 +677,7 @@ static struct dentry *ovl_workdir_create(struct ovl_fs *ofs,
 
 		/* Clear any inherited mode bits */
 		inode_lock(work->d_inode);
-		err = notify_change(work, &attr, NULL);
+		err = notify_change(&workpath, &attr, NULL);
 		inode_unlock(work->d_inode);
 		if (err)
 			goto out_dput;
diff --git a/fs/utimes.c b/fs/utimes.c
index 1d17ce98cb80..8b5635d30de4 100644
--- a/fs/utimes.c
+++ b/fs/utimes.c
@@ -57,7 +57,7 @@ static int utimes_common(const struct path *path, struct timespec64 *times)
 	}
 retry_deleg:
 	inode_lock(inode);
-	error = notify_change(path->dentry, &newattrs, &delegated_inode);
+	error = notify_change(path, &newattrs, &delegated_inode);
 	inode_unlock(inode);
 	if (delegated_inode) {
 		error = break_deleg_wait(&delegated_inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4dc62a697817..d0e7ca7bf72e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2537,8 +2537,8 @@ struct filename {
 static_assert(offsetof(struct filename, iname) % sizeof(long) == 0);
 
 extern long vfs_truncate(const struct path *, loff_t);
-extern int do_truncate(struct dentry *, loff_t start, unsigned int time_attrs,
-		       struct file *filp);
+extern int do_truncate(const struct path *p, loff_t start,
+		       unsigned int time_attrs, struct file *filp);
 extern int vfs_fallocate(struct file *file, int mode, loff_t offset,
 			loff_t len);
 extern long do_sys_open(int dfd, const char __user *filename, int flags,
@@ -2887,8 +2887,7 @@ static inline int bmap(struct inode *inode,  sector_t *block)
 	return -EINVAL;
 }
 #endif
-
-extern int notify_change(struct dentry *, struct iattr *, struct inode **);
+extern int notify_change(const struct path *, struct iattr *, struct inode **);
 extern int inode_permission(struct inode *, int);
 extern int generic_permission(struct inode *, int);
 extern int __check_sticky(struct inode *dir, struct inode *inode);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 2/3] fs: introduce uid/gid shifting bind mount
  2020-02-17 20:53 [PATCH v3 0/3] introduce a uid/gid shifting bind mount James Bottomley
  2020-02-17 20:53 ` [PATCH v3 1/3] fs: rethread notify_change to take a path instead of a dentry James Bottomley
@ 2020-02-17 20:53 ` James Bottomley
  2020-02-18  7:38   ` Amir Goldstein
  2020-02-18 22:33   ` Christoph Hellwig
  2020-02-17 20:53 ` [PATCH v3 3/3] fs: expose shifting bind mount to userspace James Bottomley
  2020-02-18  7:18 ` [PATCH v3 0/3] introduce a uid/gid shifting bind mount Amir Goldstein
  3 siblings, 2 replies; 16+ messages in thread
From: James Bottomley @ 2020-02-17 20:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: David Howells, Christian Brauner, Al Viro, Miklos Szeredi,
	Seth Forshee, linux-unionfs, Amir Goldstein,
	Stéphane Graber, Eric Biederman, Aleksa Sarai,
	Serge E . Hallyn, Tycho Andersen, containers

This implementation reverse shifts according to the user_ns belonging
to the mnt_ns.  So if the vfsmount has the newly introduced flag
MNT_SHIFT and the current user_ns is the same as the mount_ns->user_ns
then we shift back using the user_ns and an optional mnt_userns (which
belongs to the struct mount) before committing to the underlying
filesystem.

For example, if a user_ns is created where interior (fake root, uid 0)
is mapped to kernel uid 100000 then writes from interior root normally
go to the filesystem at the kernel uid.  However, if MNT_SHIFT is set,
they will be shifted back to write at uid 0, meaning we can bind mount
real image filesystems to user_ns protected faker root.

In essence there are several things which have to be done for this to
occur safely.  Firstly for all operations on the filesystem, new
credentials have to be installed where fsuid and fsgid are set to the
*interior* values. Next all inodes used from the filesystem have to
have i_uid and i_gid shifted back to the kernel values and attributes
set from user space have to have ia_uid and ia_gid shifted from the
kernel values to the interior values.  The capability checks have to
be done using ns_capable against the kernel values, but the inode
capability checks have to be done against the shifted ids.

Since creating a new credential is a reasonably expensive proposition
and we have to shift and unshift many times during path walking, a
cached copy of the shifted credential is saved to a newly created
place in the task structure.  This serves the dual purpose of allowing
us to use a pre-prepared copy of the shifted credentials and also
allows us to recognise whenever the shift is actually in effect (the
cached shifted credential pointer being equal to the current_cred()
pointer).

To get this all to work, we have a check for the vfsmount flag and the
user_ns gating a shifting of the credentials over all user space
entries to filesystem functions.  In theory the path has to be present
everywhere we do this, so we can check the vfsmount flags.  However,
for lower level functions we can cheat this path check of vfsmount
simply to check whether a shifted credential is in effect or not to
gate things like the inode permission check, which means the path
doesn't have to be threaded all the way through the permission
checking functions.  if the credential is shifted check passes, we can
also be sure that the current user_ns is the same as the mnt->user_ns,
so we can use it and thus have no need of the struct mount at the
point of the shift.

Although the shift can be effected simply by executing
do_reconfigure_mnt with MNT_SHIFT in the flags, this patch only
contains the shifting mechanisms.  The follow on patch wires up the
user visible API for turning the flag on.

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>

---

v3: added a bind mount base shift at the request of Serge Hallyn
---
 fs/attr.c             | 127 +++++++++++++++++++++++++++++++++++++++++---------
 fs/exec.c             |   3 +-
 fs/inode.c            |  10 ++--
 fs/internal.h         |   2 +
 fs/mount.h            |   1 +
 fs/namei.c            | 112 +++++++++++++++++++++++++++++++++++++-------
 fs/namespace.c        |   5 ++
 fs/open.c             |  25 +++++++++-
 fs/posix_acl.c        |   4 +-
 fs/stat.c             |  32 +++++++++++--
 include/linux/cred.h  |  12 +++++
 include/linux/mount.h |   4 +-
 include/linux/sched.h |   5 ++
 kernel/capability.c   |   9 +++-
 kernel/cred.c         |  20 ++++++++
 15 files changed, 317 insertions(+), 54 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 11201ab7e3b1..d7c5883a4b4c 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -18,14 +18,26 @@
 #include <linux/evm.h>
 #include <linux/ima.h>
 
+#include "internal.h"
+#include "mount.h"
+
 static bool chown_ok(const struct inode *inode, kuid_t uid)
 {
+	kuid_t i_uid = inode->i_uid;
+
+	if (cred_is_shifted()) {
+		struct mount *m = real_mount(current->mnt);
+
+		i_uid = KUIDT_INIT(from_kuid(m->mnt_userns, i_uid));
+		i_uid = make_kuid(current_user_ns(), __kuid_val(i_uid));
+	}
+
 	if (uid_eq(current_fsuid(), inode->i_uid) &&
-	    uid_eq(uid, inode->i_uid))
+	    uid_eq(uid, i_uid))
 		return true;
 	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
 		return true;
-	if (uid_eq(inode->i_uid, INVALID_UID) &&
+	if (uid_eq(i_uid, INVALID_UID) &&
 	    ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
 		return true;
 	return false;
@@ -33,17 +45,40 @@ static bool chown_ok(const struct inode *inode, kuid_t uid)
 
 static bool chgrp_ok(const struct inode *inode, kgid_t gid)
 {
+	kgid_t i_gid = inode->i_gid;
+	kuid_t i_uid = inode->i_uid;
+
+	if (cred_is_shifted()) {
+		struct mount *m = real_mount(current->mnt);
+		struct user_namespace *ns = current_user_ns();
+
+		i_uid = KUIDT_INIT(from_kuid(m->mnt_userns, i_uid));
+		i_uid = make_kuid(ns, __kuid_val(i_uid));
+		i_gid = KGIDT_INIT(from_kgid(m->mnt_userns, i_gid));
+		i_gid = make_kgid(ns, __kgid_val(i_gid));
+	}
 	if (uid_eq(current_fsuid(), inode->i_uid) &&
-	    (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
+	    (in_group_p(gid) || gid_eq(gid, i_gid)))
 		return true;
 	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
 		return true;
-	if (gid_eq(inode->i_gid, INVALID_GID) &&
+	if (gid_eq(i_gid, INVALID_GID) &&
 	    ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
 		return true;
 	return false;
 }
 
+int in_group_p_shifted(kgid_t grp)
+{
+	if (cred_is_shifted()) {
+		struct mount *m = real_mount(current->mnt);
+
+		grp = KGIDT_INIT(from_kgid(m->mnt_userns, grp));
+		grp = make_kgid(current_user_ns(), __kgid_val(grp));
+	}
+	return in_group_p(grp);
+}
+
 /**
  * setattr_prepare - check if attribute changes to a dentry are allowed
  * @dentry:	dentry to check
@@ -89,9 +124,10 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
 	if (ia_valid & ATTR_MODE) {
 		if (!inode_owner_or_capable(inode))
 			return -EPERM;
+
 		/* Also check the setgid bit! */
-		if (!in_group_p((ia_valid & ATTR_GID) ? attr->ia_gid :
-				inode->i_gid) &&
+		if (!in_group_p_shifted((ia_valid & ATTR_GID) ? attr->ia_gid :
+					inode->i_gid) &&
 		    !capable_wrt_inode_uidgid(inode, CAP_FSETID))
 			attr->ia_mode &= ~S_ISGID;
 	}
@@ -192,7 +228,7 @@ void setattr_copy(struct inode *inode, const struct iattr *attr)
 	if (ia_valid & ATTR_MODE) {
 		umode_t mode = attr->ia_mode;
 
-		if (!in_group_p(inode->i_gid) &&
+		if (!in_group_p_shifted(inode->i_gid) &&
 		    !capable_wrt_inode_uidgid(inode, CAP_FSETID))
 			mode &= ~S_ISGID;
 		inode->i_mode = mode;
@@ -200,6 +236,23 @@ void setattr_copy(struct inode *inode, const struct iattr *attr)
 }
 EXPORT_SYMBOL(setattr_copy);
 
+void cred_shift(kuid_t *uid, kgid_t *gid)
+{
+	if (cred_is_shifted()) {
+		struct user_namespace *ns = current_user_ns();
+		struct mount *m = real_mount(current->mnt);
+
+		if (uid) {
+			*uid = KUIDT_INIT(from_kuid(m->mnt_userns, *uid));
+			*uid = make_kuid(ns, __kuid_val(*uid));
+		}
+		if (gid) {
+			*gid = KGIDT_INIT(from_kgid(m->mnt_userns, *gid));
+			*gid = make_kgid(ns, __kgid_val(*gid));
+		}
+	}
+}
+
 /**
  * notify_change - modify attributes of a filesytem object
  * @dentry:	object affected
@@ -229,6 +282,9 @@ int notify_change(const struct path *path, struct iattr * attr,
 	int error;
 	struct timespec64 now;
 	unsigned int ia_valid = attr->ia_valid;
+	const struct cred *cred;
+	kuid_t i_uid = inode->i_uid;
+	kgid_t i_gid = inode->i_gid;
 
 	WARN_ON_ONCE(!inode_is_locked(inode));
 
@@ -237,18 +293,30 @@ int notify_change(const struct path *path, struct iattr * attr,
 			return -EPERM;
 	}
 
+	cred = change_userns_creds(path);
+	if (cred) {
+		struct mount *m = real_mount(path->mnt);
+
+		attr->ia_uid = KUIDT_INIT(from_kuid(m->mnt_ns->user_ns, attr->ia_uid));
+		attr->ia_uid = make_kuid(m->mnt_userns, __kuid_val(attr->ia_uid));
+		attr->ia_gid = KGIDT_INIT(from_kgid(m->mnt_ns->user_ns, attr->ia_gid));
+		attr->ia_gid = make_kgid(m->mnt_userns, __kgid_val(attr->ia_gid));
+	}
+
 	/*
 	 * If utimes(2) and friends are called with times == NULL (or both
 	 * times are UTIME_NOW), then we need to check for write permission
 	 */
 	if (ia_valid & ATTR_TOUCH) {
-		if (IS_IMMUTABLE(inode))
-			return -EPERM;
+		if (IS_IMMUTABLE(inode)) {
+			error = -EPERM;
+			goto err;
+		}
 
 		if (!inode_owner_or_capable(inode)) {
 			error = inode_permission(inode, MAY_WRITE);
 			if (error)
-				return error;
+				goto err;
 		}
 	}
 
@@ -274,7 +342,7 @@ int notify_change(const struct path *path, struct iattr * attr,
 	if (ia_valid & ATTR_KILL_PRIV) {
 		error = security_inode_need_killpriv(dentry);
 		if (error < 0)
-			return error;
+			goto err;
 		if (error == 0)
 			ia_valid = attr->ia_valid &= ~ATTR_KILL_PRIV;
 	}
@@ -305,34 +373,49 @@ int notify_change(const struct path *path, struct iattr * attr,
 			attr->ia_mode &= ~S_ISGID;
 		}
 	}
-	if (!(attr->ia_valid & ~(ATTR_KILL_SUID | ATTR_KILL_SGID)))
-		return 0;
+	if (!(attr->ia_valid & ~(ATTR_KILL_SUID | ATTR_KILL_SGID))) {
+		error = 0;
+		goto err;
+	}
 
 	/*
 	 * Verify that uid/gid changes are valid in the target
 	 * namespace of the superblock.
 	 */
+	error = -EOVERFLOW;
 	if (ia_valid & ATTR_UID &&
 	    !kuid_has_mapping(inode->i_sb->s_user_ns, attr->ia_uid))
-		return -EOVERFLOW;
+		goto err;
+
 	if (ia_valid & ATTR_GID &&
 	    !kgid_has_mapping(inode->i_sb->s_user_ns, attr->ia_gid))
-		return -EOVERFLOW;
+		goto err;
 
 	/* Don't allow modifications of files with invalid uids or
 	 * gids unless those uids & gids are being made valid.
 	 */
-	if (!(ia_valid & ATTR_UID) && !uid_valid(inode->i_uid))
-		return -EOVERFLOW;
-	if (!(ia_valid & ATTR_GID) && !gid_valid(inode->i_gid))
-		return -EOVERFLOW;
+	if (cred_is_shifted()) {
+		struct user_namespace *ns = current_user_ns();
+		struct mount *m = real_mount(current->mnt);
+
+		i_uid = KUIDT_INIT(from_kuid(m->mnt_userns, i_uid));
+		i_uid = make_kuid(ns, __kuid_val(i_uid));
+		i_gid = KGIDT_INIT(from_kgid(m->mnt_userns, i_gid));
+		i_gid = make_kgid(ns, __kgid_val(i_gid));
+	}
+
+	if (!(ia_valid & ATTR_UID) && !uid_valid(i_uid))
+		goto err;
+
+	if (!(ia_valid & ATTR_GID) && !gid_valid(i_gid))
+		goto err;
 
 	error = security_inode_setattr(dentry, attr);
 	if (error)
-		return error;
+		goto err;
 	error = try_break_deleg(inode, delegated_inode);
 	if (error)
-		return error;
+		goto err;
 
 	if (inode->i_op->setattr)
 		error = inode->i_op->setattr(dentry, attr);
@@ -345,6 +428,8 @@ int notify_change(const struct path *path, struct iattr * attr,
 		evm_inode_post_setattr(dentry, ia_valid);
 	}
 
+ err:
+	revert_userns_creds(cred);
 	return error;
 }
 EXPORT_SYMBOL(notify_change);
diff --git a/fs/exec.c b/fs/exec.c
index db17be51b112..926bab39ed45 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1543,13 +1543,14 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
 
 	/* Be careful if suid/sgid is set */
 	inode_lock(inode);
-
 	/* reload atomically mode/uid/gid now that lock held */
 	mode = inode->i_mode;
 	uid = inode->i_uid;
 	gid = inode->i_gid;
 	inode_unlock(inode);
 
+	cred_shift(&uid, &gid);
+
 	/* We ignore suid/sgid if there are no mappings for them in the ns */
 	if (!kuid_has_mapping(bprm->cred->user_ns, uid) ||
 		 !kgid_has_mapping(bprm->cred->user_ns, gid))
diff --git a/fs/inode.c b/fs/inode.c
index be14d3fcbee1..ae75b6396786 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2064,7 +2064,7 @@ void inode_init_owner(struct inode *inode, const struct inode *dir,
 		if (S_ISDIR(mode))
 			mode |= S_ISGID;
 		else if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) &&
-			 !in_group_p(inode->i_gid) &&
+			 !in_group_p_shifted(inode->i_gid) &&
 			 !capable_wrt_inode_uidgid(dir, CAP_FSETID))
 			mode &= ~S_ISGID;
 	} else
@@ -2083,12 +2083,16 @@ EXPORT_SYMBOL(inode_init_owner);
 bool inode_owner_or_capable(const struct inode *inode)
 {
 	struct user_namespace *ns;
+	kuid_t uid = inode->i_uid;
 
-	if (uid_eq(current_fsuid(), inode->i_uid))
+	if (uid_eq(current_fsuid(), uid))
 		return true;
 
 	ns = current_user_ns();
-	if (kuid_has_mapping(ns, inode->i_uid) && ns_capable(ns, CAP_FOWNER))
+
+	cred_shift(&uid, NULL);
+
+	if (kuid_has_mapping(ns, uid) && ns_capable(ns, CAP_FOWNER))
 		return true;
 	return false;
 }
diff --git a/fs/internal.h b/fs/internal.h
index 80d89ddb9b28..d2adcdb3eb2e 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -73,6 +73,8 @@ long do_symlinkat(const char __user *oldname, int newdfd,
 		  const char __user *newname);
 int do_linkat(int olddfd, const char __user *oldname, int newdfd,
 	      const char __user *newname, int flags);
+const struct cred *change_userns_creds(const struct path *p);
+void revert_userns_creds(const struct cred *cred);
 
 /*
  * namespace.c
diff --git a/fs/mount.h b/fs/mount.h
index 711a4093e475..c3bfc6ced4c7 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -72,6 +72,7 @@ struct mount {
 	int mnt_expiry_mark;		/* true if marked for expiry */
 	struct hlist_head mnt_pins;
 	struct hlist_head mnt_stuck_children;
+	struct user_namespace *mnt_userns; /* mapping for underlying mount uid/gid */
 } __randomize_layout;
 
 #define MNT_NS_INTERNAL ERR_PTR(-EINVAL) /* distinct from any mnt_namespace */
diff --git a/fs/namei.c b/fs/namei.c
index 531ac55c7e67..369bd18c7330 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -124,6 +124,42 @@
 
 #define EMBEDDED_NAME_MAX	(PATH_MAX - offsetof(struct filename, iname))
 
+const struct cred *change_userns_creds(const struct path *p)
+{
+	struct mount *m = real_mount(p->mnt);
+
+	if ((p->mnt->mnt_flags & MNT_SHIFT) == 0)
+		return NULL;
+
+	if (current->nsproxy->mnt_ns->user_ns != m->mnt_ns->user_ns)
+		return NULL;
+
+	if (current->mnt != p->mnt) {
+		struct cred *cred;
+		struct user_namespace *user_ns = m->mnt_ns->user_ns;
+		kuid_t fsuid = current->cred->fsuid;
+		kgid_t fsgid = current->cred->fsgid;
+
+		if (current->mnt_cred)
+			put_cred(current->mnt_cred);
+		cred = prepare_creds();
+		fsuid = KUIDT_INIT(from_kuid(user_ns, fsuid));
+		fsgid = KGIDT_INIT(from_kgid(user_ns, fsgid));
+		cred->fsuid = make_kuid(m->mnt_userns, __kuid_val(fsuid));
+		cred->fsgid = make_kgid(m->mnt_userns, __kgid_val(fsgid));
+		current->mnt = p->mnt; /* no reference needed */
+		current->mnt_cred = cred;
+	}
+	return override_creds(current->mnt_cred);
+}
+
+void revert_userns_creds(const struct cred *cred)
+{
+	if (!cred)
+		return;
+	revert_creds(cred);
+}
+
 struct filename *
 getname_flags(const char __user *filename, int flags, int *empty)
 {
@@ -303,7 +339,7 @@ static int acl_permission_check(struct inode *inode, int mask)
 				return error;
 		}
 
-		if (in_group_p(inode->i_gid))
+		if (in_group_p_shifted(inode->i_gid))
 			mode >>= 3;
 	}
 
@@ -366,7 +402,6 @@ int generic_permission(struct inode *inode, int mask)
 	if (!(mask & MAY_EXEC) || (inode->i_mode & S_IXUGO))
 		if (capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE))
 			return 0;
-
 	return -EACCES;
 }
 EXPORT_SYMBOL(generic_permission);
@@ -1897,6 +1932,7 @@ static int walk_component(struct nameidata *nd, int flags)
 	struct inode *inode;
 	unsigned seq;
 	int err;
+	const struct cred *cred;
 	/*
 	 * "." and ".." are special - ".." especially so because it has
 	 * to be able to know about the current root directory and
@@ -1908,25 +1944,31 @@ static int walk_component(struct nameidata *nd, int flags)
 			put_link(nd);
 		return err;
 	}
+	cred = change_userns_creds(&nd->path);
 	err = lookup_fast(nd, &path, &inode, &seq);
 	if (unlikely(err <= 0)) {
 		if (err < 0)
-			return err;
+			goto out;
 		path.dentry = lookup_slow(&nd->last, nd->path.dentry,
 					  nd->flags);
-		if (IS_ERR(path.dentry))
-			return PTR_ERR(path.dentry);
+		if (IS_ERR(path.dentry)) {
+			err = PTR_ERR(path.dentry);
+			goto out;
+		}
 
 		path.mnt = nd->path.mnt;
 		err = follow_managed(&path, nd);
 		if (unlikely(err < 0))
-			return err;
+			goto out;
 
 		seq = 0;	/* we are already out of RCU mode */
 		inode = d_backing_inode(path.dentry);
 	}
 
-	return step_into(nd, &path, flags, inode, seq);
+	err = step_into(nd, &path, flags, inode, seq);
+ out:
+	revert_userns_creds(cred);
+	return err;
 }
 
 /*
@@ -2180,8 +2222,10 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 	for(;;) {
 		u64 hash_len;
 		int type;
+		const struct cred *cred = change_userns_creds(&nd->path);
 
 		err = may_lookup(nd);
+		revert_userns_creds(cred);
 		if (err)
 			return err;
 
@@ -2373,12 +2417,17 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 static const char *trailing_symlink(struct nameidata *nd)
 {
 	const char *s;
+	const struct cred *cred = change_userns_creds(&nd->path);
 	int error = may_follow_link(nd);
-	if (unlikely(error))
-		return ERR_PTR(error);
+	if (unlikely(error)) {
+		s = ERR_PTR(error);
+		goto out;
+	}
 	nd->flags |= LOOKUP_PARENT;
 	nd->stack[0].name = NULL;
 	s = get_link(nd);
+ out:
+	revert_userns_creds(cred);
 	return s ? s : "";
 }
 
@@ -3343,6 +3392,7 @@ static int do_last(struct nameidata *nd,
 	struct inode *inode;
 	struct path path;
 	int error;
+	const struct cred *cred = change_userns_creds(&nd->path);
 
 	nd->flags &= ~LOOKUP_PARENT;
 	nd->flags |= op->intent;
@@ -3350,7 +3400,7 @@ static int do_last(struct nameidata *nd,
 	if (nd->last_type != LAST_NORM) {
 		error = handle_dots(nd, nd->last_type);
 		if (unlikely(error))
-			return error;
+			goto err;
 		goto finish_open;
 	}
 
@@ -3363,7 +3413,7 @@ static int do_last(struct nameidata *nd,
 			goto finish_lookup;
 
 		if (error < 0)
-			return error;
+			goto err;
 
 		BUG_ON(nd->inode != dir->d_inode);
 		BUG_ON(nd->flags & LOOKUP_RCU);
@@ -3376,12 +3426,14 @@ static int do_last(struct nameidata *nd,
 		 */
 		error = complete_walk(nd);
 		if (error)
-			return error;
+			goto err;
 
 		audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
 		/* trailing slashes? */
-		if (unlikely(nd->last.name[nd->last.len]))
-			return -EISDIR;
+		if (unlikely(nd->last.name[nd->last.len])) {
+			error = -EISDIR;
+			goto err;
+		}
 	}
 
 	if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
@@ -3437,7 +3489,7 @@ static int do_last(struct nameidata *nd,
 
 	error = follow_managed(&path, nd);
 	if (unlikely(error < 0))
-		return error;
+		goto err;
 
 	/*
 	 * create/update audit record if it already exists.
@@ -3446,7 +3498,8 @@ static int do_last(struct nameidata *nd,
 
 	if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
 		path_to_nameidata(&path, nd);
-		return -EEXIST;
+		error = -EEXIST;
+		goto err;
 	}
 
 	seq = 0;	/* out of RCU mode, so the value doesn't matter */
@@ -3454,12 +3507,12 @@ static int do_last(struct nameidata *nd,
 finish_lookup:
 	error = step_into(nd, &path, 0, inode, seq);
 	if (unlikely(error))
-		return error;
+		goto err;
 finish_open:
 	/* Why this, you ask?  _Now_ we might have grown LOOKUP_JUMPED... */
 	error = complete_walk(nd);
 	if (error)
-		return error;
+		goto err;
 	audit_inode(nd->name, nd->path.dentry, 0);
 	if (open_flag & O_CREAT) {
 		error = -EISDIR;
@@ -3501,6 +3554,8 @@ static int do_last(struct nameidata *nd,
 	}
 	if (got_write)
 		mnt_drop_write(nd->path.mnt);
+ err:
+	revert_userns_creds(cred);
 	return error;
 }
 
@@ -3819,6 +3874,7 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode,
 	struct path path;
 	int error;
 	unsigned int lookup_flags = 0;
+	const struct cred *cred;
 
 	error = may_mknod(mode);
 	if (error)
@@ -3828,6 +3884,7 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode,
 	if (IS_ERR(dentry))
 		return PTR_ERR(dentry);
 
+	cred = change_userns_creds(&path);
 	if (!IS_POSIXACL(path.dentry->d_inode))
 		mode &= ~current_umask();
 	error = security_path_mknod(&path, dentry, mode, dev);
@@ -3849,6 +3906,7 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode,
 	}
 out:
 	done_path_create(&path, dentry);
+	revert_userns_creds(cred);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
 		goto retry;
@@ -3899,18 +3957,21 @@ long do_mkdirat(int dfd, const char __user *pathname, umode_t mode)
 	struct path path;
 	int error;
 	unsigned int lookup_flags = LOOKUP_DIRECTORY;
+	const struct cred *cred;
 
 retry:
 	dentry = user_path_create(dfd, pathname, &path, lookup_flags);
 	if (IS_ERR(dentry))
 		return PTR_ERR(dentry);
 
+	cred = change_userns_creds(&path);
 	if (!IS_POSIXACL(path.dentry->d_inode))
 		mode &= ~current_umask();
 	error = security_path_mkdir(&path, dentry, mode);
 	if (!error)
 		error = vfs_mkdir(path.dentry->d_inode, dentry, mode);
 	done_path_create(&path, dentry);
+	revert_userns_creds(cred);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
 		goto retry;
@@ -3977,12 +4038,14 @@ long do_rmdir(int dfd, const char __user *pathname)
 	struct qstr last;
 	int type;
 	unsigned int lookup_flags = 0;
+	const struct cred *cred;
 retry:
 	name = filename_parentat(dfd, getname(pathname), lookup_flags,
 				&path, &last, &type);
 	if (IS_ERR(name))
 		return PTR_ERR(name);
 
+	cred = change_userns_creds(&path);
 	switch (type) {
 	case LAST_DOTDOT:
 		error = -ENOTEMPTY;
@@ -4018,6 +4081,7 @@ long do_rmdir(int dfd, const char __user *pathname)
 	inode_unlock(path.dentry->d_inode);
 	mnt_drop_write(path.mnt);
 exit1:
+	revert_userns_creds(cred);
 	path_put(&path);
 	putname(name);
 	if (retry_estale(error, lookup_flags)) {
@@ -4107,11 +4171,13 @@ long do_unlinkat(int dfd, struct filename *name)
 	struct inode *inode = NULL;
 	struct inode *delegated_inode = NULL;
 	unsigned int lookup_flags = 0;
+	const struct cred *cred;
 retry:
 	name = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
 	if (IS_ERR(name))
 		return PTR_ERR(name);
 
+	cred = change_userns_creds(&path);
 	error = -EISDIR;
 	if (type != LAST_NORM)
 		goto exit1;
@@ -4149,6 +4215,7 @@ long do_unlinkat(int dfd, struct filename *name)
 	}
 	mnt_drop_write(path.mnt);
 exit1:
+	revert_userns_creds(cred);
 	path_put(&path);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
@@ -4213,6 +4280,7 @@ long do_symlinkat(const char __user *oldname, int newdfd,
 	struct dentry *dentry;
 	struct path path;
 	unsigned int lookup_flags = 0;
+	const struct cred *cred;
 
 	from = getname(oldname);
 	if (IS_ERR(from))
@@ -4223,6 +4291,7 @@ long do_symlinkat(const char __user *oldname, int newdfd,
 	if (IS_ERR(dentry))
 		goto out_putname;
 
+	cred = change_userns_creds(&path);
 	error = security_path_symlink(&path, dentry, from->name);
 	if (!error)
 		error = vfs_symlink(path.dentry->d_inode, dentry, from->name);
@@ -4231,6 +4300,7 @@ long do_symlinkat(const char __user *oldname, int newdfd,
 		lookup_flags |= LOOKUP_REVAL;
 		goto retry;
 	}
+	revert_userns_creds(cred);
 out_putname:
 	putname(from);
 	return error;
@@ -4344,6 +4414,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
 	struct inode *delegated_inode = NULL;
 	int how = 0;
 	int error;
+	const struct cred *cred;
 
 	if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH)) != 0)
 		return -EINVAL;
@@ -4371,6 +4442,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
 	if (IS_ERR(new_dentry))
 		goto out;
 
+	cred = change_userns_creds(&new_path);
 	error = -EXDEV;
 	if (old_path.mnt != new_path.mnt)
 		goto out_dput;
@@ -4382,6 +4454,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
 		goto out_dput;
 	error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry, &delegated_inode);
 out_dput:
+	revert_userns_creds(cred);
 	done_path_create(&new_path, new_dentry);
 	if (delegated_inode) {
 		error = break_deleg_wait(&delegated_inode);
@@ -4601,6 +4674,7 @@ static int do_renameat2(int olddfd, const char __user *oldname, int newdfd,
 	unsigned int lookup_flags = 0, target_flags = LOOKUP_RENAME_TARGET;
 	bool should_retry = false;
 	int error;
+	const struct cred *cred;
 
 	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
 		return -EINVAL;
@@ -4630,6 +4704,7 @@ static int do_renameat2(int olddfd, const char __user *oldname, int newdfd,
 		goto exit1;
 	}
 
+	cred = change_userns_creds(&new_path);
 	error = -EXDEV;
 	if (old_path.mnt != new_path.mnt)
 		goto exit2;
@@ -4714,6 +4789,7 @@ static int do_renameat2(int olddfd, const char __user *oldname, int newdfd,
 	}
 	mnt_drop_write(old_path.mnt);
 exit2:
+	revert_userns_creds(cred);
 	if (retry_estale(error, lookup_flags))
 		should_retry = true;
 	path_put(&new_path);
diff --git a/fs/namespace.c b/fs/namespace.c
index 69fb23ae3d8f..4720647588ab 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -200,6 +200,8 @@ static struct mount *alloc_vfsmnt(const char *name)
 		mnt->mnt_writers = 0;
 #endif
 
+		mnt->mnt_userns = get_user_ns(&init_user_ns);
+
 		INIT_HLIST_NODE(&mnt->mnt_hash);
 		INIT_LIST_HEAD(&mnt->mnt_child);
 		INIT_LIST_HEAD(&mnt->mnt_mounts);
@@ -1044,6 +1046,8 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 	mnt->mnt.mnt_root = dget(root);
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
 	mnt->mnt_parent = mnt;
+	put_user_ns(mnt->mnt_userns);
+	mnt->mnt_userns = get_user_ns(old->mnt_userns);
 	lock_mount_hash();
 	list_add_tail(&mnt->mnt_instance, &sb->s_mounts);
 	unlock_mount_hash();
@@ -1102,6 +1106,7 @@ static void cleanup_mnt(struct mount *mnt)
 	dput(mnt->mnt.mnt_root);
 	deactivate_super(mnt->mnt.mnt_sb);
 	mnt_free_id(mnt);
+	put_user_ns(mnt->mnt_userns);
 	call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt);
 }
 
diff --git a/fs/open.c b/fs/open.c
index db6758b9636a..d27b90dce64d 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -456,11 +456,13 @@ int ksys_chdir(const char __user *filename)
 	struct path path;
 	int error;
 	unsigned int lookup_flags = LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
+	const struct cred *cred;
 retry:
 	error = user_path_at(AT_FDCWD, filename, lookup_flags, &path);
 	if (error)
 		goto out;
 
+	cred = change_userns_creds(&path);
 	error = inode_permission(path.dentry->d_inode, MAY_EXEC | MAY_CHDIR);
 	if (error)
 		goto dput_and_out;
@@ -468,6 +470,7 @@ int ksys_chdir(const char __user *filename)
 	set_fs_pwd(current->fs, &path);
 
 dput_and_out:
+	revert_userns_creds(cred);
 	path_put(&path);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
@@ -486,11 +489,13 @@ SYSCALL_DEFINE1(fchdir, unsigned int, fd)
 {
 	struct fd f = fdget_raw(fd);
 	int error;
+	const struct cred *cred;
 
 	error = -EBADF;
 	if (!f.file)
 		goto out;
 
+	cred = change_userns_creds(&f.file->f_path);
 	error = -ENOTDIR;
 	if (!d_can_lookup(f.file->f_path.dentry))
 		goto out_putf;
@@ -499,6 +504,7 @@ SYSCALL_DEFINE1(fchdir, unsigned int, fd)
 	if (!error)
 		set_fs_pwd(current->fs, &f.file->f_path);
 out_putf:
+	revert_userns_creds(cred);
 	fdput(f);
 out:
 	return error;
@@ -547,11 +553,13 @@ static int chmod_common(const struct path *path, umode_t mode)
 	struct inode *inode = path->dentry->d_inode;
 	struct inode *delegated_inode = NULL;
 	struct iattr newattrs;
+	const struct cred *cred;
 	int error;
 
+	cred = change_userns_creds(path);
 	error = mnt_want_write(path->mnt);
 	if (error)
-		return error;
+		goto out;
 retry_deleg:
 	inode_lock(inode);
 	error = security_path_chmod(path, mode);
@@ -568,6 +576,8 @@ static int chmod_common(const struct path *path, umode_t mode)
 			goto retry_deleg;
 	}
 	mnt_drop_write(path->mnt);
+ out:
+	revert_userns_creds(cred);
 	return error;
 }
 
@@ -666,6 +676,7 @@ int do_fchownat(int dfd, const char __user *filename, uid_t user, gid_t group,
 	struct path path;
 	int error = -EINVAL;
 	int lookup_flags;
+	const struct cred *cred;
 
 	if ((flag & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
 		goto out;
@@ -677,12 +688,14 @@ int do_fchownat(int dfd, const char __user *filename, uid_t user, gid_t group,
 	error = user_path_at(dfd, filename, lookup_flags, &path);
 	if (error)
 		goto out;
+	cred = change_userns_creds(&path);
 	error = mnt_want_write(path.mnt);
 	if (error)
 		goto out_release;
 	error = chown_common(&path, user, group);
 	mnt_drop_write(path.mnt);
 out_release:
+	revert_userns_creds(cred);
 	path_put(&path);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
@@ -713,10 +726,12 @@ int ksys_fchown(unsigned int fd, uid_t user, gid_t group)
 {
 	struct fd f = fdget(fd);
 	int error = -EBADF;
+	const struct cred *cred;
 
 	if (!f.file)
 		goto out;
 
+	cred = change_userns_creds(&f.file->f_path);
 	error = mnt_want_write_file(f.file);
 	if (error)
 		goto out_fput;
@@ -724,6 +739,7 @@ int ksys_fchown(unsigned int fd, uid_t user, gid_t group)
 	error = chown_common(&f.file->f_path, user, group);
 	mnt_drop_write_file(f.file);
 out_fput:
+	revert_userns_creds(cred);
 	fdput(f);
 out:
 	return error;
@@ -911,8 +927,13 @@ EXPORT_SYMBOL(file_path);
  */
 int vfs_open(const struct path *path, struct file *file)
 {
+	int ret;
+	const struct cred *cred = change_userns_creds(path);
+
 	file->f_path = *path;
-	return do_dentry_open(file, d_backing_inode(path->dentry), NULL);
+	ret = do_dentry_open(file, d_backing_inode(path->dentry), NULL);
+	revert_userns_creds(cred);
+	return ret;
 }
 
 struct file *dentry_open(const struct path *path, int flags,
diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 249672bf54fe..ff777110f3da 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -364,7 +364,7 @@ posix_acl_permission(struct inode *inode, const struct posix_acl *acl, int want)
                                         goto mask;
 				break;
                         case ACL_GROUP_OBJ:
-                                if (in_group_p(inode->i_gid)) {
+				if (in_group_p_shifted(inode->i_gid)) {
 					found = 1;
 					if ((pa->e_perm & want) == want)
 						goto mask;
@@ -655,7 +655,7 @@ int posix_acl_update_mode(struct inode *inode, umode_t *mode_p,
 		return error;
 	if (error == 0)
 		*acl = NULL;
-	if (!in_group_p(inode->i_gid) &&
+	if (!in_group_p_shifted(inode->i_gid) &&
 	    !capable_wrt_inode_uidgid(inode, CAP_FSETID))
 		mode &= ~S_ISGID;
 	*mode_p = mode;
diff --git a/fs/stat.c b/fs/stat.c
index 030008796479..634b8d13ed51 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -22,6 +22,7 @@
 #include <asm/unistd.h>
 
 #include "internal.h"
+#include "mount.h"
 
 /**
  * generic_fillattr - Fill in the basic attributes from the inode struct
@@ -50,6 +51,23 @@ void generic_fillattr(struct inode *inode, struct kstat *stat)
 }
 EXPORT_SYMBOL(generic_fillattr);
 
+static void shift_check(struct vfsmount *mnt, struct kstat *stat)
+{
+	struct mount *m = real_mount(mnt);
+	struct user_namespace *user_ns = m->mnt_ns->user_ns;
+
+	if ((mnt->mnt_flags & MNT_SHIFT) == 0)
+		return;
+
+	if (current->nsproxy->mnt_ns->user_ns != m->mnt_ns->user_ns)
+		return;
+
+	stat->uid = KUIDT_INIT(from_kuid(m->mnt_userns, stat->uid));
+	stat->uid = make_kuid(user_ns, __kuid_val(stat->uid));
+	stat->gid = KGIDT_INIT(from_kgid(m->mnt_userns, stat->gid));
+	stat->gid = make_kgid(user_ns, __kgid_val(stat->gid));
+}
+
 /**
  * vfs_getattr_nosec - getattr without security checks
  * @path: file to get attributes from
@@ -67,6 +85,7 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
 		      u32 request_mask, unsigned int query_flags)
 {
 	struct inode *inode = d_backing_inode(path->dentry);
+	int ret;
 
 	memset(stat, 0, sizeof(*stat));
 	stat->result_mask |= STATX_BASIC_STATS;
@@ -79,12 +98,17 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
 	if (IS_AUTOMOUNT(inode))
 		stat->attributes |= STATX_ATTR_AUTOMOUNT;
 
+	ret = 0;
 	if (inode->i_op->getattr)
-		return inode->i_op->getattr(path, stat, request_mask,
-					    query_flags);
+		ret = inode->i_op->getattr(path, stat, request_mask,
+					   query_flags);
+	else
+		generic_fillattr(inode, stat);
 
-	generic_fillattr(inode, stat);
-	return 0;
+	if (!ret)
+		shift_check(path->mnt, stat);
+
+	return ret;
 }
 EXPORT_SYMBOL(vfs_getattr_nosec);
 
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 18639c069263..d29638617844 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -59,6 +59,7 @@ extern struct group_info *groups_alloc(int);
 extern void groups_free(struct group_info *);
 
 extern int in_group_p(kgid_t);
+extern int in_group_p_shifted(kgid_t);
 extern int in_egroup_p(kgid_t);
 extern int groups_search(const struct group_info *, kgid_t);
 
@@ -75,6 +76,10 @@ static inline int in_group_p(kgid_t grp)
 {
         return 1;
 }
+static inline int in_group_p_shifted(kgid_t grp)
+{
+	return 1;
+}
 static inline int in_egroup_p(kgid_t grp)
 {
         return 1;
@@ -422,4 +427,11 @@ do {						\
 	*(_fsgid) = __cred->fsgid;		\
 } while(0)
 
+static inline bool cred_is_shifted(void)
+{
+	return current_cred() == current->mnt_cred;
+}
+
+extern void cred_shift(kuid_t *kuid, kgid_t *kgid);
+
 #endif /* _LINUX_CRED_H */
diff --git a/include/linux/mount.h b/include/linux/mount.h
index bf8cc4108b8f..cdc5d981d594 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -46,7 +46,7 @@ struct fs_context;
 #define MNT_SHARED_MASK	(MNT_UNBINDABLE)
 #define MNT_USER_SETTABLE_MASK  (MNT_NOSUID | MNT_NODEV | MNT_NOEXEC \
 				 | MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME \
-				 | MNT_READONLY)
+				 | MNT_READONLY | MNT_SHIFT)
 #define MNT_ATIME_MASK (MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME )
 
 #define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | \
@@ -65,6 +65,8 @@ struct fs_context;
 #define MNT_MARKED		0x4000000
 #define MNT_UMOUNT		0x8000000
 
+#define MNT_SHIFT		0x10000000
+
 struct vfsmount {
 	struct dentry *mnt_root;	/* root of the mounted tree */
 	struct super_block *mnt_sb;	/* pointer to superblock */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04278493bf15..b489c3964e5a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -19,6 +19,7 @@
 #include <linux/plist.h>
 #include <linux/hrtimer.h>
 #include <linux/seccomp.h>
+#include <linux/mount.h>
 #include <linux/nodemask.h>
 #include <linux/rcupdate.h>
 #include <linux/refcount.h>
@@ -882,6 +883,10 @@ struct task_struct {
 	/* Effective (overridable) subjective task credentials (COW): */
 	const struct cred __rcu		*cred;
 
+	/* cache for uid/gid shifted cred tied to mnt */
+	struct cred			*mnt_cred;
+	struct vfsmount			*mnt;
+
 #ifdef CONFIG_KEYS
 	/* Cached requested key. */
 	struct key			*cached_requested_key;
diff --git a/kernel/capability.c b/kernel/capability.c
index 1444f3954d75..a28955a38460 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -486,8 +486,13 @@ EXPORT_SYMBOL(file_ns_capable);
  */
 bool privileged_wrt_inode_uidgid(struct user_namespace *ns, const struct inode *inode)
 {
-	return kuid_has_mapping(ns, inode->i_uid) &&
-		kgid_has_mapping(ns, inode->i_gid);
+	kuid_t i_uid = inode->i_uid;
+	kgid_t i_gid = inode->i_gid;
+
+	cred_shift(&i_uid, &i_gid);
+
+	return kuid_has_mapping(ns, i_uid) &&
+		kgid_has_mapping(ns, i_gid);
 }
 
 /**
diff --git a/kernel/cred.c b/kernel/cred.c
index 809a985b1793..32a8eb5c91b3 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -167,6 +167,8 @@ void exit_creds(struct task_struct *tsk)
 	validate_creds(cred);
 	alter_cred_subscribers(cred, -1);
 	put_cred(cred);
+	if (tsk->mnt_cred)
+		put_cred(tsk->mnt_cred);
 
 	cred = (struct cred *) tsk->cred;
 	tsk->cred = NULL;
@@ -318,6 +320,17 @@ struct cred *prepare_exec_creds(void)
 	return new;
 }
 
+static void flush_mnt_cred(struct task_struct *t)
+{
+	if (t->mnt_cred == t->cred)
+		return;
+	if (t->mnt_cred)
+		put_cred(t->mnt_cred);
+	t->mnt_cred = NULL;
+	/* mnt is only used for comparison, so it has no reference */
+	t->mnt = NULL;
+}
+
 /*
  * Copy credentials for the new process created by fork()
  *
@@ -344,6 +357,8 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
 	    ) {
 		p->real_cred = get_cred(p->cred);
 		get_cred(p->cred);
+		p->mnt = NULL;
+		p->mnt_cred = NULL;
 		alter_cred_subscribers(p->cred, 2);
 		kdebug("share_creds(%p{%d,%d})",
 		       p->cred, atomic_read(&p->cred->usage),
@@ -383,6 +398,8 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
 
 	atomic_inc(&new->user->processes);
 	p->cred = p->real_cred = get_cred(new);
+	p->mnt = NULL;
+	p->mnt_cred = NULL;
 	alter_cred_subscribers(new, 2);
 	validate_creds(new);
 	return 0;
@@ -506,6 +523,7 @@ int commit_creds(struct cred *new)
 	/* release the old obj and subj refs both */
 	put_cred(old);
 	put_cred(old);
+	flush_mnt_cred(task);
 	return 0;
 }
 EXPORT_SYMBOL(commit_creds);
@@ -564,6 +582,7 @@ const struct cred *override_creds(const struct cred *new)
 	alter_cred_subscribers(new, 1);
 	rcu_assign_pointer(current->cred, new);
 	alter_cred_subscribers(old, -1);
+	flush_mnt_cred(current);
 
 	kdebug("override_creds() = %p{%d,%d}", old,
 	       atomic_read(&old->usage),
@@ -589,6 +608,7 @@ void revert_creds(const struct cred *old)
 
 	validate_creds(old);
 	validate_creds(override);
+	flush_mnt_cred(current);
 	alter_cred_subscribers(old, 1);
 	rcu_assign_pointer(current->cred, old);
 	alter_cred_subscribers(override, -1);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 3/3] fs: expose shifting bind mount to userspace
  2020-02-17 20:53 [PATCH v3 0/3] introduce a uid/gid shifting bind mount James Bottomley
  2020-02-17 20:53 ` [PATCH v3 1/3] fs: rethread notify_change to take a path instead of a dentry James Bottomley
  2020-02-17 20:53 ` [PATCH v3 2/3] fs: introduce uid/gid shifting bind mount James Bottomley
@ 2020-02-17 20:53 ` James Bottomley
  2020-02-18  7:18 ` [PATCH v3 0/3] introduce a uid/gid shifting bind mount Amir Goldstein
  3 siblings, 0 replies; 16+ messages in thread
From: James Bottomley @ 2020-02-17 20:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: David Howells, Christian Brauner, Al Viro, Miklos Szeredi,
	Seth Forshee, linux-unionfs, Amir Goldstein,
	Stéphane Graber, Eric Biederman, Aleksa Sarai,
	Serge E . Hallyn, Tycho Andersen, containers

This capability is exposed currently through the proposed new bind
API. To mark a mount for shifting, you add the allow-shift flag to the
properties, either by a reconfigure or a rebind.  Only real root on
the system can do this.  Once this is done, admin in a user namespace
(i.e. an unprivileged user) can take that mount point and bind it with
a shift in effect.

This patch has the ability to shift image uids built in.  The way you
do this is create a usernamespace where the fake root of the image is
seen by the interior uid as zero.  Then pass this namespace in using
the "ns" option of the bind.  One of the key design criteria is that
the mounted image never appears at exterior uid zero, so the image
shift isn't activated until the whole shift is applied.

The way an admin marks a mount is:

pathfd = open("/path/to/shift", O_PATH);
fd = configfd_open("bind", O_CLOEXEC);
configfd_action(fd, CONFIGFD_SET_FD, "pathfd", NULL, pathfd);
configfd_action(fd, CONFIGFD_SET_FLAG, "allow-shift", NULL, 0);
configfd_action(fd, CONFIGFD_SET_FLAG, "detached", NULL, 0);
/* optionally apply a shift to the base image */
nsfd = open("/proc/567/ns/user", O_RDONLY);  /* note: not O_PATH */
configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd);
/* then create the detached object */
configfd_action(fd, CONFIGFD_CMD_CREATE, NULL, NULL, 0);
configfd_action(fd, CONFIGFD_GET_FD, "bindfd", &bindfd, O_CLOEXEC);

move_mount(bindfd, "", AT_FDCWD, "/path/to/allow", MOVE_MOUNT_F_EMPTY_PATH);

Technically /path/to/shift and /path/to/allow can be the same, which
basically installs a mnt at the path that allows onward traversal.

Then any mount namespace in a user namespace can do:

pathfd = open("/path/to/allow", O_PATH);
fd = configfd_open("bind", O_CLOEXEC);
configfd_action(fd, CONFIGFD_SET_FD, "pathfd", NULL, pathfd);
configfd_action(fd, CONFIGFD_SET_FLAG, "shift", NULL, 0);
configfd_action(fd, CONFIGFD_SET_FLAG, "detached", NULL, 0);
configfd_action(fd, CONFIGFD_CMD_CREATE, NULL, NULL, 0);
configfd_action(fd, CONFIGFD_GET_FD, "bindfd", &bindfd, O_CLOEXEC);

move_mount(bindfd, "", AT_FDCWD, "/path/to/mount", MOVE_MOUNT_F_EMPTY_PATH);

And /path/to/mount will have the uid/gid shifting bind mount installed.

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>

---

v3: added ns option to do base image shifting
---
 fs/bind.c           | 105 +++++++++++++++++++++++++++++++++++++++++++++++-----
 fs/mount.h          |   2 +
 fs/namespace.c      |   1 +
 fs/proc_namespace.c |   4 ++
 4 files changed, 103 insertions(+), 9 deletions(-)

diff --git a/fs/bind.c b/fs/bind.c
index c1dedef40169..0cb40d3d4289 100644
--- a/fs/bind.c
+++ b/fs/bind.c
@@ -6,23 +6,29 @@
  */
 
 #include <linux/configfd.h>
+#include <linux/fdtable.h>
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/mount.h>
 #include <linux/nsproxy.h>
+#include <linux/proc_ns.h>
+#include <linux/user_namespace.h>
 
 #include "internal.h"
 #include "mount.h"
 
 struct bind_data {
-	bool		ro:1;
-	bool		noexec:1;
-	bool		nosuid:1;
-	bool		nodev:1;
-	bool		detached:1;
-	bool		recursive:1;
-	struct file	*file;
-	struct file	*retfile;
+	bool			ro:1;
+	bool			noexec:1;
+	bool			nosuid:1;
+	bool			nodev:1;
+	bool			detached:1;
+	bool			recursive:1;
+	bool			shift:1;
+	bool			allow_shift:1;
+	struct file		*file;
+	struct file		*retfile;
+	struct user_namespace	*userns;
 };
 
 struct bind_data *to_bind_data(const struct configfd_context *cfc)
@@ -30,13 +36,54 @@ struct bind_data *to_bind_data(const struct configfd_context *cfc)
 	return cfc->data;
 }
 
+static int match_file(const void *p, struct file *file, unsigned int fd)
+{
+	if (p == file)
+		return fd + 1;
+	return 0;
+}
+
+static int bind_set_userns(const struct configfd_context *cfc,
+			   struct configfd_param *p)
+{
+	struct bind_data *bd = to_bind_data(cfc);
+	int fd;
+	struct file *file;
+	struct ns_common *ns;
+
+	if (!bd->allow_shift) {
+		plogger_err(&cfc->log, "ns may only be specified if allow_shift is also");
+		return -EINVAL;
+	}
+
+	/* huge fuss here because nsfs matches on fd not file */
+	fd = iterate_fd(current->files, 0, match_file, p->file);
+	if (fd == 0)
+		return -EINVAL;
+	fd--;
+	file = proc_ns_fget(fd);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	ns = get_proc_ns(file_inode(file));
+	if (ns->ops->type != CLONE_NEWUSER)
+		return -EINVAL;
+
+	bd->userns = get_user_ns(container_of(ns, struct user_namespace, ns));
+
+	/* note we haven't consumed p->file so configfd will release it */
+	return 0;
+}
+
 static int bind_set_fd(const struct configfd_context *cfc,
 		       struct configfd_param *p)
 {
 	struct bind_data *bd = to_bind_data(cfc);
 	struct path *path;
 
-	if (strcmp(p->key, "pathfd") != 0)
+	if (strcmp(p->key, "ns") == 0)
+		return bind_set_userns(cfc, p);
+	else if (strcmp(p->key, "pathfd") != 0)
 		return -EINVAL;
 
 	path = &p->file->f_path;
@@ -66,6 +113,25 @@ static int bind_set_flag(const struct configfd_context *cfc,
 		bd->nodev = true;
 	} else if (strcmp(p->key, "noexec") == 0) {
 		bd->noexec = true;
+	} else if (strcmp(p->key, "shift") == 0) {
+		struct mount *m;
+
+		if (!bd->file) {
+			plogger_err(&cfc->log, "can't shift without setting pathfd");
+			return -EINVAL;
+		}
+		m = real_mount(bd->file->f_path.mnt);
+		if (!m->allow_shift) {
+			plogger_err(&cfc->log, "pathfd doesn't allow shifting");
+			return -EINVAL;
+		}
+		bd->shift = true;
+	} else if (strcmp(p->key, "allow-shift") == 0) {
+		if (!capable(CAP_SYS_ADMIN)) {
+			plogger_err(&cfc->log, "must be root to set allow-shift");
+			return -EPERM;
+		}
+		bd->allow_shift = true;
 	} else if (strcmp(p->key, "recursive") == 0 &&
 		   cfc->op == CONFIGFD_CMD_CREATE) {
 		bd->recursive = true;
@@ -126,6 +192,8 @@ static int bind_get_mnt_flags(struct bind_data *bd, int mnt_flags)
 		mnt_flags |= MNT_NODEV;
 	if (bd->noexec)
 		mnt_flags |= MNT_NOEXEC;
+	if (bd->shift)
+		mnt_flags |= MNT_SHIFT;
 
 	return mnt_flags;
 }
@@ -143,6 +211,13 @@ static int bind_reconfigure(const struct configfd_context *cfc)
 	mnt_flags = bd->file->f_path.mnt->mnt_flags & MNT_ATIME_MASK;
 	mnt_flags = bind_get_mnt_flags(bd, mnt_flags);
 
+	if (bd->allow_shift) {
+		struct mount *m = real_mount(bd->file->f_path.mnt);
+
+		/* FIXME: this should be set with the reconfigure locking */
+		m->allow_shift = true;
+	}
+
 	return do_reconfigure_mnt(&bd->file->f_path, mnt_flags);
 }
 
@@ -178,11 +253,21 @@ static int bind_create(const struct configfd_context *cfc)
 
 	if (bd->detached) {
 		int mnt_flags = f->f_path.mnt->mnt_flags & MNT_ATIME_MASK;
+		struct mount *m = real_mount(f->f_path.mnt);
 
 		mnt_flags = bind_get_mnt_flags(bd, mnt_flags);
 
 		/* since this is a detached copy, we can do without locking */
 		f->f_path.mnt->mnt_flags |= mnt_flags;
+
+		if (bd->allow_shift)
+			m->allow_shift = true;
+
+		if (bd->userns) {
+			put_user_ns(m->mnt_userns);
+			m->mnt_userns = bd->userns;
+			bd->userns = NULL; /* transfer the reference */
+		}
 	}
 
 	bd->retfile = f;
@@ -208,6 +293,8 @@ static void bind_free(const struct configfd_context *cfc)
 
 	if (bd->file)
 		fput(bd->file);
+	if (bd->userns)
+		put_user_ns(bd->userns);
 }
 
 static struct configfd_ops bind_type_ops = {
diff --git a/fs/mount.h b/fs/mount.h
index c3bfc6ced4c7..157e86cf9930 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -72,6 +72,8 @@ struct mount {
 	int mnt_expiry_mark;		/* true if marked for expiry */
 	struct hlist_head mnt_pins;
 	struct hlist_head mnt_stuck_children;
+	/* shifting bind mount parameters */
+	bool allow_shift:1;
 	struct user_namespace *mnt_userns; /* mapping for underlying mount uid/gid */
 } __randomize_layout;
 
diff --git a/fs/namespace.c b/fs/namespace.c
index 4720647588ab..75e4df4ddf54 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1040,6 +1040,7 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 
 	mnt->mnt.mnt_flags = old->mnt.mnt_flags;
 	mnt->mnt.mnt_flags &= ~(MNT_WRITE_HOLD|MNT_MARKED|MNT_INTERNAL);
+	mnt->allow_shift = old->allow_shift;
 
 	atomic_inc(&sb->s_active);
 	mnt->mnt.mnt_sb = sb;
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 273ee82d8aa9..bdf8d23cf42e 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -70,14 +70,18 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
 		{ MNT_NOATIME, ",noatime" },
 		{ MNT_NODIRATIME, ",nodiratime" },
 		{ MNT_RELATIME, ",relatime" },
+		{ MNT_SHIFT, ",shift" },
 		{ 0, NULL }
 	};
 	const struct proc_fs_info *fs_infop;
+	struct mount *rm = real_mount(mnt);
 
 	for (fs_infop = mnt_info; fs_infop->flag; fs_infop++) {
 		if (mnt->mnt_flags & fs_infop->flag)
 			seq_puts(m, fs_infop->str);
 	}
+	if (rm->allow_shift)
+		seq_puts(m, ",allow-shift");
 }
 
 static inline void mangle(struct seq_file *m, const char *s)
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/3] introduce a uid/gid shifting bind mount
  2020-02-17 20:53 [PATCH v3 0/3] introduce a uid/gid shifting bind mount James Bottomley
                   ` (2 preceding siblings ...)
  2020-02-17 20:53 ` [PATCH v3 3/3] fs: expose shifting bind mount to userspace James Bottomley
@ 2020-02-18  7:18 ` Amir Goldstein
  2020-02-18 16:11   ` James Bottomley
  3 siblings, 1 reply; 16+ messages in thread
From: Amir Goldstein @ 2020-02-18  7:18 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-fsdevel, David Howells, Christian Brauner, Al Viro,
	Miklos Szeredi, Seth Forshee, overlayfs, Stéphane Graber,
	Eric Biederman, Aleksa Sarai, Serge E . Hallyn, Tycho Andersen,
	Linux Containers

On Mon, Feb 17, 2020 at 10:56 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> The object of this series is to replace shiftfs with a proper uid/gid
> shifting bind mount instead of the shiftfs hack of introducing
> something that looks similar to an overlay filesystem to do it.
>
> The VFS still has the problem that in order to tell what vfsmount a
> dentry belongs to, struct path would have to be threaded everywhere
> struct dentry currently is.  However, this patch is structured only to
> require a rethreading of notify_change.  The rest of the knowledge
> that a shift is in operation is carried in the task structure by
> caching the unshifted credentials.
>
> Note that although it is currently dependent on the new configfd
> interface for bind mounts, only patch 3/3 relies on this, and the
> whole thing could be redone as a syscall or any other mechanism
> (depending on how people eventually want to fix the problem with the
> new fsconfig mechanism being unable to reconfigure bind mounts).
>
> The changes from v2 are I've added Amir's reviewed-by for the
> notify_change rethreading and I've implemented Serge's request for a
> base offset shift for the image.  It turned out to be much harder to
> implement a simple linear shift than simply to do it through a
> different userns, so that's how I've done it.  The userns you need to
> set up for the offset shifted image is one where the interior uid
> would see the shifted image as fake root.  I've introduced an
> additional "ns" config parameter, which must be specified when
> building the allow shift mount point (so it's done by the admin, not
> by the unprivileged user).  I've also taken care that the image
> shifted to zero (real root) is never visible in the filesystem.  Patch
> 3/3 explains how to use the additional "ns" parameter.
>
>

James,

To us common people who do not breath containers, your proposal seems
like a competing implementation to Christian's proposal [1]. If it were a
competing implementation, I think Christian's proposal would have won by
points for being less intrusive to VFS.

But it is not really a competing implementation, is it? Your proposals meet
two different, but very overlapping, set of requirements. IMHO, none of you
did a really good job of explaining that in the cover latter, let
alone, refer to
each others proposals (I am referring to your v3 posting of course).

IIUC, Christian's proposal deals with single shared image per
non-overlapping groups of containers. And it deals with this use case very
elegantly IMO. From your comments on Christian's post, it does not
seem that you oppose to his proposal, except that it does not meet the
requirements for all of your use cases.

IIUC, your proposal can deal with multiple shared images per overlapping
groups of containers and it adds an element of "auto-reverse-mapping",
which reduces the administration overhead of this to be nightmare
of orchestration.

It seems to me, that you should look into working your patch set on
top of fsid mapping and try to make use of it as much as possible.
And to make things a bit more clear to the rest of us, you should probably
market your feature as "auto back shifting mount" or something like that
and explain the added value of the feature on top of plain fsid mapping.

Thanks,
Amir.


[1] https://lore.kernel.org/linux-fsdevel/20200214183554.1133805-1-christian.brauner@ubuntu.com/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 2/3] fs: introduce uid/gid shifting bind mount
  2020-02-17 20:53 ` [PATCH v3 2/3] fs: introduce uid/gid shifting bind mount James Bottomley
@ 2020-02-18  7:38   ` Amir Goldstein
  2020-02-18 22:33   ` Christoph Hellwig
  1 sibling, 0 replies; 16+ messages in thread
From: Amir Goldstein @ 2020-02-18  7:38 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-fsdevel, David Howells, Christian Brauner, Al Viro,
	Miklos Szeredi, Seth Forshee, overlayfs, Stéphane Graber,
	Eric Biederman, Aleksa Sarai, Serge E . Hallyn, Tycho Andersen,
	Linux Containers

On Mon, Feb 17, 2020 at 10:58 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> This implementation reverse shifts according to the user_ns belonging
> to the mnt_ns.  So if the vfsmount has the newly introduced flag
> MNT_SHIFT and the current user_ns is the same as the mount_ns->user_ns
> then we shift back using the user_ns and an optional mnt_userns (which
> belongs to the struct mount) before committing to the underlying
> filesystem.
>
> For example, if a user_ns is created where interior (fake root, uid 0)
> is mapped to kernel uid 100000 then writes from interior root normally
> go to the filesystem at the kernel uid.  However, if MNT_SHIFT is set,
> they will be shifted back to write at uid 0, meaning we can bind mount
> real image filesystems to user_ns protected faker root.
>
> In essence there are several things which have to be done for this to
> occur safely.  Firstly for all operations on the filesystem, new
> credentials have to be installed where fsuid and fsgid are set to the
> *interior* values. Next all inodes used from the filesystem have to
> have i_uid and i_gid shifted back to the kernel values and attributes
> set from user space have to have ia_uid and ia_gid shifted from the
> kernel values to the interior values.  The capability checks have to
> be done using ns_capable against the kernel values, but the inode
> capability checks have to be done against the shifted ids.
>
> Since creating a new credential is a reasonably expensive proposition
> and we have to shift and unshift many times during path walking, a
> cached copy of the shifted credential is saved to a newly created
> place in the task structure.  This serves the dual purpose of allowing
> us to use a pre-prepared copy of the shifted credentials and also
> allows us to recognise whenever the shift is actually in effect (the
> cached shifted credential pointer being equal to the current_cred()
> pointer).
>
> To get this all to work, we have a check for the vfsmount flag and the
> user_ns gating a shifting of the credentials over all user space
> entries to filesystem functions.  In theory the path has to be present
> everywhere we do this, so we can check the vfsmount flags.  However,
> for lower level functions we can cheat this path check of vfsmount
> simply to check whether a shifted credential is in effect or not to
> gate things like the inode permission check, which means the path
> doesn't have to be threaded all the way through the permission
> checking functions.  if the credential is shifted check passes, we can
> also be sure that the current user_ns is the same as the mnt->user_ns,
> so we can use it and thus have no need of the struct mount at the
> point of the shift.
>
> Although the shift can be effected simply by executing
> do_reconfigure_mnt with MNT_SHIFT in the flags, this patch only
> contains the shifting mechanisms.  The follow on patch wires up the
> user visible API for turning the flag on.
>
> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
>
> ---
[...]

> @@ -3828,6 +3884,7 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode,
>         if (IS_ERR(dentry))
>                 return PTR_ERR(dentry);
>
> +       cred = change_userns_creds(&path);
>         if (!IS_POSIXACL(path.dentry->d_inode))
>                 mode &= ~current_umask();
>         error = security_path_mknod(&path, dentry, mode, dev);
[...]

> +       cred = change_userns_creds(&path);
>         if (!IS_POSIXACL(path.dentry->d_inode))
>                 mode &= ~current_umask();
>         error = security_path_mkdir(&path, dentry, mode);
[...]

> +       cred = change_userns_creds(&path);
>         error = security_path_symlink(&path, dentry, from->name);

I see a pattern above.

Perhaps change_userns_creds() should be inside security_path_XXX hooks?
Perhaps auto-shifting bind mount should be implemented by an LSM?
After, all "gating" access to filesystem, is part of what LSMs do and
uid (or fsid)
shifting is a sort of "gating".
Heck, there should already be a way to attach a security context to a mount,
right? So you don't even need a new UAPI in order to configure the auto-shifting
LSM. And you could use standard security.* xattr for persistent configuration
of the auto-shifting filesystem sections, which is something that you wanted
to solve anyway, right?

Apologies if my suggestions are flawed with misunderstanding of the feature.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/3] introduce a uid/gid shifting bind mount
  2020-02-18  7:18 ` [PATCH v3 0/3] introduce a uid/gid shifting bind mount Amir Goldstein
@ 2020-02-18 16:11   ` James Bottomley
  2020-02-18 17:26     ` Christian Brauner
  0 siblings, 1 reply; 16+ messages in thread
From: James Bottomley @ 2020-02-18 16:11 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, Linux Containers, overlayfs, David Howells,
	Seth Forshee, Al Viro, linux-fsdevel, Eric Biederman

On Tue, 2020-02-18 at 09:18 +0200, Amir Goldstein wrote:
> On Mon, Feb 17, 2020 at 10:56 PM James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> > 
> > The object of this series is to replace shiftfs with a proper
> > uid/gid shifting bind mount instead of the shiftfs hack of
> > introducing something that looks similar to an overlay filesystem
> > to do it.
> > 
> > The VFS still has the problem that in order to tell what vfsmount a
> > dentry belongs to, struct path would have to be threaded everywhere
> > struct dentry currently is.  However, this patch is structured only
> > to require a rethreading of notify_change.  The rest of the
> > knowledge that a shift is in operation is carried in the task
> > structure by caching the unshifted credentials.
> > 
> > Note that although it is currently dependent on the new configfd
> > interface for bind mounts, only patch 3/3 relies on this, and the
> > whole thing could be redone as a syscall or any other mechanism
> > (depending on how people eventually want to fix the problem with
> > the new fsconfig mechanism being unable to reconfigure bind
> > mounts).
> > 
> > The changes from v2 are I've added Amir's reviewed-by for the
> > notify_change rethreading and I've implemented Serge's request for
> > a base offset shift for the image.  It turned out to be much harder
> > to implement a simple linear shift than simply to do it through a
> > different userns, so that's how I've done it.  The userns you need
> > to set up for the offset shifted image is one where the interior
> > uid would see the shifted image as fake root.  I've introduced an
> > additional "ns" config parameter, which must be specified when
> > building the allow shift mount point (so it's done by the admin,
> > not by the unprivileged user).  I've also taken care that the image
> > shifted to zero (real root) is never visible in the
> > filesystem.  Patch 3/3 explains how to use the additional "ns"
> > parameter.
> > 
> > 
> 
> James,
> 
> To us common people who do not breath containers, your proposal seems
> like a competing implementation to Christian's proposal [1].

I think we have three things that swirl around this space and aren't
quite direct copies of each other's use cases but aren't entirely
disjoint either: the superblock user namespace, this and the user
namespace fsid mapping.

>  If it were a competing implementation, I think Christian's proposal
> would have won by points for being less intrusive to VFS.

Heh, that one's subjective.  I think the current fsid code is missing
quite a few use cases in the stat/attr/in_group_p cases.  I'm just
building the code now to run it through the shiftfs tests and see how
it fares.  I think once those cases are added, the VFS changes in fsid
will be the same as I have in patch 2/3 ... primarily because we all
have to shift the same thing at the same point.  If you include the
notify_change rethreading then, yes, you're correct, but that patch
does stand on its own and is consonant with a long term vfs goal of
using path instead of dentry.

> But it is not really a competing implementation, is it? Your
> proposals meet two different, but very overlapping, set of
> requirements. IMHO, none of you did a really good job of explaining
> that in the cover latter, let alone, refer to each others proposals
> (I am referring to your v3 posting of course).

Yes, I know, but the fsid one is only a few days old so I haven't had
time to absorb all of it yet.

> IIUC, Christian's proposal deals with single shared image per
> non-overlapping groups of containers. And it deals with this use case
> very elegantly IMO. From your comments on Christian's post, it does
> not seem that you oppose to his proposal, except that it does not
> meet the requirements for all of your use cases.

No, but I think it could.  It's one of these perennial problems of more
generic vs more specific to use case.  I'm a bit lost in really what we
need for containers.  In the original shiftfs I made it superblock
based precisely because that was the only way I could integrate
s_user_ns into it ... and I thought that was a good idea.

> IIUC, your proposal can deal with multiple shared images per
> overlapping groups of containers

That's right ... it does the shift based on path and user namespace. 
In theory it allows an image with shifted and unshifted pieces ...
however, I'm not sure there's even a use case for that granularity
because all my current image shifting use cases are all or nothing. 
The granularity is an accident of the bind mount implementation.

>  and it adds an element of "auto-reverse-mapping", which reduces the
> administration overhead of this to be nightmare of orchestration.

Right, but the same thing could be an option to the fsid proposal: the
default use could shift forward on kuid and back on the same map for
fsuid ... then it would do what shiftfs does.  Likewise, shiftfs could
pick up the shift from an existing namespace and thus look more like
what fsuid does.

> It seems to me, that you should look into working your patch set on
> top of fsid mapping and try to make use of it as much as possible.
> And to make things a bit more clear to the rest of us, you should
> probably market your feature as "auto back shifting mount" or
> something like that and explain the added value of the feature on top
> of plain fsid mapping

Well we both need the same shift points, so we could definitely both
work off a generic vfs encapsulation of "shift needed here".  Once
that's done it does become a question of use and activation.

I can't help feeling that now that we've been around the houses a few
times, s_user_ns is actually in the wrong place and it should be in the
mount struct not the superblock.  I get the impression we've got the
what we need to expose (the use cases) well established (at least in
our heads).  The big question your asking is implementation (the how)
and also whether there isn't a combination of the exposures that works
for everyone.  I think this might make a very good session at LSF/MM. 
The how piece is completely within the purview of kernel developers. 
The use case is more problematic because that does involve the
container orchestration community.  However, I think at LSF/MM if we
could get agreement on a unified implementation that covers the three
use cases we're in a much better position to have the container
orchestration conversation  because it's simply a case of tweaking the
activation mechanisms.  I'll propose it as a topic.

James


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/3] introduce a uid/gid shifting bind mount
  2020-02-18 16:11   ` James Bottomley
@ 2020-02-18 17:26     ` Christian Brauner
  2020-02-18 19:05       ` James Bottomley
  0 siblings, 1 reply; 16+ messages in thread
From: Christian Brauner @ 2020-02-18 17:26 UTC (permalink / raw)
  To: James Bottomley
  Cc: Amir Goldstein, Miklos Szeredi, Linux Containers, overlayfs,
	David Howells, Seth Forshee, Al Viro, linux-fsdevel,
	Eric Biederman

On Tue, Feb 18, 2020 at 08:11:00AM -0800, James Bottomley wrote:
> On Tue, 2020-02-18 at 09:18 +0200, Amir Goldstein wrote:
> > On Mon, Feb 17, 2020 at 10:56 PM James Bottomley
> > <James.Bottomley@hansenpartnership.com> wrote:
> > > 
> > > The object of this series is to replace shiftfs with a proper
> > > uid/gid shifting bind mount instead of the shiftfs hack of
> > > introducing something that looks similar to an overlay filesystem
> > > to do it.
> > > 
> > > The VFS still has the problem that in order to tell what vfsmount a
> > > dentry belongs to, struct path would have to be threaded everywhere
> > > struct dentry currently is.  However, this patch is structured only
> > > to require a rethreading of notify_change.  The rest of the
> > > knowledge that a shift is in operation is carried in the task
> > > structure by caching the unshifted credentials.
> > > 
> > > Note that although it is currently dependent on the new configfd
> > > interface for bind mounts, only patch 3/3 relies on this, and the
> > > whole thing could be redone as a syscall or any other mechanism
> > > (depending on how people eventually want to fix the problem with
> > > the new fsconfig mechanism being unable to reconfigure bind
> > > mounts).
> > > 
> > > The changes from v2 are I've added Amir's reviewed-by for the
> > > notify_change rethreading and I've implemented Serge's request for
> > > a base offset shift for the image.  It turned out to be much harder
> > > to implement a simple linear shift than simply to do it through a
> > > different userns, so that's how I've done it.  The userns you need
> > > to set up for the offset shifted image is one where the interior
> > > uid would see the shifted image as fake root.  I've introduced an
> > > additional "ns" config parameter, which must be specified when
> > > building the allow shift mount point (so it's done by the admin,
> > > not by the unprivileged user).  I've also taken care that the image
> > > shifted to zero (real root) is never visible in the
> > > filesystem.  Patch 3/3 explains how to use the additional "ns"
> > > parameter.
> > > 
> > > 
> > 
> > James,
> > 
> > To us common people who do not breath containers, your proposal seems
> > like a competing implementation to Christian's proposal [1].
> 
> I think we have three things that swirl around this space and aren't
> quite direct copies of each other's use cases but aren't entirely
> disjoint either: the superblock user namespace, this and the user
> namespace fsid mapping.
> 
> >  If it were a competing implementation, I think Christian's proposal
> > would have won by points for being less intrusive to VFS.
> 
> Heh, that one's subjective.  I think the current fsid code is missing

I honestly don't think it is subjective. Leaving aside that the patch
here is more invasive to the vfs just in how it needs to alter it, you
currently require a whole new set of syscalls for this feature. The
syscalls themselves even introduce a whole new API and complicated
semantics in the kernel and it's completely unclear whether they will
make it or not. Even if not, this feature will be tied to the new mount
api making it way harder and a longer process to adopt for userspace
given that not even all bits and pieces userspace needs are currently in
any released kernel.

But way more important: what Amir got right is that your approach and
fsid mappings don't stand in each others way at all. Shiftfed
bind-mounts can be implemented completely independent of fsid mappings
after the fact on top of it.

Your example, does this:

nsfd = open("/proc/567/ns/user", O_RDONLY);  /* note: not O_PATH */
configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd);

as the ultimate step. Essentially marking a mountpoint as shifted
relative to that user namespace. Once fsid mappings are in all that you
need to do is replace your make_kuid()/from_kuid()/from_kuid_munged()
calls and so on in your patchset with
make_kfsuid()/from_kfsuid()/from_kfsuid_munged() and you're done. So I
honestly don't currently see any need to block the patchsets on each
other. 

Christian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/3] introduce a uid/gid shifting bind mount
  2020-02-18 17:26     ` Christian Brauner
@ 2020-02-18 19:05       ` James Bottomley
  2020-02-18 20:03         ` Christian Brauner
  0 siblings, 1 reply; 16+ messages in thread
From: James Bottomley @ 2020-02-18 19:05 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Amir Goldstein, Miklos Szeredi, Linux Containers, overlayfs,
	David Howells, Seth Forshee, Al Viro, linux-fsdevel,
	Eric Biederman

On Tue, 2020-02-18 at 18:26 +0100, Christian Brauner wrote:
> On Tue, Feb 18, 2020 at 08:11:00AM -0800, James Bottomley wrote:
> > On Tue, 2020-02-18 at 09:18 +0200, Amir Goldstein wrote:
> > > On Mon, Feb 17, 2020 at 10:56 PM James Bottomley
> > > <James.Bottomley@hansenpartnership.com> wrote:
> > > > 
> > > > The object of this series is to replace shiftfs with a proper
> > > > uid/gid shifting bind mount instead of the shiftfs hack of
> > > > introducing something that looks similar to an overlay
> > > > filesystem to do it.
> > > > 
> > > > The VFS still has the problem that in order to tell what
> > > > vfsmount a dentry belongs to, struct path would have to be
> > > > threaded everywhere struct dentry currently is.  However, this
> > > > patch is structured only to require a rethreading of
> > > > notify_change.  The rest of the knowledge that a shift is in
> > > > operation is carried in the task structure by caching the
> > > > unshifted credentials.
> > > > 
> > > > Note that although it is currently dependent on the new
> > > > configfd interface for bind mounts, only patch 3/3 relies on
> > > > this, and the whole thing could be redone as a syscall or any
> > > > other mechanism (depending on how people eventually want to fix
> > > > the problem with the new fsconfig mechanism being unable to
> > > > reconfigure bind mounts).
> > > > 
> > > > The changes from v2 are I've added Amir's reviewed-by for the
> > > > notify_change rethreading and I've implemented Serge's request
> > > > for a base offset shift for the image.  It turned out to be
> > > > much harder to implement a simple linear shift than simply to
> > > > do it through a different userns, so that's how I've done
> > > > it.  The userns you need to set up for the offset shifted image
> > > > is one where the interior uid would see the shifted image as
> > > > fake root.  I've introduced an additional "ns" config
> > > > parameter, which must be specified when building the allow
> > > > shift mount point (so it's done by the admin, not by the
> > > > unprivileged user).  I've also taken care that the image
> > > > shifted to zero (real root) is never visible in the
> > > > filesystem.  Patch 3/3 explains how to use the additional "ns"
> > > > parameter.
> > > > 
> > > > 
> > > 
> > > James,
> > > 
> > > To us common people who do not breath containers, your proposal
> > > seems like a competing implementation to Christian's proposal
> > > [1].
> > 
> > I think we have three things that swirl around this space and
> > aren't quite direct copies of each other's use cases but aren't
> > entirely disjoint either: the superblock user namespace, this and
> > the user namespace fsid mapping.
> > 
> > >  If it were a competing implementation, I think Christian's
> > > proposal would have won by points for being less intrusive to
> > > VFS.
> > 
> > Heh, that one's subjective.  I think the current fsid code is
> > missing
> 
> I honestly don't think it is subjective. Leaving aside that the patch
> here is more invasive to the vfs just in how it needs to alter it,
> you currently require a whole new set of syscalls for this feature. 

Well I really don't ... I'd like to replace about four existing
syscalls (fspick/fsopen/fsconfig/fsmount) with two more general ones
(configfd_open/configfd_action), but I kept the original four to
demonstrate the two new ones could subsume their work.  However, I
really think we should leave aside the activation mechanism discussion
and concentrate on the internal implementation.

> The syscalls themselves even introduce a whole new API and
> complicated semantics in the kernel and it's completely unclear
> whether they will make it or not.

As I said in the cover letter.  I'm entirely mechanism agnostic. 
Whatever solves our current rebind reconfigure problem will be usable
for shifting bind mounts.  I like the idea of using the same fsconfig
mechanism, but this patch isn't wedded to it, that's why 2/3 is
implementation and 3/3 is activation.  The activation patch can be
completely replaced without affecting the implementation mechanism.

>  Even if not, this feature will be tied to the new mount api making
> it way harder and a longer process to adopt for userspace given that
> not even all bits and pieces userspace needs are currently in
> any released kernel.

Well, given that this problem and various proposed solutions have been
around for years, I really don't think we're suddenly in any rush to
get it out of the door and into user hands.

> But way more important: what Amir got right is that your approach and
> fsid mappings don't stand in each others way at all. Shiftfed
> bind-mounts can be implemented completely independent of fsid
> mappings after the fact on top of it.
> 
> Your example, does this:
> 
> nsfd = open("/proc/567/ns/user", O_RDONLY);  /* note: not O_PATH */
> configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd);
> 
> as the ultimate step. Essentially marking a mountpoint as shifted
> relative to that user namespace. Once fsid mappings are in all that
> you need to do is replace your
> make_kuid()/from_kuid()/from_kuid_munged() calls and so on in your
> patchset with make_kfsuid()/from_kfsuid()/from_kfsuid_munged() and
> you're done. So I honestly don't currently see any need to block the
> patchsets on each other. 

Can I repeat: there's no rush to get upstream on this.  Let's pause to
get the kernel implementation (the thing we have to maintain) right.  I
realise we could each work around the other and get our implementations
bent around each other so they all work independently thus making our
disjoint user cabals happy but I don't think that would lead to the
best outcome for kernel maintainability.

I already think that history shows us that s_user_ns went upstream too
fast and the fact that unprivileged fuse has yet to make it (and the
reception the original shiftfs got) indicates there may be an
abstraction problem with s_user_ns (I have to confess here that while I
think I understand the unprivileged fuse issues for both the daemon and
the consumer, I don't have a clear insight into the f2fs use case ...
it's something androidy that no-one has fully explained).  So I think
the implementors of all three features need to come up with a generic
VFS shifting abstraction that covers all the use cases.  We also need
to decide which user_ns we care about: the one above us in current, the
one captured when the mnt_ns got unshared, the one captured in
s_user_ns when the mount was done or the new one I introduced which
captures the user_ns at struct mount creation/clone time ... I really
don't think we need them all so we should work out what do they all
mean and which should we care about and start ruthlessly eliminating
any one we believe we don't care about.

We're really like people building a castle by each working on our own
turret and my fear is the weight of all the turrets, even if it doesn't
cause the whole castle simply to collapse, is weakening our security
posture.

James


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/3] introduce a uid/gid shifting bind mount
  2020-02-18 19:05       ` James Bottomley
@ 2020-02-18 20:03         ` Christian Brauner
  2020-02-18 23:43           ` James Bottomley
  0 siblings, 1 reply; 16+ messages in thread
From: Christian Brauner @ 2020-02-18 20:03 UTC (permalink / raw)
  To: James Bottomley
  Cc: Amir Goldstein, Miklos Szeredi, Linux Containers, overlayfs,
	David Howells, Seth Forshee, Al Viro, linux-fsdevel,
	Eric Biederman, stgraber

On Tue, Feb 18, 2020 at 11:05:48AM -0800, James Bottomley wrote:
> On Tue, 2020-02-18 at 18:26 +0100, Christian Brauner wrote:
> > On Tue, Feb 18, 2020 at 08:11:00AM -0800, James Bottomley wrote:
> > > On Tue, 2020-02-18 at 09:18 +0200, Amir Goldstein wrote:
> > > > On Mon, Feb 17, 2020 at 10:56 PM James Bottomley
> > > > <James.Bottomley@hansenpartnership.com> wrote:
> > > > > 
> > > > > The object of this series is to replace shiftfs with a proper
> > > > > uid/gid shifting bind mount instead of the shiftfs hack of
> > > > > introducing something that looks similar to an overlay
> > > > > filesystem to do it.
> > > > > 
> > > > > The VFS still has the problem that in order to tell what
> > > > > vfsmount a dentry belongs to, struct path would have to be
> > > > > threaded everywhere struct dentry currently is.  However, this
> > > > > patch is structured only to require a rethreading of
> > > > > notify_change.  The rest of the knowledge that a shift is in
> > > > > operation is carried in the task structure by caching the
> > > > > unshifted credentials.
> > > > > 
> > > > > Note that although it is currently dependent on the new
> > > > > configfd interface for bind mounts, only patch 3/3 relies on
> > > > > this, and the whole thing could be redone as a syscall or any
> > > > > other mechanism (depending on how people eventually want to fix
> > > > > the problem with the new fsconfig mechanism being unable to
> > > > > reconfigure bind mounts).
> > > > > 
> > > > > The changes from v2 are I've added Amir's reviewed-by for the
> > > > > notify_change rethreading and I've implemented Serge's request
> > > > > for a base offset shift for the image.  It turned out to be
> > > > > much harder to implement a simple linear shift than simply to
> > > > > do it through a different userns, so that's how I've done
> > > > > it.  The userns you need to set up for the offset shifted image
> > > > > is one where the interior uid would see the shifted image as
> > > > > fake root.  I've introduced an additional "ns" config
> > > > > parameter, which must be specified when building the allow
> > > > > shift mount point (so it's done by the admin, not by the
> > > > > unprivileged user).  I've also taken care that the image
> > > > > shifted to zero (real root) is never visible in the
> > > > > filesystem.  Patch 3/3 explains how to use the additional "ns"
> > > > > parameter.
> > > > > 
> > > > > 
> > > > 
> > > > James,
> > > > 
> > > > To us common people who do not breath containers, your proposal
> > > > seems like a competing implementation to Christian's proposal
> > > > [1].
> > > 
> > > I think we have three things that swirl around this space and
> > > aren't quite direct copies of each other's use cases but aren't
> > > entirely disjoint either: the superblock user namespace, this and
> > > the user namespace fsid mapping.
> > > 
> > > >  If it were a competing implementation, I think Christian's
> > > > proposal would have won by points for being less intrusive to
> > > > VFS.
> > > 
> > > Heh, that one's subjective.  I think the current fsid code is
> > > missing
> > 
> > I honestly don't think it is subjective. Leaving aside that the patch
> > here is more invasive to the vfs just in how it needs to alter it,
> > you currently require a whole new set of syscalls for this feature. 
> 
> Well I really don't ... I'd like to replace about four existing
> syscalls (fspick/fsopen/fsconfig/fsmount) with two more general ones
> (configfd_open/configfd_action), but I kept the original four to
> demonstrate the two new ones could subsume their work.  However, I
> really think we should leave aside the activation mechanism discussion
> and concentrate on the internal implementation.
> 
> > The syscalls themselves even introduce a whole new API and
> > complicated semantics in the kernel and it's completely unclear
> > whether they will make it or not.
> 
> As I said in the cover letter.  I'm entirely mechanism agnostic. 
> Whatever solves our current rebind reconfigure problem will be usable
> for shifting bind mounts.  I like the idea of using the same fsconfig
> mechanism, but this patch isn't wedded to it, that's why 2/3 is
> implementation and 3/3 is activation.  The activation patch can be
> completely replaced without affecting the implementation mechanism.
> 
> >  Even if not, this feature will be tied to the new mount api making
> > it way harder and a longer process to adopt for userspace given that
> > not even all bits and pieces userspace needs are currently in
> > any released kernel.
> 
> Well, given that this problem and various proposed solutions have been
> around for years, I really don't think we're suddenly in any rush to
> get it out of the door and into user hands.
> 
> > But way more important: what Amir got right is that your approach and
> > fsid mappings don't stand in each others way at all. Shiftfed
> > bind-mounts can be implemented completely independent of fsid
> > mappings after the fact on top of it.
> > 
> > Your example, does this:
> > 
> > nsfd = open("/proc/567/ns/user", O_RDONLY);  /* note: not O_PATH */
> > configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd);
> > 
> > as the ultimate step. Essentially marking a mountpoint as shifted
> > relative to that user namespace. Once fsid mappings are in all that
> > you need to do is replace your
> > make_kuid()/from_kuid()/from_kuid_munged() calls and so on in your
> > patchset with make_kfsuid()/from_kfsuid()/from_kfsuid_munged() and
> > you're done. So I honestly don't currently see any need to block the
> > patchsets on each other. 
> 
> Can I repeat: there's no rush to get upstream on this.  Let's pause to
> get the kernel implementation (the thing we have to maintain) right.  I
> realise we could each work around the other and get our implementations
> bent around each other so they all work independently thus making our
> disjoint user cabals happy but I don't think that would lead to the
> best outcome for kernel maintainability.

We have had the discussion with all major stakeholders in a single room
on what we need at LPC 2019. We agreed on what we need and fsids are a
concrete proposal for an implementation that appears to solve all discussed
major use-cases in a simple and elegant manner, which can also be cleanly
extended to cover your approach later.
Imho, there is no need to have the same discussion again at an
invite-only event focussed on kernel developers where most of the major
stakeholders are unlikely to be able to participate. The patch proposals
are here on all relevant list where everyone can participate and we can discuss
them right here.
I have not yet heard a concrete technical reason why the patch proposal is
inadequate and I see no reason to stall this.

> 
> I already think that history shows us that s_user_ns went upstream too
> fast and the fact that unprivileged fuse has yet to make it (and the

We've established on the other patchset that fsid mappings in no way interfere
nor care about s_user_ns so I'm not going to go into this again here. But for
the record, unprivileged fuse mounts are supported since:

commit 4ad769f3c346ec3d458e255548dec26ca5284cf6
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Tue May 29 09:04:46 2018 -0500

    fuse: Allow fully unprivileged mounts

    Now that the fuse and the vfs work is complete.  Allow the fuse filesystem
    to be mounted by the root user in a user namespace.

    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

Christian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 2/3] fs: introduce uid/gid shifting bind mount
  2020-02-17 20:53 ` [PATCH v3 2/3] fs: introduce uid/gid shifting bind mount James Bottomley
  2020-02-18  7:38   ` Amir Goldstein
@ 2020-02-18 22:33   ` Christoph Hellwig
  2020-02-19  1:19     ` James Bottomley
  1 sibling, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2020-02-18 22:33 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-fsdevel, David Howells, Christian Brauner, Al Viro,
	Miklos Szeredi, Seth Forshee, linux-unionfs, Amir Goldstein,
	Stéphane Graber, Eric Biederman, Aleksa Sarai,
	Serge E . Hallyn, Tycho Andersen, containers

On Mon, Feb 17, 2020 at 12:53:06PM -0800, James Bottomley wrote:
> 
> v3: added a bind mount base shift at the request of Serge Hallyn
> ---
>  fs/attr.c             | 127 +++++++++++++++++++++++++++++++++++++++++---------
>  fs/exec.c             |   3 +-
>  fs/inode.c            |  10 ++--
>  fs/internal.h         |   2 +
>  fs/mount.h            |   1 +
>  fs/namei.c            | 112 +++++++++++++++++++++++++++++++++++++-------
>  fs/namespace.c        |   5 ++
>  fs/open.c             |  25 +++++++++-
>  fs/posix_acl.c        |   4 +-
>  fs/stat.c             |  32 +++++++++++--
>  include/linux/cred.h  |  12 +++++
>  include/linux/mount.h |   4 +-
>  include/linux/sched.h |   5 ++
>  kernel/capability.c   |   9 +++-
>  kernel/cred.c         |  20 ++++++++
>  15 files changed, 317 insertions(+), 54 deletions(-)
> 
> diff --git a/fs/attr.c b/fs/attr.c
> index 11201ab7e3b1..d7c5883a4b4c 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -18,14 +18,26 @@
>  #include <linux/evm.h>
>  #include <linux/ima.h>
>  
> +#include "internal.h"
> +#include "mount.h"
> +
>  static bool chown_ok(const struct inode *inode, kuid_t uid)
>  {
> +	kuid_t i_uid = inode->i_uid;
> +
> +	if (cred_is_shifted()) {
> +		struct mount *m = real_mount(current->mnt);
> +
> +		i_uid = KUIDT_INIT(from_kuid(m->mnt_userns, i_uid));
> +		i_uid = make_kuid(current_user_ns(), __kuid_val(i_uid));
> +	}
> +
>  	if (uid_eq(current_fsuid(), inode->i_uid) &&
> -	    uid_eq(uid, inode->i_uid))
> +	    uid_eq(uid, i_uid))
>  		return true;
>  	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>  		return true;
> -	if (uid_eq(inode->i_uid, INVALID_UID) &&
> +	if (uid_eq(i_uid, INVALID_UID) &&
>  	    ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
>  		return true;
>  	return false;
> @@ -33,17 +45,40 @@ static bool chown_ok(const struct inode *inode, kuid_t uid)
>  
>  static bool chgrp_ok(const struct inode *inode, kgid_t gid)
>  {
> +	kgid_t i_gid = inode->i_gid;
> +	kuid_t i_uid = inode->i_uid;
> +
> +	if (cred_is_shifted()) {
> +		struct mount *m = real_mount(current->mnt);
> +		struct user_namespace *ns = current_user_ns();
> +
> +		i_uid = KUIDT_INIT(from_kuid(m->mnt_userns, i_uid));
> +		i_uid = make_kuid(ns, __kuid_val(i_uid));
> +		i_gid = KGIDT_INIT(from_kgid(m->mnt_userns, i_gid));
> +		i_gid = make_kgid(ns, __kgid_val(i_gid));
> +	}
>  	if (uid_eq(current_fsuid(), inode->i_uid) &&
> -	    (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
> +	    (in_group_p(gid) || gid_eq(gid, i_gid)))
>  		return true;
>  	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>  		return true;
> -	if (gid_eq(inode->i_gid, INVALID_GID) &&
> +	if (gid_eq(i_gid, INVALID_GID) &&
>  	    ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
>  		return true;
>  	return false;
>  }
>  
> +int in_group_p_shifted(kgid_t grp)
> +{
> +	if (cred_is_shifted()) {
> +		struct mount *m = real_mount(current->mnt);
> +
> +		grp = KGIDT_INIT(from_kgid(m->mnt_userns, grp));
> +		grp = make_kgid(current_user_ns(), __kgid_val(grp));
> +	}
> +	return in_group_p(grp);
> +}
> +
>  /**
>   * setattr_prepare - check if attribute changes to a dentry are allowed
>   * @dentry:	dentry to check
> @@ -89,9 +124,10 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
>  	if (ia_valid & ATTR_MODE) {
>  		if (!inode_owner_or_capable(inode))
>  			return -EPERM;
> +
>  		/* Also check the setgid bit! */
> -		if (!in_group_p((ia_valid & ATTR_GID) ? attr->ia_gid :
> -				inode->i_gid) &&
> +		if (!in_group_p_shifted((ia_valid & ATTR_GID) ? attr->ia_gid :
> +					inode->i_gid) &&
>  		    !capable_wrt_inode_uidgid(inode, CAP_FSETID))
>  			attr->ia_mode &= ~S_ISGID;
>  	}
> @@ -192,7 +228,7 @@ void setattr_copy(struct inode *inode, const struct iattr *attr)
>  	if (ia_valid & ATTR_MODE) {
>  		umode_t mode = attr->ia_mode;
>  
> -		if (!in_group_p(inode->i_gid) &&
> +		if (!in_group_p_shifted(inode->i_gid) &&
>  		    !capable_wrt_inode_uidgid(inode, CAP_FSETID))
>  			mode &= ~S_ISGID;
>  		inode->i_mode = mode;
> @@ -200,6 +236,23 @@ void setattr_copy(struct inode *inode, const struct iattr *attr)
>  }
>  EXPORT_SYMBOL(setattr_copy);
>  
> +void cred_shift(kuid_t *uid, kgid_t *gid)
> +{
> +	if (cred_is_shifted()) {
> +		struct user_namespace *ns = current_user_ns();
> +		struct mount *m = real_mount(current->mnt);
> +
> +		if (uid) {
> +			*uid = KUIDT_INIT(from_kuid(m->mnt_userns, *uid));
> +			*uid = make_kuid(ns, __kuid_val(*uid));
> +		}
> +		if (gid) {
> +			*gid = KGIDT_INIT(from_kgid(m->mnt_userns, *gid));
> +			*gid = make_kgid(ns, __kgid_val(*gid));
> +		}
> +	}
> +}
> +
>  /**
>   * notify_change - modify attributes of a filesytem object
>   * @dentry:	object affected
> @@ -229,6 +282,9 @@ int notify_change(const struct path *path, struct iattr * attr,
>  	int error;
>  	struct timespec64 now;
>  	unsigned int ia_valid = attr->ia_valid;
> +	const struct cred *cred;
> +	kuid_t i_uid = inode->i_uid;
> +	kgid_t i_gid = inode->i_gid;
>  
>  	WARN_ON_ONCE(!inode_is_locked(inode));
>  
> @@ -237,18 +293,30 @@ int notify_change(const struct path *path, struct iattr * attr,
>  			return -EPERM;
>  	}
>  
> +	cred = change_userns_creds(path);
> +	if (cred) {
> +		struct mount *m = real_mount(path->mnt);
> +
> +		attr->ia_uid = KUIDT_INIT(from_kuid(m->mnt_ns->user_ns, attr->ia_uid));
> +		attr->ia_uid = make_kuid(m->mnt_userns, __kuid_val(attr->ia_uid));
> +		attr->ia_gid = KGIDT_INIT(from_kgid(m->mnt_ns->user_ns, attr->ia_gid));
> +		attr->ia_gid = make_kgid(m->mnt_userns, __kgid_val(attr->ia_gid));
> +	}
> +
>  	/*
>  	 * If utimes(2) and friends are called with times == NULL (or both
>  	 * times are UTIME_NOW), then we need to check for write permission
>  	 */
>  	if (ia_valid & ATTR_TOUCH) {
> -		if (IS_IMMUTABLE(inode))
> -			return -EPERM;
> +		if (IS_IMMUTABLE(inode)) {
> +			error = -EPERM;
> +			goto err;
> +		}
>  
>  		if (!inode_owner_or_capable(inode)) {
>  			error = inode_permission(inode, MAY_WRITE);
>  			if (error)
> -				return error;
> +				goto err;
>  		}
>  	}
>  
> @@ -274,7 +342,7 @@ int notify_change(const struct path *path, struct iattr * attr,
>  	if (ia_valid & ATTR_KILL_PRIV) {
>  		error = security_inode_need_killpriv(dentry);
>  		if (error < 0)
> -			return error;
> +			goto err;
>  		if (error == 0)
>  			ia_valid = attr->ia_valid &= ~ATTR_KILL_PRIV;
>  	}
> @@ -305,34 +373,49 @@ int notify_change(const struct path *path, struct iattr * attr,
>  			attr->ia_mode &= ~S_ISGID;
>  		}
>  	}
> -	if (!(attr->ia_valid & ~(ATTR_KILL_SUID | ATTR_KILL_SGID)))
> -		return 0;
> +	if (!(attr->ia_valid & ~(ATTR_KILL_SUID | ATTR_KILL_SGID))) {
> +		error = 0;
> +		goto err;
> +	}
>  
>  	/*
>  	 * Verify that uid/gid changes are valid in the target
>  	 * namespace of the superblock.
>  	 */
> +	error = -EOVERFLOW;
>  	if (ia_valid & ATTR_UID &&
>  	    !kuid_has_mapping(inode->i_sb->s_user_ns, attr->ia_uid))
> -		return -EOVERFLOW;
> +		goto err;
> +
>  	if (ia_valid & ATTR_GID &&
>  	    !kgid_has_mapping(inode->i_sb->s_user_ns, attr->ia_gid))
> -		return -EOVERFLOW;
> +		goto err;
>  
>  	/* Don't allow modifications of files with invalid uids or
>  	 * gids unless those uids & gids are being made valid.
>  	 */
> -	if (!(ia_valid & ATTR_UID) && !uid_valid(inode->i_uid))
> -		return -EOVERFLOW;
> -	if (!(ia_valid & ATTR_GID) && !gid_valid(inode->i_gid))
> -		return -EOVERFLOW;
> +	if (cred_is_shifted()) {
> +		struct user_namespace *ns = current_user_ns();
> +		struct mount *m = real_mount(current->mnt);
> +
> +		i_uid = KUIDT_INIT(from_kuid(m->mnt_userns, i_uid));
> +		i_uid = make_kuid(ns, __kuid_val(i_uid));
> +		i_gid = KGIDT_INIT(from_kgid(m->mnt_userns, i_gid));
> +		i_gid = make_kgid(ns, __kgid_val(i_gid));
> +	}
> +
> +	if (!(ia_valid & ATTR_UID) && !uid_valid(i_uid))
> +		goto err;
> +
> +	if (!(ia_valid & ATTR_GID) && !gid_valid(i_gid))
> +		goto err;
>  
>  	error = security_inode_setattr(dentry, attr);
>  	if (error)
> -		return error;
> +		goto err;
>  	error = try_break_deleg(inode, delegated_inode);
>  	if (error)
> -		return error;
> +		goto err;
>  
>  	if (inode->i_op->setattr)
>  		error = inode->i_op->setattr(dentry, attr);
> @@ -345,6 +428,8 @@ int notify_change(const struct path *path, struct iattr * attr,
>  		evm_inode_post_setattr(dentry, ia_valid);
>  	}
>  
> + err:
> +	revert_userns_creds(cred);
>  	return error;
>  }
>  EXPORT_SYMBOL(notify_change);
> diff --git a/fs/exec.c b/fs/exec.c
> index db17be51b112..926bab39ed45 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1543,13 +1543,14 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
>  
>  	/* Be careful if suid/sgid is set */
>  	inode_lock(inode);
> -
>  	/* reload atomically mode/uid/gid now that lock held */
>  	mode = inode->i_mode;
>  	uid = inode->i_uid;
>  	gid = inode->i_gid;
>  	inode_unlock(inode);
>  
> +	cred_shift(&uid, &gid);
> +
>  	/* We ignore suid/sgid if there are no mappings for them in the ns */
>  	if (!kuid_has_mapping(bprm->cred->user_ns, uid) ||
>  		 !kgid_has_mapping(bprm->cred->user_ns, gid))
> diff --git a/fs/inode.c b/fs/inode.c
> index be14d3fcbee1..ae75b6396786 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -2064,7 +2064,7 @@ void inode_init_owner(struct inode *inode, const struct inode *dir,
>  		if (S_ISDIR(mode))
>  			mode |= S_ISGID;
>  		else if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) &&
> -			 !in_group_p(inode->i_gid) &&
> +			 !in_group_p_shifted(inode->i_gid) &&
>  			 !capable_wrt_inode_uidgid(dir, CAP_FSETID))
>  			mode &= ~S_ISGID;
>  	} else
> @@ -2083,12 +2083,16 @@ EXPORT_SYMBOL(inode_init_owner);
>  bool inode_owner_or_capable(const struct inode *inode)
>  {
>  	struct user_namespace *ns;
> +	kuid_t uid = inode->i_uid;
>  
> -	if (uid_eq(current_fsuid(), inode->i_uid))
> +	if (uid_eq(current_fsuid(), uid))
>  		return true;
>  
>  	ns = current_user_ns();
> -	if (kuid_has_mapping(ns, inode->i_uid) && ns_capable(ns, CAP_FOWNER))
> +
> +	cred_shift(&uid, NULL);
> +
> +	if (kuid_has_mapping(ns, uid) && ns_capable(ns, CAP_FOWNER))
>  		return true;
>  	return false;
>  }
> diff --git a/fs/internal.h b/fs/internal.h
> index 80d89ddb9b28..d2adcdb3eb2e 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -73,6 +73,8 @@ long do_symlinkat(const char __user *oldname, int newdfd,
>  		  const char __user *newname);
>  int do_linkat(int olddfd, const char __user *oldname, int newdfd,
>  	      const char __user *newname, int flags);
> +const struct cred *change_userns_creds(const struct path *p);
> +void revert_userns_creds(const struct cred *cred);
>  
>  /*
>   * namespace.c
> diff --git a/fs/mount.h b/fs/mount.h
> index 711a4093e475..c3bfc6ced4c7 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -72,6 +72,7 @@ struct mount {
>  	int mnt_expiry_mark;		/* true if marked for expiry */
>  	struct hlist_head mnt_pins;
>  	struct hlist_head mnt_stuck_children;
> +	struct user_namespace *mnt_userns; /* mapping for underlying mount uid/gid */
>  } __randomize_layout;
>  
>  #define MNT_NS_INTERNAL ERR_PTR(-EINVAL) /* distinct from any mnt_namespace */
> diff --git a/fs/namei.c b/fs/namei.c
> index 531ac55c7e67..369bd18c7330 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -124,6 +124,42 @@
>  
>  #define EMBEDDED_NAME_MAX	(PATH_MAX - offsetof(struct filename, iname))
>  
> +const struct cred *change_userns_creds(const struct path *p)
> +{
> +	struct mount *m = real_mount(p->mnt);
> +
> +	if ((p->mnt->mnt_flags & MNT_SHIFT) == 0)
> +		return NULL;
> +
> +	if (current->nsproxy->mnt_ns->user_ns != m->mnt_ns->user_ns)
> +		return NULL;
> +
> +	if (current->mnt != p->mnt) {
> +		struct cred *cred;
> +		struct user_namespace *user_ns = m->mnt_ns->user_ns;
> +		kuid_t fsuid = current->cred->fsuid;
> +		kgid_t fsgid = current->cred->fsgid;
> +
> +		if (current->mnt_cred)
> +			put_cred(current->mnt_cred);
> +		cred = prepare_creds();
> +		fsuid = KUIDT_INIT(from_kuid(user_ns, fsuid));
> +		fsgid = KGIDT_INIT(from_kgid(user_ns, fsgid));
> +		cred->fsuid = make_kuid(m->mnt_userns, __kuid_val(fsuid));
> +		cred->fsgid = make_kgid(m->mnt_userns, __kgid_val(fsgid));
> +		current->mnt = p->mnt; /* no reference needed */
> +		current->mnt_cred = cred;
> +	}
> +	return override_creds(current->mnt_cred);
> +}
> +
> +void revert_userns_creds(const struct cred *cred)
> +{
> +	if (!cred)
> +		return;
> +	revert_creds(cred);
> +}
> +
>  struct filename *
>  getname_flags(const char __user *filename, int flags, int *empty)
>  {
> @@ -303,7 +339,7 @@ static int acl_permission_check(struct inode *inode, int mask)
>  				return error;
>  		}
>  
> -		if (in_group_p(inode->i_gid))
> +		if (in_group_p_shifted(inode->i_gid))
>  			mode >>= 3;
>  	}
>  
> @@ -366,7 +402,6 @@ int generic_permission(struct inode *inode, int mask)
>  	if (!(mask & MAY_EXEC) || (inode->i_mode & S_IXUGO))
>  		if (capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE))
>  			return 0;
> -
>  	return -EACCES;
>  }
>  EXPORT_SYMBOL(generic_permission);
> @@ -1897,6 +1932,7 @@ static int walk_component(struct nameidata *nd, int flags)
>  	struct inode *inode;
>  	unsigned seq;
>  	int err;
> +	const struct cred *cred;
>  	/*
>  	 * "." and ".." are special - ".." especially so because it has
>  	 * to be able to know about the current root directory and
> @@ -1908,25 +1944,31 @@ static int walk_component(struct nameidata *nd, int flags)
>  			put_link(nd);
>  		return err;
>  	}
> +	cred = change_userns_creds(&nd->path);
>  	err = lookup_fast(nd, &path, &inode, &seq);
>  	if (unlikely(err <= 0)) {
>  		if (err < 0)
> -			return err;
> +			goto out;
>  		path.dentry = lookup_slow(&nd->last, nd->path.dentry,
>  					  nd->flags);
> -		if (IS_ERR(path.dentry))
> -			return PTR_ERR(path.dentry);
> +		if (IS_ERR(path.dentry)) {
> +			err = PTR_ERR(path.dentry);
> +			goto out;
> +		}
>  
>  		path.mnt = nd->path.mnt;
>  		err = follow_managed(&path, nd);
>  		if (unlikely(err < 0))
> -			return err;
> +			goto out;
>  
>  		seq = 0;	/* we are already out of RCU mode */
>  		inode = d_backing_inode(path.dentry);
>  	}
>  
> -	return step_into(nd, &path, flags, inode, seq);
> +	err = step_into(nd, &path, flags, inode, seq);
> + out:
> +	revert_userns_creds(cred);
> +	return err;
>  }
>  
>  /*
> @@ -2180,8 +2222,10 @@ static int link_path_walk(const char *name, struct nameidata *nd)
>  	for(;;) {
>  		u64 hash_len;
>  		int type;
> +		const struct cred *cred = change_userns_creds(&nd->path);
>  
>  		err = may_lookup(nd);
> +		revert_userns_creds(cred);
>  		if (err)
>  			return err;
>  
> @@ -2373,12 +2417,17 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
>  static const char *trailing_symlink(struct nameidata *nd)
>  {
>  	const char *s;
> +	const struct cred *cred = change_userns_creds(&nd->path);
>  	int error = may_follow_link(nd);
> -	if (unlikely(error))
> -		return ERR_PTR(error);
> +	if (unlikely(error)) {
> +		s = ERR_PTR(error);
> +		goto out;
> +	}
>  	nd->flags |= LOOKUP_PARENT;
>  	nd->stack[0].name = NULL;
>  	s = get_link(nd);
> + out:
> +	revert_userns_creds(cred);
>  	return s ? s : "";
>  }
>  
> @@ -3343,6 +3392,7 @@ static int do_last(struct nameidata *nd,
>  	struct inode *inode;
>  	struct path path;
>  	int error;
> +	const struct cred *cred = change_userns_creds(&nd->path);
>  
>  	nd->flags &= ~LOOKUP_PARENT;
>  	nd->flags |= op->intent;
> @@ -3350,7 +3400,7 @@ static int do_last(struct nameidata *nd,
>  	if (nd->last_type != LAST_NORM) {
>  		error = handle_dots(nd, nd->last_type);
>  		if (unlikely(error))
> -			return error;
> +			goto err;
>  		goto finish_open;
>  	}
>  
> @@ -3363,7 +3413,7 @@ static int do_last(struct nameidata *nd,
>  			goto finish_lookup;
>  
>  		if (error < 0)
> -			return error;
> +			goto err;
>  
>  		BUG_ON(nd->inode != dir->d_inode);
>  		BUG_ON(nd->flags & LOOKUP_RCU);
> @@ -3376,12 +3426,14 @@ static int do_last(struct nameidata *nd,
>  		 */
>  		error = complete_walk(nd);
>  		if (error)
> -			return error;
> +			goto err;
>  
>  		audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
>  		/* trailing slashes? */
> -		if (unlikely(nd->last.name[nd->last.len]))
> -			return -EISDIR;
> +		if (unlikely(nd->last.name[nd->last.len])) {
> +			error = -EISDIR;
> +			goto err;
> +		}
>  	}
>  
>  	if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
> @@ -3437,7 +3489,7 @@ static int do_last(struct nameidata *nd,
>  
>  	error = follow_managed(&path, nd);
>  	if (unlikely(error < 0))
> -		return error;
> +		goto err;
>  
>  	/*
>  	 * create/update audit record if it already exists.
> @@ -3446,7 +3498,8 @@ static int do_last(struct nameidata *nd,
>  
>  	if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
>  		path_to_nameidata(&path, nd);
> -		return -EEXIST;
> +		error = -EEXIST;
> +		goto err;
>  	}
>  
>  	seq = 0;	/* out of RCU mode, so the value doesn't matter */
> @@ -3454,12 +3507,12 @@ static int do_last(struct nameidata *nd,
>  finish_lookup:
>  	error = step_into(nd, &path, 0, inode, seq);
>  	if (unlikely(error))
> -		return error;
> +		goto err;
>  finish_open:
>  	/* Why this, you ask?  _Now_ we might have grown LOOKUP_JUMPED... */
>  	error = complete_walk(nd);
>  	if (error)
> -		return error;
> +		goto err;
>  	audit_inode(nd->name, nd->path.dentry, 0);
>  	if (open_flag & O_CREAT) {
>  		error = -EISDIR;
> @@ -3501,6 +3554,8 @@ static int do_last(struct nameidata *nd,
>  	}
>  	if (got_write)
>  		mnt_drop_write(nd->path.mnt);
> + err:
> +	revert_userns_creds(cred);
>  	return error;
>  }
>  
> @@ -3819,6 +3874,7 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode,
>  	struct path path;
>  	int error;
>  	unsigned int lookup_flags = 0;
> +	const struct cred *cred;
>  
>  	error = may_mknod(mode);
>  	if (error)
> @@ -3828,6 +3884,7 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode,
>  	if (IS_ERR(dentry))
>  		return PTR_ERR(dentry);
>  
> +	cred = change_userns_creds(&path);
>  	if (!IS_POSIXACL(path.dentry->d_inode))
>  		mode &= ~current_umask();
>  	error = security_path_mknod(&path, dentry, mode, dev);
> @@ -3849,6 +3906,7 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode,
>  	}
>  out:
>  	done_path_create(&path, dentry);
> +	revert_userns_creds(cred);
>  	if (retry_estale(error, lookup_flags)) {
>  		lookup_flags |= LOOKUP_REVAL;
>  		goto retry;
> @@ -3899,18 +3957,21 @@ long do_mkdirat(int dfd, const char __user *pathname, umode_t mode)
>  	struct path path;
>  	int error;
>  	unsigned int lookup_flags = LOOKUP_DIRECTORY;
> +	const struct cred *cred;
>  
>  retry:
>  	dentry = user_path_create(dfd, pathname, &path, lookup_flags);
>  	if (IS_ERR(dentry))
>  		return PTR_ERR(dentry);
>  
> +	cred = change_userns_creds(&path);
>  	if (!IS_POSIXACL(path.dentry->d_inode))
>  		mode &= ~current_umask();
>  	error = security_path_mkdir(&path, dentry, mode);
>  	if (!error)
>  		error = vfs_mkdir(path.dentry->d_inode, dentry, mode);
>  	done_path_create(&path, dentry);
> +	revert_userns_creds(cred);
>  	if (retry_estale(error, lookup_flags)) {
>  		lookup_flags |= LOOKUP_REVAL;
>  		goto retry;
> @@ -3977,12 +4038,14 @@ long do_rmdir(int dfd, const char __user *pathname)
>  	struct qstr last;
>  	int type;
>  	unsigned int lookup_flags = 0;
> +	const struct cred *cred;
>  retry:
>  	name = filename_parentat(dfd, getname(pathname), lookup_flags,
>  				&path, &last, &type);
>  	if (IS_ERR(name))
>  		return PTR_ERR(name);
>  
> +	cred = change_userns_creds(&path);
>  	switch (type) {
>  	case LAST_DOTDOT:
>  		error = -ENOTEMPTY;
> @@ -4018,6 +4081,7 @@ long do_rmdir(int dfd, const char __user *pathname)
>  	inode_unlock(path.dentry->d_inode);
>  	mnt_drop_write(path.mnt);
>  exit1:
> +	revert_userns_creds(cred);
>  	path_put(&path);
>  	putname(name);
>  	if (retry_estale(error, lookup_flags)) {
> @@ -4107,11 +4171,13 @@ long do_unlinkat(int dfd, struct filename *name)
>  	struct inode *inode = NULL;
>  	struct inode *delegated_inode = NULL;
>  	unsigned int lookup_flags = 0;
> +	const struct cred *cred;
>  retry:
>  	name = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
>  	if (IS_ERR(name))
>  		return PTR_ERR(name);
>  
> +	cred = change_userns_creds(&path);
>  	error = -EISDIR;
>  	if (type != LAST_NORM)
>  		goto exit1;
> @@ -4149,6 +4215,7 @@ long do_unlinkat(int dfd, struct filename *name)
>  	}
>  	mnt_drop_write(path.mnt);
>  exit1:
> +	revert_userns_creds(cred);
>  	path_put(&path);
>  	if (retry_estale(error, lookup_flags)) {
>  		lookup_flags |= LOOKUP_REVAL;
> @@ -4213,6 +4280,7 @@ long do_symlinkat(const char __user *oldname, int newdfd,
>  	struct dentry *dentry;
>  	struct path path;
>  	unsigned int lookup_flags = 0;
> +	const struct cred *cred;
>  
>  	from = getname(oldname);
>  	if (IS_ERR(from))
> @@ -4223,6 +4291,7 @@ long do_symlinkat(const char __user *oldname, int newdfd,
>  	if (IS_ERR(dentry))
>  		goto out_putname;
>  
> +	cred = change_userns_creds(&path);
>  	error = security_path_symlink(&path, dentry, from->name);
>  	if (!error)
>  		error = vfs_symlink(path.dentry->d_inode, dentry, from->name);
> @@ -4231,6 +4300,7 @@ long do_symlinkat(const char __user *oldname, int newdfd,
>  		lookup_flags |= LOOKUP_REVAL;
>  		goto retry;
>  	}
> +	revert_userns_creds(cred);
>  out_putname:
>  	putname(from);
>  	return error;
> @@ -4344,6 +4414,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
>  	struct inode *delegated_inode = NULL;
>  	int how = 0;
>  	int error;
> +	const struct cred *cred;
>  
>  	if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH)) != 0)
>  		return -EINVAL;
> @@ -4371,6 +4442,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
>  	if (IS_ERR(new_dentry))
>  		goto out;
>  
> +	cred = change_userns_creds(&new_path);
>  	error = -EXDEV;
>  	if (old_path.mnt != new_path.mnt)
>  		goto out_dput;
> @@ -4382,6 +4454,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
>  		goto out_dput;
>  	error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry, &delegated_inode);
>  out_dput:
> +	revert_userns_creds(cred);
>  	done_path_create(&new_path, new_dentry);
>  	if (delegated_inode) {
>  		error = break_deleg_wait(&delegated_inode);
> @@ -4601,6 +4674,7 @@ static int do_renameat2(int olddfd, const char __user *oldname, int newdfd,
>  	unsigned int lookup_flags = 0, target_flags = LOOKUP_RENAME_TARGET;
>  	bool should_retry = false;
>  	int error;
> +	const struct cred *cred;
>  
>  	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
>  		return -EINVAL;
> @@ -4630,6 +4704,7 @@ static int do_renameat2(int olddfd, const char __user *oldname, int newdfd,
>  		goto exit1;
>  	}
>  
> +	cred = change_userns_creds(&new_path);
>  	error = -EXDEV;
>  	if (old_path.mnt != new_path.mnt)
>  		goto exit2;
> @@ -4714,6 +4789,7 @@ static int do_renameat2(int olddfd, const char __user *oldname, int newdfd,
>  	}
>  	mnt_drop_write(old_path.mnt);
>  exit2:
> +	revert_userns_creds(cred);
>  	if (retry_estale(error, lookup_flags))
>  		should_retry = true;
>  	path_put(&new_path);
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 69fb23ae3d8f..4720647588ab 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -200,6 +200,8 @@ static struct mount *alloc_vfsmnt(const char *name)
>  		mnt->mnt_writers = 0;
>  #endif
>  
> +		mnt->mnt_userns = get_user_ns(&init_user_ns);
> +
>  		INIT_HLIST_NODE(&mnt->mnt_hash);
>  		INIT_LIST_HEAD(&mnt->mnt_child);
>  		INIT_LIST_HEAD(&mnt->mnt_mounts);
> @@ -1044,6 +1046,8 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
>  	mnt->mnt.mnt_root = dget(root);
>  	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
>  	mnt->mnt_parent = mnt;
> +	put_user_ns(mnt->mnt_userns);
> +	mnt->mnt_userns = get_user_ns(old->mnt_userns);
>  	lock_mount_hash();
>  	list_add_tail(&mnt->mnt_instance, &sb->s_mounts);
>  	unlock_mount_hash();
> @@ -1102,6 +1106,7 @@ static void cleanup_mnt(struct mount *mnt)
>  	dput(mnt->mnt.mnt_root);
>  	deactivate_super(mnt->mnt.mnt_sb);
>  	mnt_free_id(mnt);
> +	put_user_ns(mnt->mnt_userns);
>  	call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt);
>  }
>  
> diff --git a/fs/open.c b/fs/open.c
> index db6758b9636a..d27b90dce64d 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -456,11 +456,13 @@ int ksys_chdir(const char __user *filename)
>  	struct path path;
>  	int error;
>  	unsigned int lookup_flags = LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
> +	const struct cred *cred;
>  retry:
>  	error = user_path_at(AT_FDCWD, filename, lookup_flags, &path);
>  	if (error)
>  		goto out;
>  
> +	cred = change_userns_creds(&path);
>  	error = inode_permission(path.dentry->d_inode, MAY_EXEC | MAY_CHDIR);
>  	if (error)
>  		goto dput_and_out;
> @@ -468,6 +470,7 @@ int ksys_chdir(const char __user *filename)
>  	set_fs_pwd(current->fs, &path);
>  
>  dput_and_out:
> +	revert_userns_creds(cred);
>  	path_put(&path);
>  	if (retry_estale(error, lookup_flags)) {
>  		lookup_flags |= LOOKUP_REVAL;
> @@ -486,11 +489,13 @@ SYSCALL_DEFINE1(fchdir, unsigned int, fd)
>  {
>  	struct fd f = fdget_raw(fd);
>  	int error;
> +	const struct cred *cred;
>  
>  	error = -EBADF;
>  	if (!f.file)
>  		goto out;
>  
> +	cred = change_userns_creds(&f.file->f_path);
>  	error = -ENOTDIR;
>  	if (!d_can_lookup(f.file->f_path.dentry))
>  		goto out_putf;
> @@ -499,6 +504,7 @@ SYSCALL_DEFINE1(fchdir, unsigned int, fd)
>  	if (!error)
>  		set_fs_pwd(current->fs, &f.file->f_path);
>  out_putf:
> +	revert_userns_creds(cred);
>  	fdput(f);
>  out:
>  	return error;
> @@ -547,11 +553,13 @@ static int chmod_common(const struct path *path, umode_t mode)
>  	struct inode *inode = path->dentry->d_inode;
>  	struct inode *delegated_inode = NULL;
>  	struct iattr newattrs;
> +	const struct cred *cred;
>  	int error;
>  
> +	cred = change_userns_creds(path);
>  	error = mnt_want_write(path->mnt);
>  	if (error)
> -		return error;
> +		goto out;
>  retry_deleg:
>  	inode_lock(inode);
>  	error = security_path_chmod(path, mode);
> @@ -568,6 +576,8 @@ static int chmod_common(const struct path *path, umode_t mode)
>  			goto retry_deleg;
>  	}
>  	mnt_drop_write(path->mnt);
> + out:
> +	revert_userns_creds(cred);
>  	return error;
>  }
>  
> @@ -666,6 +676,7 @@ int do_fchownat(int dfd, const char __user *filename, uid_t user, gid_t group,
>  	struct path path;
>  	int error = -EINVAL;
>  	int lookup_flags;
> +	const struct cred *cred;
>  
>  	if ((flag & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
>  		goto out;
> @@ -677,12 +688,14 @@ int do_fchownat(int dfd, const char __user *filename, uid_t user, gid_t group,
>  	error = user_path_at(dfd, filename, lookup_flags, &path);
>  	if (error)
>  		goto out;
> +	cred = change_userns_creds(&path);
>  	error = mnt_want_write(path.mnt);
>  	if (error)
>  		goto out_release;
>  	error = chown_common(&path, user, group);
>  	mnt_drop_write(path.mnt);
>  out_release:
> +	revert_userns_creds(cred);
>  	path_put(&path);
>  	if (retry_estale(error, lookup_flags)) {
>  		lookup_flags |= LOOKUP_REVAL;
> @@ -713,10 +726,12 @@ int ksys_fchown(unsigned int fd, uid_t user, gid_t group)
>  {
>  	struct fd f = fdget(fd);
>  	int error = -EBADF;
> +	const struct cred *cred;
>  
>  	if (!f.file)
>  		goto out;
>  
> +	cred = change_userns_creds(&f.file->f_path);
>  	error = mnt_want_write_file(f.file);
>  	if (error)
>  		goto out_fput;
> @@ -724,6 +739,7 @@ int ksys_fchown(unsigned int fd, uid_t user, gid_t group)
>  	error = chown_common(&f.file->f_path, user, group);
>  	mnt_drop_write_file(f.file);
>  out_fput:
> +	revert_userns_creds(cred);
>  	fdput(f);
>  out:
>  	return error;
> @@ -911,8 +927,13 @@ EXPORT_SYMBOL(file_path);
>   */
>  int vfs_open(const struct path *path, struct file *file)
>  {
> +	int ret;
> +	const struct cred *cred = change_userns_creds(path);
> +
>  	file->f_path = *path;
> -	return do_dentry_open(file, d_backing_inode(path->dentry), NULL);
> +	ret = do_dentry_open(file, d_backing_inode(path->dentry), NULL);
> +	revert_userns_creds(cred);
> +	return ret;
>  }
>  
>  struct file *dentry_open(const struct path *path, int flags,
> diff --git a/fs/posix_acl.c b/fs/posix_acl.c
> index 249672bf54fe..ff777110f3da 100644
> --- a/fs/posix_acl.c
> +++ b/fs/posix_acl.c
> @@ -364,7 +364,7 @@ posix_acl_permission(struct inode *inode, const struct posix_acl *acl, int want)
>                                          goto mask;
>  				break;
>                          case ACL_GROUP_OBJ:
> -                                if (in_group_p(inode->i_gid)) {
> +				if (in_group_p_shifted(inode->i_gid)) {
>  					found = 1;
>  					if ((pa->e_perm & want) == want)
>  						goto mask;
> @@ -655,7 +655,7 @@ int posix_acl_update_mode(struct inode *inode, umode_t *mode_p,
>  		return error;
>  	if (error == 0)
>  		*acl = NULL;
> -	if (!in_group_p(inode->i_gid) &&
> +	if (!in_group_p_shifted(inode->i_gid) &&
>  	    !capable_wrt_inode_uidgid(inode, CAP_FSETID))
>  		mode &= ~S_ISGID;
>  	*mode_p = mode;
> diff --git a/fs/stat.c b/fs/stat.c
> index 030008796479..634b8d13ed51 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -22,6 +22,7 @@
>  #include <asm/unistd.h>
>  
>  #include "internal.h"
> +#include "mount.h"
>  
>  /**
>   * generic_fillattr - Fill in the basic attributes from the inode struct
> @@ -50,6 +51,23 @@ void generic_fillattr(struct inode *inode, struct kstat *stat)
>  }
>  EXPORT_SYMBOL(generic_fillattr);
>  
> +static void shift_check(struct vfsmount *mnt, struct kstat *stat)
> +{
> +	struct mount *m = real_mount(mnt);
> +	struct user_namespace *user_ns = m->mnt_ns->user_ns;
> +
> +	if ((mnt->mnt_flags & MNT_SHIFT) == 0)
> +		return;
> +
> +	if (current->nsproxy->mnt_ns->user_ns != m->mnt_ns->user_ns)
> +		return;
> +
> +	stat->uid = KUIDT_INIT(from_kuid(m->mnt_userns, stat->uid));
> +	stat->uid = make_kuid(user_ns, __kuid_val(stat->uid));
> +	stat->gid = KGIDT_INIT(from_kgid(m->mnt_userns, stat->gid));
> +	stat->gid = make_kgid(user_ns, __kgid_val(stat->gid));
> +}
> +
>  /**
>   * vfs_getattr_nosec - getattr without security checks
>   * @path: file to get attributes from
> @@ -67,6 +85,7 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
>  		      u32 request_mask, unsigned int query_flags)
>  {
>  	struct inode *inode = d_backing_inode(path->dentry);
> +	int ret;
>  
>  	memset(stat, 0, sizeof(*stat));
>  	stat->result_mask |= STATX_BASIC_STATS;
> @@ -79,12 +98,17 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
>  	if (IS_AUTOMOUNT(inode))
>  		stat->attributes |= STATX_ATTR_AUTOMOUNT;
>  
> +	ret = 0;
>  	if (inode->i_op->getattr)
> -		return inode->i_op->getattr(path, stat, request_mask,
> -					    query_flags);
> +		ret = inode->i_op->getattr(path, stat, request_mask,
> +					   query_flags);
> +	else
> +		generic_fillattr(inode, stat);
>  
> -	generic_fillattr(inode, stat);
> -	return 0;
> +	if (!ret)
> +		shift_check(path->mnt, stat);
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL(vfs_getattr_nosec);
>  
> diff --git a/include/linux/cred.h b/include/linux/cred.h
> index 18639c069263..d29638617844 100644
> --- a/include/linux/cred.h
> +++ b/include/linux/cred.h
> @@ -59,6 +59,7 @@ extern struct group_info *groups_alloc(int);
>  extern void groups_free(struct group_info *);
>  
>  extern int in_group_p(kgid_t);
> +extern int in_group_p_shifted(kgid_t);

How do I know when to use in_group_p_shifted vs in_group_p?
What about the various other fs callers?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/3] introduce a uid/gid shifting bind mount
  2020-02-18 20:03         ` Christian Brauner
@ 2020-02-18 23:43           ` James Bottomley
  2020-02-19 13:26             ` Christian Brauner
  2020-02-19 16:01             ` Stéphane Graber
  0 siblings, 2 replies; 16+ messages in thread
From: James Bottomley @ 2020-02-18 23:43 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Amir Goldstein, Miklos Szeredi, Linux Containers, overlayfs,
	David Howells, Seth Forshee, Al Viro, linux-fsdevel,
	Eric Biederman, stgraber

On Tue, 2020-02-18 at 21:03 +0100, Christian Brauner wrote:
> On Tue, Feb 18, 2020 at 11:05:48AM -0800, James Bottomley wrote:
> > On Tue, 2020-02-18 at 18:26 +0100, Christian Brauner wrote:
[...]
> > > But way more important: what Amir got right is that your approach
> > > and fsid mappings don't stand in each others way at all. Shiftfed
> > > bind-mounts can be implemented completely independent of fsid
> > > mappings after the fact on top of it.
> > > 
> > > Your example, does this:
> > > 
> > > nsfd = open("/proc/567/ns/user", O_RDONLY);  /* note: not O_PATH
> > > */
> > > configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd);
> > > 
> > > as the ultimate step. Essentially marking a mountpoint as shifted
> > > relative to that user namespace. Once fsid mappings are in all
> > > that you need to do is replace your
> > > make_kuid()/from_kuid()/from_kuid_munged() calls and so on in
> > > your patchset with
> > > make_kfsuid()/from_kfsuid()/from_kfsuid_munged() and you're done.
> > > So I honestly don't currently see any need to block the patchsets
> > > on each other. 
> > 
> > Can I repeat: there's no rush to get upstream on this.  Let's pause
> > to get the kernel implementation (the thing we have to maintain)
> > right.  I realise we could each work around the other and get our
> > implementations bent around each other so they all work
> > independently thus making our disjoint user cabals happy but I
> > don't think that would lead to the best outcome for kernel
> > maintainability.
> 
> We have had the discussion with all major stakeholders in a single
> room on what we need at LPC 2019.

Well, you didn't invite me, so I think "stakeholders" means people we
selected because we like their use case.  More importantly:
"stakeholders" traditionally means not only people who want to consume
the feature but also people who have to maintain it ... how many VFS
stakeholders were present?

>  We agreed on what we need and fsids are a concrete proposal for an
> implementation that appears to solve all discussed major use-cases in
> a simple and elegant manner, which can also be cleanly extended to
> cover your approach later.  Imho, there is no need to have the same
> discussion again at an invite-only event focussed on kernel
> developers where most of the major stakeholders are unlikely to be
> able to participate. The patch proposals are here on all relevant
> list where everyone can participate and we can discuss them right
> here. I have not yet heard a concrete technical reason why the patch
> proposal is inadequate and I see no reason to stall this.

You cut the actual justification I gave: tacking together ad hoc
solutions for particular interests has already lead to a proliferation
of similar but not quite user_ns captures spreading through the vfs.  I
didn't say don't do it this way ... all I said was let's get clear what
we are doing and lets put together a shifting infrastructure that's
clean, easy to understand and reason about in security terms and which
can be used to implement all our use cases ... including s_user_ns. 
And when we've done this, lets eject any of the ad hoc stuff we find we
don't need to make the whole thing simpler.

> > I already think that history shows us that s_user_ns went upstream
> > too fast and the fact that unprivileged fuse has yet to make it
> > (and the
> 
> We've established on the other patchset that fsid mappings in no way
> interfere nor care about s_user_ns so I'm not going to go into this
> again here. But for the record, unprivileged fuse mounts are
> supported since:

I know, but I'm taking the opposite view: not caring about the other
uses and working around them has lead to the ad hoc userns creep we see
today and I think we need to roll it back to a consistent and easy to
reason about implementation.

> commit 4ad769f3c346ec3d458e255548dec26ca5284cf6
> Author: Eric W. Biederman <ebiederm@xmission.com>
> Date:   Tue May 29 09:04:46 2018 -0500
> 
>     fuse: Allow fully unprivileged mounts

I know the patch is there ... I just haven't found any users yet, so I
think there's still something else missing.   This is really Seth's
baby so I was hoping he'd have ideas about what.

James


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 2/3] fs: introduce uid/gid shifting bind mount
  2020-02-18 22:33   ` Christoph Hellwig
@ 2020-02-19  1:19     ` James Bottomley
  0 siblings, 0 replies; 16+ messages in thread
From: James Bottomley @ 2020-02-19  1:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Miklos Szeredi, containers, linux-unionfs, David Howells,
	Seth Forshee, Al Viro, linux-fsdevel, Eric Biederman

On Tue, 2020-02-18 at 14:33 -0800, Christoph Hellwig wrote:
> On Mon, Feb 17, 2020 at 12:53:06PM -0800, James Bottomley wrote:

[...]
> > diff --git a/include/linux/cred.h b/include/linux/cred.h
> > index 18639c069263..d29638617844 100644
> > --- a/include/linux/cred.h
> > +++ b/include/linux/cred.h
> > @@ -59,6 +59,7 @@ extern struct group_info *groups_alloc(int);
> >  extern void groups_free(struct group_info *);
> >  
> >  extern int in_group_p(kgid_t);
> > +extern int in_group_p_shifted(kgid_t);
> 
> How do I know when to use in_group_p_shifted vs in_group_p?
> What about the various other fs callers?

So this is one I wondered about too.  The problem is that the shifted
credential (the one representing the fsuid/fsgid the filesystem will
see) still has cred->group_info representing the kuid/kgid which are
unshifted from the filesystem perspective.  The solution was to use
in_group_p_shifted when you're comparing a filesystem view fsgid and
use in_group_p when you're comparing a kernel kgid.

However, I'm now thinking that's way too complex and what should happen
is that I should shift every member of cred->group_info so that all
searches happen on the fs view, meaning the fs always uses in_group_p
like it does today and only the corner cases that compare a kgid need
shifting.

James



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/3] introduce a uid/gid shifting bind mount
  2020-02-18 23:43           ` James Bottomley
@ 2020-02-19 13:26             ` Christian Brauner
  2020-02-19 22:26               ` James Bottomley
  2020-02-19 16:01             ` Stéphane Graber
  1 sibling, 1 reply; 16+ messages in thread
From: Christian Brauner @ 2020-02-19 13:26 UTC (permalink / raw)
  To: James Bottomley
  Cc: Amir Goldstein, Miklos Szeredi, Linux Containers, overlayfs,
	David Howells, Seth Forshee, Al Viro, linux-fsdevel,
	Eric Biederman, stgraber

On Tue, Feb 18, 2020 at 03:43:18PM -0800, James Bottomley wrote:
> On Tue, 2020-02-18 at 21:03 +0100, Christian Brauner wrote:
> > On Tue, Feb 18, 2020 at 11:05:48AM -0800, James Bottomley wrote:
> > > On Tue, 2020-02-18 at 18:26 +0100, Christian Brauner wrote:
> [...]
> > > > But way more important: what Amir got right is that your approach
> > > > and fsid mappings don't stand in each others way at all. Shiftfed
> > > > bind-mounts can be implemented completely independent of fsid
> > > > mappings after the fact on top of it.
> > > > 
> > > > Your example, does this:
> > > > 
> > > > nsfd = open("/proc/567/ns/user", O_RDONLY);  /* note: not O_PATH
> > > > */
> > > > configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd);
> > > > 
> > > > as the ultimate step. Essentially marking a mountpoint as shifted
> > > > relative to that user namespace. Once fsid mappings are in all
> > > > that you need to do is replace your
> > > > make_kuid()/from_kuid()/from_kuid_munged() calls and so on in
> > > > your patchset with
> > > > make_kfsuid()/from_kfsuid()/from_kfsuid_munged() and you're done.
> > > > So I honestly don't currently see any need to block the patchsets
> > > > on each other. 
> > > 
> > > Can I repeat: there's no rush to get upstream on this.  Let's pause
> > > to get the kernel implementation (the thing we have to maintain)
> > > right.  I realise we could each work around the other and get our
> > > implementations bent around each other so they all work
> > > independently thus making our disjoint user cabals happy but I
> > > don't think that would lead to the best outcome for kernel
> > > maintainability.
> > 
> > We have had the discussion with all major stakeholders in a single
> > room on what we need at LPC 2019.
> 
> Well, you didn't invite me, so I think "stakeholders" means people we

I'm confused as you were in the room with everyone. It's even on the
schedule under your name:
https://linuxplumbersconf.org/event/4/contributions/474/

> selected because we like their use case.  More importantly:
> "stakeholders" traditionally means not only people who want to consume
> the feature but also people who have to maintain it ... how many VFS
> stakeholders were present?

Again, I'm confused you were in the room with at least David and Eric.

In any case, the patches are on the relevant lists. They are actively
being discussed and visible to everyone and we'll have time for proper
review which is already happening.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/3] introduce a uid/gid shifting bind mount
  2020-02-18 23:43           ` James Bottomley
  2020-02-19 13:26             ` Christian Brauner
@ 2020-02-19 16:01             ` Stéphane Graber
  1 sibling, 0 replies; 16+ messages in thread
From: Stéphane Graber @ 2020-02-19 16:01 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christian Brauner, Amir Goldstein, Miklos Szeredi,
	Linux Containers, overlayfs, David Howells, Seth Forshee,
	Al Viro, linux-fsdevel, Eric Biederman

On Tue, Feb 18, 2020 at 6:43 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> On Tue, 2020-02-18 at 21:03 +0100, Christian Brauner wrote:
> > On Tue, Feb 18, 2020 at 11:05:48AM -0800, James Bottomley wrote:
> > > On Tue, 2020-02-18 at 18:26 +0100, Christian Brauner wrote:
> [...]
> > > > But way more important: what Amir got right is that your approach
> > > > and fsid mappings don't stand in each others way at all. Shiftfed
> > > > bind-mounts can be implemented completely independent of fsid
> > > > mappings after the fact on top of it.
> > > >
> > > > Your example, does this:
> > > >
> > > > nsfd = open("/proc/567/ns/user", O_RDONLY);  /* note: not O_PATH
> > > > */
> > > > configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd);
> > > >
> > > > as the ultimate step. Essentially marking a mountpoint as shifted
> > > > relative to that user namespace. Once fsid mappings are in all
> > > > that you need to do is replace your
> > > > make_kuid()/from_kuid()/from_kuid_munged() calls and so on in
> > > > your patchset with
> > > > make_kfsuid()/from_kfsuid()/from_kfsuid_munged() and you're done.
> > > > So I honestly don't currently see any need to block the patchsets
> > > > on each other.
> > >
> > > Can I repeat: there's no rush to get upstream on this.  Let's pause
> > > to get the kernel implementation (the thing we have to maintain)
> > > right.  I realise we could each work around the other and get our
> > > implementations bent around each other so they all work
> > > independently thus making our disjoint user cabals happy but I
> > > don't think that would lead to the best outcome for kernel
> > > maintainability.
> >
> > We have had the discussion with all major stakeholders in a single
> > room on what we need at LPC 2019.
>
> Well, you didn't invite me, so I think "stakeholders" means people we
> selected because we like their use case.  More importantly:
> "stakeholders" traditionally means not only people who want to consume
> the feature but also people who have to maintain it ... how many VFS
> stakeholders were present?
>
> >  We agreed on what we need and fsids are a concrete proposal for an
> > implementation that appears to solve all discussed major use-cases in
> > a simple and elegant manner, which can also be cleanly extended to
> > cover your approach later.  Imho, there is no need to have the same
> > discussion again at an invite-only event focussed on kernel
> > developers where most of the major stakeholders are unlikely to be
> > able to participate. The patch proposals are here on all relevant
> > list where everyone can participate and we can discuss them right
> > here. I have not yet heard a concrete technical reason why the patch
> > proposal is inadequate and I see no reason to stall this.
>
> You cut the actual justification I gave: tacking together ad hoc
> solutions for particular interests has already lead to a proliferation
> of similar but not quite user_ns captures spreading through the vfs.  I
> didn't say don't do it this way ... all I said was let's get clear what
> we are doing and lets put together a shifting infrastructure that's
> clean, easy to understand and reason about in security terms and which
> can be used to implement all our use cases ... including s_user_ns.
> And when we've done this, lets eject any of the ad hoc stuff we find we
> don't need to make the whole thing simpler.
>
> > > I already think that history shows us that s_user_ns went upstream
> > > too fast and the fact that unprivileged fuse has yet to make it
> > > (and the
> >
> > We've established on the other patchset that fsid mappings in no way
> > interfere nor care about s_user_ns so I'm not going to go into this
> > again here. But for the record, unprivileged fuse mounts are
> > supported since:
>
> I know, but I'm taking the opposite view: not caring about the other
> uses and working around them has lead to the ad hoc userns creep we see
> today and I think we need to roll it back to a consistent and easy to
> reason about implementation.
>
> > commit 4ad769f3c346ec3d458e255548dec26ca5284cf6
> > Author: Eric W. Biederman <ebiederm@xmission.com>
> > Date:   Tue May 29 09:04:46 2018 -0500
> >
> >     fuse: Allow fully unprivileged mounts
>
> I know the patch is there ... I just haven't found any users yet, so I
> think there's still something else missing.   This is really Seth's
> baby so I was hoping he'd have ideas about what.

I'm confused by that part, we have hundreds of thousands of users of
this feature
and it's been backported to older kernels on a number of platforms due
to its usefulness.

It's what's used for installing/running snap packages inside Ubuntu containers,
it's also quite widely used inside Terminal/Crostini on Chromebooks,
performs transparent unprivileged mounts on Travis-CI and is also used by
some of the rootless runtimes to allow for mounting overlayfs, squashfs or
even ext* inside unprivileged containers.

It's certainly not something that's super user visible, it also
doesn't need any particular
support for it to work, just use fuse as usual inside an unprivileged container.
So the lack of noise about it doesn't necessarily indicate that it's not used,
it may just indicate that it's been working as intended :)

> James
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/3] introduce a uid/gid shifting bind mount
  2020-02-19 13:26             ` Christian Brauner
@ 2020-02-19 22:26               ` James Bottomley
  0 siblings, 0 replies; 16+ messages in thread
From: James Bottomley @ 2020-02-19 22:26 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Amir Goldstein, Miklos Szeredi, Linux Containers, overlayfs,
	David Howells, Seth Forshee, Al Viro, linux-fsdevel,
	Eric Biederman, stgraber

On Wed, 2020-02-19 at 14:26 +0100, Christian Brauner wrote:
> On Tue, Feb 18, 2020 at 03:43:18PM -0800, James Bottomley wrote:
> > On Tue, 2020-02-18 at 21:03 +0100, Christian Brauner wrote:
> > > On Tue, Feb 18, 2020 at 11:05:48AM -0800, James Bottomley wrote:
> > > > On Tue, 2020-02-18 at 18:26 +0100, Christian Brauner wrote:
> > 
> > [...]
> > > > > But way more important: what Amir got right is that your
> > > > > approach and fsid mappings don't stand in each others way at
> > > > > all. Shiftfed bind-mounts can be implemented completely
> > > > > independent of fsid mappings after the fact on top of it.
> > > > > 
> > > > > Your example, does this:
> > > > > 
> > > > > nsfd = open("/proc/567/ns/user", O_RDONLY);  /* note: not
> > > > > O_PATH
> > > > > */
> > > > > configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd);
> > > > > 
> > > > > as the ultimate step. Essentially marking a mountpoint as
> > > > > shifted relative to that user namespace. Once fsid mappings
> > > > > are in all that you need to do is replace your
> > > > > make_kuid()/from_kuid()/from_kuid_munged() calls and so on in
> > > > > your patchset with
> > > > > make_kfsuid()/from_kfsuid()/from_kfsuid_munged() and you're
> > > > > done. So I honestly don't currently see any need to block the
> > > > > patchsets on each other. 
> > > > 
> > > > Can I repeat: there's no rush to get upstream on this.  Let's
> > > > pause to get the kernel implementation (the thing we have to
> > > > maintain) right.  I realise we could each work around the other
> > > > and get our implementations bent around each other so they all
> > > > work independently thus making our disjoint user cabals happy
> > > > but I don't think that would lead to the best outcome for
> > > > kernel maintainability.
> > > 
> > > We have had the discussion with all major stakeholders in a
> > > single room on what we need at LPC 2019.
> > 
> > Well, you didn't invite me, so I think "stakeholders" means people
> > we
> 
> I'm confused as you were in the room with everyone. It's even on the
> schedule under your name:
> https://linuxplumbersconf.org/event/4/contributions/474/
> 
> > selected because we like their use case.  More importantly:
> > "stakeholders" traditionally means not only people who want to
> > consume the feature but also people who have to maintain it ... how
> > many VFS stakeholders were present?
> 
> Again, I'm confused you were in the room with at least David and
> Eric.

OK, I think we both got different things out of this, but I've now put
the videos for this on the LPC site so others can judge for themselves.
 The one for this session is here:

https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=2h12m52s

Although it does make more sense if you watch the Dave Howells session
on the new mount API first:

https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=1h45m54s

There is lots of discussion of the shared image use case, but nowhere
is there any discussion of separating uid from fsuid in the user
namespace, unless I missed it again in my second viewing?

> In any case, the patches are on the relevant lists. They are actively
> being discussed and visible to everyone and we'll have time for
> proper review which is already happening.

The big issue, though, is that while you've produced a patch set that
covers your use case (shared unprivileged images) it doesn't cover mine
(privileged images).  However, the patch set I produced does cover both
use cases.  Safely handling privileged images is a much bigger security
problem than unprivileged ones, which is why I have more VFS code than
you do.  While I could, in theory, remove your use case from my patch,
doing so would really only reduce it by 38 lines (this is the diffstat
going from v2 to v3).  However, I still need all the code in the v2
patch to handle the backshifts needed for safe privileged images and I
pretty much don't use any of the fsid code you add because I need an
unprivileged fsid and uid, so the shifts are always identical for both
... which is the default before your patch.

This is what I mean about us getting the VFS implementation correct:  I
think I can sweep up s_user_ns into my patch (unproven because I've yet
to write the patch), effectively making it use the mnt_userns I
introduced and eliminating the superblock additions and, for 38
additional VFS lines, I also pick up your use case ... although there
may be subtleties I've missed.

I also think the way I coded it is much easier to use for an
orchestration system.  Basically you create a user_ns for the unpacker,
rendering it unprivileged, and it unpacks your images into the
filesystem so they also become unprivileged and globally shifted.  Then
whenever you use these images for a container with a separate uid
shift, you pass the unpacker userns with the "ns" argument and bind
mount the root into the container's userns: the container gets fake
root and any writes by fake root are shifted correctly into the
unprivileged image.  Everything just works and there's no need to
bother with fsid's.

However, what's still not completely done by any of us is convincing
the VFS community that we've got the correct and maintainable way of
adding all this to their VFS layer.

James


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-02-19 22:26 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-17 20:53 [PATCH v3 0/3] introduce a uid/gid shifting bind mount James Bottomley
2020-02-17 20:53 ` [PATCH v3 1/3] fs: rethread notify_change to take a path instead of a dentry James Bottomley
2020-02-17 20:53 ` [PATCH v3 2/3] fs: introduce uid/gid shifting bind mount James Bottomley
2020-02-18  7:38   ` Amir Goldstein
2020-02-18 22:33   ` Christoph Hellwig
2020-02-19  1:19     ` James Bottomley
2020-02-17 20:53 ` [PATCH v3 3/3] fs: expose shifting bind mount to userspace James Bottomley
2020-02-18  7:18 ` [PATCH v3 0/3] introduce a uid/gid shifting bind mount Amir Goldstein
2020-02-18 16:11   ` James Bottomley
2020-02-18 17:26     ` Christian Brauner
2020-02-18 19:05       ` James Bottomley
2020-02-18 20:03         ` Christian Brauner
2020-02-18 23:43           ` James Bottomley
2020-02-19 13:26             ` Christian Brauner
2020-02-19 22:26               ` James Bottomley
2020-02-19 16:01             ` Stéphane Graber

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).