linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] introduce a uid/gid shifting bind mount
@ 2020-01-04 20:39 James Bottomley
  2020-01-04 20:39 ` [PATCH v2 1/3] fs: rethread notify_change to take a path instead of a dentry James Bottomley
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: James Bottomley @ 2020-01-04 20:39 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: David Howells, Christian Brauner, Al Viro, Miklos Szeredi,
	Seth Forshee, linux-unionfs, Amir Goldstein,
	Stéphane Graber, Eric Biederman, Aleksa Sarai, containers

The object of this series is to replace shiftfs with a proper uid/gid
shifting bind mount instead of the shiftfs hack of introducing
something that looks similar to an overlay filesystem to do it.

The VFS still has the problem that in order to tell what vfsmount a
dentry belongs to, struct path would have to be threaded everywhere
struct dentry currently is.  However, this patch is structured only to
require a rethreading of notify_change.  The rest of the knowledge
that a shift is in operation is carried in the task structure by
caching the unshifted credentials.

The only real change from v1 is that the notify_change patch is
updated to fix the issues pointed out by Amir Goldstein.  And I've
combined the precursor patch to rethread notify_changes into the
series.

James

---

James Bottomley (3):
  fs: rethread notify_change to take a path instead of a dentry
  fs: introduce uid/gid shifting bind mount
  fs: expose shifting bind mount to userspace

 drivers/base/devtmpfs.c   |   8 +++-
 fs/attr.c                 |  91 ++++++++++++++++++++++++++++----------
 fs/bind.c                 |  35 +++++++++++++++
 fs/cachefiles/interface.c |   6 ++-
 fs/coredump.c             |   4 +-
 fs/ecryptfs/inode.c       |   9 ++--
 fs/exec.c                 |   7 ++-
 fs/inode.c                |  16 ++++---
 fs/internal.h             |   2 +
 fs/mount.h                |   2 +
 fs/namei.c                | 110 ++++++++++++++++++++++++++++++++++++++--------
 fs/namespace.c            |   1 +
 fs/nfsd/vfs.c             |  13 +++---
 fs/open.c                 |  44 ++++++++++++++-----
 fs/overlayfs/copy_up.c    |  40 ++++++++++-------
 fs/overlayfs/dir.c        |  10 ++++-
 fs/overlayfs/inode.c      |   6 ++-
 fs/overlayfs/overlayfs.h  |   2 +-
 fs/overlayfs/super.c      |   3 +-
 fs/posix_acl.c            |   4 +-
 fs/proc_namespace.c       |   4 ++
 fs/stat.c                 |  31 +++++++++++--
 fs/utimes.c               |   2 +-
 include/linux/cred.h      |  10 +++++
 include/linux/fs.h        |   6 +--
 include/linux/mount.h     |   4 +-
 include/linux/sched.h     |   5 +++
 kernel/capability.c       |  14 +++++-
 kernel/cred.c             |  20 +++++++++
 kernel/groups.c           |   7 +++
 30 files changed, 408 insertions(+), 108 deletions(-)

-- 
2.16.4


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/3] fs: rethread notify_change to take a path instead of a dentry
  2020-01-04 20:39 [PATCH v2 0/3] introduce a uid/gid shifting bind mount James Bottomley
@ 2020-01-04 20:39 ` James Bottomley
  2020-01-04 21:52   ` Amir Goldstein
  2020-01-04 20:39 ` [PATCH v2 2/3] fs: introduce uid/gid shifting bind mount James Bottomley
  2020-01-04 20:39 ` [PATCH v2 3/3] fs: expose shifting bind mount to userspace James Bottomley
  2 siblings, 1 reply; 15+ messages in thread
From: James Bottomley @ 2020-01-04 20:39 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: David Howells, Christian Brauner, Al Viro, Miklos Szeredi,
	Seth Forshee, linux-unionfs, Amir Goldstein,
	Stéphane Graber, Eric Biederman, Aleksa Sarai, containers

In order to prepare for implementing shiftfs as a property changing
bind mount, the path (which contains the vfsmount) must be threaded
through everywhere we are going to do either a permission check or an
attribute get/set so that we can arrange for the credentials for the
operation to be based on the bind mount properties rather than those
of current.

---

v2: fix issues found by Amir Goldstein

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
---
 drivers/base/devtmpfs.c   |  8 ++++++--
 fs/attr.c                 |  4 +++-
 fs/cachefiles/interface.c |  6 ++++--
 fs/coredump.c             |  4 ++--
 fs/ecryptfs/inode.c       |  9 ++++++---
 fs/inode.c                |  7 ++++---
 fs/namei.c                |  2 +-
 fs/nfsd/vfs.c             | 13 ++++++++-----
 fs/open.c                 | 19 ++++++++++---------
 fs/overlayfs/copy_up.c    | 40 ++++++++++++++++++++++++----------------
 fs/overlayfs/dir.c        | 10 ++++++++--
 fs/overlayfs/inode.c      |  6 ++++--
 fs/overlayfs/overlayfs.h  |  2 +-
 fs/overlayfs/super.c      |  3 ++-
 fs/utimes.c               |  2 +-
 include/linux/fs.h        |  6 +++---
 16 files changed, 87 insertions(+), 54 deletions(-)

diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index 6cdbf1531238..eb9f072f5deb 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -224,13 +224,17 @@ static int handle_create(const char *nodename, umode_t mode, kuid_t uid,
 	err = vfs_mknod(d_inode(path.dentry), dentry, mode, dev->devt);
 	if (!err) {
 		struct iattr newattrs;
+		struct path newpath = {
+			.mnt = path.mnt,
+			.dentry = dentry,
+		};
 
 		newattrs.ia_mode = mode;
 		newattrs.ia_uid = uid;
 		newattrs.ia_gid = gid;
 		newattrs.ia_valid = ATTR_MODE|ATTR_UID|ATTR_GID;
 		inode_lock(d_inode(dentry));
-		notify_change(dentry, &newattrs, NULL);
+		notify_change(&newpath, &newattrs, NULL);
 		inode_unlock(d_inode(dentry));
 
 		/* mark as kernel-created inode */
@@ -337,7 +341,7 @@ static int handle_remove(const char *nodename, struct device *dev)
 			newattrs.ia_valid =
 				ATTR_UID|ATTR_GID|ATTR_MODE;
 			inode_lock(d_inode(dentry));
-			notify_change(dentry, &newattrs, NULL);
+			notify_change(&p, &newattrs, NULL);
 			inode_unlock(d_inode(dentry));
 			err = vfs_unlink(d_inode(parent.dentry), dentry, NULL);
 			if (!err || err == -ENOENT)
diff --git a/fs/attr.c b/fs/attr.c
index df28035aa23e..370b18807f05 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -226,8 +226,10 @@ EXPORT_SYMBOL(setattr_copy);
  * the file open for write, as there can be no conflicting delegation in
  * that case.
  */
-int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **delegated_inode)
+int notify_change(const struct path *path, struct iattr * attr,
+		  struct inode **delegated_inode)
 {
+	struct dentry *dentry = path->dentry;
 	struct inode *inode = dentry->d_inode;
 	umode_t mode = inode->i_mode;
 	int error;
diff --git a/fs/cachefiles/interface.c b/fs/cachefiles/interface.c
index 4cea5fbf695e..f11216f59a56 100644
--- a/fs/cachefiles/interface.c
+++ b/fs/cachefiles/interface.c
@@ -436,6 +436,7 @@ static int cachefiles_attr_changed(struct fscache_object *_object)
 	uint64_t ni_size;
 	loff_t oi_size;
 	int ret;
+	struct path path;
 
 	ni_size = _object->store_limit_l;
 
@@ -466,18 +467,19 @@ static int cachefiles_attr_changed(struct fscache_object *_object)
 	/* if there's an extension to a partial page at the end of the backing
 	 * file, we need to discard the partial page so that we pick up new
 	 * data after it */
+	path = (struct path){ .mnt = cache->mnt, .dentry = object->backer };
 	if (oi_size & ~PAGE_MASK && ni_size > oi_size) {
 		_debug("discard tail %llx", oi_size);
 		newattrs.ia_valid = ATTR_SIZE;
 		newattrs.ia_size = oi_size & PAGE_MASK;
-		ret = notify_change(object->backer, &newattrs, NULL);
+		ret = notify_change(&path, &newattrs, NULL);
 		if (ret < 0)
 			goto truncate_failed;
 	}
 
 	newattrs.ia_valid = ATTR_SIZE;
 	newattrs.ia_size = ni_size;
-	ret = notify_change(object->backer, &newattrs, NULL);
+	ret = notify_change(&path, &newattrs, NULL);
 
 truncate_failed:
 	inode_unlock(d_inode(object->backer));
diff --git a/fs/coredump.c b/fs/coredump.c
index b1ea7dfbd149..69899bfb025a 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -775,7 +775,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 			goto close_fail;
 		if (!(cprm.file->f_mode & FMODE_CAN_WRITE))
 			goto close_fail;
-		if (do_truncate(cprm.file->f_path.dentry, 0, 0, cprm.file))
+		if (do_truncate(&cprm.file->f_path, 0, 0, cprm.file))
 			goto close_fail;
 	}
 
@@ -879,7 +879,7 @@ void dump_truncate(struct coredump_params *cprm)
 	if (file->f_op->llseek && file->f_op->llseek != no_llseek) {
 		offset = file->f_op->llseek(file, 0, SEEK_CUR);
 		if (i_size_read(file->f_mapping->host) < offset)
-			do_truncate(file->f_path.dentry, offset, 0, file);
+			do_truncate(&file->f_path, offset, 0, file);
 	}
 }
 EXPORT_SYMBOL(dump_truncate);
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index e23752d9a79f..3bc67c478163 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -852,10 +852,11 @@ int ecryptfs_truncate(struct dentry *dentry, loff_t new_length)
 
 	rc = truncate_upper(dentry, &ia, &lower_ia);
 	if (!rc && lower_ia.ia_valid & ATTR_SIZE) {
-		struct dentry *lower_dentry = ecryptfs_dentry_to_lower(dentry);
+		struct path *lower_path = ecryptfs_dentry_to_lower_path(dentry);
+		struct dentry *lower_dentry = lower_path->dentry;
 
 		inode_lock(d_inode(lower_dentry));
-		rc = notify_change(lower_dentry, &lower_ia, NULL);
+		rc = notify_change(lower_path, &lower_ia, NULL);
 		inode_unlock(d_inode(lower_dentry));
 	}
 	return rc;
@@ -883,6 +884,7 @@ static int ecryptfs_setattr(struct dentry *dentry, struct iattr *ia)
 {
 	int rc = 0;
 	struct dentry *lower_dentry;
+	struct path *lower_path;
 	struct iattr lower_ia;
 	struct inode *inode;
 	struct inode *lower_inode;
@@ -897,6 +899,7 @@ static int ecryptfs_setattr(struct dentry *dentry, struct iattr *ia)
 	inode = d_inode(dentry);
 	lower_inode = ecryptfs_inode_to_lower(inode);
 	lower_dentry = ecryptfs_dentry_to_lower(dentry);
+	lower_path = ecryptfs_dentry_to_lower_path(dentry);
 	mutex_lock(&crypt_stat->cs_mutex);
 	if (d_is_dir(dentry))
 		crypt_stat->flags &= ~(ECRYPTFS_ENCRYPTED);
@@ -959,7 +962,7 @@ static int ecryptfs_setattr(struct dentry *dentry, struct iattr *ia)
 		lower_ia.ia_valid &= ~ATTR_MODE;
 
 	inode_lock(d_inode(lower_dentry));
-	rc = notify_change(lower_dentry, &lower_ia, NULL);
+	rc = notify_change(lower_path, &lower_ia, NULL);
 	inode_unlock(d_inode(lower_dentry));
 out:
 	fsstack_copy_attr_all(inode, lower_inode);
diff --git a/fs/inode.c b/fs/inode.c
index 96d62d97694e..18ff3081bda0 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1817,7 +1817,7 @@ int dentry_needs_remove_privs(struct dentry *dentry)
 	return mask;
 }
 
-static int __remove_privs(struct dentry *dentry, int kill)
+static int __remove_privs(struct path *path, int kill)
 {
 	struct iattr newattrs;
 
@@ -1826,7 +1826,7 @@ static int __remove_privs(struct dentry *dentry, int kill)
 	 * Note we call this on write, so notify_change will not
 	 * encounter any conflicting delegations:
 	 */
-	return notify_change(dentry, &newattrs, NULL);
+	return notify_change(path, &newattrs, NULL);
 }
 
 /*
@@ -1835,6 +1835,7 @@ static int __remove_privs(struct dentry *dentry, int kill)
  */
 int file_remove_privs(struct file *file)
 {
+	struct path *path = &file->f_path;
 	struct dentry *dentry = file_dentry(file);
 	struct inode *inode = file_inode(file);
 	int kill;
@@ -1853,7 +1854,7 @@ int file_remove_privs(struct file *file)
 	if (kill < 0)
 		return kill;
 	if (kill)
-		error = __remove_privs(dentry, kill);
+		error = __remove_privs(path, kill);
 	if (!error)
 		inode_has_no_xattr(inode);
 
diff --git a/fs/namei.c b/fs/namei.c
index d6c91d1e88cb..7bb4b1dcf3cc 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3012,7 +3012,7 @@ static int handle_truncate(struct file *filp)
 	if (!error)
 		error = security_path_truncate(path);
 	if (!error) {
-		error = do_truncate(path->dentry, 0,
+		error = do_truncate(path, 0,
 				    ATTR_MTIME|ATTR_CTIME|ATTR_OPEN,
 				    filp);
 	}
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index c0dc491537a6..dc990cc8f549 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -360,8 +360,8 @@ __be32
 nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
 	     int check_guard, time_t guardtime)
 {
-	struct dentry	*dentry;
 	struct inode	*inode;
+	struct path	path;
 	int		accmode = NFSD_MAY_SATTR;
 	umode_t		ftype = 0;
 	__be32		err;
@@ -400,8 +400,11 @@ nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
 			goto out;
 	}
 
-	dentry = fhp->fh_dentry;
-	inode = d_inode(dentry);
+	path = (struct path) {
+		.mnt = fhp->fh_export->ex_path.mnt,
+		.dentry = fhp->fh_dentry,
+	};
+	inode = d_inode(path.dentry);
 
 	/* Ignore any mode updates on symlinks */
 	if (S_ISLNK(inode->i_mode))
@@ -442,7 +445,7 @@ nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
 			.ia_size	= iap->ia_size,
 		};
 
-		host_err = notify_change(dentry, &size_attr, NULL);
+		host_err = notify_change(&path, &size_attr, NULL);
 		if (host_err)
 			goto out_unlock;
 		iap->ia_valid &= ~ATTR_SIZE;
@@ -457,7 +460,7 @@ nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
 	}
 
 	iap->ia_valid |= ATTR_CTIME;
-	host_err = notify_change(dentry, iap, NULL);
+	host_err = notify_change(&path, iap, NULL);
 
 out_unlock:
 	fh_unlock(fhp);
diff --git a/fs/open.c b/fs/open.c
index b62f5c0923a8..033e2112fbda 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -35,11 +35,12 @@
 
 #include "internal.h"
 
-int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
+int do_truncate(const struct path *path, loff_t length, unsigned int time_attrs,
 	struct file *filp)
 {
 	int ret;
 	struct iattr newattrs;
+	struct dentry *dentry = path->dentry;
 
 	/* Not pretty: "inode->i_size" shouldn't really be signed. But it is. */
 	if (length < 0)
@@ -61,7 +62,7 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
 
 	inode_lock(dentry->d_inode);
 	/* Note any delegations or leases have already been broken: */
-	ret = notify_change(dentry, &newattrs, NULL);
+	ret = notify_change(path, &newattrs, NULL);
 	inode_unlock(dentry->d_inode);
 	return ret;
 }
@@ -107,7 +108,7 @@ long vfs_truncate(const struct path *path, loff_t length)
 	if (!error)
 		error = security_path_truncate(path);
 	if (!error)
-		error = do_truncate(path->dentry, length, 0, NULL);
+		error = do_truncate(path, length, 0, NULL);
 
 put_write_and_out:
 	put_write_access(inode);
@@ -155,7 +156,7 @@ COMPAT_SYSCALL_DEFINE2(truncate, const char __user *, path, compat_off_t, length
 long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
 {
 	struct inode *inode;
-	struct dentry *dentry;
+	struct path *path;
 	struct fd f;
 	int error;
 
@@ -171,8 +172,8 @@ long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
 	if (f.file->f_flags & O_LARGEFILE)
 		small = 0;
 
-	dentry = f.file->f_path.dentry;
-	inode = dentry->d_inode;
+	path = &f.file->f_path;
+	inode = path->dentry->d_inode;
 	error = -EINVAL;
 	if (!S_ISREG(inode->i_mode) || !(f.file->f_mode & FMODE_WRITE))
 		goto out_putf;
@@ -192,7 +193,7 @@ long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
 	if (!error)
 		error = security_path_truncate(&f.file->f_path);
 	if (!error)
-		error = do_truncate(dentry, length, ATTR_MTIME|ATTR_CTIME, f.file);
+		error = do_truncate(path, length, ATTR_MTIME|ATTR_CTIME, f.file);
 	sb_end_write(inode->i_sb);
 out_putf:
 	fdput(f);
@@ -558,7 +559,7 @@ static int chmod_common(const struct path *path, umode_t mode)
 		goto out_unlock;
 	newattrs.ia_mode = (mode & S_IALLUGO) | (inode->i_mode & ~S_IALLUGO);
 	newattrs.ia_valid = ATTR_MODE | ATTR_CTIME;
-	error = notify_change(path->dentry, &newattrs, &delegated_inode);
+	error = notify_change(path, &newattrs, &delegated_inode);
 out_unlock:
 	inode_unlock(inode);
 	if (delegated_inode) {
@@ -649,7 +650,7 @@ static int chown_common(const struct path *path, uid_t user, gid_t group)
 	inode_lock(inode);
 	error = security_path_chown(path, uid, gid);
 	if (!error)
-		error = notify_change(path->dentry, &newattrs, &delegated_inode);
+		error = notify_change(path, &newattrs, &delegated_inode);
 	inode_unlock(inode);
 	if (delegated_inode) {
 		error = break_deleg_wait(&delegated_inode);
diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 6220642fe113..b16231c9dd11 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -177,17 +177,17 @@ static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
 	return error;
 }
 
-static int ovl_set_size(struct dentry *upperdentry, struct kstat *stat)
+static int ovl_set_size(struct path *upperpath, struct kstat *stat)
 {
 	struct iattr attr = {
 		.ia_valid = ATTR_SIZE,
 		.ia_size = stat->size,
 	};
 
-	return notify_change(upperdentry, &attr, NULL);
+	return notify_change(upperpath, &attr, NULL);
 }
 
-static int ovl_set_timestamps(struct dentry *upperdentry, struct kstat *stat)
+static int ovl_set_timestamps(struct path *upperpath, struct kstat *stat)
 {
 	struct iattr attr = {
 		.ia_valid =
@@ -196,10 +196,10 @@ static int ovl_set_timestamps(struct dentry *upperdentry, struct kstat *stat)
 		.ia_mtime = stat->mtime,
 	};
 
-	return notify_change(upperdentry, &attr, NULL);
+	return notify_change(upperpath, &attr, NULL);
 }
 
-int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat)
+int ovl_set_attr(struct path *upperpath, struct kstat *stat)
 {
 	int err = 0;
 
@@ -208,7 +208,7 @@ int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat)
 			.ia_valid = ATTR_MODE,
 			.ia_mode = stat->mode,
 		};
-		err = notify_change(upperdentry, &attr, NULL);
+		err = notify_change(upperpath, &attr, NULL);
 	}
 	if (!err) {
 		struct iattr attr = {
@@ -216,10 +216,10 @@ int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat)
 			.ia_uid = stat->uid,
 			.ia_gid = stat->gid,
 		};
-		err = notify_change(upperdentry, &attr, NULL);
+		err = notify_change(upperpath, &attr, NULL);
 	}
 	if (!err)
-		ovl_set_timestamps(upperdentry, stat);
+		ovl_set_timestamps(upperpath, stat);
 
 	return err;
 }
@@ -398,8 +398,13 @@ static int ovl_link_up(struct ovl_copy_up_ctx *c)
 {
 	int err;
 	struct dentry *upper;
-	struct dentry *upperdir = ovl_dentry_upper(c->parent);
-	struct inode *udir = d_inode(upperdir);
+	struct dentry *upperdir;
+	struct path upperdirpath;
+	struct inode *udir;
+
+	ovl_path_upper(c->parent, &upperdirpath);
+	upperdir = upperdirpath.dentry;
+	udir = d_inode(upperdir);
 
 	/* Mark parent "impure" because it may now contain non-pure upper */
 	err = ovl_set_impure(c->parent, upperdir);
@@ -420,7 +425,7 @@ static int ovl_link_up(struct ovl_copy_up_ctx *c)
 
 		if (!err) {
 			/* Restore timestamps on parent (best effort) */
-			ovl_set_timestamps(upperdir, &c->pstat);
+			ovl_set_timestamps(&upperdirpath, &c->pstat);
 			ovl_dentry_set_upper_alias(c->dentry);
 		}
 	}
@@ -436,15 +441,16 @@ static int ovl_link_up(struct ovl_copy_up_ctx *c)
 static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
 {
 	int err;
+	struct path upperpath, temppath;
 
+	ovl_path_upper(c->dentry, &upperpath);
 	/*
 	 * Copy up data first and then xattrs. Writing data after
 	 * xattrs will remove security.capability xattr automatically.
 	 */
 	if (S_ISREG(c->stat.mode) && !c->metacopy) {
-		struct path upperpath, datapath;
+		struct path datapath;
 
-		ovl_path_upper(c->dentry, &upperpath);
 		if (WARN_ON(upperpath.dentry != NULL))
 			return -EIO;
 		upperpath.dentry = temp;
@@ -478,12 +484,13 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
 		if (err)
 			return err;
 	}
+	temppath = (struct path){ .mnt = upperpath.mnt, .dentry = temp };
 
 	inode_lock(temp->d_inode);
 	if (c->metacopy)
-		err = ovl_set_size(temp, &c->stat);
+		err = ovl_set_size(&temppath, &c->stat);
 	if (!err)
-		err = ovl_set_attr(temp, &c->stat);
+		err = ovl_set_attr(&temppath, &c->stat);
 	inode_unlock(temp->d_inode);
 
 	return err;
@@ -699,10 +706,11 @@ static int ovl_do_copy_up(struct ovl_copy_up_ctx *c)
 		err = ovl_set_nlink_upper(c->dentry);
 	} else {
 		struct inode *udir = d_inode(c->destdir);
+		struct path destpath = { .mnt = ofs->upper_mnt, .dentry = c->destdir };
 
 		/* Restore timestamps on parent (best effort) */
 		inode_lock(udir);
-		ovl_set_timestamps(c->destdir, &c->pstat);
+		ovl_set_timestamps(&destpath, &c->pstat);
 		inode_unlock(udir);
 
 		ovl_dentry_set_upper_alias(c->dentry);
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 29abdb1d3b5c..6729fb6e15a9 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -374,7 +374,8 @@ static struct dentry *ovl_clear_empty(struct dentry *dentry,
 		goto out_cleanup;
 
 	inode_lock(opaquedir->d_inode);
-	err = ovl_set_attr(opaquedir, &stat);
+	err = ovl_set_attr(&(struct path) { .mnt = upperpath.mnt,
+					    .dentry = opaquedir }, &stat);
 	inode_unlock(opaquedir->d_inode);
 	if (err)
 		goto out_cleanup;
@@ -435,10 +436,13 @@ static int ovl_create_over_whiteout(struct dentry *dentry, struct inode *inode,
 	struct inode *udir = upperdir->d_inode;
 	struct dentry *upper;
 	struct dentry *newdentry;
+	struct path newpath;
 	int err;
 	struct posix_acl *acl, *default_acl;
 	bool hardlink = !!cattr->hardlink;
 
+	ovl_path_upper(dentry, &newpath);
+
 	if (WARN_ON(!workdir))
 		return -EROFS;
 
@@ -478,8 +482,10 @@ static int ovl_create_over_whiteout(struct dentry *dentry, struct inode *inode,
 			.ia_valid = ATTR_MODE,
 			.ia_mode = cattr->mode,
 		};
+
+		newpath.dentry = newdentry;
 		inode_lock(newdentry->d_inode);
-		err = notify_change(newdentry, &attr, NULL);
+		err = notify_change(&newpath, &attr, NULL);
 		inode_unlock(newdentry->d_inode);
 		if (err)
 			goto out_cleanup;
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index b045cf1826fc..da39c3b40669 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -45,8 +45,10 @@ int ovl_setattr(struct dentry *dentry, struct iattr *attr)
 		err = ovl_copy_up_with_data(dentry);
 	if (!err) {
 		struct inode *winode = NULL;
+		struct path upperpath;
 
-		upperdentry = ovl_dentry_upper(dentry);
+		ovl_path_upper(dentry, &upperpath);
+		upperdentry = upperpath.dentry;
 
 		if (attr->ia_valid & ATTR_SIZE) {
 			winode = d_inode(upperdentry);
@@ -60,7 +62,7 @@ int ovl_setattr(struct dentry *dentry, struct iattr *attr)
 
 		inode_lock(upperdentry->d_inode);
 		old_cred = ovl_override_creds(dentry->d_sb);
-		err = notify_change(upperdentry, attr, NULL);
+		err = notify_change(&upperpath, attr, NULL);
 		revert_creds(old_cred);
 		if (!err)
 			ovl_copyattr(upperdentry->d_inode, dentry->d_inode);
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index f283b1d69a9e..24537d13076d 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -445,7 +445,7 @@ int ovl_copy_up_with_data(struct dentry *dentry);
 int ovl_copy_up_flags(struct dentry *dentry, int flags);
 int ovl_maybe_copy_up(struct dentry *dentry, int flags);
 int ovl_copy_xattr(struct dentry *old, struct dentry *new);
-int ovl_set_attr(struct dentry *upper, struct kstat *stat);
+int ovl_set_attr(struct path *upperpath, struct kstat *stat);
 struct ovl_fh *ovl_encode_real_fh(struct dentry *real, bool is_upper);
 int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
 		   struct dentry *upper);
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 7621ff176d15..82c1da52831b 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -632,6 +632,7 @@ static struct dentry *ovl_workdir_create(struct ovl_fs *ofs,
 			.ia_valid = ATTR_MODE,
 			.ia_mode = S_IFDIR | 0,
 		};
+		const struct path workpath = { .mnt = mnt, .dentry = work };
 
 		if (work->d_inode) {
 			err = -EEXIST;
@@ -675,7 +676,7 @@ static struct dentry *ovl_workdir_create(struct ovl_fs *ofs,
 
 		/* Clear any inherited mode bits */
 		inode_lock(work->d_inode);
-		err = notify_change(work, &attr, NULL);
+		err = notify_change(&workpath, &attr, NULL);
 		inode_unlock(work->d_inode);
 		if (err)
 			goto out_dput;
diff --git a/fs/utimes.c b/fs/utimes.c
index c952b6b3d8a0..9b9e78c914af 100644
--- a/fs/utimes.c
+++ b/fs/utimes.c
@@ -57,7 +57,7 @@ static int utimes_common(const struct path *path, struct timespec64 *times)
 	}
 retry_deleg:
 	inode_lock(inode);
-	error = notify_change(path->dentry, &newattrs, &delegated_inode);
+	error = notify_change(path, &newattrs, &delegated_inode);
 	inode_unlock(inode);
 	if (delegated_inode) {
 		error = break_deleg_wait(&delegated_inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 70eb6255680d..3b3a1a25e244 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2526,8 +2526,8 @@ struct filename {
 static_assert(offsetof(struct filename, iname) % sizeof(long) == 0);
 
 extern long vfs_truncate(const struct path *, loff_t);
-extern int do_truncate(struct dentry *, loff_t start, unsigned int time_attrs,
-		       struct file *filp);
+extern int do_truncate(const struct path *p, loff_t start,
+		       unsigned int time_attrs, struct file *filp);
 extern int vfs_fallocate(struct file *file, int mode, loff_t offset,
 			loff_t len);
 extern long do_sys_open(int dfd, const char __user *filename, int flags,
@@ -2870,7 +2870,7 @@ extern void emergency_remount(void);
 #ifdef CONFIG_BLOCK
 extern sector_t bmap(struct inode *, sector_t);
 #endif
-extern int notify_change(struct dentry *, struct iattr *, struct inode **);
+extern int notify_change(const struct path *, struct iattr *, struct inode **);
 extern int inode_permission(struct inode *, int);
 extern int generic_permission(struct inode *, int);
 extern int __check_sticky(struct inode *dir, struct inode *inode);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/3] fs: introduce uid/gid shifting bind mount
  2020-01-04 20:39 [PATCH v2 0/3] introduce a uid/gid shifting bind mount James Bottomley
  2020-01-04 20:39 ` [PATCH v2 1/3] fs: rethread notify_change to take a path instead of a dentry James Bottomley
@ 2020-01-04 20:39 ` James Bottomley
  2020-01-04 23:09   ` Amir Goldstein
  2020-01-13  3:41   ` Serge E. Hallyn
  2020-01-04 20:39 ` [PATCH v2 3/3] fs: expose shifting bind mount to userspace James Bottomley
  2 siblings, 2 replies; 15+ messages in thread
From: James Bottomley @ 2020-01-04 20:39 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: David Howells, Christian Brauner, Al Viro, Miklos Szeredi,
	Seth Forshee, linux-unionfs, Amir Goldstein,
	Stéphane Graber, Eric Biederman, Aleksa Sarai, containers

This implementation reverse shifts according to the user_ns belonging
to the mnt_ns.  So if the vfsmount has the newly introduced flag
MNT_SHIFT and the current user_ns is the same as the mount_ns->user_ns
then we shift back using the user_ns before committing to the
underlying filesystem.

For example, if a user_ns is created where interior (fake root, uid 0)
is mapped to kernel uid 100000 then writes from interior root normally
go to the filesystem at the kernel uid.  However, if MNT_SHIFT is set,
they will be shifted back to write at uid 0, meaning we can bind mount
real image filesystems to user_ns protected faker root.

In essence there are several things which have to be done for this to
occur safely.  Firstly for all operations on the filesystem, new
credentials have to be installed where fsuid and fsgid are set to the
*interior* values. Next all inodes used from the filesystem have to
have i_uid and i_gid shifted back to the kernel values and attributes
set from user space have to have ia_uid and ia_gid shifted from the
kernel values to the interior values.  The capability checks have to
be done using ns_capable against the kernel values, but the inode
capability checks have to be done against the shifted ids.

Since creating a new credential is a reasonably expensive proposition
and we have to shift and unshift many times during path walking, a
cached copy of the shifted credential is saved to a newly created
place in the task structure.  This serves the dual purpose of allowing
us to use a pre-prepared copy of the shifted credentials and also
allows us to recognise whenever the shift is actually in effect (the
cached shifted credential pointer being equal to the current_cred()
pointer).

To get this all to work, we have a check for the vfsmount flag and the
user_ns gating a shifting of the credentials over all user space
entries to filesystem functions.  In theory the path has to be present
everywhere we do this, so we can check the vfsmount flags.  However,
for lower level functions we can cheat this path check of vfsmount
simply to check whether a shifted credential is in effect or not to
gate things like the inode permission check, which means the path
doesn't have to be threaded all the way through the permission
checking functions.  if the credential is shifted check passes, we can
also be sure that the current user_ns is the same as the mnt->user_ns,
so we can use it and thus have no need of the struct mount at the
point of the shift.

Although the shift can be effected simply by executing
do_reconfigure_mnt with MNT_SHIFT in the flags, this patch only
contains the shifting mechanisms.  The follow on patch wires up the
user visible API for turning the flag on.

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>

---

Note this patch depends on the rethreading notify_change to use a path
instead of a dentry patch
---
 fs/attr.c             |  87 ++++++++++++++++++++++++++++++----------
 fs/exec.c             |   7 +++-
 fs/inode.c            |   9 +++--
 fs/internal.h         |   2 +
 fs/namei.c            | 108 +++++++++++++++++++++++++++++++++++++++++---------
 fs/open.c             |  25 +++++++++++-
 fs/posix_acl.c        |   4 +-
 fs/stat.c             |  31 +++++++++++++--
 include/linux/cred.h  |  10 +++++
 include/linux/mount.h |   4 +-
 include/linux/sched.h |   5 +++
 kernel/capability.c   |  14 ++++++-
 kernel/cred.c         |  20 ++++++++++
 kernel/groups.c       |   7 ++++
 14 files changed, 279 insertions(+), 54 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 370b18807f05..3efb2dc67896 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -18,14 +18,22 @@
 #include <linux/evm.h>
 #include <linux/ima.h>
 
+#include "internal.h"
+#include "mount.h"
+
 static bool chown_ok(const struct inode *inode, kuid_t uid)
 {
+	kuid_t i_uid = inode->i_uid;
+
+	if (cred_is_shifted())
+		i_uid = make_kuid(current_user_ns(), __kuid_val(i_uid));
+
 	if (uid_eq(current_fsuid(), inode->i_uid) &&
-	    uid_eq(uid, inode->i_uid))
+	    uid_eq(uid, i_uid))
 		return true;
 	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
 		return true;
-	if (uid_eq(inode->i_uid, INVALID_UID) &&
+	if (uid_eq(i_uid, INVALID_UID) &&
 	    ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
 		return true;
 	return false;
@@ -33,12 +41,21 @@ static bool chown_ok(const struct inode *inode, kuid_t uid)
 
 static bool chgrp_ok(const struct inode *inode, kgid_t gid)
 {
+	kgid_t i_gid = inode->i_gid;
+	kuid_t i_uid = inode->i_uid;
+
+	if (cred_is_shifted()) {
+		struct user_namespace *ns = current_user_ns();
+
+		i_uid = make_kuid(ns, __kuid_val(i_uid));
+		i_gid = make_kgid(ns, __kgid_val(i_gid));
+	}
 	if (uid_eq(current_fsuid(), inode->i_uid) &&
-	    (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
+	    (in_group_p(gid) || gid_eq(gid, i_gid)))
 		return true;
 	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
 		return true;
-	if (gid_eq(inode->i_gid, INVALID_GID) &&
+	if (gid_eq(i_gid, INVALID_GID) &&
 	    ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
 		return true;
 	return false;
@@ -89,9 +106,10 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
 	if (ia_valid & ATTR_MODE) {
 		if (!inode_owner_or_capable(inode))
 			return -EPERM;
+
 		/* Also check the setgid bit! */
-		if (!in_group_p((ia_valid & ATTR_GID) ? attr->ia_gid :
-				inode->i_gid) &&
+		if (!in_group_p_shifted((ia_valid & ATTR_GID) ? attr->ia_gid :
+					inode->i_gid) &&
 		    !capable_wrt_inode_uidgid(inode, CAP_FSETID))
 			attr->ia_mode &= ~S_ISGID;
 	}
@@ -198,7 +216,7 @@ void setattr_copy(struct inode *inode, const struct iattr *attr)
 	if (ia_valid & ATTR_MODE) {
 		umode_t mode = attr->ia_mode;
 
-		if (!in_group_p(inode->i_gid) &&
+		if (!in_group_p_shifted(inode->i_gid) &&
 		    !capable_wrt_inode_uidgid(inode, CAP_FSETID))
 			mode &= ~S_ISGID;
 		inode->i_mode = mode;
@@ -235,6 +253,9 @@ int notify_change(const struct path *path, struct iattr * attr,
 	int error;
 	struct timespec64 now;
 	unsigned int ia_valid = attr->ia_valid;
+	const struct cred *cred;
+	kuid_t i_uid = inode->i_uid;
+	kgid_t i_gid = inode->i_gid;
 
 	WARN_ON_ONCE(!inode_is_locked(inode));
 
@@ -243,18 +264,28 @@ int notify_change(const struct path *path, struct iattr * attr,
 			return -EPERM;
 	}
 
+	cred = change_userns_creds(path);
+	if (cred) {
+		struct mount *m = real_mount(path->mnt);
+
+		attr->ia_uid = KUIDT_INIT(from_kuid(m->mnt_ns->user_ns, attr->ia_uid));
+		attr->ia_gid = KGIDT_INIT(from_kgid(m->mnt_ns->user_ns, attr->ia_gid));
+	}
+
 	/*
 	 * If utimes(2) and friends are called with times == NULL (or both
 	 * times are UTIME_NOW), then we need to check for write permission
 	 */
 	if (ia_valid & ATTR_TOUCH) {
-		if (IS_IMMUTABLE(inode))
-			return -EPERM;
+		if (IS_IMMUTABLE(inode)) {
+			error = -EPERM;
+			goto err;
+		}
 
 		if (!inode_owner_or_capable(inode)) {
 			error = inode_permission(inode, MAY_WRITE);
 			if (error)
-				return error;
+				goto err;
 		}
 	}
 
@@ -275,7 +306,7 @@ int notify_change(const struct path *path, struct iattr * attr,
 	if (ia_valid & ATTR_KILL_PRIV) {
 		error = security_inode_need_killpriv(dentry);
 		if (error < 0)
-			return error;
+			goto err;
 		if (error == 0)
 			ia_valid = attr->ia_valid &= ~ATTR_KILL_PRIV;
 	}
@@ -306,34 +337,46 @@ int notify_change(const struct path *path, struct iattr * attr,
 			attr->ia_mode &= ~S_ISGID;
 		}
 	}
-	if (!(attr->ia_valid & ~(ATTR_KILL_SUID | ATTR_KILL_SGID)))
-		return 0;
+	if (!(attr->ia_valid & ~(ATTR_KILL_SUID | ATTR_KILL_SGID))) {
+		error = 0;
+		goto err;
+	}
 
 	/*
 	 * Verify that uid/gid changes are valid in the target
 	 * namespace of the superblock.
 	 */
+	error = -EOVERFLOW;
 	if (ia_valid & ATTR_UID &&
 	    !kuid_has_mapping(inode->i_sb->s_user_ns, attr->ia_uid))
-		return -EOVERFLOW;
+		goto err;
+
 	if (ia_valid & ATTR_GID &&
 	    !kgid_has_mapping(inode->i_sb->s_user_ns, attr->ia_gid))
-		return -EOVERFLOW;
+		goto err;
 
 	/* Don't allow modifications of files with invalid uids or
 	 * gids unless those uids & gids are being made valid.
 	 */
-	if (!(ia_valid & ATTR_UID) && !uid_valid(inode->i_uid))
-		return -EOVERFLOW;
-	if (!(ia_valid & ATTR_GID) && !gid_valid(inode->i_gid))
-		return -EOVERFLOW;
+	if (cred_is_shifted()) {
+		struct user_namespace *ns = current_user_ns();
+
+		i_uid = make_kuid(ns, __kuid_val(i_uid));
+		i_gid = make_kgid(ns, __kgid_val(i_gid));
+	}
+
+	if (!(ia_valid & ATTR_UID) && !uid_valid(i_uid))
+		goto err;
+
+	if (!(ia_valid & ATTR_GID) && !gid_valid(i_gid))
+		goto err;
 
 	error = security_inode_setattr(dentry, attr);
 	if (error)
-		return error;
+		goto err;
 	error = try_break_deleg(inode, delegated_inode);
 	if (error)
-		return error;
+		goto err;
 
 	if (inode->i_op->setattr)
 		error = inode->i_op->setattr(dentry, attr);
@@ -346,6 +389,8 @@ int notify_change(const struct path *path, struct iattr * attr,
 		evm_inode_post_setattr(dentry, ia_valid);
 	}
 
+ err:
+	revert_userns_creds(cred);
 	return error;
 }
 EXPORT_SYMBOL(notify_change);
diff --git a/fs/exec.c b/fs/exec.c
index 74d88dab98dd..4baf91391689 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1539,13 +1539,18 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
 
 	/* Be careful if suid/sgid is set */
 	inode_lock(inode);
-
 	/* reload atomically mode/uid/gid now that lock held */
 	mode = inode->i_mode;
 	uid = inode->i_uid;
 	gid = inode->i_gid;
 	inode_unlock(inode);
 
+	if (cred_is_shifted()) {
+		struct user_namespace *ns = current_user_ns();
+
+		uid = make_kuid(ns, __kuid_val(uid));
+		gid = make_kgid(ns, __kgid_val(gid));
+	}
 	/* We ignore suid/sgid if there are no mappings for them in the ns */
 	if (!kuid_has_mapping(bprm->cred->user_ns, uid) ||
 		 !kgid_has_mapping(bprm->cred->user_ns, gid))
diff --git a/fs/inode.c b/fs/inode.c
index 18ff3081bda0..f5f7f7cbd374 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2060,7 +2060,7 @@ void inode_init_owner(struct inode *inode, const struct inode *dir,
 		if (S_ISDIR(mode))
 			mode |= S_ISGID;
 		else if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) &&
-			 !in_group_p(inode->i_gid) &&
+			 !in_group_p_shifted(inode->i_gid) &&
 			 !capable_wrt_inode_uidgid(dir, CAP_FSETID))
 			mode &= ~S_ISGID;
 	} else
@@ -2079,12 +2079,15 @@ EXPORT_SYMBOL(inode_init_owner);
 bool inode_owner_or_capable(const struct inode *inode)
 {
 	struct user_namespace *ns;
+	kuid_t uid = inode->i_uid;
 
-	if (uid_eq(current_fsuid(), inode->i_uid))
+	if (uid_eq(current_fsuid(), uid))
 		return true;
 
 	ns = current_user_ns();
-	if (kuid_has_mapping(ns, inode->i_uid) && ns_capable(ns, CAP_FOWNER))
+	if (cred_is_shifted())
+		uid = make_kuid(ns, __kuid_val(uid));
+	if (kuid_has_mapping(ns, uid) && ns_capable(ns, CAP_FOWNER))
 		return true;
 	return false;
 }
diff --git a/fs/internal.h b/fs/internal.h
index 9cbf6097c77f..47ac2f295f70 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -73,6 +73,8 @@ long do_symlinkat(const char __user *oldname, int newdfd,
 		  const char __user *newname);
 int do_linkat(int olddfd, const char __user *oldname, int newdfd,
 	      const char __user *newname, int flags);
+const struct cred *change_userns_creds(const struct path *p);
+void revert_userns_creds(const struct cred *cred);
 
 /*
  * namespace.c
diff --git a/fs/namei.c b/fs/namei.c
index 7bb4b1dcf3cc..0f36f21e6964 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -124,6 +124,38 @@
 
 #define EMBEDDED_NAME_MAX	(PATH_MAX - offsetof(struct filename, iname))
 
+const struct cred *change_userns_creds(const struct path *p)
+{
+	struct mount *m = real_mount(p->mnt);
+
+	if ((p->mnt->mnt_flags & MNT_SHIFT) == 0)
+		return NULL;
+
+	if (current->nsproxy->mnt_ns->user_ns != m->mnt_ns->user_ns)
+		return NULL;
+
+	if (current->mnt != p->mnt) {
+		struct cred *cred;
+		struct user_namespace *user_ns = m->mnt_ns->user_ns;
+
+		if (current->mnt_cred)
+			put_cred(current->mnt_cred);
+		cred = prepare_creds();
+		cred->fsuid = KUIDT_INIT(from_kuid(user_ns, current->cred->fsuid));
+		cred->fsgid = KGIDT_INIT(from_kgid(user_ns, current->cred->fsgid));
+		current->mnt = p->mnt; /* no reference needed */
+		current->mnt_cred = cred;
+	}
+	return override_creds(current->mnt_cred);
+}
+
+void revert_userns_creds(const struct cred *cred)
+{
+	if (!cred)
+		return;
+	revert_creds(cred);
+}
+
 struct filename *
 getname_flags(const char __user *filename, int flags, int *empty)
 {
@@ -303,7 +335,7 @@ static int acl_permission_check(struct inode *inode, int mask)
 				return error;
 		}
 
-		if (in_group_p(inode->i_gid))
+		if (in_group_p_shifted(inode->i_gid))
 			mode >>= 3;
 	}
 
@@ -366,7 +398,6 @@ int generic_permission(struct inode *inode, int mask)
 	if (!(mask & MAY_EXEC) || (inode->i_mode & S_IXUGO))
 		if (capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE))
 			return 0;
-
 	return -EACCES;
 }
 EXPORT_SYMBOL(generic_permission);
@@ -1784,6 +1815,7 @@ static int walk_component(struct nameidata *nd, int flags)
 	struct inode *inode;
 	unsigned seq;
 	int err;
+	const struct cred *cred;
 	/*
 	 * "." and ".." are special - ".." especially so because it has
 	 * to be able to know about the current root directory and
@@ -1795,25 +1827,31 @@ static int walk_component(struct nameidata *nd, int flags)
 			put_link(nd);
 		return err;
 	}
+	cred = change_userns_creds(&nd->path);
 	err = lookup_fast(nd, &path, &inode, &seq);
 	if (unlikely(err <= 0)) {
 		if (err < 0)
-			return err;
+			goto out;
 		path.dentry = lookup_slow(&nd->last, nd->path.dentry,
 					  nd->flags);
-		if (IS_ERR(path.dentry))
-			return PTR_ERR(path.dentry);
+		if (IS_ERR(path.dentry)) {
+			err = PTR_ERR(path.dentry);
+			goto out;
+		}
 
 		path.mnt = nd->path.mnt;
 		err = follow_managed(&path, nd);
 		if (unlikely(err < 0))
-			return err;
+			goto out;
 
 		seq = 0;	/* we are already out of RCU mode */
 		inode = d_backing_inode(path.dentry);
 	}
 
-	return step_into(nd, &path, flags, inode, seq);
+	err = step_into(nd, &path, flags, inode, seq);
+ out:
+	revert_userns_creds(cred);
+	return err;
 }
 
 /*
@@ -2067,8 +2105,10 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 	for(;;) {
 		u64 hash_len;
 		int type;
+		const struct cred *cred = change_userns_creds(&nd->path);
 
 		err = may_lookup(nd);
+		revert_userns_creds(cred);
 		if (err)
 			return err;
 
@@ -2242,12 +2282,17 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 static const char *trailing_symlink(struct nameidata *nd)
 {
 	const char *s;
+	const struct cred *cred = change_userns_creds(&nd->path);
 	int error = may_follow_link(nd);
-	if (unlikely(error))
-		return ERR_PTR(error);
+	if (unlikely(error)) {
+		s = ERR_PTR(error);
+		goto out;
+	}
 	nd->flags |= LOOKUP_PARENT;
 	nd->stack[0].name = NULL;
 	s = get_link(nd);
+ out:
+	revert_userns_creds(cred);
 	return s ? s : "";
 }
 
@@ -3273,6 +3318,7 @@ static int do_last(struct nameidata *nd,
 	struct inode *inode;
 	struct path path;
 	int error;
+	const struct cred *cred = change_userns_creds(&nd->path);
 
 	nd->flags &= ~LOOKUP_PARENT;
 	nd->flags |= op->intent;
@@ -3280,7 +3326,7 @@ static int do_last(struct nameidata *nd,
 	if (nd->last_type != LAST_NORM) {
 		error = handle_dots(nd, nd->last_type);
 		if (unlikely(error))
-			return error;
+			goto err;
 		goto finish_open;
 	}
 
@@ -3293,7 +3339,7 @@ static int do_last(struct nameidata *nd,
 			goto finish_lookup;
 
 		if (error < 0)
-			return error;
+			goto err;
 
 		BUG_ON(nd->inode != dir->d_inode);
 		BUG_ON(nd->flags & LOOKUP_RCU);
@@ -3306,12 +3352,14 @@ static int do_last(struct nameidata *nd,
 		 */
 		error = complete_walk(nd);
 		if (error)
-			return error;
+			goto err;
 
 		audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
 		/* trailing slashes? */
-		if (unlikely(nd->last.name[nd->last.len]))
-			return -EISDIR;
+		if (unlikely(nd->last.name[nd->last.len])) {
+			error = -EISDIR;
+			goto err;
+		}
 	}
 
 	if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
@@ -3367,7 +3415,7 @@ static int do_last(struct nameidata *nd,
 
 	error = follow_managed(&path, nd);
 	if (unlikely(error < 0))
-		return error;
+		goto err;
 
 	/*
 	 * create/update audit record if it already exists.
@@ -3376,7 +3424,8 @@ static int do_last(struct nameidata *nd,
 
 	if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
 		path_to_nameidata(&path, nd);
-		return -EEXIST;
+		error = -EEXIST;
+		goto err;
 	}
 
 	seq = 0;	/* out of RCU mode, so the value doesn't matter */
@@ -3384,12 +3433,12 @@ static int do_last(struct nameidata *nd,
 finish_lookup:
 	error = step_into(nd, &path, 0, inode, seq);
 	if (unlikely(error))
-		return error;
+		goto err;
 finish_open:
 	/* Why this, you ask?  _Now_ we might have grown LOOKUP_JUMPED... */
 	error = complete_walk(nd);
 	if (error)
-		return error;
+		goto err;
 	audit_inode(nd->name, nd->path.dentry, 0);
 	if (open_flag & O_CREAT) {
 		error = -EISDIR;
@@ -3431,6 +3480,8 @@ static int do_last(struct nameidata *nd,
 	}
 	if (got_write)
 		mnt_drop_write(nd->path.mnt);
+ err:
+	revert_userns_creds(cred);
 	return error;
 }
 
@@ -3749,6 +3800,7 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode,
 	struct path path;
 	int error;
 	unsigned int lookup_flags = 0;
+	const struct cred *cred;
 
 	error = may_mknod(mode);
 	if (error)
@@ -3758,6 +3810,7 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode,
 	if (IS_ERR(dentry))
 		return PTR_ERR(dentry);
 
+	cred = change_userns_creds(&path);
 	if (!IS_POSIXACL(path.dentry->d_inode))
 		mode &= ~current_umask();
 	error = security_path_mknod(&path, dentry, mode, dev);
@@ -3779,6 +3832,7 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode,
 	}
 out:
 	done_path_create(&path, dentry);
+	revert_userns_creds(cred);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
 		goto retry;
@@ -3829,18 +3883,21 @@ long do_mkdirat(int dfd, const char __user *pathname, umode_t mode)
 	struct path path;
 	int error;
 	unsigned int lookup_flags = LOOKUP_DIRECTORY;
+	const struct cred *cred;
 
 retry:
 	dentry = user_path_create(dfd, pathname, &path, lookup_flags);
 	if (IS_ERR(dentry))
 		return PTR_ERR(dentry);
 
+	cred = change_userns_creds(&path);
 	if (!IS_POSIXACL(path.dentry->d_inode))
 		mode &= ~current_umask();
 	error = security_path_mkdir(&path, dentry, mode);
 	if (!error)
 		error = vfs_mkdir(path.dentry->d_inode, dentry, mode);
 	done_path_create(&path, dentry);
+	revert_userns_creds(cred);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
 		goto retry;
@@ -3907,12 +3964,14 @@ long do_rmdir(int dfd, const char __user *pathname)
 	struct qstr last;
 	int type;
 	unsigned int lookup_flags = 0;
+	const struct cred *cred;
 retry:
 	name = filename_parentat(dfd, getname(pathname), lookup_flags,
 				&path, &last, &type);
 	if (IS_ERR(name))
 		return PTR_ERR(name);
 
+	cred = change_userns_creds(&path);
 	switch (type) {
 	case LAST_DOTDOT:
 		error = -ENOTEMPTY;
@@ -3948,6 +4007,7 @@ long do_rmdir(int dfd, const char __user *pathname)
 	inode_unlock(path.dentry->d_inode);
 	mnt_drop_write(path.mnt);
 exit1:
+	revert_userns_creds(cred);
 	path_put(&path);
 	putname(name);
 	if (retry_estale(error, lookup_flags)) {
@@ -4037,11 +4097,13 @@ long do_unlinkat(int dfd, struct filename *name)
 	struct inode *inode = NULL;
 	struct inode *delegated_inode = NULL;
 	unsigned int lookup_flags = 0;
+	const struct cred *cred;
 retry:
 	name = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
 	if (IS_ERR(name))
 		return PTR_ERR(name);
 
+	cred = change_userns_creds(&path);
 	error = -EISDIR;
 	if (type != LAST_NORM)
 		goto exit1;
@@ -4079,6 +4141,7 @@ long do_unlinkat(int dfd, struct filename *name)
 	}
 	mnt_drop_write(path.mnt);
 exit1:
+	revert_userns_creds(cred);
 	path_put(&path);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
@@ -4143,6 +4206,7 @@ long do_symlinkat(const char __user *oldname, int newdfd,
 	struct dentry *dentry;
 	struct path path;
 	unsigned int lookup_flags = 0;
+	const struct cred *cred;
 
 	from = getname(oldname);
 	if (IS_ERR(from))
@@ -4153,6 +4217,7 @@ long do_symlinkat(const char __user *oldname, int newdfd,
 	if (IS_ERR(dentry))
 		goto out_putname;
 
+	cred = change_userns_creds(&path);
 	error = security_path_symlink(&path, dentry, from->name);
 	if (!error)
 		error = vfs_symlink(path.dentry->d_inode, dentry, from->name);
@@ -4161,6 +4226,7 @@ long do_symlinkat(const char __user *oldname, int newdfd,
 		lookup_flags |= LOOKUP_REVAL;
 		goto retry;
 	}
+	revert_userns_creds(cred);
 out_putname:
 	putname(from);
 	return error;
@@ -4274,6 +4340,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
 	struct inode *delegated_inode = NULL;
 	int how = 0;
 	int error;
+	const struct cred *cred;
 
 	if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH)) != 0)
 		return -EINVAL;
@@ -4301,6 +4368,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
 	if (IS_ERR(new_dentry))
 		goto out;
 
+	cred = change_userns_creds(&new_path);
 	error = -EXDEV;
 	if (old_path.mnt != new_path.mnt)
 		goto out_dput;
@@ -4312,6 +4380,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
 		goto out_dput;
 	error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry, &delegated_inode);
 out_dput:
+	revert_userns_creds(cred);
 	done_path_create(&new_path, new_dentry);
 	if (delegated_inode) {
 		error = break_deleg_wait(&delegated_inode);
@@ -4531,6 +4600,7 @@ static int do_renameat2(int olddfd, const char __user *oldname, int newdfd,
 	unsigned int lookup_flags = 0, target_flags = LOOKUP_RENAME_TARGET;
 	bool should_retry = false;
 	int error;
+	const struct cred *cred;
 
 	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
 		return -EINVAL;
@@ -4560,6 +4630,7 @@ static int do_renameat2(int olddfd, const char __user *oldname, int newdfd,
 		goto exit1;
 	}
 
+	cred = change_userns_creds(&new_path);
 	error = -EXDEV;
 	if (old_path.mnt != new_path.mnt)
 		goto exit2;
@@ -4644,6 +4715,7 @@ static int do_renameat2(int olddfd, const char __user *oldname, int newdfd,
 	}
 	mnt_drop_write(old_path.mnt);
 exit2:
+	revert_userns_creds(cred);
 	if (retry_estale(error, lookup_flags))
 		should_retry = true;
 	path_put(&new_path);
diff --git a/fs/open.c b/fs/open.c
index 033e2112fbda..7cad2b723925 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -456,11 +456,13 @@ int ksys_chdir(const char __user *filename)
 	struct path path;
 	int error;
 	unsigned int lookup_flags = LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
+	const struct cred *cred;
 retry:
 	error = user_path_at(AT_FDCWD, filename, lookup_flags, &path);
 	if (error)
 		goto out;
 
+	cred = change_userns_creds(&path);
 	error = inode_permission(path.dentry->d_inode, MAY_EXEC | MAY_CHDIR);
 	if (error)
 		goto dput_and_out;
@@ -468,6 +470,7 @@ int ksys_chdir(const char __user *filename)
 	set_fs_pwd(current->fs, &path);
 
 dput_and_out:
+	revert_userns_creds(cred);
 	path_put(&path);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
@@ -486,11 +489,13 @@ SYSCALL_DEFINE1(fchdir, unsigned int, fd)
 {
 	struct fd f = fdget_raw(fd);
 	int error;
+	const struct cred *cred;
 
 	error = -EBADF;
 	if (!f.file)
 		goto out;
 
+	cred = change_userns_creds(&f.file->f_path);
 	error = -ENOTDIR;
 	if (!d_can_lookup(f.file->f_path.dentry))
 		goto out_putf;
@@ -499,6 +504,7 @@ SYSCALL_DEFINE1(fchdir, unsigned int, fd)
 	if (!error)
 		set_fs_pwd(current->fs, &f.file->f_path);
 out_putf:
+	revert_userns_creds(cred);
 	fdput(f);
 out:
 	return error;
@@ -547,11 +553,13 @@ static int chmod_common(const struct path *path, umode_t mode)
 	struct inode *inode = path->dentry->d_inode;
 	struct inode *delegated_inode = NULL;
 	struct iattr newattrs;
+	const struct cred *cred;
 	int error;
 
+	cred = change_userns_creds(path);
 	error = mnt_want_write(path->mnt);
 	if (error)
-		return error;
+		goto out;
 retry_deleg:
 	inode_lock(inode);
 	error = security_path_chmod(path, mode);
@@ -568,6 +576,8 @@ static int chmod_common(const struct path *path, umode_t mode)
 			goto retry_deleg;
 	}
 	mnt_drop_write(path->mnt);
+ out:
+	revert_userns_creds(cred);
 	return error;
 }
 
@@ -666,6 +676,7 @@ int do_fchownat(int dfd, const char __user *filename, uid_t user, gid_t group,
 	struct path path;
 	int error = -EINVAL;
 	int lookup_flags;
+	const struct cred *cred;
 
 	if ((flag & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
 		goto out;
@@ -677,12 +688,14 @@ int do_fchownat(int dfd, const char __user *filename, uid_t user, gid_t group,
 	error = user_path_at(dfd, filename, lookup_flags, &path);
 	if (error)
 		goto out;
+	cred = change_userns_creds(&path);
 	error = mnt_want_write(path.mnt);
 	if (error)
 		goto out_release;
 	error = chown_common(&path, user, group);
 	mnt_drop_write(path.mnt);
 out_release:
+	revert_userns_creds(cred);
 	path_put(&path);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
@@ -713,10 +726,12 @@ int ksys_fchown(unsigned int fd, uid_t user, gid_t group)
 {
 	struct fd f = fdget(fd);
 	int error = -EBADF;
+	const struct cred *cred;
 
 	if (!f.file)
 		goto out;
 
+	cred = change_userns_creds(&f.file->f_path);
 	error = mnt_want_write_file(f.file);
 	if (error)
 		goto out_fput;
@@ -724,6 +739,7 @@ int ksys_fchown(unsigned int fd, uid_t user, gid_t group)
 	error = chown_common(&f.file->f_path, user, group);
 	mnt_drop_write_file(f.file);
 out_fput:
+	revert_userns_creds(cred);
 	fdput(f);
 out:
 	return error;
@@ -911,8 +927,13 @@ EXPORT_SYMBOL(file_path);
  */
 int vfs_open(const struct path *path, struct file *file)
 {
+	int ret;
+	const struct cred *cred = change_userns_creds(path);
+
 	file->f_path = *path;
-	return do_dentry_open(file, d_backing_inode(path->dentry), NULL);
+	ret = do_dentry_open(file, d_backing_inode(path->dentry), NULL);
+	revert_userns_creds(cred);
+	return ret;
 }
 
 struct file *dentry_open(const struct path *path, int flags,
diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 84ad1c90d535..b5aa36261964 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -364,7 +364,7 @@ posix_acl_permission(struct inode *inode, const struct posix_acl *acl, int want)
                                         goto mask;
 				break;
                         case ACL_GROUP_OBJ:
-                                if (in_group_p(inode->i_gid)) {
+				if (in_group_p_shifted(inode->i_gid)) {
 					found = 1;
 					if ((pa->e_perm & want) == want)
 						goto mask;
@@ -652,7 +652,7 @@ int posix_acl_update_mode(struct inode *inode, umode_t *mode_p,
 		return error;
 	if (error == 0)
 		*acl = NULL;
-	if (!in_group_p(inode->i_gid) &&
+	if (!in_group_p_shifted(inode->i_gid) &&
 	    !capable_wrt_inode_uidgid(inode, CAP_FSETID))
 		mode &= ~S_ISGID;
 	*mode_p = mode;
diff --git a/fs/stat.c b/fs/stat.c
index c38e4c2e1221..0018b168d7a7 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -21,6 +21,8 @@
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
 
+#include "mount.h"
+
 /**
  * generic_fillattr - Fill in the basic attributes from the inode struct
  * @inode: Inode to use as the source
@@ -48,6 +50,21 @@ void generic_fillattr(struct inode *inode, struct kstat *stat)
 }
 EXPORT_SYMBOL(generic_fillattr);
 
+static void shift_check(struct vfsmount *mnt, struct kstat *stat)
+{
+	struct mount *m = real_mount(mnt);
+	struct user_namespace *user_ns = m->mnt_ns->user_ns;
+
+	if ((mnt->mnt_flags & MNT_SHIFT) == 0)
+		return;
+
+	if (current->nsproxy->mnt_ns->user_ns != m->mnt_ns->user_ns)
+		return;
+
+	stat->uid = make_kuid(user_ns, __kuid_val(stat->uid));
+	stat->gid = make_kgid(user_ns, __kgid_val(stat->gid));
+}
+
 /**
  * vfs_getattr_nosec - getattr without security checks
  * @path: file to get attributes from
@@ -65,6 +82,7 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
 		      u32 request_mask, unsigned int query_flags)
 {
 	struct inode *inode = d_backing_inode(path->dentry);
+	int ret;
 
 	memset(stat, 0, sizeof(*stat));
 	stat->result_mask |= STATX_BASIC_STATS;
@@ -77,12 +95,17 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
 	if (IS_AUTOMOUNT(inode))
 		stat->attributes |= STATX_ATTR_AUTOMOUNT;
 
+	ret = 0;
 	if (inode->i_op->getattr)
-		return inode->i_op->getattr(path, stat, request_mask,
-					    query_flags);
+		ret = inode->i_op->getattr(path, stat, request_mask,
+					   query_flags);
+	else
+		generic_fillattr(inode, stat);
 
-	generic_fillattr(inode, stat);
-	return 0;
+	if (!ret)
+		shift_check(path->mnt, stat);
+
+	return ret;
 }
 EXPORT_SYMBOL(vfs_getattr_nosec);
 
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 18639c069263..8a5f2c9b613a 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -59,6 +59,7 @@ extern struct group_info *groups_alloc(int);
 extern void groups_free(struct group_info *);
 
 extern int in_group_p(kgid_t);
+extern int in_group_p_shifted(kgid_t);
 extern int in_egroup_p(kgid_t);
 extern int groups_search(const struct group_info *, kgid_t);
 
@@ -75,6 +76,10 @@ static inline int in_group_p(kgid_t grp)
 {
         return 1;
 }
+static inline int in_group_p_shifted(kgid_t grp)
+{
+	return 1;
+}
 static inline int in_egroup_p(kgid_t grp)
 {
         return 1;
@@ -422,4 +427,9 @@ do {						\
 	*(_fsgid) = __cred->fsgid;		\
 } while(0)
 
+static inline bool cred_is_shifted(void)
+{
+	return current_cred() == current->mnt_cred;
+}
+
 #endif /* _LINUX_CRED_H */
diff --git a/include/linux/mount.h b/include/linux/mount.h
index bf8cc4108b8f..cdc5d981d594 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -46,7 +46,7 @@ struct fs_context;
 #define MNT_SHARED_MASK	(MNT_UNBINDABLE)
 #define MNT_USER_SETTABLE_MASK  (MNT_NOSUID | MNT_NODEV | MNT_NOEXEC \
 				 | MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME \
-				 | MNT_READONLY)
+				 | MNT_READONLY | MNT_SHIFT)
 #define MNT_ATIME_MASK (MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME )
 
 #define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | \
@@ -65,6 +65,8 @@ struct fs_context;
 #define MNT_MARKED		0x4000000
 #define MNT_UMOUNT		0x8000000
 
+#define MNT_SHIFT		0x10000000
+
 struct vfsmount {
 	struct dentry *mnt_root;	/* root of the mounted tree */
 	struct super_block *mnt_sb;	/* pointer to superblock */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 467d26046416..d376dc7bcf76 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -19,6 +19,7 @@
 #include <linux/plist.h>
 #include <linux/hrtimer.h>
 #include <linux/seccomp.h>
+#include <linux/mount.h>
 #include <linux/nodemask.h>
 #include <linux/rcupdate.h>
 #include <linux/refcount.h>
@@ -882,6 +883,10 @@ struct task_struct {
 	/* Effective (overridable) subjective task credentials (COW): */
 	const struct cred __rcu		*cred;
 
+	/* cache for uid/gid shifted cred tied to mnt */
+	struct cred			*mnt_cred;
+	struct vfsmount			*mnt;
+
 #ifdef CONFIG_KEYS
 	/* Cached requested key. */
 	struct key			*cached_requested_key;
diff --git a/kernel/capability.c b/kernel/capability.c
index 1444f3954d75..3273e85a644c 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -486,8 +486,18 @@ EXPORT_SYMBOL(file_ns_capable);
  */
 bool privileged_wrt_inode_uidgid(struct user_namespace *ns, const struct inode *inode)
 {
-	return kuid_has_mapping(ns, inode->i_uid) &&
-		kgid_has_mapping(ns, inode->i_gid);
+	kuid_t i_uid = inode->i_uid;
+	kgid_t i_gid = inode->i_gid;
+
+	if (cred_is_shifted()) {
+		struct user_namespace *cns = current_user_ns();
+
+		i_uid = make_kuid(cns, __kuid_val(i_uid));
+		i_gid = make_kgid(cns, __kgid_val(i_gid));
+	}
+
+	return kuid_has_mapping(ns, i_uid) &&
+		kgid_has_mapping(ns, i_gid);
 }
 
 /**
diff --git a/kernel/cred.c b/kernel/cred.c
index c0a4c12d38b2..bbe0e2e64081 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -167,6 +167,8 @@ void exit_creds(struct task_struct *tsk)
 	validate_creds(cred);
 	alter_cred_subscribers(cred, -1);
 	put_cred(cred);
+	if (tsk->mnt_cred)
+		put_cred(tsk->mnt_cred);
 
 	cred = (struct cred *) tsk->cred;
 	tsk->cred = NULL;
@@ -318,6 +320,17 @@ struct cred *prepare_exec_creds(void)
 	return new;
 }
 
+static void flush_mnt_cred(struct task_struct *t)
+{
+	if (t->mnt_cred == t->cred)
+		return;
+	if (t->mnt_cred)
+		put_cred(t->mnt_cred);
+	t->mnt_cred = NULL;
+	/* mnt is only used for comparison, so it has no reference */
+	t->mnt = NULL;
+}
+
 /*
  * Copy credentials for the new process created by fork()
  *
@@ -344,6 +357,8 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
 	    ) {
 		p->real_cred = get_cred(p->cred);
 		get_cred(p->cred);
+		p->mnt = NULL;
+		p->mnt_cred = NULL;
 		alter_cred_subscribers(p->cred, 2);
 		kdebug("share_creds(%p{%d,%d})",
 		       p->cred, atomic_read(&p->cred->usage),
@@ -383,6 +398,8 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
 
 	atomic_inc(&new->user->processes);
 	p->cred = p->real_cred = get_cred(new);
+	p->mnt = NULL;
+	p->mnt_cred = NULL;
 	alter_cred_subscribers(new, 2);
 	validate_creds(new);
 	return 0;
@@ -506,6 +523,7 @@ int commit_creds(struct cred *new)
 	/* release the old obj and subj refs both */
 	put_cred(old);
 	put_cred(old);
+	flush_mnt_cred(task);
 	return 0;
 }
 EXPORT_SYMBOL(commit_creds);
@@ -564,6 +582,7 @@ const struct cred *override_creds(const struct cred *new)
 	alter_cred_subscribers(new, 1);
 	rcu_assign_pointer(current->cred, new);
 	alter_cred_subscribers(old, -1);
+	flush_mnt_cred(current);
 
 	kdebug("override_creds() = %p{%d,%d}", old,
 	       atomic_read(&old->usage),
@@ -589,6 +608,7 @@ void revert_creds(const struct cred *old)
 
 	validate_creds(old);
 	validate_creds(override);
+	flush_mnt_cred(current);
 	alter_cred_subscribers(old, 1);
 	rcu_assign_pointer(current->cred, old);
 	alter_cred_subscribers(override, -1);
diff --git a/kernel/groups.c b/kernel/groups.c
index daae2f2dc6d4..772b49a367b0 100644
--- a/kernel/groups.c
+++ b/kernel/groups.c
@@ -228,6 +228,13 @@ int in_group_p(kgid_t grp)
 
 EXPORT_SYMBOL(in_group_p);
 
+int in_group_p_shifted(kgid_t grp)
+{
+	if (cred_is_shifted())
+		grp = make_kgid(current_user_ns(), __kgid_val(grp));
+	return in_group_p(grp);
+}
+
 int in_egroup_p(kgid_t grp)
 {
 	const struct cred *cred = current_cred();
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 3/3] fs: expose shifting bind mount to userspace
  2020-01-04 20:39 [PATCH v2 0/3] introduce a uid/gid shifting bind mount James Bottomley
  2020-01-04 20:39 ` [PATCH v2 1/3] fs: rethread notify_change to take a path instead of a dentry James Bottomley
  2020-01-04 20:39 ` [PATCH v2 2/3] fs: introduce uid/gid shifting bind mount James Bottomley
@ 2020-01-04 20:39 ` James Bottomley
  2 siblings, 0 replies; 15+ messages in thread
From: James Bottomley @ 2020-01-04 20:39 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: David Howells, Christian Brauner, Al Viro, Miklos Szeredi,
	Seth Forshee, linux-unionfs, Amir Goldstein,
	Stéphane Graber, Eric Biederman, Aleksa Sarai, containers

This capability is exposed currently through the proposed new bind
API. To mark a mount for shifting, you add the allow-shift flag to the
properties, either by a reconfigure or a rebind.  Only real root on
the system can do this.  Once this is done, admin in a user namespace
(i.e. an unprivileged user) can take that mount point and bind it with
a shift in effect.

The way an admin marks a mount is:

pathfd = open("/path/to/shift", O_PATH);
fd = configfd_open("bind", O_CLOEXEC);
configfd_action(fd, CONFIGFD_SET_FD, "pathfd", NULL, pathfd);
configfd_action(fd, CONFIGFD_SET_FLAG, "allow-shift", NULL, 0);
configfd_action(fd, CONFIGFD_SET_FLAG, "detached", NULL, 0);
configfd_action(fd, CONFIGFD_CMD_CREATE, NULL, NULL, 0);
configfd_action(fd, CONFIGFD_GET_FD, "bindfd", &bindfd, O_CLOEXEC);

move_mount(bindfd, "", AT_FDCWD, "/path/to/allow", MOVE_MOUNT_F_EMPTY_PATH);

Technically /path/to/shift and /path/to/allow can be the same, which
basically installs a mnt at the path that allows onward traversal.

Then any mount namespace in a user namespace can do:

pathfd = open("/path/to/allow", O_PATH);
fd = configfd_open("bind", O_CLOEXEC);
configfd_action(fd, CONFIGFD_SET_FD, "pathfd", NULL, pathfd);
configfd_action(fd, CONFIGFD_SET_FLAG, "shift", NULL, 0);
configfd_action(fd, CONFIGFD_SET_FLAG, "detached", NULL, 0);
configfd_action(fd, CONFIGFD_CMD_CREATE, NULL, NULL, 0);
configfd_action(fd, CONFIGFD_GET_FD, "bindfd", &bindfd, O_CLOEXEC);

move_mount(bindfd, "", AT_FDCWD, "/path/to/mount", MOVE_MOUNT_F_EMPTY_PATH);

And /path/to/mount will have the uid/gid shifting bind mount installed.

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
---
 fs/bind.c           | 35 +++++++++++++++++++++++++++++++++++
 fs/mount.h          |  2 ++
 fs/namespace.c      |  1 +
 fs/proc_namespace.c |  4 ++++
 4 files changed, 42 insertions(+)

diff --git a/fs/bind.c b/fs/bind.c
index eea4e6cd5108..6b4668041248 100644
--- a/fs/bind.c
+++ b/fs/bind.c
@@ -21,6 +21,8 @@ struct bind_data {
 	bool		nodev:1;
 	bool		detached:1;
 	bool		recursive:1;
+	bool		shift:1;
+	bool		allow_shift:1;
 	struct file	*file;
 	struct file	*retfile;
 };
@@ -66,6 +68,25 @@ static int bind_set_flag(const struct configfd_context *cfc,
 		bd->nodev = true;
 	} else if (strcmp(p->key, "noexec") == 0) {
 		bd->noexec = true;
+	} else if (strcmp(p->key, "shift") == 0) {
+		struct mount *m;
+
+		if (!bd->file) {
+			logger_err(cfc->log, "can't shift without setting pathfd");
+			return -EINVAL;
+		}
+		m = real_mount(bd->file->f_path.mnt);
+		if (!m->allow_shift) {
+			logger_err(cfc->log, "pathfd doesn't allow shifting");
+			return -EINVAL;
+		}
+		bd->shift = true;
+	} else if (strcmp(p->key, "allow-shift") == 0) {
+		if (!capable(CAP_SYS_ADMIN)) {
+			logger_err(cfc->log, "must be root to set allow-shift");
+			return -EPERM;
+		}
+		bd->allow_shift = true;
 	} else if (strcmp(p->key, "recursive") == 0 &&
 		   cfc->op == CONFIGFD_CMD_CREATE) {
 		bd->recursive = true;
@@ -126,6 +147,8 @@ static int bind_get_mnt_flags(struct bind_data *bd, int mnt_flags)
 		mnt_flags |= MNT_NODEV;
 	if (bd->noexec)
 		mnt_flags |= MNT_NOEXEC;
+	if (bd->shift)
+		mnt_flags |= MNT_SHIFT;
 
 	return mnt_flags;
 }
@@ -143,6 +166,13 @@ static int bind_reconfigure(const struct configfd_context *cfc)
 	mnt_flags = bd->file->f_path.mnt->mnt_flags & MNT_ATIME_MASK;
 	mnt_flags = bind_get_mnt_flags(bd, mnt_flags);
 
+	if (bd->allow_shift) {
+		struct mount *m = real_mount(bd->file->f_path.mnt);
+
+		/* FIXME: this should be set with the reconfigure locking */
+		m->allow_shift = true;
+	}
+
 	return do_reconfigure_mnt(&bd->file->f_path, mnt_flags);
 }
 
@@ -183,6 +213,11 @@ static int bind_create(const struct configfd_context *cfc)
 
 		/* since this is a detached copy, we can do without locking */
 		f->f_path.mnt->mnt_flags |= mnt_flags;
+		if (bd->allow_shift) {
+			struct mount *m = real_mount(f->f_path.mnt);
+
+			m->allow_shift = true;
+		}
 	}
 
 	bd->retfile = f;
diff --git a/fs/mount.h b/fs/mount.h
index 711a4093e475..1da13decf93b 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -72,6 +72,8 @@ struct mount {
 	int mnt_expiry_mark;		/* true if marked for expiry */
 	struct hlist_head mnt_pins;
 	struct hlist_head mnt_stuck_children;
+	/* shifting bind mount parameters */
+	bool allow_shift:1;
 } __randomize_layout;
 
 #define MNT_NS_INTERNAL ERR_PTR(-EINVAL) /* distinct from any mnt_namespace */
diff --git a/fs/namespace.c b/fs/namespace.c
index 3ae0124e9783..8266e9540595 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1038,6 +1038,7 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 
 	mnt->mnt.mnt_flags = old->mnt.mnt_flags;
 	mnt->mnt.mnt_flags &= ~(MNT_WRITE_HOLD|MNT_MARKED|MNT_INTERNAL);
+	mnt->allow_shift = old->allow_shift;
 
 	atomic_inc(&sb->s_active);
 	mnt->mnt.mnt_sb = sb;
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 273ee82d8aa9..bdf8d23cf42e 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -70,14 +70,18 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
 		{ MNT_NOATIME, ",noatime" },
 		{ MNT_NODIRATIME, ",nodiratime" },
 		{ MNT_RELATIME, ",relatime" },
+		{ MNT_SHIFT, ",shift" },
 		{ 0, NULL }
 	};
 	const struct proc_fs_info *fs_infop;
+	struct mount *rm = real_mount(mnt);
 
 	for (fs_infop = mnt_info; fs_infop->flag; fs_infop++) {
 		if (mnt->mnt_flags & fs_infop->flag)
 			seq_puts(m, fs_infop->str);
 	}
+	if (rm->allow_shift)
+		seq_puts(m, ",allow-shift");
 }
 
 static inline void mangle(struct seq_file *m, const char *s)
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/3] fs: rethread notify_change to take a path instead of a dentry
  2020-01-04 20:39 ` [PATCH v2 1/3] fs: rethread notify_change to take a path instead of a dentry James Bottomley
@ 2020-01-04 21:52   ` Amir Goldstein
  0 siblings, 0 replies; 15+ messages in thread
From: Amir Goldstein @ 2020-01-04 21:52 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-fsdevel, David Howells, Christian Brauner, Al Viro,
	Miklos Szeredi, Seth Forshee, overlayfs, Stéphane Graber,
	Eric Biederman, Aleksa Sarai, Linux Containers

On Sat, Jan 4, 2020 at 10:40 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> In order to prepare for implementing shiftfs as a property changing
> bind mount, the path (which contains the vfsmount) must be threaded
> through everywhere we are going to do either a permission check or an
> attribute get/set so that we can arrange for the credentials for the
> operation to be based on the bind mount properties rather than those
> of current.
>
> ---
>
> v2: fix issues found by Amir Goldstein

Looks fine to me.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

>
> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
> ---
>  drivers/base/devtmpfs.c   |  8 ++++++--
>  fs/attr.c                 |  4 +++-
>  fs/cachefiles/interface.c |  6 ++++--
>  fs/coredump.c             |  4 ++--
>  fs/ecryptfs/inode.c       |  9 ++++++---
>  fs/inode.c                |  7 ++++---
>  fs/namei.c                |  2 +-
>  fs/nfsd/vfs.c             | 13 ++++++++-----
>  fs/open.c                 | 19 ++++++++++---------
>  fs/overlayfs/copy_up.c    | 40 ++++++++++++++++++++++++----------------
>  fs/overlayfs/dir.c        | 10 ++++++++--
>  fs/overlayfs/inode.c      |  6 ++++--
>  fs/overlayfs/overlayfs.h  |  2 +-
>  fs/overlayfs/super.c      |  3 ++-
>  fs/utimes.c               |  2 +-
>  include/linux/fs.h        |  6 +++---
>  16 files changed, 87 insertions(+), 54 deletions(-)
>
> diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
> index 6cdbf1531238..eb9f072f5deb 100644
> --- a/drivers/base/devtmpfs.c
> +++ b/drivers/base/devtmpfs.c
> @@ -224,13 +224,17 @@ static int handle_create(const char *nodename, umode_t mode, kuid_t uid,
>         err = vfs_mknod(d_inode(path.dentry), dentry, mode, dev->devt);
>         if (!err) {
>                 struct iattr newattrs;
> +               struct path newpath = {
> +                       .mnt = path.mnt,
> +                       .dentry = dentry,
> +               };
>
>                 newattrs.ia_mode = mode;
>                 newattrs.ia_uid = uid;
>                 newattrs.ia_gid = gid;
>                 newattrs.ia_valid = ATTR_MODE|ATTR_UID|ATTR_GID;
>                 inode_lock(d_inode(dentry));
> -               notify_change(dentry, &newattrs, NULL);
> +               notify_change(&newpath, &newattrs, NULL);
>                 inode_unlock(d_inode(dentry));
>
>                 /* mark as kernel-created inode */
> @@ -337,7 +341,7 @@ static int handle_remove(const char *nodename, struct device *dev)
>                         newattrs.ia_valid =
>                                 ATTR_UID|ATTR_GID|ATTR_MODE;
>                         inode_lock(d_inode(dentry));
> -                       notify_change(dentry, &newattrs, NULL);
> +                       notify_change(&p, &newattrs, NULL);
>                         inode_unlock(d_inode(dentry));
>                         err = vfs_unlink(d_inode(parent.dentry), dentry, NULL);
>                         if (!err || err == -ENOENT)
> diff --git a/fs/attr.c b/fs/attr.c
> index df28035aa23e..370b18807f05 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -226,8 +226,10 @@ EXPORT_SYMBOL(setattr_copy);
>   * the file open for write, as there can be no conflicting delegation in
>   * that case.
>   */
> -int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **delegated_inode)
> +int notify_change(const struct path *path, struct iattr * attr,
> +                 struct inode **delegated_inode)
>  {
> +       struct dentry *dentry = path->dentry;
>         struct inode *inode = dentry->d_inode;
>         umode_t mode = inode->i_mode;
>         int error;
> diff --git a/fs/cachefiles/interface.c b/fs/cachefiles/interface.c
> index 4cea5fbf695e..f11216f59a56 100644
> --- a/fs/cachefiles/interface.c
> +++ b/fs/cachefiles/interface.c
> @@ -436,6 +436,7 @@ static int cachefiles_attr_changed(struct fscache_object *_object)
>         uint64_t ni_size;
>         loff_t oi_size;
>         int ret;
> +       struct path path;
>
>         ni_size = _object->store_limit_l;
>
> @@ -466,18 +467,19 @@ static int cachefiles_attr_changed(struct fscache_object *_object)
>         /* if there's an extension to a partial page at the end of the backing
>          * file, we need to discard the partial page so that we pick up new
>          * data after it */
> +       path = (struct path){ .mnt = cache->mnt, .dentry = object->backer };
>         if (oi_size & ~PAGE_MASK && ni_size > oi_size) {
>                 _debug("discard tail %llx", oi_size);
>                 newattrs.ia_valid = ATTR_SIZE;
>                 newattrs.ia_size = oi_size & PAGE_MASK;
> -               ret = notify_change(object->backer, &newattrs, NULL);
> +               ret = notify_change(&path, &newattrs, NULL);
>                 if (ret < 0)
>                         goto truncate_failed;
>         }
>
>         newattrs.ia_valid = ATTR_SIZE;
>         newattrs.ia_size = ni_size;
> -       ret = notify_change(object->backer, &newattrs, NULL);
> +       ret = notify_change(&path, &newattrs, NULL);
>
>  truncate_failed:
>         inode_unlock(d_inode(object->backer));
> diff --git a/fs/coredump.c b/fs/coredump.c
> index b1ea7dfbd149..69899bfb025a 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -775,7 +775,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>                         goto close_fail;
>                 if (!(cprm.file->f_mode & FMODE_CAN_WRITE))
>                         goto close_fail;
> -               if (do_truncate(cprm.file->f_path.dentry, 0, 0, cprm.file))
> +               if (do_truncate(&cprm.file->f_path, 0, 0, cprm.file))
>                         goto close_fail;
>         }
>
> @@ -879,7 +879,7 @@ void dump_truncate(struct coredump_params *cprm)
>         if (file->f_op->llseek && file->f_op->llseek != no_llseek) {
>                 offset = file->f_op->llseek(file, 0, SEEK_CUR);
>                 if (i_size_read(file->f_mapping->host) < offset)
> -                       do_truncate(file->f_path.dentry, offset, 0, file);
> +                       do_truncate(&file->f_path, offset, 0, file);
>         }
>  }
>  EXPORT_SYMBOL(dump_truncate);
> diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
> index e23752d9a79f..3bc67c478163 100644
> --- a/fs/ecryptfs/inode.c
> +++ b/fs/ecryptfs/inode.c
> @@ -852,10 +852,11 @@ int ecryptfs_truncate(struct dentry *dentry, loff_t new_length)
>
>         rc = truncate_upper(dentry, &ia, &lower_ia);
>         if (!rc && lower_ia.ia_valid & ATTR_SIZE) {
> -               struct dentry *lower_dentry = ecryptfs_dentry_to_lower(dentry);
> +               struct path *lower_path = ecryptfs_dentry_to_lower_path(dentry);
> +               struct dentry *lower_dentry = lower_path->dentry;
>
>                 inode_lock(d_inode(lower_dentry));
> -               rc = notify_change(lower_dentry, &lower_ia, NULL);
> +               rc = notify_change(lower_path, &lower_ia, NULL);
>                 inode_unlock(d_inode(lower_dentry));
>         }
>         return rc;
> @@ -883,6 +884,7 @@ static int ecryptfs_setattr(struct dentry *dentry, struct iattr *ia)
>  {
>         int rc = 0;
>         struct dentry *lower_dentry;
> +       struct path *lower_path;
>         struct iattr lower_ia;
>         struct inode *inode;
>         struct inode *lower_inode;
> @@ -897,6 +899,7 @@ static int ecryptfs_setattr(struct dentry *dentry, struct iattr *ia)
>         inode = d_inode(dentry);
>         lower_inode = ecryptfs_inode_to_lower(inode);
>         lower_dentry = ecryptfs_dentry_to_lower(dentry);
> +       lower_path = ecryptfs_dentry_to_lower_path(dentry);
>         mutex_lock(&crypt_stat->cs_mutex);
>         if (d_is_dir(dentry))
>                 crypt_stat->flags &= ~(ECRYPTFS_ENCRYPTED);
> @@ -959,7 +962,7 @@ static int ecryptfs_setattr(struct dentry *dentry, struct iattr *ia)
>                 lower_ia.ia_valid &= ~ATTR_MODE;
>
>         inode_lock(d_inode(lower_dentry));
> -       rc = notify_change(lower_dentry, &lower_ia, NULL);
> +       rc = notify_change(lower_path, &lower_ia, NULL);
>         inode_unlock(d_inode(lower_dentry));
>  out:
>         fsstack_copy_attr_all(inode, lower_inode);
> diff --git a/fs/inode.c b/fs/inode.c
> index 96d62d97694e..18ff3081bda0 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1817,7 +1817,7 @@ int dentry_needs_remove_privs(struct dentry *dentry)
>         return mask;
>  }
>
> -static int __remove_privs(struct dentry *dentry, int kill)
> +static int __remove_privs(struct path *path, int kill)
>  {
>         struct iattr newattrs;
>
> @@ -1826,7 +1826,7 @@ static int __remove_privs(struct dentry *dentry, int kill)
>          * Note we call this on write, so notify_change will not
>          * encounter any conflicting delegations:
>          */
> -       return notify_change(dentry, &newattrs, NULL);
> +       return notify_change(path, &newattrs, NULL);
>  }
>
>  /*
> @@ -1835,6 +1835,7 @@ static int __remove_privs(struct dentry *dentry, int kill)
>   */
>  int file_remove_privs(struct file *file)
>  {
> +       struct path *path = &file->f_path;
>         struct dentry *dentry = file_dentry(file);
>         struct inode *inode = file_inode(file);
>         int kill;
> @@ -1853,7 +1854,7 @@ int file_remove_privs(struct file *file)
>         if (kill < 0)
>                 return kill;
>         if (kill)
> -               error = __remove_privs(dentry, kill);
> +               error = __remove_privs(path, kill);
>         if (!error)
>                 inode_has_no_xattr(inode);
>
> diff --git a/fs/namei.c b/fs/namei.c
> index d6c91d1e88cb..7bb4b1dcf3cc 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3012,7 +3012,7 @@ static int handle_truncate(struct file *filp)
>         if (!error)
>                 error = security_path_truncate(path);
>         if (!error) {
> -               error = do_truncate(path->dentry, 0,
> +               error = do_truncate(path, 0,
>                                     ATTR_MTIME|ATTR_CTIME|ATTR_OPEN,
>                                     filp);
>         }
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index c0dc491537a6..dc990cc8f549 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -360,8 +360,8 @@ __be32
>  nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
>              int check_guard, time_t guardtime)
>  {
> -       struct dentry   *dentry;
>         struct inode    *inode;
> +       struct path     path;
>         int             accmode = NFSD_MAY_SATTR;
>         umode_t         ftype = 0;
>         __be32          err;
> @@ -400,8 +400,11 @@ nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
>                         goto out;
>         }
>
> -       dentry = fhp->fh_dentry;
> -       inode = d_inode(dentry);
> +       path = (struct path) {
> +               .mnt = fhp->fh_export->ex_path.mnt,
> +               .dentry = fhp->fh_dentry,
> +       };
> +       inode = d_inode(path.dentry);
>
>         /* Ignore any mode updates on symlinks */
>         if (S_ISLNK(inode->i_mode))
> @@ -442,7 +445,7 @@ nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
>                         .ia_size        = iap->ia_size,
>                 };
>
> -               host_err = notify_change(dentry, &size_attr, NULL);
> +               host_err = notify_change(&path, &size_attr, NULL);
>                 if (host_err)
>                         goto out_unlock;
>                 iap->ia_valid &= ~ATTR_SIZE;
> @@ -457,7 +460,7 @@ nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
>         }
>
>         iap->ia_valid |= ATTR_CTIME;
> -       host_err = notify_change(dentry, iap, NULL);
> +       host_err = notify_change(&path, iap, NULL);
>
>  out_unlock:
>         fh_unlock(fhp);
> diff --git a/fs/open.c b/fs/open.c
> index b62f5c0923a8..033e2112fbda 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -35,11 +35,12 @@
>
>  #include "internal.h"
>
> -int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
> +int do_truncate(const struct path *path, loff_t length, unsigned int time_attrs,
>         struct file *filp)
>  {
>         int ret;
>         struct iattr newattrs;
> +       struct dentry *dentry = path->dentry;
>
>         /* Not pretty: "inode->i_size" shouldn't really be signed. But it is. */
>         if (length < 0)
> @@ -61,7 +62,7 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
>
>         inode_lock(dentry->d_inode);
>         /* Note any delegations or leases have already been broken: */
> -       ret = notify_change(dentry, &newattrs, NULL);
> +       ret = notify_change(path, &newattrs, NULL);
>         inode_unlock(dentry->d_inode);
>         return ret;
>  }
> @@ -107,7 +108,7 @@ long vfs_truncate(const struct path *path, loff_t length)
>         if (!error)
>                 error = security_path_truncate(path);
>         if (!error)
> -               error = do_truncate(path->dentry, length, 0, NULL);
> +               error = do_truncate(path, length, 0, NULL);
>
>  put_write_and_out:
>         put_write_access(inode);
> @@ -155,7 +156,7 @@ COMPAT_SYSCALL_DEFINE2(truncate, const char __user *, path, compat_off_t, length
>  long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
>  {
>         struct inode *inode;
> -       struct dentry *dentry;
> +       struct path *path;
>         struct fd f;
>         int error;
>
> @@ -171,8 +172,8 @@ long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
>         if (f.file->f_flags & O_LARGEFILE)
>                 small = 0;
>
> -       dentry = f.file->f_path.dentry;
> -       inode = dentry->d_inode;
> +       path = &f.file->f_path;
> +       inode = path->dentry->d_inode;
>         error = -EINVAL;
>         if (!S_ISREG(inode->i_mode) || !(f.file->f_mode & FMODE_WRITE))
>                 goto out_putf;
> @@ -192,7 +193,7 @@ long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
>         if (!error)
>                 error = security_path_truncate(&f.file->f_path);
>         if (!error)
> -               error = do_truncate(dentry, length, ATTR_MTIME|ATTR_CTIME, f.file);
> +               error = do_truncate(path, length, ATTR_MTIME|ATTR_CTIME, f.file);
>         sb_end_write(inode->i_sb);
>  out_putf:
>         fdput(f);
> @@ -558,7 +559,7 @@ static int chmod_common(const struct path *path, umode_t mode)
>                 goto out_unlock;
>         newattrs.ia_mode = (mode & S_IALLUGO) | (inode->i_mode & ~S_IALLUGO);
>         newattrs.ia_valid = ATTR_MODE | ATTR_CTIME;
> -       error = notify_change(path->dentry, &newattrs, &delegated_inode);
> +       error = notify_change(path, &newattrs, &delegated_inode);
>  out_unlock:
>         inode_unlock(inode);
>         if (delegated_inode) {
> @@ -649,7 +650,7 @@ static int chown_common(const struct path *path, uid_t user, gid_t group)
>         inode_lock(inode);
>         error = security_path_chown(path, uid, gid);
>         if (!error)
> -               error = notify_change(path->dentry, &newattrs, &delegated_inode);
> +               error = notify_change(path, &newattrs, &delegated_inode);
>         inode_unlock(inode);
>         if (delegated_inode) {
>                 error = break_deleg_wait(&delegated_inode);
> diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
> index 6220642fe113..b16231c9dd11 100644
> --- a/fs/overlayfs/copy_up.c
> +++ b/fs/overlayfs/copy_up.c
> @@ -177,17 +177,17 @@ static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
>         return error;
>  }
>
> -static int ovl_set_size(struct dentry *upperdentry, struct kstat *stat)
> +static int ovl_set_size(struct path *upperpath, struct kstat *stat)
>  {
>         struct iattr attr = {
>                 .ia_valid = ATTR_SIZE,
>                 .ia_size = stat->size,
>         };
>
> -       return notify_change(upperdentry, &attr, NULL);
> +       return notify_change(upperpath, &attr, NULL);
>  }
>
> -static int ovl_set_timestamps(struct dentry *upperdentry, struct kstat *stat)
> +static int ovl_set_timestamps(struct path *upperpath, struct kstat *stat)
>  {
>         struct iattr attr = {
>                 .ia_valid =
> @@ -196,10 +196,10 @@ static int ovl_set_timestamps(struct dentry *upperdentry, struct kstat *stat)
>                 .ia_mtime = stat->mtime,
>         };
>
> -       return notify_change(upperdentry, &attr, NULL);
> +       return notify_change(upperpath, &attr, NULL);
>  }
>
> -int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat)
> +int ovl_set_attr(struct path *upperpath, struct kstat *stat)
>  {
>         int err = 0;
>
> @@ -208,7 +208,7 @@ int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat)
>                         .ia_valid = ATTR_MODE,
>                         .ia_mode = stat->mode,
>                 };
> -               err = notify_change(upperdentry, &attr, NULL);
> +               err = notify_change(upperpath, &attr, NULL);
>         }
>         if (!err) {
>                 struct iattr attr = {
> @@ -216,10 +216,10 @@ int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat)
>                         .ia_uid = stat->uid,
>                         .ia_gid = stat->gid,
>                 };
> -               err = notify_change(upperdentry, &attr, NULL);
> +               err = notify_change(upperpath, &attr, NULL);
>         }
>         if (!err)
> -               ovl_set_timestamps(upperdentry, stat);
> +               ovl_set_timestamps(upperpath, stat);
>
>         return err;
>  }
> @@ -398,8 +398,13 @@ static int ovl_link_up(struct ovl_copy_up_ctx *c)
>  {
>         int err;
>         struct dentry *upper;
> -       struct dentry *upperdir = ovl_dentry_upper(c->parent);
> -       struct inode *udir = d_inode(upperdir);
> +       struct dentry *upperdir;
> +       struct path upperdirpath;
> +       struct inode *udir;
> +
> +       ovl_path_upper(c->parent, &upperdirpath);
> +       upperdir = upperdirpath.dentry;
> +       udir = d_inode(upperdir);
>
>         /* Mark parent "impure" because it may now contain non-pure upper */
>         err = ovl_set_impure(c->parent, upperdir);
> @@ -420,7 +425,7 @@ static int ovl_link_up(struct ovl_copy_up_ctx *c)
>
>                 if (!err) {
>                         /* Restore timestamps on parent (best effort) */
> -                       ovl_set_timestamps(upperdir, &c->pstat);
> +                       ovl_set_timestamps(&upperdirpath, &c->pstat);
>                         ovl_dentry_set_upper_alias(c->dentry);
>                 }
>         }
> @@ -436,15 +441,16 @@ static int ovl_link_up(struct ovl_copy_up_ctx *c)
>  static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
>  {
>         int err;
> +       struct path upperpath, temppath;
>
> +       ovl_path_upper(c->dentry, &upperpath);
>         /*
>          * Copy up data first and then xattrs. Writing data after
>          * xattrs will remove security.capability xattr automatically.
>          */
>         if (S_ISREG(c->stat.mode) && !c->metacopy) {
> -               struct path upperpath, datapath;
> +               struct path datapath;
>
> -               ovl_path_upper(c->dentry, &upperpath);
>                 if (WARN_ON(upperpath.dentry != NULL))
>                         return -EIO;
>                 upperpath.dentry = temp;
> @@ -478,12 +484,13 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
>                 if (err)
>                         return err;
>         }
> +       temppath = (struct path){ .mnt = upperpath.mnt, .dentry = temp };
>
>         inode_lock(temp->d_inode);
>         if (c->metacopy)
> -               err = ovl_set_size(temp, &c->stat);
> +               err = ovl_set_size(&temppath, &c->stat);
>         if (!err)
> -               err = ovl_set_attr(temp, &c->stat);
> +               err = ovl_set_attr(&temppath, &c->stat);
>         inode_unlock(temp->d_inode);
>
>         return err;
> @@ -699,10 +706,11 @@ static int ovl_do_copy_up(struct ovl_copy_up_ctx *c)
>                 err = ovl_set_nlink_upper(c->dentry);
>         } else {
>                 struct inode *udir = d_inode(c->destdir);
> +               struct path destpath = { .mnt = ofs->upper_mnt, .dentry = c->destdir };
>
>                 /* Restore timestamps on parent (best effort) */
>                 inode_lock(udir);
> -               ovl_set_timestamps(c->destdir, &c->pstat);
> +               ovl_set_timestamps(&destpath, &c->pstat);
>                 inode_unlock(udir);
>
>                 ovl_dentry_set_upper_alias(c->dentry);
> diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
> index 29abdb1d3b5c..6729fb6e15a9 100644
> --- a/fs/overlayfs/dir.c
> +++ b/fs/overlayfs/dir.c
> @@ -374,7 +374,8 @@ static struct dentry *ovl_clear_empty(struct dentry *dentry,
>                 goto out_cleanup;
>
>         inode_lock(opaquedir->d_inode);
> -       err = ovl_set_attr(opaquedir, &stat);
> +       err = ovl_set_attr(&(struct path) { .mnt = upperpath.mnt,
> +                                           .dentry = opaquedir }, &stat);
>         inode_unlock(opaquedir->d_inode);
>         if (err)
>                 goto out_cleanup;
> @@ -435,10 +436,13 @@ static int ovl_create_over_whiteout(struct dentry *dentry, struct inode *inode,
>         struct inode *udir = upperdir->d_inode;
>         struct dentry *upper;
>         struct dentry *newdentry;
> +       struct path newpath;
>         int err;
>         struct posix_acl *acl, *default_acl;
>         bool hardlink = !!cattr->hardlink;
>
> +       ovl_path_upper(dentry, &newpath);
> +
>         if (WARN_ON(!workdir))
>                 return -EROFS;
>
> @@ -478,8 +482,10 @@ static int ovl_create_over_whiteout(struct dentry *dentry, struct inode *inode,
>                         .ia_valid = ATTR_MODE,
>                         .ia_mode = cattr->mode,
>                 };
> +
> +               newpath.dentry = newdentry;
>                 inode_lock(newdentry->d_inode);
> -               err = notify_change(newdentry, &attr, NULL);
> +               err = notify_change(&newpath, &attr, NULL);
>                 inode_unlock(newdentry->d_inode);
>                 if (err)
>                         goto out_cleanup;
> diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
> index b045cf1826fc..da39c3b40669 100644
> --- a/fs/overlayfs/inode.c
> +++ b/fs/overlayfs/inode.c
> @@ -45,8 +45,10 @@ int ovl_setattr(struct dentry *dentry, struct iattr *attr)
>                 err = ovl_copy_up_with_data(dentry);
>         if (!err) {
>                 struct inode *winode = NULL;
> +               struct path upperpath;
>
> -               upperdentry = ovl_dentry_upper(dentry);
> +               ovl_path_upper(dentry, &upperpath);
> +               upperdentry = upperpath.dentry;
>
>                 if (attr->ia_valid & ATTR_SIZE) {
>                         winode = d_inode(upperdentry);
> @@ -60,7 +62,7 @@ int ovl_setattr(struct dentry *dentry, struct iattr *attr)
>
>                 inode_lock(upperdentry->d_inode);
>                 old_cred = ovl_override_creds(dentry->d_sb);
> -               err = notify_change(upperdentry, attr, NULL);
> +               err = notify_change(&upperpath, attr, NULL);
>                 revert_creds(old_cred);
>                 if (!err)
>                         ovl_copyattr(upperdentry->d_inode, dentry->d_inode);
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index f283b1d69a9e..24537d13076d 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -445,7 +445,7 @@ int ovl_copy_up_with_data(struct dentry *dentry);
>  int ovl_copy_up_flags(struct dentry *dentry, int flags);
>  int ovl_maybe_copy_up(struct dentry *dentry, int flags);
>  int ovl_copy_xattr(struct dentry *old, struct dentry *new);
> -int ovl_set_attr(struct dentry *upper, struct kstat *stat);
> +int ovl_set_attr(struct path *upperpath, struct kstat *stat);
>  struct ovl_fh *ovl_encode_real_fh(struct dentry *real, bool is_upper);
>  int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
>                    struct dentry *upper);
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 7621ff176d15..82c1da52831b 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -632,6 +632,7 @@ static struct dentry *ovl_workdir_create(struct ovl_fs *ofs,
>                         .ia_valid = ATTR_MODE,
>                         .ia_mode = S_IFDIR | 0,
>                 };
> +               const struct path workpath = { .mnt = mnt, .dentry = work };
>
>                 if (work->d_inode) {
>                         err = -EEXIST;
> @@ -675,7 +676,7 @@ static struct dentry *ovl_workdir_create(struct ovl_fs *ofs,
>
>                 /* Clear any inherited mode bits */
>                 inode_lock(work->d_inode);
> -               err = notify_change(work, &attr, NULL);
> +               err = notify_change(&workpath, &attr, NULL);
>                 inode_unlock(work->d_inode);
>                 if (err)
>                         goto out_dput;
> diff --git a/fs/utimes.c b/fs/utimes.c
> index c952b6b3d8a0..9b9e78c914af 100644
> --- a/fs/utimes.c
> +++ b/fs/utimes.c
> @@ -57,7 +57,7 @@ static int utimes_common(const struct path *path, struct timespec64 *times)
>         }
>  retry_deleg:
>         inode_lock(inode);
> -       error = notify_change(path->dentry, &newattrs, &delegated_inode);
> +       error = notify_change(path, &newattrs, &delegated_inode);
>         inode_unlock(inode);
>         if (delegated_inode) {
>                 error = break_deleg_wait(&delegated_inode);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 70eb6255680d..3b3a1a25e244 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2526,8 +2526,8 @@ struct filename {
>  static_assert(offsetof(struct filename, iname) % sizeof(long) == 0);
>
>  extern long vfs_truncate(const struct path *, loff_t);
> -extern int do_truncate(struct dentry *, loff_t start, unsigned int time_attrs,
> -                      struct file *filp);
> +extern int do_truncate(const struct path *p, loff_t start,
> +                      unsigned int time_attrs, struct file *filp);
>  extern int vfs_fallocate(struct file *file, int mode, loff_t offset,
>                         loff_t len);
>  extern long do_sys_open(int dfd, const char __user *filename, int flags,
> @@ -2870,7 +2870,7 @@ extern void emergency_remount(void);
>  #ifdef CONFIG_BLOCK
>  extern sector_t bmap(struct inode *, sector_t);
>  #endif
> -extern int notify_change(struct dentry *, struct iattr *, struct inode **);
> +extern int notify_change(const struct path *, struct iattr *, struct inode **);
>  extern int inode_permission(struct inode *, int);
>  extern int generic_permission(struct inode *, int);
>  extern int __check_sticky(struct inode *dir, struct inode *inode);
> --
> 2.16.4
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] fs: introduce uid/gid shifting bind mount
  2020-01-04 20:39 ` [PATCH v2 2/3] fs: introduce uid/gid shifting bind mount James Bottomley
@ 2020-01-04 23:09   ` Amir Goldstein
  2020-01-05 17:44     ` James Bottomley
  2020-01-13  3:41   ` Serge E. Hallyn
  1 sibling, 1 reply; 15+ messages in thread
From: Amir Goldstein @ 2020-01-04 23:09 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-fsdevel, David Howells, Christian Brauner, Al Viro,
	Miklos Szeredi, Seth Forshee, overlayfs, Stéphane Graber,
	Eric Biederman, Aleksa Sarai, Linux Containers

On Sat, Jan 4, 2020 at 10:41 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> This implementation reverse shifts according to the user_ns belonging
> to the mnt_ns.  So if the vfsmount has the newly introduced flag
> MNT_SHIFT and the current user_ns is the same as the mount_ns->user_ns
> then we shift back using the user_ns before committing to the
> underlying filesystem.
>
> For example, if a user_ns is created where interior (fake root, uid 0)
> is mapped to kernel uid 100000 then writes from interior root normally
> go to the filesystem at the kernel uid.  However, if MNT_SHIFT is set,
> they will be shifted back to write at uid 0, meaning we can bind mount
> real image filesystems to user_ns protected faker root.
>
> In essence there are several things which have to be done for this to
> occur safely.  Firstly for all operations on the filesystem, new
> credentials have to be installed where fsuid and fsgid are set to the
> *interior* values.

Must we really install new creds?
Maybe we just need to set/clear a SHIFTED flag on current creds?

i.e. instead of change_userns_creds(path)/revert_userns_creds()
how about start_shifted_creds(mnt)/end_shifted_creds().

and then cred_is_shifted() only checks the flag and no need for
all the cached creds mechanism.

current_fsuid()/current_fsgid() will take care of the shifting based on
the creds flag.

Also, you should consider placing a call to start_shifted/end_shifted
inside __mnt_want_write()/__mnt_drop_write().
This should automatically cover all writable fs ops  - including some that
you missed (setxattr).

Taking this a step further, perhaps it would make sense to wrap all readonly
fs ops with mnt_want_read()/mnt_drop_read() flavors.
Note that inode level already has a similar i_readcount access counter.

This could be used, for example, to provide a facility that is stronger than
MNT_DETACH, and weaker than filesystem "shutdown" ioctl, for blocking
new file opens (with openat()) on a mounted filesystem.

The point is, you add gating to vfs that is generic and not for single use
case (i.e. cred shifting).

Apologies in advance if  some of these ideas are ill advised.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] fs: introduce uid/gid shifting bind mount
  2020-01-04 23:09   ` Amir Goldstein
@ 2020-01-05 17:44     ` James Bottomley
  0 siblings, 0 replies; 15+ messages in thread
From: James Bottomley @ 2020-01-05 17:44 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, David Howells, Christian Brauner, Al Viro,
	Miklos Szeredi, Seth Forshee, overlayfs, Stéphane Graber,
	Eric Biederman, Aleksa Sarai, Linux Containers

On Sun, 2020-01-05 at 01:09 +0200, Amir Goldstein wrote:
> On Sat, Jan 4, 2020 at 10:41 PM James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> > 
> > This implementation reverse shifts according to the user_ns
> > belonging to the mnt_ns.  So if the vfsmount has the newly
> > introduced flag MNT_SHIFT and the current user_ns is the same as
> > the mount_ns->user_ns then we shift back using the user_ns before
> > committing to the underlying filesystem.
> > 
> > For example, if a user_ns is created where interior (fake root, uid
> > 0) is mapped to kernel uid 100000 then writes from interior root
> > normally go to the filesystem at the kernel uid.  However, if
> > MNT_SHIFT is set, they will be shifted back to write at uid 0,
> > meaning we can bind mount real image filesystems to user_ns
> > protected faker root.
> > 
> > In essence there are several things which have to be done for this
> > to occur safely.  Firstly for all operations on the filesystem, new
> > credentials have to be installed where fsuid and fsgid are set to
> > the *interior* values.
> 
> Must we really install new creds?

Well, the reason for doing it is that for uid/gid changes that's the
way everything else does it, so principle of least surprise.

However, there are two other cases where this doesn't work:

   1. inode uid/gid changes, which are compared against the real uid/gid 
   2. execution, where we need to bring the filesystem uid/gid to the
      exterior representation of the interior values for unprivileged
      execution to work.
   3. the capable_wrt_inode checks where we're usually checking if the
      inode uid/gid has an interior mapping, but since for shiftfs we're
      assuming inode uid/gid are interior we need to see if anything maps
      to them.
   4. the in_group_p checks to see whether we're in a group that's
      capable.

1. could be fixed by shifting uid/gid ... I just didn't think this was
a good idea.  2,3. can't be fixed because the direction of the check
needs to be reversed and 4 has to be done separately because group_info
 is a pointer to something that lives outside the credential.  Taking
the pointer, creating a new one and shifting every group is possible, I
just also wasn't sure if it was a wise thing to do.

> Maybe we just need to set/clear a SHIFTED flag on current creds?
> 
> i.e. instead of change_userns_creds(path)/revert_userns_creds()
> how about start_shifted_creds(mnt)/end_shifted_creds().
> 
> and then cred_is_shifted() only checks the flag and no need for
> all the cached creds mechanism.
> 
> current_fsuid()/current_fsgid() will take care of the shifting based
> on the creds flag.

So it is true, if current_fsuid/fsgid did the mapping, it would be a
fifth case above, but we'd have to be sure no-one ever used the bare
current_cred()->fsuid.  Auditing filesystems, it looks like there's
only one current case of this in namei.c:may_follow_link(), so I think
it could work ... but it's still a danger for other places, like
security module checks and things.

> Also, you should consider placing a call to start_shifted/end_shifted
> inside __mnt_want_write()/__mnt_drop_write().
> This should automatically cover all writable fs ops  - including some
> that you missed (setxattr).

xattr handling wasn't really missed, I left it out because it was the
controversial case last time.  Should the interior root be able to set
xattrs?  I think the argument was tending towards the yes except
security. prefix ones last time so perhaps it is safe to reintroduce.

> Taking this a step further, perhaps it would make sense to wrap all
> readonly fs ops with mnt_want_read()/mnt_drop_read() flavors.
> Note that inode level already has a similar i_readcount access
> counter.

Unfortunately, read and write aren't the only operations where we need
a shift, there's also lookup (which doesn't require read or write). 
Now we could also go with mnt_want_lookup/mnt_drop_lookup or simply
keep the existing shift coding on the lookup path.

> This could be used, for example, to provide a facility that is
> stronger than MNT_DETACH, and weaker than filesystem "shutdown"
> ioctl, for blocking new file opens (with openat()) on a mounted
> filesystem.
> 
> The point is, you add gating to vfs that is generic and not for
> single use case (i.e. cred shifting).
> 
> Apologies in advance if  some of these ideas are ill advised.

Of the two ideas, I think using a generic gate point, if we can find
it, is a definite winner because it programmatically identifies the
shift points.  I'm less enthused about moving the shift into
current_fsuid/fsgid because of the potential for stuff to go wrong and
because it's counter to how everything else is currently done, but I'll
let the filesystem experts weigh in on this one.  The good news is that
the two ideas aren't dependent on each other so either can be
implemented without the other.

James


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] fs: introduce uid/gid shifting bind mount
  2020-01-04 20:39 ` [PATCH v2 2/3] fs: introduce uid/gid shifting bind mount James Bottomley
  2020-01-04 23:09   ` Amir Goldstein
@ 2020-01-13  3:41   ` Serge E. Hallyn
  2020-01-15 18:19     ` James Bottomley
  1 sibling, 1 reply; 15+ messages in thread
From: Serge E. Hallyn @ 2020-01-13  3:41 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-fsdevel, Miklos Szeredi, containers, linux-unionfs,
	David Howells, Seth Forshee, Al Viro, Eric Biederman, serge

On Sat, Jan 04, 2020 at 12:39:45PM -0800, James Bottomley wrote:
> This implementation reverse shifts according to the user_ns belonging
> to the mnt_ns.  So if the vfsmount has the newly introduced flag
> MNT_SHIFT and the current user_ns is the same as the mount_ns->user_ns
> then we shift back using the user_ns before committing to the
> underlying filesystem.
> 
> For example, if a user_ns is created where interior (fake root, uid 0)
> is mapped to kernel uid 100000 then writes from interior root normally
> go to the filesystem at the kernel uid.  However, if MNT_SHIFT is set,
> they will be shifted back to write at uid 0, meaning we can bind mount
> real image filesystems to user_ns protected faker root.

Thanks, James, I definately would like to see shifting in the VFS
api.

I have a few practical concerns about this implementation, but my biggest
concern is more fundemental:  this again by design leaves littered about
the filesystem uid-0 owned files which were written by an untrusted user.

I would feel much better if you institutionalized having the origin
shifted.  For instance, take a squashfs for a container fs, shift it
so that fsuid 0 == hostuid 100000.  Mount that, with a marker saying
how it is shifted, then set 'shiftable'.  Now use that as a base for
allowing an unpriv user to shift.  If that user has subuid 200000 as
container uid 0, then its root will write files as uid 100000 in the
fs.  This isn't perfect, but I think something along these lines would
be far safer.

Two namespaces with different uid maps can share the filesystem as though
they both had the same uidmap.  (This currently is to me the most
interesting use case for shifing bind mounts)

If the user wants to tar up the result, they can do do in their own
namespace, resulting in uid 0 shown as uid 0.  If host root wants to
do so, they can umount it everywhere and use something like fuidshift.
Or, they can also create a namespace to do the shifting to uid 0 in tar.

My more practical concerns include: (1) once a userns has set a shiftable
bind mount to shift, if it then creates a new child userns, that ns
will not see (iiuc) see the fs as shifted.  (2) there seems to be no
good reason to stick to caching the cred for only one mnt, versus
having a per-userns hashtable of creds for shifted mnts.  Was that just
a temporary shortcut or did you intend it to stay that way?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] fs: introduce uid/gid shifting bind mount
  2020-01-13  3:41   ` Serge E. Hallyn
@ 2020-01-15 18:19     ` James Bottomley
  2020-01-16  6:44       ` Serge E. Hallyn
  0 siblings, 1 reply; 15+ messages in thread
From: James Bottomley @ 2020-01-15 18:19 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: linux-fsdevel, Miklos Szeredi, containers, linux-unionfs,
	David Howells, Seth Forshee, Al Viro, Eric Biederman

On Sun, 2020-01-12 at 21:41 -0600, Serge E. Hallyn wrote:
> On Sat, Jan 04, 2020 at 12:39:45PM -0800, James Bottomley wrote:
> > This implementation reverse shifts according to the user_ns
> > belonging to the mnt_ns.  So if the vfsmount has the newly
> > introduced flag MNT_SHIFT and the current user_ns is the same as
> > the mount_ns->user_ns then we shift back using the user_ns before
> > committing to the underlying filesystem.
> > 
> > For example, if a user_ns is created where interior (fake root, uid
> > 0) is mapped to kernel uid 100000 then writes from interior root
> > normally go to the filesystem at the kernel uid.  However, if
> > MNT_SHIFT is set, they will be shifted back to write at uid 0,
> > meaning we can bind mount real image filesystems to user_ns
> > protected faker root.
> 
> Thanks, James, I definately would like to see shifting in the VFS
> api.
> 
> I have a few practical concerns about this implementation, but my
> biggest concern is more fundemental:  this again by design leaves
> littered about the filesystem uid-0 owned files which were written by
> an untrusted user.

Well, I think that's a consequence of my use case: using unmodified
container images with the user namespace.  We're starting to do IMA/EVM
signatures in our images, so shifted UID images aren't an option for us
.  Therefore I have to figure out a way of allowing an untrusted user
to write safely at UID zero.  For me that safety comes from strictly
corralling where they can write and making sure the container
orchestration system sets it up correctly.

> I would feel much better if you institutionalized having the origin
> shifted.  For instance, take a squashfs for a container fs, shift it
> so that fsuid 0 == hostuid 100000.  Mount that, with a marker saying
> how it is shifted, then set 'shiftable'.  Now use that as a base for
> allowing an unpriv user to shift.  If that user has subuid 200000 as
> container uid 0, then its root will write files as uid 100000 in the
> fs.  This isn't perfect, but I think something along these lines
> would be far safer.

OK, so I fully agree that if you're not doing integrity in the
container, then this is an option for you and whatever API gets
upstreamed should cope with that case.

So to push on the API a bit, what do you want?  The reverse along the
user_ns one I implemented is easy: a single flag tells you to map back
or not.  However, the implementation is phrased in terms of shifted
credentials, so as long as we know how to map, it can work for both our
use cases.  I think in plumbers you expressed interest in simply
passing the map to the mount rather than doing it via a user_ns; is
that still the case?

> Two namespaces with different uid maps can share the filesystem as
> though they both had the same uidmap.  (This currently is to me the
> most interesting use case for shifing bind mounts)
> 
> If the user wants to tar up the result, they can do do in their own
> namespace, resulting in uid 0 shown as uid 0.  If host root wants to
> do so, they can umount it everywhere and use something like
> fuidshift. Or, they can also create a namespace to do the shifting to
> uid 0 in tar.
> 
> My more practical concerns include: (1) once a userns has set a
> shiftable bind mount to shift, if it then creates a new child userns,
> that ns will not see (iiuc) see the fs as shifted.

Actually, it will.  The shift is a property of the vfsmount (or
underlying struct mount), so if you create a new user_ns then a
mount_ns, the new mount_ns inherits the shift flag and you go back
along your new user_ns.  If you don't create a new mount_ns, you still
get the shift flag, but you're still being shifted along the parent
user_ns, not your new one.

>   (2) there seems to be no good reason to stick to caching the cred
> for only one mnt, versus having a per-userns hashtable of creds for
> shifted mnts.  Was that just a temporary shortcut or did you intend
> it to stay that way?

Well, it was a demo of mechanism, but remember this cache is per thread
of execution.  Is there really going to be a use case where one thread
of execution traverses multiple user_ns's and so would need multiple
shifted credentials?  The current implementation could be backed by a
hashtable, but I really think it would only make sense if the creds
could be global, but they're not they're strictly thread specific
entities.

James



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] fs: introduce uid/gid shifting bind mount
  2020-01-15 18:19     ` James Bottomley
@ 2020-01-16  6:44       ` Serge E. Hallyn
  2020-01-16 16:29         ` James Bottomley
  0 siblings, 1 reply; 15+ messages in thread
From: Serge E. Hallyn @ 2020-01-16  6:44 UTC (permalink / raw)
  To: James Bottomley
  Cc: Serge E. Hallyn, linux-fsdevel, Miklos Szeredi, containers,
	linux-unionfs, David Howells, Seth Forshee, Al Viro,
	Eric Biederman

On Wed, Jan 15, 2020 at 10:19:20AM -0800, James Bottomley wrote:
> On Sun, 2020-01-12 at 21:41 -0600, Serge E. Hallyn wrote:
> > On Sat, Jan 04, 2020 at 12:39:45PM -0800, James Bottomley wrote:
> > > This implementation reverse shifts according to the user_ns
> > > belonging to the mnt_ns.  So if the vfsmount has the newly
> > > introduced flag MNT_SHIFT and the current user_ns is the same as
> > > the mount_ns->user_ns then we shift back using the user_ns before
> > > committing to the underlying filesystem.
> > > 
> > > For example, if a user_ns is created where interior (fake root, uid
> > > 0) is mapped to kernel uid 100000 then writes from interior root
> > > normally go to the filesystem at the kernel uid.  However, if
> > > MNT_SHIFT is set, they will be shifted back to write at uid 0,
> > > meaning we can bind mount real image filesystems to user_ns
> > > protected faker root.
> > 
> > Thanks, James, I definately would like to see shifting in the VFS
> > api.
> > 
> > I have a few practical concerns about this implementation, but my
> > biggest concern is more fundemental:  this again by design leaves
> > littered about the filesystem uid-0 owned files which were written by
> > an untrusted user.
> 
> Well, I think that's a consequence of my use case: using unmodified
> container images with the user namespace.  We're starting to do IMA/EVM
> signatures in our images, so shifted UID images aren't an option for us
> .  Therefore I have to figure out a way of allowing an untrusted user
> to write safely at UID zero.  For me that safety comes from strictly
> corralling where they can write and making sure the container
> orchestration system sets it up correctly.

Isn't that a matter of convention?  You could ship, store, and measure
the files already shifted.  An OCI annotation could show the offset,
say 100000.

Now if any admin runs across this device noone will be tricked by the root
owned files.

Mount could conceivably look like:

	mount --bind --origin-uid 100000 --shift /proc/50/ns/user /src /dest

(the --shift idea coming from Tycho).  I'd prefer --origin to be another
user namespace fd, which I suppose some tool could easily set up, for
instance:

	pid1=`setup-userns-fd -m b:0:100000:65536`
	pid2=$(prepare a container userns)
	mount --bind --shift-origin=/proc/$pid1/ns/user \
		--shift-target=/proc/$pid2/ns/user /src /dest

You could presumably always skip the shift-origin to achieve what you're
doing now.

> > I would feel much better if you institutionalized having the origin
> > shifted.  For instance, take a squashfs for a container fs, shift it
> > so that fsuid 0 == hostuid 100000.  Mount that, with a marker saying
> > how it is shifted, then set 'shiftable'.  Now use that as a base for
> > allowing an unpriv user to shift.  If that user has subuid 200000 as
> > container uid 0, then its root will write files as uid 100000 in the
> > fs.  This isn't perfect, but I think something along these lines
> > would be far safer.
> 
> OK, so I fully agree that if you're not doing integrity in the
> container, then this is an option for you and whatever API gets
> upstreamed should cope with that case.
> 
> So to push on the API a bit, what do you want?  The reverse along the
> user_ns one I implemented is easy: a single flag tells you to map back
> or not.  However, the implementation is phrased in terms of shifted
> credentials, so as long as we know how to map, it can work for both our
> use cases.  I think in plumbers you expressed interest in simply
> passing the map to the mount rather than doing it via a user_ns; is
> that still the case?

Oh I think I'm fine either way - I can always create a user_ns to match
the map I want.

-serge

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] fs: introduce uid/gid shifting bind mount
  2020-01-16  6:44       ` Serge E. Hallyn
@ 2020-01-16 16:29         ` James Bottomley
  2020-01-17 15:44           ` Serge E. Hallyn
  0 siblings, 1 reply; 15+ messages in thread
From: James Bottomley @ 2020-01-16 16:29 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Eric Biederman, Miklos Szeredi, containers, linux-unionfs,
	David Howells, Seth Forshee, Al Viro, linux-fsdevel

On Thu, 2020-01-16 at 00:44 -0600, Serge E. Hallyn wrote:
> On Wed, Jan 15, 2020 at 10:19:20AM -0800, James Bottomley wrote:
> > On Sun, 2020-01-12 at 21:41 -0600, Serge E. Hallyn wrote:
> > > On Sat, Jan 04, 2020 at 12:39:45PM -0800, James Bottomley wrote:
> > > > This implementation reverse shifts according to the user_ns
> > > > belonging to the mnt_ns.  So if the vfsmount has the newly
> > > > introduced flag MNT_SHIFT and the current user_ns is the same
> > > > as the mount_ns->user_ns then we shift back using the user_ns
> > > > before committing to the underlying filesystem.
> > > > 
> > > > For example, if a user_ns is created where interior (fake root,
> > > > uid 0) is mapped to kernel uid 100000 then writes from interior
> > > > root normally go to the filesystem at the kernel uid.  However,
> > > > if MNT_SHIFT is set, they will be shifted back to write at uid
> > > > 0, meaning we can bind mount real image filesystems to user_ns
> > > > protected faker root.
> > > 
> > > Thanks, James, I definately would like to see shifting in the VFS
> > > api.
> > > 
> > > I have a few practical concerns about this implementation, but my
> > > biggest concern is more fundemental:  this again by design leaves
> > > littered about the filesystem uid-0 owned files which were
> > > written by an untrusted user.
> > 
> > Well, I think that's a consequence of my use case: using unmodified
> > container images with the user namespace.  We're starting to do
> > IMA/EVM signatures in our images, so shifted UID images aren't an
> > option for us.  Therefore I have to figure out a way of allowing an
> > untrusted user to write safely at UID zero.  For me that safety
> > comes from strictly corralling where they can write and making sure
> > the container orchestration system sets it up correctly.
> 
> Isn't that a matter of convention?  You could ship, store, and
> measure the files already shifted.  An OCI annotation could show the
> offset, say 100000.

We could, but it's the wrong way to look at it to tell a customer that
if they want us to run the image safely they have to modify it at the
build stage.  As a cloud service provider I want to make the statement
that I can run any customer image safely as long as it was built to
whatever standards the registry supports.  That has to include
integrity protected images.  And I have to be able to attest to a
customer that I'm running their image as part of the customer integrity
verification.

> Now if any admin runs across this device noone will be tricked by the
> root owned files.

Perhaps you could go into what tricks you think will happen?  This is
clearly the thread model of using unmodified images you have which
might be different from the one I have.  My mitigation is basically
that as long as no tenant or unprivileged user can get at the unshifted
image, we're fine.

> Mount could conceivably look like:
> 
> 	mount --bind --origin-uid 100000 --shift /proc/50/ns/user /src
> /dest
> 
> (the --shift idea coming from Tycho).

Just so we're clear --origin-uid <uid> means map back along the --shift 
user_ns but add this <uid> to whatever interior id the shift produces? 
I think that's fairly easy to parametrise and store in the bind mount,
yes.

>   I'd prefer --origin to be another user namespace fd, which I
> suppose some tool could easily set up, for instance:
> 
> 	pid1=`setup-userns-fd -m b:0:100000:65536`
> 	pid2=$(prepare a container userns)
> 	mount --bind --shift-origin=/proc/$pid1/ns/user \
> 		--shift-target=/proc/$pid2/ns/user /src /dest
> 
> You could presumably always skip the shift-origin to achieve what
> you're doing now.

Yes, if you're happy to have --shift-origin <uid> default to 0

I have to ask in the above, what is the point of the pid1 user_ns?  Do
you ever use pid1 for anything else? It looks like you were merely
creating it for the object of having it passed into the bind.  If
there's never any use for the --shift-origin <ns_fd> then I think I
agree that a bare number is a better abstraction.  Or are you thinking
we'll have use cases where a simple numeric addition won't serve and
our only user mechanism for complex parametrisation of the shift map is
a user_ns?

The other slight problem is that now the bind mount does need to
understand complex arguments, which it definitely doesn't today.  I'm
happy with extending fsconfig to bind, so it can do complex arguments
like this, but it sounds like others are dubious so doing the above
also depends on agreeing whatever extension we do to bind.

I suppose bind reconfigure could be yet another system call in the
open_tree/move_mount pantheon, which would also solve the remount with
different bind parameters problem with the new API.

The other thing I worry about is that is separating the shift_user_ns
from the mount_ns->user_ns a potential security hole?  For the
unprivileged operation of this, I like the idea of enforcing them to be
the same so the tenant can only shift back along a user_ns they're
operating in.  The problem being that the kernel has no way of
validating that the passed in <ns_fd> is within the subuid/subgid range
of the unprivileged user, so we're trusting that the user can't get
access to the ns_fd of a user_ns outside that range.

> > > I would feel much better if you institutionalized having the
> > > origin shifted.  For instance, take a squashfs for a container
> > > fs, shift it so that fsuid 0 == hostuid 100000.  Mount that, with
> > > a marker saying how it is shifted, then set 'shiftable'.  Now use
> > > that as a base for allowing an unpriv user to shift.  If that
> > > user has subuid 200000 as container uid 0, then its root will
> > > write files as uid 100000 in the fs.  This isn't perfect, but I
> > > think something along these lines would be far safer.
> > 
> > OK, so I fully agree that if you're not doing integrity in the
> > container, then this is an option for you and whatever API gets
> > upstreamed should cope with that case.
> > 
> > So to push on the API a bit, what do you want?  The reverse along
> > the user_ns one I implemented is easy: a single flag tells you to
> > map back or not.  However, the implementation is phrased in terms
> > of shifted credentials, so as long as we know how to map, it can
> > work for both our use cases.  I think in plumbers you expressed
> > interest in simply passing the map to the mount rather than doing
> > it via a user_ns; is that still the case?
> 
> Oh I think I'm fine either way - I can always create a user_ns to
> match the map I want.

I think it comes down to whether there's an actual use for the user_ns
you create.  It seems a bit wasteful merely to create a user_ns for the
purpose of passing something that can also be simply parametrised if
there's no further use for that user_ns.

James


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] fs: introduce uid/gid shifting bind mount
  2020-01-16 16:29         ` James Bottomley
@ 2020-01-17 15:44           ` Serge E. Hallyn
  2020-01-17 16:25             ` James Bottomley
  0 siblings, 1 reply; 15+ messages in thread
From: Serge E. Hallyn @ 2020-01-17 15:44 UTC (permalink / raw)
  To: James Bottomley
  Cc: Serge E. Hallyn, Eric Biederman, Miklos Szeredi, containers,
	linux-unionfs, David Howells, Seth Forshee, Al Viro,
	linux-fsdevel

On Thu, Jan 16, 2020 at 08:29:33AM -0800, James Bottomley wrote:
> On Thu, 2020-01-16 at 00:44 -0600, Serge E. Hallyn wrote:
> > On Wed, Jan 15, 2020 at 10:19:20AM -0800, James Bottomley wrote:
> > > On Sun, 2020-01-12 at 21:41 -0600, Serge E. Hallyn wrote:
> > > > On Sat, Jan 04, 2020 at 12:39:45PM -0800, James Bottomley wrote:
> > > > > This implementation reverse shifts according to the user_ns
> > > > > belonging to the mnt_ns.  So if the vfsmount has the newly
> > > > > introduced flag MNT_SHIFT and the current user_ns is the same
> > > > > as the mount_ns->user_ns then we shift back using the user_ns
> > > > > before committing to the underlying filesystem.
> > > > > 
> > > > > For example, if a user_ns is created where interior (fake root,
> > > > > uid 0) is mapped to kernel uid 100000 then writes from interior
> > > > > root normally go to the filesystem at the kernel uid.  However,
> > > > > if MNT_SHIFT is set, they will be shifted back to write at uid
> > > > > 0, meaning we can bind mount real image filesystems to user_ns
> > > > > protected faker root.
> > > > 
> > > > Thanks, James, I definately would like to see shifting in the VFS
> > > > api.
> > > > 
> > > > I have a few practical concerns about this implementation, but my
> > > > biggest concern is more fundemental:  this again by design leaves
> > > > littered about the filesystem uid-0 owned files which were
> > > > written by an untrusted user.
> > > 
> > > Well, I think that's a consequence of my use case: using unmodified
> > > container images with the user namespace.  We're starting to do
> > > IMA/EVM signatures in our images, so shifted UID images aren't an
> > > option for us.  Therefore I have to figure out a way of allowing an
> > > untrusted user to write safely at UID zero.  For me that safety
> > > comes from strictly corralling where they can write and making sure
> > > the container orchestration system sets it up correctly.
> > 
> > Isn't that a matter of convention?  You could ship, store, and
> > measure the files already shifted.  An OCI annotation could show the
> > offset, say 100000.
> 
> We could, but it's the wrong way to look at it to tell a customer that
> if they want us to run the image safely they have to modify it at the
> build stage.  As a cloud service provider I want to make the statement
> that I can run any customer image safely as long as it was built to
> whatever standards the registry supports.  That has to include
> integrity protected images.  And I have to be able to attest to a

And does the customer measure the files, or do you?

> customer that I'm running their image as part of the customer integrity
> verification.

Makes sense.  And in your environment, you can easily partition off a
place (or an otherwise unused namespace) in which to mount these images.
So using a null mapping for the 'origin' would make sense there.

But in cases where what you want is a single directory shared by several
containers with disjoint uid mappings, where this is the only directory
they share - be it for logs, or data, etc, and be it by infrastructure
containers in the course of running a cluster or a set of students
manipulating shared data with their otherwise completely unprivileged
containers - we can make the shared directory a lot less of a minefield.

> > Now if any admin runs across this device noone will be tricked by the
> > root owned files.
> 
> Perhaps you could go into what tricks you think will happen?  This is

I don't like to use my own underactive imagination to decide what an
attacker - or accidental fool - might be likely to do.  But simply
writing a setuid-root shell script called 'ls' will probably hit
*someone* who against all advice has . at the front of their path.

(Don't look at me like that - it's 2020 and we still have flashy
respectable-looking websites encouraging people to wget | sudo /bin/sh)

> clearly the thread model of using unmodified images you have which
> might be different from the one I have.  My mitigation is basically
> that as long as no tenant or unprivileged user can get at the unshifted
> image, we're fine.

Are you sure?  What if $package accidentally ships a broken cronjob
that tries to run ./bin/sh -c "logger $(date)" ?

> > Mount could conceivably look like:
> > 
> > 	mount --bind --origin-uid 100000 --shift /proc/50/ns/user /src
> > /dest
> > 
> > (the --shift idea coming from Tycho).
> 
> Just so we're clear --origin-uid <uid> means map back along the --shift 
> user_ns but add this <uid> to whatever interior id the shift produces? 

If by interior id you mean the kuid, then yes :)

> I think that's fairly easy to parametrise and store in the bind mount,
> yes.
> 
> >   I'd prefer --origin to be another user namespace fd, which I
> > suppose some tool could easily set up, for instance:
> > 
> > 	pid1=`setup-userns-fd -m b:0:100000:65536`
> > 	pid2=$(prepare a container userns)
> > 	mount --bind --shift-origin=/proc/$pid1/ns/user \
> > 		--shift-target=/proc/$pid2/ns/user /src /dest
> > 
> > You could presumably always skip the shift-origin to achieve what
> > you're doing now.
> 
> Yes, if you're happy to have --shift-origin <uid> default to 0

Yeah I think that's fine.  I'd expect any distro which tries to configure
this for easy consumption to allocate a 65k subuid range for 'images',
and set a default shift-origin under /etc which 'mount' would consult,
or something like that.  The kernel almost certainly would default to 0.

> I have to ask in the above, what is the point of the pid1 user_ns?  Do
> you ever use pid1 for anything else?

Probably not.

> It looks like you were merely
> creating it for the object of having it passed into the bind.  If
> there's never any use for the --shift-origin <ns_fd> then I think I
> agree that a bare number is a better abstraction.  Or are you thinking
> we'll have use cases where a simple numeric addition won't serve and
> our only user mechanism for complex parametrisation of the shift map is
> a user_ns?

I don't think so.  People can have some pretty convoluted uid mappings
right now, but presumably the images we are talking about would be
the result of an rsync or tar *in* such a namespace.  Though again, limited
imagination and all that.  There *may* be very good use cases for a more
complicated mapping.

> The other slight problem is that now the bind mount does need to
> understand complex arguments, which it definitely doesn't today.  I'm
> happy with extending fsconfig to bind, so it can do complex arguments
> like this, but it sounds like others are dubious so doing the above
> also depends on agreeing whatever extension we do to bind.
> 
> I suppose bind reconfigure could be yet another system call in the
> open_tree/move_mount pantheon, which would also solve the remount with
> different bind parameters problem with the new API.
> 
> The other thing I worry about is that is separating the shift_user_ns
> from the mount_ns->user_ns a potential security hole?  For the
> unprivileged operation of this, I like the idea of enforcing them to be
> the same so the tenant can only shift back along a user_ns they're
> operating in.  The problem being that the kernel has no way of
> validating that the passed in <ns_fd> is within the subuid/subgid range
> of the unprivileged user, so we're trusting that the user can't get
> access to the ns_fd of a user_ns outside that range.

I guess I figured we would have privileged task in the owning namespace
(presumably init_user_ns) mark a bind mount as shiftable (maybe
specifying who is allowed to bind mount it using the mapped root uid,
analogous to how the namespaced file capabilities are identified) and
then the ns_fd of the task doing the "mount --bind --shift" (which is
privileged inside the ns_fd userns) would be used, unmodified (or even
modified, since whatever uid args the task would pass would have to be
valid inside the mounting userns)

So something like:

1. On the host:

   mount --bind --mark-shiftable-by 200000 --origin-uid 100000 /data/group1

2. In the container which has its root mapped to host uid 200000

   mount --bind --shift /data/group1 /data/group1

> > > > I would feel much better if you institutionalized having the
> > > > origin shifted.  For instance, take a squashfs for a container
> > > > fs, shift it so that fsuid 0 == hostuid 100000.  Mount that, with
> > > > a marker saying how it is shifted, then set 'shiftable'.  Now use
> > > > that as a base for allowing an unpriv user to shift.  If that
> > > > user has subuid 200000 as container uid 0, then its root will
> > > > write files as uid 100000 in the fs.  This isn't perfect, but I
> > > > think something along these lines would be far safer.
> > > 
> > > OK, so I fully agree that if you're not doing integrity in the
> > > container, then this is an option for you and whatever API gets
> > > upstreamed should cope with that case.
> > > 
> > > So to push on the API a bit, what do you want?  The reverse along
> > > the user_ns one I implemented is easy: a single flag tells you to
> > > map back or not.  However, the implementation is phrased in terms
> > > of shifted credentials, so as long as we know how to map, it can
> > > work for both our use cases.  I think in plumbers you expressed
> > > interest in simply passing the map to the mount rather than doing
> > > it via a user_ns; is that still the case?
> > 
> > Oh I think I'm fine either way - I can always create a user_ns to
> > match the map I want.
> 
> I think it comes down to whether there's an actual use for the user_ns
> you create.  It seems a bit wasteful merely to create a user_ns for the
> purpose of passing something that can also be simply parametrised if
> there's no further use for that user_ns.

Oh - I consider the detail of whether we pass a userid or userns nsfd as
more of an implementation detail which we can hash out after the more
general shift-mount api is decided upon.  Anyway, passing nsfds just has
a cool factor :)

-serge

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] fs: introduce uid/gid shifting bind mount
  2020-01-17 15:44           ` Serge E. Hallyn
@ 2020-01-17 16:25             ` James Bottomley
  2020-01-17 21:19               ` Tycho Andersen
  0 siblings, 1 reply; 15+ messages in thread
From: James Bottomley @ 2020-01-17 16:25 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Eric Biederman, Miklos Szeredi, containers, linux-unionfs,
	David Howells, Seth Forshee, Al Viro, linux-fsdevel

On Fri, 2020-01-17 at 09:44 -0600, Serge E. Hallyn wrote:
> On Thu, Jan 16, 2020 at 08:29:33AM -0800, James Bottomley wrote:
> > On Thu, 2020-01-16 at 00:44 -0600, Serge E. Hallyn wrote:
> > > On Wed, Jan 15, 2020 at 10:19:20AM -0800, James Bottomley wrote:
> > > > On Sun, 2020-01-12 at 21:41 -0600, Serge E. Hallyn wrote:
> > > > > On Sat, Jan 04, 2020 at 12:39:45PM -0800, James Bottomley
> > > > > wrote:
> > > > > > This implementation reverse shifts according to the user_ns
> > > > > > belonging to the mnt_ns.  So if the vfsmount has the newly
> > > > > > introduced flag MNT_SHIFT and the current user_ns is the
> > > > > > same as the mount_ns->user_ns then we shift back using the
> > > > > > user_ns before committing to the underlying filesystem.
> > > > > > 
> > > > > > For example, if a user_ns is created where interior (fake
> > > > > > root,uid 0) is mapped to kernel uid 100000 then writes from
> > > > > > interior root normally go to the filesystem at the kernel
> > > > > > uid.  However, if MNT_SHIFT is set, they will be shifted
> > > > > > back to write at uid 0, meaning we can bind mount real
> > > > > > image filesystems to user_ns protected faker root.
> > > > > 
> > > > > Thanks, James, I definately would like to see shifting in the
> > > > > VFS api.
> > > > > 
> > > > > I have a few practical concerns about this implementation,
> > > > > but my biggest concern is more fundemental:  this again by
> > > > > design leaves littered about the filesystem uid-0 owned files
> > > > > which were written by an untrusted user.
> > > > 
> > > > Well, I think that's a consequence of my use case: using
> > > > unmodified container images with the user namespace.  We're
> > > > starting to do IMA/EVM signatures in our images, so shifted UID
> > > > images aren't an option for us.  Therefore I have to figure out
> > > > a way of allowing an untrusted user to write safely at UID
> > > > zero.  For me that safety comes from strictly corralling where
> > > > they can write and making sure the container orchestration
> > > > system sets it up correctly.
> > > 
> > > Isn't that a matter of convention?  You could ship, store, and
> > > measure the files already shifted.  An OCI annotation could show
> > > the offset, say 100000.
> > 
> > We could, but it's the wrong way to look at it to tell a customer
> > that if they want us to run the image safely they have to modify it
> > at the build stage.  As a cloud service provider I want to make the
> > statement that I can run any customer image safely as long as it
> > was built to whatever standards the registry supports.  That has to
> > include integrity protected images.  And I have to be able to
> > attest to a 
>  
> And does the customer measure the files, or do you?

Well, the customer signs the image, which is why we can't alter it. 
The idea would be that the CSP does the measurement and the provision
of the running dashboard but that the customer could also do it
themselves if they didn't trust the CSP ... or just wanted to verify
the CSP integrity tool was working correctly.

> > customer that I'm running their image as part of the customer
> > integrity verification.
> 
> Makes sense.  And in your environment, you can easily partition off a
> place (or an otherwise unused namespace) in which to mount these
> images. So using a null mapping for the 'origin' would make sense
> there.
> 
> But in cases where what you want is a single directory shared by
> several containers with disjoint uid mappings, where this is the only
> directory they share - be it for logs, or data, etc, and be it by
> infrastructure containers in the course of running a cluster or a set
> of students manipulating shared data with their otherwise completely
> unprivileged containers - we can make the shared directory a lot less
> of a minefield.

Sure ... I didn't say you didn't have a use case.   I was just pointing
out that I have a write at uid 0 one.  I'm happy to implement a scheme
that covers both.

> > > Now if any admin runs across this device noone will be tricked by
> > > the root owned files.
> > 
> > Perhaps you could go into what tricks you think will happen?  This
> > is
> 
> I don't like to use my own underactive imagination to decide what an
> attacker - or accidental fool - might be likely to do.  But simply
> writing a setuid-root shell script called 'ls' will probably hit
> *someone* who against all advice has . at the front of their path.

The suid/sgid is already on my list of potential threats.  It's
mitigated by never letting the tenant get access to where the unshifted
image is unpacked.

> (Don't look at me like that - it's 2020 and we still have flashy
> respectable-looking websites encouraging people to wget | sudo
> /bin/sh)

Person with root access being stupid is definitely a problem, but it's
not really in my threat model.

> > clearly the thread model of using unmodified images you have which
> > might be different from the one I have.  My mitigation is basically
> > that as long as no tenant or unprivileged user can get at the
> > unshifted image, we're fine.
> 
> Are you sure?  What if $package accidentally ships a broken cronjob
> that tries to run ./bin/sh -c "logger $(date)" ?

You mean sends a message to the systemd log socket?  That's usually
intercepted to go into the per-container Kafka receiver (or whatever
else the CSP uses for logging).

> > > Mount could conceivably look like:
> > > 
> > > 	mount --bind --origin-uid 100000 --shift /proc/50/ns/user /src
> > > /dest
> > > 
> > > (the --shift idea coming from Tycho).
> > 
> > Just so we're clear --origin-uid <uid> means map back along the --
> > shift user_ns but add this <uid> to whatever interior id the shift
> > produces? 
> 
> If by interior id you mean the kuid, then yes :)

Oh, no ... we need to get the terminogy straight.  The kuid, as in the
uid the kernel sees, is what I think of as the exterior uid.  The uid
the container tenant thinks they see is the interior one.

So let's say the user_ns maps interior 0 to exterior 20,000 but that
the image begins at 100,000.  You have an --origin-uid of 100000 in the
above, I believe.  How we get there is the user as sudo root writes a
file "f".  "f" has interior owner 0.  However, the exterior owner,
which is what usually gets written, as 20,000.  Doing the shift I'd
take the 20,000; shift back along the tenant user_ns to get 0 and then
add the 100,000 offset to end up writing to the image at 100,000

That would mean a bunch of different user_ns could be set up all
shifting back to the same 100,000 and thus share the image.

> > I think that's fairly easy to parametrise and store in the bind
> > mount, yes.
> > 
> > >   I'd prefer --origin to be another user namespace fd, which I
> > > suppose some tool could easily set up, for instance:
> > > 
> > > 	pid1=`setup-userns-fd -m b:0:100000:65536`
> > > 	pid2=$(prepare a container userns)
> > > 	mount --bind --shift-origin=/proc/$pid1/ns/user \
> > > 		--shift-target=/proc/$pid2/ns/user /src /dest
> > > 
> > > You could presumably always skip the shift-origin to achieve what
> > > you're doing now.
> > 
> > Yes, if you're happy to have --shift-origin <uid> default to 0
> 
> Yeah I think that's fine.  I'd expect any distro which tries to
> configure this for easy consumption to allocate a 65k subuid range
> for 'images', and set a default shift-origin under /etc which 'mount'
> would consult, or something like that.  The kernel almost certainly
> would default to 0.
> 
> > I have to ask in the above, what is the point of the pid1
> > user_ns?  Do you ever use pid1 for anything else?
> 
> Probably not.
> 
> > It looks like you were merely creating it for the object of having
> > it passed into the bind.  If there's never any use for the --shift-
> > origin <ns_fd> then I think I agree that a bare number is a better
> > abstraction.  Or are you thinking we'll have use cases where a
> > simple numeric addition won't serve and our only user mechanism for
> > complex parametrisation of the shift map is a user_ns?
> 
> I don't think so.  People can have some pretty convoluted uid
> mappings right now, but presumably the images we are talking about
> would be the result of an rsync or tar *in* such a namespace.  Though
> again, limited imagination and all that.  There *may* be very good
> use cases for a more complicated mapping.

Would it be OK to implement the simple now and add the complex later if
it ever materializes.  Provided we can agree on a way of passing extra
arguments to bind, we have the freedom to add stuff later.

> > The other slight problem is that now the bind mount does need to
> > understand complex arguments, which it definitely doesn't
> > today.  I'm happy with extending fsconfig to bind, so it can do
> > complex arguments like this, but it sounds like others are dubious
> > so doing the above also depends on agreeing whatever extension we
> > do to bind.
> > 
> > I suppose bind reconfigure could be yet another system call in the
> > open_tree/move_mount pantheon, which would also solve the remount
> > with different bind parameters problem with the new API.
> > 
> > The other thing I worry about is that is separating the
> > shift_user_ns from the mount_ns->user_ns a potential security
> > hole?  For the unprivileged operation of this, I like the idea of
> > enforcing them to be the same so the tenant can only shift back
> > along a user_ns they're operating in.  The problem being that the
> > kernel has no way of validating that the passed in <ns_fd> is
> > within the subuid/subgid range of the unprivileged user, so we're
> > trusting that the user can't get access to the ns_fd of a user_ns
> > outside that range.
> 
> I guess I figured we would have privileged task in the owning
> namespace (presumably init_user_ns) mark a bind mount as shiftable 

Yes, that's what I've got today in the prototype.  It mirrors the
original shiftfs mechanism.  However, I have also heard people say they
want a permanent mark, like an xattr for this.

> (maybe specifying who is allowed to bind mount it using the mapped
> root uid, analogous to how the namespaced file capabilities are
> identified) and then the ns_fd of the task doing the "mount --bind --
> shift" (which is privileged inside the ns_fd userns) would be used,
> unmodified (or even modified, since whatever uid args the task would
> pass would have to be valid inside the mounting userns)
> 
> So something like:
> 
> 1. On the host:
> 
>    mount --bind --mark-shiftable-by 200000 --origin-uid 100000
> /data/group1
> 
> 2. In the container which has its root mapped to host uid 200000
> 
>    mount --bind --shift /data/group1 /data/group1

We can certainly do that, but it does mean one mark (i.e. one mount
point, so /data/group<n>) per user_ns at different uids.  A simpler
alternative may be to do the mark as

mount --bind --mark-shiftable --origin-uid 100000 /data/group

And regulate access to /data/group by the usual filesystem ACL.  Then
anyone who can get at /data/group can do

mount --bind --shift /data/group1 /my/place/for/the/image

I tend to think that ACL security is sufficient, since it's what
everyone is used to but I don't object to having the additional origin
check as well.

> > > > > I would feel much better if you institutionalized having the
> > > > > origin shifted.  For instance, take a squashfs for a
> > > > > container fs, shift it so that fsuid 0 == hostuid
> > > > > 100000.  Mount that, with a marker saying how it is shifted,
> > > > > then set 'shiftable'.  Now use that as a base for allowing an
> > > > > unpriv user to shift.  If that user has subuid 200000 as
> > > > > container uid 0, then its root will write files as uid 100000
> > > > > in the fs.  This isn't perfect, but I think something along
> > > > > these lines would be far safer.
> > > > 
> > > > OK, so I fully agree that if you're not doing integrity in the
> > > > container, then this is an option for you and whatever API gets
> > > > upstreamed should cope with that case.
> > > > 
> > > > So to push on the API a bit, what do you want?  The reverse
> > > > along the user_ns one I implemented is easy: a single flag
> > > > tells you to map back or not.  However, the implementation is
> > > > phrased in terms of shifted credentials, so as long as we know
> > > > how to map, it can work for both our use cases.  I think in
> > > > plumbers you expressed interest in simply passing the map to
> > > > the mount rather than doing it via a user_ns; is that still the
> > > > case?
> > > 
> > > Oh I think I'm fine either way - I can always create a user_ns to
> > > match the map I want.
> > 
> > I think it comes down to whether there's an actual use for the
> > user_ns you create.  It seems a bit wasteful merely to create a
> > user_ns for the purpose of passing something that can also be
> > simply parametrised if there's no further use for that user_ns.
> 
> Oh - I consider the detail of whether we pass a userid or userns nsfd
> as more of an implementation detail which we can hash out after the
> more general shift-mount api is decided upon.  Anyway, passing nsfds
> just has a cool factor :)

Well, yes, won't aruge on the cool factor-ness.

James

> -serge
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] fs: introduce uid/gid shifting bind mount
  2020-01-17 16:25             ` James Bottomley
@ 2020-01-17 21:19               ` Tycho Andersen
  2020-01-17 22:52                 ` James Bottomley
  0 siblings, 1 reply; 15+ messages in thread
From: Tycho Andersen @ 2020-01-17 21:19 UTC (permalink / raw)
  To: James Bottomley
  Cc: Serge E. Hallyn, Miklos Szeredi, containers, linux-unionfs,
	David Howells, Seth Forshee, Eric Biederman, linux-fsdevel,
	Al Viro

On Fri, Jan 17, 2020 at 08:25:42AM -0800, James Bottomley wrote:
> On Fri, 2020-01-17 at 09:44 -0600, Serge E. Hallyn wrote:
> > On Thu, Jan 16, 2020 at 08:29:33AM -0800, James Bottomley wrote:
> > I guess I figured we would have privileged task in the owning
> > namespace (presumably init_user_ns) mark a bind mount as shiftable 
> 
> Yes, that's what I've got today in the prototype.  It mirrors the
> original shiftfs mechanism.  However, I have also heard people say they
> want a permanent mark, like an xattr for this.

Please, no. mount() failures are already hard to reason about, I would
rather not add another temporary (or worse, permanent) non-obvious
failure mode.

What if we make shifted bind mounts always readonly? That will force
people to use an overlay (or something else) on top, but they probably
want to do that anyway so they can avoid tainting the original
container image with writes.

> > Oh - I consider the detail of whether we pass a userid or userns nsfd
> > as more of an implementation detail which we can hash out after the
> > more general shift-mount api is decided upon.  Anyway, passing nsfds
> > just has a cool factor :)
> 
> Well, yes, won't aruge on the cool factor-ness.

It's not just the cool factor: if you're doing this, it's presumably
because you want to use it with a container in a user namespace.
Specifying the same parameters twice leaves room for error, causing
CVEs and more work.

Tycho

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] fs: introduce uid/gid shifting bind mount
  2020-01-17 21:19               ` Tycho Andersen
@ 2020-01-17 22:52                 ` James Bottomley
  0 siblings, 0 replies; 15+ messages in thread
From: James Bottomley @ 2020-01-17 22:52 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Serge E. Hallyn, Miklos Szeredi, containers, linux-unionfs,
	David Howells, Seth Forshee, Eric Biederman, linux-fsdevel,
	Al Viro

On Fri, 2020-01-17 at 13:19 -0800, Tycho Andersen wrote:
> On Fri, Jan 17, 2020 at 08:25:42AM -0800, James Bottomley wrote:
> > On Fri, 2020-01-17 at 09:44 -0600, Serge E. Hallyn wrote:
> > > On Thu, Jan 16, 2020 at 08:29:33AM -0800, James Bottomley wrote:
> > > I guess I figured we would have privileged task in the owning
> > > namespace (presumably init_user_ns) mark a bind mount as
> > > shiftable 
> > 
> > Yes, that's what I've got today in the prototype.  It mirrors the
> > original shiftfs mechanism.  However, I have also heard people say
> > they want a permanent mark, like an xattr for this.
> 
> Please, no. mount() failures are already hard to reason about, I
> would rather not add another temporary (or worse, permanent) non-
> obvious failure mode.

I'm not particularly bothered either way ... although using xattrs
always seems to end up biting us for nesting, so I wasn't wildly
enthusiastic about it.

> What if we make shifted bind mounts always readonly? That will force
> people to use an overlay (or something else) on top, but they
> probably want to do that anyway so they can avoid tainting the
> original container image with writes.

That really causes problems for the mutable (non-docker) container use
case which is pretty much the way I always use containers.  Who wants
to bother with overlayfs when their image is expected to mutate: it's
just a huge hassle.

> > > Oh - I consider the detail of whether we pass a userid or userns
> > > nsfd as more of an implementation detail which we can hash out
> > > after the more general shift-mount api is decided upon.  Anyway,
> > > passing nsfds just has a cool factor :)
> > 
> > Well, yes, won't aruge on the cool factor-ness.
> 
> It's not just the cool factor: if you're doing this, it's presumably
> because you want to use it with a container in a user namespace.
> Specifying the same parameters twice leaves room for error, causing
> CVEs and more work.

It depends.  For the offset, we agreed there's no extant user_ns, so
you have to create one specifically.  That leads to a more error prone
setup with no actual checking benefit.

For the shift_ns, it depends whether you want one mount point per
tenant, in which case the tenant user_ns might be a useful check, or
one mount point with an ACL in which case you just backshift along the
binding tenant user_ns.

James


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-01-17 22:52 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-04 20:39 [PATCH v2 0/3] introduce a uid/gid shifting bind mount James Bottomley
2020-01-04 20:39 ` [PATCH v2 1/3] fs: rethread notify_change to take a path instead of a dentry James Bottomley
2020-01-04 21:52   ` Amir Goldstein
2020-01-04 20:39 ` [PATCH v2 2/3] fs: introduce uid/gid shifting bind mount James Bottomley
2020-01-04 23:09   ` Amir Goldstein
2020-01-05 17:44     ` James Bottomley
2020-01-13  3:41   ` Serge E. Hallyn
2020-01-15 18:19     ` James Bottomley
2020-01-16  6:44       ` Serge E. Hallyn
2020-01-16 16:29         ` James Bottomley
2020-01-17 15:44           ` Serge E. Hallyn
2020-01-17 16:25             ` James Bottomley
2020-01-17 21:19               ` Tycho Andersen
2020-01-17 22:52                 ` James Bottomley
2020-01-04 20:39 ` [PATCH v2 3/3] fs: expose shifting bind mount to userspace James Bottomley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).