All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/1] shiftfs: uid/gid shifting filesystem (s_user_ns version)
@ 2017-02-04 19:18 James Bottomley
  2017-02-04 19:19 ` [RFC 1/1] shiftfs: uid/gid shifting bind mount James Bottomley
  0 siblings, 1 reply; 82+ messages in thread
From: James Bottomley @ 2017-02-04 19:18 UTC (permalink / raw)
  To: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

This is a rewrite of the original shiftfs code to make use of super
block user namespaces.  I've also removed the mappings passed in as
mount options in favour of using the mappings in s_user_ns.  The upshot
is that it probably needs retesting for all the bugs people found,
since there's a lot of new code, and the use case has changed.  Now, to
use it, you have to mark the filesystems you want to be mountable
inside a user namespace as root:

mount -t shiftfs -o mark <origin> <mark location>

The origin should be inaccessible to the unprivileged user, and the
access to the <mark location> can be controlled by the usual filesystem
permissions.  Once this is done, any user who can get access to the
<mark location> can do (as the local user namespace root):

mount -t shiftfs <mark location> <somewhere in my local mount ns>

And they will be able to write at their user namespace shifts, but have
the interior view of the uid/gid be what appears on the <origin>

In using the s_user_ns, a lot of the code actually simplified, because
now our credential shifting code simply becomes use the <origin>
s_user_ns and the shifted uid/gid.  The updated d_real() code from
overlayfs is also used, so shiftfs now no-longer needs its own file
operations.

---
[original blurb]

My use case for this is that I run a lot of unprivileged architectural
emulation containers on my system using user namespaces.  Details here:

http://blog.hansenpartnership.com/unprivileged-build-containers/

They're mostly for building non-x86 stuff (like aarch64 and arm secure
boot and mips images).  For builds, I have all the environments in my
home directory with downshifted uids; however, sometimes I need to use
them to administer real images that run on systems, meaning the uids
are the usual privileged ones not the downshifted ones.  The only
current choice I have is to start the emulation as root so the uid/gids
match.  The reason for this filesystem is to use my standard
unprivileged containers to maintain these images.  The way I do this is
crack the image with a loop and then shift the uids before bringing up
the container.  I usually loop mount into /var/tmp/images/, so it's
owned by real root there:

jarvis:~ # ls -l /var/tmp/images/mips|head -4
total 0
drwxr-xr-x 1 root root 8192 May 12 08:33 bin
drwxr-xr-x 1 root root    6 May 12 08:33 boot
drwxr-xr-x 1 root root  167 May 12 08:33 dev

And I usually run my build containers with a uid_map of 

         0     100000       1000
      1000       1000          1
     65534     101000          1

(maps 0-999 shifted, then shifts nobody to 1000 and keeps my uid [1000]
fixed so I can mount my home directory into the namespace) and
something similar with gid_map. So I shift mount the mips image with

mount -t shiftfs -o
uidmap=0:100000:1000,uidmap=65534:101000:1,gidmap=0:100000:100,gidmap=1
01:100101:899,gidmap=65533:101000:2 /var/tmp/images/mips
/home/jejb/containers/mips

and I now see it as

jejb@jarvis:~> ls -l containers/mips|head -4
total 0
drwxr-xr-x 1 100000 100000 8192 May 12 08:33 bin/
drwxr-xr-x 1 100000 100000    6 May 12 08:33 boot/
drwxr-xr-x 1 100000 100000  167 May 12 08:33 dev/

Like my usual unprivileged build roots and I can now use an
unprivileged container to enter and administer the image.

It seems like a lot of container systems need to do something similar
when they try and provide unprivileged access to standard images. 
 Right at the moment, the security mechanism only allows root in the
host to use this, but it's not impossible to come up with a scheme for
marking trees that can safely be shift mounted by unprivileged user
namespaces.

James

---

James Bottomley (1):
  shiftfs: uid/gid shifting bind mount RFC

 fs/Kconfig                 |   8 +
 fs/Makefile                |   1 +
 fs/shiftfs.c               | 728 +++++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/magic.h |   2 +
 4 files changed, 739 insertions(+)
 create mode 100644 fs/shiftfs.c

-- 
2.6.6

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-04 19:18 [RFC 0/1] shiftfs: uid/gid shifting filesystem (s_user_ns version) James Bottomley
@ 2017-02-04 19:19 ` James Bottomley
  2017-02-05  7:51   ` Amir Goldstein
                     ` (4 more replies)
  0 siblings, 5 replies; 82+ messages in thread
From: James Bottomley @ 2017-02-04 19:19 UTC (permalink / raw)
  To: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

This allows any subtree to be uid/gid shifted and bound elsewhere.  It
does this by operating simlarly to overlayfs.  Its primary use is for
shifting the underlying uids of filesystems used to support
unpriviliged (uid shifted) containers.  The usual use case here is
that the container is operating with an uid shifted unprivileged root
but sometimes needs to make use of or work with a filesystem image
that has root at real uid 0.

The mechanism is to allow any subordinate mount namespace to mount a
shiftfs filesystem (by marking it FS_USERNS_MOUNT) but only allowing
it to mount marked subtrees (using the -o mark option as root).  Once
mounted, the subtree is mapped via the super block user namespace so
that the interior ids of the mounting user namespace are the ids
written to the filesystem.

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>

---

v1: - based on original shiftfs with uid mappings now done via s_user_ns
---
 fs/Kconfig                 |   8 +
 fs/Makefile                |   1 +
 fs/shiftfs.c               | 728 +++++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/magic.h |   2 +
 4 files changed, 739 insertions(+)
 create mode 100644 fs/shiftfs.c

diff --git a/fs/Kconfig b/fs/Kconfig
index c2a377c..b6adac0 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -104,6 +104,14 @@ source "fs/autofs4/Kconfig"
 source "fs/fuse/Kconfig"
 source "fs/overlayfs/Kconfig"
 
+config SHIFT_FS
+	tristate "UID/GID shifting overlay filesystem for containers"
+	help
+	  This filesystem can overlay any mounted filesystem and shift
+	  the uid/gid the files appear at.  The idea is that
+	  unprivileged containers can use this to mount root volumes
+	  using this technique.
+
 menu "Caches"
 
 source "fs/fscache/Kconfig"
diff --git a/fs/Makefile b/fs/Makefile
index 7bbaca9..2aa3ad4 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -128,3 +128,4 @@ obj-y				+= exofs/ # Multiple modules
 obj-$(CONFIG_CEPH_FS)		+= ceph/
 obj-$(CONFIG_PSTORE)		+= pstore/
 obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
+obj-$(CONFIG_SHIFT_FS)		+= shiftfs.o
diff --git a/fs/shiftfs.c b/fs/shiftfs.c
new file mode 100644
index 0000000..a4a1f98
--- /dev/null
+++ b/fs/shiftfs.c
@@ -0,0 +1,728 @@
+#include <linux/cred.h>
+#include <linux/mount.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/magic.h>
+#include <linux/parser.h>
+#include <linux/seq_file.h>
+#include <linux/statfs.h>
+#include <linux/slab.h>
+#include <linux/user_namespace.h>
+#include <linux/uidgid.h>
+#include <linux/xattr.h>
+
+struct shiftfs_super_info {
+	struct vfsmount *mnt;
+	struct user_namespace *userns;
+	bool mark;
+};
+
+static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
+				       struct dentry *dentry);
+
+enum {
+	OPT_MARK,
+	OPT_LAST,
+};
+
+/* global filesystem options */
+static const match_table_t tokens = {
+	{ OPT_MARK, "mark" },
+	{ OPT_LAST, NULL }
+};
+
+static const struct cred *shiftfs_get_up_creds(struct super_block *sb)
+{
+	struct shiftfs_super_info *ssi = sb->s_fs_info;
+	struct cred *cred = prepare_creds();
+
+	if (!cred)
+		return NULL;
+
+	cred->fsuid = KUIDT_INIT(from_kuid(sb->s_user_ns, cred->fsuid));
+	cred->fsgid = KGIDT_INIT(from_kgid(sb->s_user_ns, cred->fsgid));
+	cred->user_ns = ssi->userns;
+
+	return cred;
+}
+
+static const struct cred *shiftfs_new_creds(const struct cred **newcred,
+					    struct super_block *sb)
+{
+	const struct cred *cred = shiftfs_get_up_creds(sb);
+
+	*newcred = cred;
+
+	if (cred)
+		cred = override_creds(cred);
+	else
+		printk(KERN_ERR "shiftfs: Credential override failed: no memory\n");
+
+	return cred;
+}
+
+static void shiftfs_old_creds(const struct cred *oldcred,
+			      const struct cred **newcred)
+{
+	if (!*newcred)
+		return;
+
+	revert_creds(oldcred);
+	put_cred(*newcred);
+}
+
+static int shiftfs_parse_options(struct shiftfs_super_info *ssi, char *options)
+{
+	char *p;
+	substring_t args[MAX_OPT_ARGS];
+
+	ssi->mark = false;
+
+	while ((p = strsep(&options, ",")) != NULL) {
+		int token;
+
+		if (!*p)
+			continue;
+
+		token = match_token(p, tokens, args);
+		switch (token) {
+		case OPT_MARK:
+			ssi->mark = true;
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+static void shiftfs_d_release(struct dentry *dentry)
+{
+	struct dentry *real = dentry->d_fsdata;
+
+	dput(real);
+}
+
+static struct dentry *shiftfs_d_real(struct dentry *dentry,
+				     const struct inode *inode,
+				     unsigned int flags)
+{
+	struct dentry *real = dentry->d_fsdata;
+
+	if (unlikely(real->d_flags & DCACHE_OP_REAL))
+		return real->d_op->d_real(real, real->d_inode, flags);
+
+	return real;
+}
+
+static const struct dentry_operations shiftfs_dentry_ops = {
+	.d_release	= shiftfs_d_release,
+	.d_real		= shiftfs_d_real,
+};
+
+static int shiftfs_readlink(struct dentry *dentry, char __user *data,
+			    int flags)
+{
+	struct dentry *real = dentry->d_fsdata;
+	const struct inode_operations *iop = real->d_inode->i_op;
+
+	if (iop->readlink)
+		return iop->readlink(real, data, flags);
+
+	return -EINVAL;
+}
+
+static const char *shiftfs_get_link(struct dentry *dentry, struct inode *inode,
+				    struct delayed_call *done)
+{
+	if (dentry) {
+		struct dentry *real = dentry->d_fsdata;
+		struct inode *reali = real->d_inode;
+		const struct inode_operations *iop = reali->i_op;
+		const char *res = ERR_PTR(-EPERM);
+
+		if (iop->get_link)
+			res = iop->get_link(real, reali, done);
+
+		return res;
+	} else {
+		/* RCU lookup not supported */
+		return ERR_PTR(-ECHILD);
+	}
+}
+
+static int shiftfs_setxattr(struct dentry *dentry, struct inode *inode,
+			    const char *name, const void *value,
+			    size_t size, int flags)
+{
+	struct dentry *real = dentry->d_fsdata;
+	int err = -EOPNOTSUPP;
+	const struct cred *oldcred, *newcred;
+
+	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+	err = vfs_setxattr(real, name, value, size, flags);
+	shiftfs_old_creds(oldcred, &newcred);
+
+	return err;
+}
+
+static int shiftfs_xattr_get(const struct xattr_handler *handler,
+			     struct dentry *dentry, struct inode *inode,
+			     const char *name, void *value, size_t size)
+{
+	struct dentry *real = dentry->d_fsdata;
+	int err;
+	const struct cred *oldcred, *newcred;
+
+	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+	err = vfs_getxattr(real, name, value, size);
+	shiftfs_old_creds(oldcred, &newcred);
+
+	return err;
+}
+
+static ssize_t shiftfs_listxattr(struct dentry *dentry, char *list,
+				 size_t size)
+{
+	struct dentry *real = dentry->d_fsdata;
+	int err;
+	const struct cred *oldcred, *newcred;
+
+	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+	err = vfs_listxattr(real, list, size);
+	shiftfs_old_creds(oldcred, &newcred);
+
+	return err;
+}
+
+static int shiftfs_removexattr(struct dentry *dentry, const char *name)
+{
+	struct dentry *real = dentry->d_fsdata;
+	int err;
+	const struct cred *oldcred, *newcred;
+
+	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+	err = vfs_removexattr(real, name);
+	shiftfs_old_creds(oldcred, &newcred);
+
+	return err;
+}
+
+static int shiftfs_xattr_set(const struct xattr_handler *handler,
+			     struct dentry *dentry, struct inode *inode,
+			     const char *name, const void *value, size_t size,
+			     int flags)
+{
+	if (!value)
+		return shiftfs_removexattr(dentry, name);
+	return shiftfs_setxattr(dentry, inode, name, value, size, flags);
+}
+
+static void shiftfs_fill_inode(struct inode *inode, struct dentry *dentry)
+{
+	struct inode *reali;
+
+	if (!dentry)
+		return;
+
+	reali = dentry->d_inode;
+
+	if (!reali->i_op->get_link)
+		inode->i_opflags |= IOP_NOFOLLOW;
+
+	inode->i_mapping = reali->i_mapping;
+	inode->i_private = dentry;
+
+	i_uid_write(inode, __kuid_val(reali->i_uid));
+	i_gid_write(inode, __kgid_val(reali->i_gid));
+}
+
+static int shiftfs_make_object(struct inode *dir, struct dentry *dentry,
+			       umode_t mode, const char *symlink,
+			       struct dentry *hardlink, bool excl)
+{
+	struct dentry *real = dir->i_private, *new = dentry->d_fsdata;
+	struct inode *reali = real->d_inode, *newi;
+	const struct inode_operations *iop = reali->i_op;
+	int err;
+	const struct cred *oldcred, *newcred;
+	bool op_ok = false;
+
+	if (hardlink) {
+		op_ok = iop->link;
+	} else {
+		switch (mode & S_IFMT) {
+		case S_IFDIR:
+			op_ok = iop->mkdir;
+			break;
+		case S_IFREG:
+			op_ok = iop->create;
+			break;
+		case S_IFLNK:
+			op_ok = iop->symlink;
+		}
+	}
+	if (!op_ok)
+		return -EINVAL;
+
+
+	newi = shiftfs_new_inode(dentry->d_sb, mode, NULL);
+	if (!newi)
+		return -ENOMEM;
+
+	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+
+	inode_lock_nested(reali, I_MUTEX_PARENT);
+
+	err = -EINVAL;		/* shut gcc up about uninit var */
+	if (hardlink) {
+		struct dentry *realhardlink = hardlink->d_fsdata;
+
+		err = vfs_link(realhardlink, reali, new, NULL);
+	} else {
+		switch (mode & S_IFMT) {
+		case S_IFDIR:
+			err = vfs_mkdir(reali, new, mode);
+			break;
+		case S_IFREG:
+			err = vfs_create(reali, new, mode, excl);
+			break;
+		case S_IFLNK:
+			err = vfs_symlink(reali, new, symlink);
+		}
+	}
+
+	shiftfs_old_creds(oldcred, &newcred);
+
+	if (err)
+		goto out_dput;
+
+	shiftfs_fill_inode(newi, new);
+
+	d_instantiate(dentry, newi);
+
+	new = NULL;
+	newi = NULL;
+
+ out_dput:
+	dput(new);
+	iput(newi);
+	inode_unlock(reali);
+
+	return err;
+}
+
+static int shiftfs_create(struct inode *dir, struct dentry *dentry,
+			  umode_t mode,  bool excl)
+{
+	mode |= S_IFREG;
+
+	return shiftfs_make_object(dir, dentry, mode, NULL, NULL, excl);
+}
+
+static int shiftfs_mkdir(struct inode *dir, struct dentry *dentry,
+			 umode_t mode)
+{
+	mode |= S_IFDIR;
+
+	return shiftfs_make_object(dir, dentry, mode, NULL, NULL, false);
+}
+
+static int shiftfs_link(struct dentry *hardlink, struct inode *dir,
+			struct dentry *dentry)
+{
+	return shiftfs_make_object(dir, dentry, 0, NULL, hardlink, false);
+}
+
+static int shiftfs_symlink(struct inode *dir, struct dentry *dentry,
+			   const char *symlink)
+{
+	return shiftfs_make_object(dir, dentry, S_IFLNK, symlink, NULL, false);
+}
+
+static int shiftfs_rm(struct inode *dir, struct dentry *dentry, bool rmdir)
+{
+	struct dentry *real = dir->i_private, *new = dentry->d_fsdata;
+	struct inode *reali = real->d_inode;
+	int err;
+	const struct cred *oldcred, *newcred;
+
+	inode_lock_nested(reali, I_MUTEX_PARENT);
+
+	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+
+	if (rmdir)
+		err = vfs_rmdir(reali, new);
+	else
+		err = vfs_unlink(reali, new, NULL);
+
+	shiftfs_old_creds(oldcred, &newcred);
+	inode_unlock(reali);
+
+	return err;
+}
+
+static int shiftfs_unlink(struct inode *dir, struct dentry *dentry)
+{
+	return shiftfs_rm(dir, dentry, false);
+}
+
+static int shiftfs_rmdir(struct inode *dir, struct dentry *dentry)
+{
+	return shiftfs_rm(dir, dentry, true);
+}
+
+static int shiftfs_rename(struct inode *olddir, struct dentry *old,
+			  struct inode *newdir, struct dentry *new,
+			  unsigned int flags)
+{
+	struct dentry *rodd = olddir->i_private, *rndd = newdir->i_private,
+		*realold = old->d_fsdata,
+		*realnew = new->d_fsdata, *trap;
+	struct inode *realolddir = rodd->d_inode, *realnewdir = rndd->d_inode;
+	int err = -EINVAL;
+	const struct cred *oldcred, *newcred;
+
+	trap = lock_rename(rndd, rodd);
+
+	if (trap == realold || trap == realnew)
+		goto out_unlock;
+
+	oldcred = shiftfs_new_creds(&newcred, old->d_sb);
+
+	err = vfs_rename(realolddir, realold, realnewdir,
+			 realnew, NULL, flags);
+
+	shiftfs_old_creds(oldcred, &newcred);
+
+ out_unlock:
+	unlock_rename(rndd, rodd);
+
+	return err;
+}
+
+static struct dentry *shiftfs_lookup(struct inode *dir, struct dentry *dentry,
+				     unsigned int flags)
+{
+	struct dentry *real = dir->i_private, *new;
+	struct inode *reali = real->d_inode, *newi;
+	const struct cred *oldcred, *newcred;
+
+	inode_lock(reali);
+	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+	new = lookup_one_len(dentry->d_name.name, real, dentry->d_name.len);
+	shiftfs_old_creds(oldcred, &newcred);
+	inode_unlock(reali);
+
+	if (IS_ERR(new))
+		return new;
+
+	dentry->d_fsdata = new;
+
+	if (!new->d_inode)
+		return NULL;
+
+	newi = shiftfs_new_inode(dentry->d_sb, new->d_inode->i_mode, new);
+	if (!newi) {
+		dput(new);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	d_splice_alias(newi, dentry);
+
+	return NULL;
+}
+
+static int shiftfs_permission(struct inode *inode, int mask)
+{
+	struct dentry *real = inode->i_private;
+	struct inode *reali = real->d_inode;
+	const struct inode_operations *iop = reali->i_op;
+	int err;
+	const struct cred *oldcred, *newcred;
+
+	if (mask & MAY_NOT_BLOCK)
+		return -ECHILD;
+
+	oldcred = shiftfs_new_creds(&newcred, inode->i_sb);
+	if (iop->permission)
+		err = iop->permission(reali, mask);
+	else
+		err = generic_permission(reali, mask);
+	shiftfs_old_creds(oldcred, &newcred);
+
+	return err;
+}
+
+static int shiftfs_setattr(struct dentry *dentry, struct iattr *attr)
+{
+	struct dentry *real = dentry->d_fsdata;
+	struct inode *reali = real->d_inode;
+	const struct inode_operations *iop = reali->i_op;
+	struct iattr newattr = *attr;
+	const struct cred *oldcred, *newcred;
+	struct super_block *sb = dentry->d_sb;
+	int err;
+
+	newattr.ia_uid = KUIDT_INIT(from_kuid(sb->s_user_ns, attr->ia_uid));
+	newattr.ia_gid = KGIDT_INIT(from_kgid(sb->s_user_ns, attr->ia_gid));
+
+	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+	inode_lock(reali);
+	if (iop->setattr)
+		err = iop->setattr(real, &newattr);
+	else
+		err = simple_setattr(real, &newattr);
+	inode_unlock(reali);
+	shiftfs_old_creds(oldcred, &newcred);
+
+	if (err)
+		return err;
+
+	/* all OK, reflect the change on our inode */
+	setattr_copy(d_inode(dentry), attr);
+	return 0;
+}
+
+static int shiftfs_getattr(struct vfsmount *mnt, struct dentry *dentry,
+			   struct kstat *stat)
+{
+	struct inode *inode = dentry->d_inode;
+	struct dentry *real = inode->i_private;
+	struct inode *reali = real->d_inode;
+	const struct inode_operations *iop = reali->i_op;
+	int err = 0;
+
+	mnt = dentry->d_sb->s_fs_info;
+
+	if (iop->getattr)
+		err = iop->getattr(mnt, real, stat);
+	else
+		generic_fillattr(reali, stat);
+
+	if (err)
+		return err;
+
+	stat->uid = inode->i_uid;
+	stat->gid = inode->i_gid;
+	return 0;
+}
+
+static const struct inode_operations shiftfs_inode_ops = {
+	.lookup		= shiftfs_lookup,
+	.getattr	= shiftfs_getattr,
+	.setattr	= shiftfs_setattr,
+	.permission	= shiftfs_permission,
+	.mkdir		= shiftfs_mkdir,
+	.symlink	= shiftfs_symlink,
+	.get_link	= shiftfs_get_link,
+	.readlink	= shiftfs_readlink,
+	.unlink		= shiftfs_unlink,
+	.rmdir		= shiftfs_rmdir,
+	.rename		= shiftfs_rename,
+	.link		= shiftfs_link,
+	.create		= shiftfs_create,
+	.mknod		= NULL,	/* no special files currently */
+	.listxattr	= shiftfs_listxattr,
+};
+
+static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
+				       struct dentry *dentry)
+{
+	struct inode *inode;
+
+	inode = new_inode(sb);
+	if (!inode)
+		return NULL;
+
+	mode &= S_IFMT;
+
+	inode->i_ino = get_next_ino();
+	inode->i_mode = mode;
+	inode->i_flags |= S_NOATIME | S_NOCMTIME;
+
+	inode->i_op = &shiftfs_inode_ops;
+
+	shiftfs_fill_inode(inode, dentry);
+
+	return inode;
+}
+
+static int shiftfs_show_options(struct seq_file *m, struct dentry *dentry)
+{
+	struct super_block *sb = dentry->d_sb;
+	struct shiftfs_super_info *ssi = sb->s_fs_info;
+
+	if (ssi->mark)
+		seq_show_option(m, "mark", NULL);
+
+	return 0;
+}
+
+static int shiftfs_statfs(struct dentry *dentry, struct kstatfs *buf)
+{
+	struct super_block *sb = dentry->d_sb;
+	struct shiftfs_super_info *ssi = sb->s_fs_info;
+	struct dentry *root = sb->s_root;
+	struct dentry *realroot = root->d_fsdata;
+	struct path realpath = { .mnt = ssi->mnt, .dentry = realroot };
+	int err;
+
+	err = vfs_statfs(&realpath, buf);
+	if (err)
+		return err;
+
+	buf->f_type = sb->s_magic;
+
+	return 0;
+}
+
+static void shiftfs_put_super(struct super_block *sb)
+{
+	struct shiftfs_super_info *ssi = sb->s_fs_info;
+
+	mntput(ssi->mnt);
+	put_user_ns(ssi->userns);
+	kfree(ssi);
+}
+
+static const struct xattr_handler shiftfs_xattr_handler = {
+	.prefix = "",
+	.get    = shiftfs_xattr_get,
+	.set    = shiftfs_xattr_set,
+};
+
+const struct xattr_handler *shiftfs_xattr_handlers[] = {
+	&shiftfs_xattr_handler,
+	NULL
+};
+
+static const struct super_operations shiftfs_super_ops = {
+	.put_super	= shiftfs_put_super,
+	.show_options	= shiftfs_show_options,
+	.statfs		= shiftfs_statfs,
+};
+
+struct shiftfs_data {
+	void *data;
+	const char *path;
+};
+
+static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
+			      int silent)
+{
+	struct shiftfs_data *data = raw_data;
+	char *name = kstrdup(data->path, GFP_KERNEL);
+	int err = -ENOMEM;
+	struct shiftfs_super_info *ssi = NULL;
+	struct path path;
+	struct dentry *dentry;
+
+	if (!name)
+		goto out;
+
+	ssi = kzalloc(sizeof(*ssi), GFP_KERNEL);
+	if (!ssi)
+		goto out;
+
+	err = -EPERM;
+	err = shiftfs_parse_options(ssi, data->data);
+	if (err)
+		goto out;
+
+	/* to mark a mount point, must be real root */
+	if (ssi->mark && !capable(CAP_SYS_ADMIN))
+		goto out;
+
+	/* else to mount a mark, must be userns admin */
+	if (!ssi->mark && !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
+		goto out;
+
+	err = kern_path(name, LOOKUP_FOLLOW, &path);
+	if (err)
+		goto out;
+
+	err = -EPERM;
+	if (!S_ISDIR(path.dentry->d_inode->i_mode)) {
+		err = -ENOTDIR;
+		goto out_put;
+	}
+	if (ssi->mark) {
+		/*
+		 * this part is visible unshifted, so make sure no
+		 * executables that could be used to give suid
+		 * privileges
+		 */
+		sb->s_iflags = SB_I_NOEXEC;
+		ssi->mnt = path.mnt;
+		dentry = path.dentry;
+	} else {
+		struct shiftfs_super_info *mp_ssi;
+
+		/*
+		 * this leg executes if we're admin capable in
+		 * the namespace, so be very careful
+		 */
+		if (path.dentry->d_sb->s_magic != SHIFTFS_MAGIC)
+			goto out_put;
+		mp_ssi = path.dentry->d_sb->s_fs_info;
+		if (!mp_ssi->mark)
+			goto out_put;
+		ssi->mnt = mntget(mp_ssi->mnt);
+		dentry = dget(path.dentry->d_fsdata);
+		path_put(&path);
+	}
+	ssi->userns = get_user_ns(dentry->d_sb->s_user_ns);
+	sb->s_fs_info = ssi;
+	sb->s_magic = SHIFTFS_MAGIC;
+	sb->s_op = &shiftfs_super_ops;
+	sb->s_xattr = shiftfs_xattr_handlers;
+	sb->s_d_op = &shiftfs_dentry_ops;
+	sb->s_root = d_make_root(shiftfs_new_inode(sb, S_IFDIR, dentry));
+	sb->s_root->d_fsdata = dentry;
+
+	return 0;
+
+ out_put:
+	path_put(&path);
+ out:
+	kfree(name);
+	kfree(ssi);
+	return err;
+}
+
+static struct dentry *shiftfs_mount(struct file_system_type *fs_type,
+				    int flags, const char *dev_name, void *data)
+{
+	struct shiftfs_data d = { data, dev_name };
+
+	return mount_nodev(fs_type, flags, &d, shiftfs_fill_super);
+}
+
+static struct file_system_type shiftfs_type = {
+	.owner		= THIS_MODULE,
+	.name		= "shiftfs",
+	.mount		= shiftfs_mount,
+	.kill_sb	= kill_anon_super,
+	.fs_flags	= FS_USERNS_MOUNT,
+};
+
+static int __init shiftfs_init(void)
+{
+	return register_filesystem(&shiftfs_type);
+}
+
+static void __exit shiftfs_exit(void)
+{
+	unregister_filesystem(&shiftfs_type);
+}
+
+MODULE_ALIAS_FS("shiftfs");
+MODULE_AUTHOR("James Bottomley");
+MODULE_DESCRIPTION("uid/gid shifting bind filesystem");
+MODULE_LICENSE("GPL v2");
+module_init(shiftfs_init)
+module_exit(shiftfs_exit)
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index e230af2..a2fdb01 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -85,4 +85,6 @@
 #define BALLOON_KVM_MAGIC	0x13661366
 #define ZSMALLOC_MAGIC		0x58295829
 
+#define SHIFTFS_MAGIC		0x6a656a62
+
 #endif /* __LINUX_MAGIC_H__ */
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-04 19:19 ` [RFC 1/1] shiftfs: uid/gid shifting bind mount James Bottomley
@ 2017-02-05  7:51   ` Amir Goldstein
  2017-02-06  1:18     ` James Bottomley
  2017-02-06  3:25   ` J. R. Okajima
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 82+ messages in thread
From: Amir Goldstein @ 2017-02-05  7:51 UTC (permalink / raw)
  To: James Bottomley
  Cc: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, LSM List, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

On Sat, Feb 4, 2017 at 9:19 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> This allows any subtree to be uid/gid shifted and bound elsewhere.  It
> does this by operating simlarly to overlayfs.  Its primary use is for
> shifting the underlying uids of filesystems used to support
> unpriviliged (uid shifted) containers.  The usual use case here is
> that the container is operating with an uid shifted unprivileged root
> but sometimes needs to make use of or work with a filesystem image
> that has root at real uid 0.
>
> The mechanism is to allow any subordinate mount namespace to mount a
> shiftfs filesystem (by marking it FS_USERNS_MOUNT) but only allowing
> it to mount marked subtrees (using the -o mark option as root).  Once
> mounted, the subtree is mapped via the super block user namespace so
> that the interior ids of the mounting user namespace are the ids
> written to the filesystem.
>
> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
>

James,

Allow me to point out some problems in this patch and offer a slightly different
approach.

First of all, the subject says "uid/gid shifting bind mount", but it's not
really a bind mount. What it is is a stackable mount and 2 levels of
stack no less.
So one thing that is missing is increasing of sb->s_stack_depth and that
also means that shiftfs cannot be used to recursively shift uids in child userns
if that was ever the intention.

The other problem is that by forking overlayfs functionality, shiftfs is going
to miss out on overlayfs bug fixes related to user credentials differ
from mounter
credentials, like fd3220d ("ovl: update S_ISGID when setting posix ACLs").
I am not sure that this specific case is relevant to shiftfs, but
there could be other.

So how about, instead of forking a new containers specialized stackable fs,
that the needed functionality be merged into overlayfs code?
I think overlayfs container users may also benefit from shiftfs
functionality, no?
In any case, overlayfs has considerable millage used as fs for containers,
so many issues related to running with different userns may have already been
addressed.

Overlayfs already stores the mounter's credentials and uses them to perform
most of the operations on upper.

I know it wasn't the original purpose of overlayfs to run as a single layer, but
there is nothing really preventing from doing that. In fact, I am
doing just that
with my snapshot mount patches, see:
https://github.com/amir73il/linux/commit/acc6c25eab03c176c9ef736544fab3fba663765d#diff-2b85a3c5bea4263d08a2bdff639192c3
I registered a new fs type ("snapshot"), which reuses most of the existing
overlayfs operations.
With this patch it is possible to mount an overlay with only upper layer, so all
the operations are pass through except for the credentials, e.g.:

mount -t snapshot -o upper=<origin> shiftfs_test <mark location>

If you think this concept is workable, then the functionality of
mounting overlayfs
with only upper should be integrated into plain overlayfs and shiftfs could be
a very thin variant of overlayfs mount using shitfs_fs_type, just for the sake
of having FS_USERNS_MOUNT, e.g:

+ /*
+  * XXX: reusing ovl_mount()/ovl_fill_super(), but could also just reuse
+  * ovl_dentry_operations/ovl_super_operations/ovl_xattr_handlers/ovl_new_inode()
+  */
+static struct file_system_type shiftfs_type = {
+       .owner          = THIS_MODULE,
+       .name           = "shiftfs",
+       .mount          = ovl_mount,
+       .kill_sb        = kill_anon_super,
+       .fs_flags       = FS_USERNS_MOUNT,
+};
+MODULE_ALIAS_FS("shiftfs");
+MODULE_ALIAS("shiftfs");
+#define IS_SHIFTFS_SB(sb) ((sb)->s_type == &shiftfs_type)

And instead of verifying that shiftfs is mounted inside container over shiftfs,
verify that it is mounted over an overlayfs noexec mount e.g.:

+       if (IS_SHIFTFS_SB(sb)) {
+               /*
+                * this leg executes if we're admin capable in
+                * the namespace, so be very careful
+                */
+               if (path.dentry->d_sb->s_magic != OVERLAYFS_MAGIC ||
!(path.dentry->d_sb->s_iflags & SB_I_NOEXEC))
+                       goto out_put;

>From users manual POV:

in host:
mount -t overlay -o noexec,upper=<origin> container_visible <mark location>

in container:
mount -t shiftfs -o upper=<mark location> container_writable
<somewhere in my local mount ns>

Thought?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-05  7:51   ` Amir Goldstein
@ 2017-02-06  1:18     ` James Bottomley
  2017-02-06  6:59       ` Amir Goldstein
  2017-02-14 23:03       ` Vivek Goyal
  0 siblings, 2 replies; 82+ messages in thread
From: James Bottomley @ 2017-02-06  1:18 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, LSM List, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

On Sun, 2017-02-05 at 09:51 +0200, Amir Goldstein wrote:
> On Sat, Feb 4, 2017 at 9:19 PM, James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> > This allows any subtree to be uid/gid shifted and bound elsewhere. 
> >  It does this by operating simlarly to overlayfs.  Its primary use 
> > is for shifting the underlying uids of filesystems used to support
> > unpriviliged (uid shifted) containers.  The usual use case here is
> > that the container is operating with an uid shifted unprivileged 
> > root but sometimes needs to make use of or work with a filesystem 
> > image that has root at real uid 0.
> > 
> > The mechanism is to allow any subordinate mount namespace to mount 
> > a shiftfs filesystem (by marking it FS_USERNS_MOUNT) but only
> > allowing it to mount marked subtrees (using the -o mark option as 
> > root).   Once mounted, the subtree is mapped via the super block 
> > user namespace so that the interior ids of the mounting user 
> > namespace are the ids written to the filesystem.
> > 
> > Signed-off-by: James Bottomley <
> > James.Bottomley@HansenPartnership.com>
> > 
> 
> James,
> 
> Allow me to point out some problems in this patch and offer a 
> slightly different approach.
> 
> First of all, the subject says "uid/gid shifting bind mount", but 
> it's not really a bind mount. What it is is a stackable mount and 2 
> levels of stack no less.

The reason for the description is to have it behave exactly like a bind
mount.  You can assert that a bind mount is, in fact, a stacked mount,
but we don't currently.  I'm also not sure where you get your 2 levels
from?

>  So one thing that is missing is increasing of sb->s_stack_depth and
> that also means that shiftfs cannot be used to recursively shift uids
> in child userns if that was ever the intention.

I can't think of a use case that would ever need that, but perhaps
other container people can.

> The other problem is that by forking overlayfs functionality,

So this wouldn't really be the right way to look at it: shiftfs shares
no code with overlayfs at all, so is definitely not a fork.  The only
piece of functionality it has which is similar to overlayfs is the way
it does lookups via a new dentry cache.  However, that functionality is
not unique to overlayfs and if you look, you'll see that
shiftfs_lookup() actually has far more in common with
ecryptfs_lookup().

>  shiftfs is going to miss out on overlayfs bug fixes related to user 
> credentials differ from mounter credentials, like fd3220d ("ovl: 
> update S_ISGID when setting posix ACLs"). I am not sure that this 
> specific case is relevant to shiftfs, but there could be other.

OK, so shiftfs doesn't have this bug and the reason why is
illustrative: basically shiftfs does three things

   1. lookups via a uid/gid shifted dentry cache
   2. shifted credential inode operations permission checks on the
      underlying filesystem
   3. location marking for unprivileged mount

I think we've already seen that 1. isn't from overlayfs but the
functionality could be added to overlayfs, I suppose.  The big problem
is 2.  The overlayfs code emulates the permission checks, which makes
it rather complex (this is where you get your bugs like the above
from).  I did actually look at adding 2. to overlayfs on the theory
that a single layer overlay might be closest to what this is, but
eventually concluded I'd have to take the special cases and add a whole
lot more to them ... it really would increase the maintenance burden
substantially and make the code an unreadable rats nest.

When you think about it this way, it becomes obvious that the clean
separation is if shiftfs functionality is layered on top of overlayfs
and when you do that, doing it as its own filesystem is more logical.

> So how about, instead of forking a new containers specialized 
> stackable fs, that the needed functionality be merged into overlayfs 
> code? I think overlayfs container users may also benefit from shiftfs
> functionality, no?

I think I covered the why not merge the code above.  As to the
functionality, since Docker already has a graph driver, the graph
driver can do the shifting on top of the overlays.

>  In any case, overlayfs has considerable millage used as fs for
> containers, so many issues related to running with different userns
> may have already been addressed.

Overlayfs is s_user_ns blind so it's highly unlikely to have seen any
issues with the user namespaces, let alone addressed them.  This will
also be compounded by the fact that its primary user: docker, has
rather a weak use of the user namespace currently.

The other thing is the use case: Most immutable infrastructure
container systems create the overlays in the host and then bind them
into the container.  This binding is an additional mount operation. 
 Now the could mount from an overlay as an overlay but it's adding
complexity because the container itself cannot control the overlay
(it's a host provided thing) so it is definitely cleaner to make the
second mount a different filesystem (i.e. shiftfs) where the nature of
the overlay is hidden from the container.

> Overlayfs already stores the mounter's credentials and uses them to 
> perform most of the operations on upper.

OK, that's case 2. again.  So I think you may be labouring under the
misapprehension that shiftfs and overlayfs do the same thing with
override credentials?  They don't: overlayfs emulates the permission
lookups and then overrides based on *historical* admin credentials to
force what it's already decided on the underlying fielsystems.  Shiftfs
overrides the *current* credentials with a uid/gid and namespace shift
and then runs the permission checks.  Thus if I wanted to add what
shiftfs does to overlayfs, I'd have to add another load of overriding
based on current credentials in the currently unoverriden emulated
permission checks.  I think you can see that simply running the real
permission checks on the underlying filesystem with overridden
credentials is much simpler.

> I know it wasn't the original purpose of overlayfs to run as a single 
> layer, but there is nothing really preventing from doing that. In 
> fact, I am doing just that with my snapshot mount patches, see:
> https://github.com/amir73il/linux/commit/acc6c25eab03c176c9ef736544fa
> b3fba663765d#diff-2b85a3c5bea4263d08a2bdff639192c3
> I registered a new fs type ("snapshot"), which reuses most of the 
> existing overlayfs operations. With this patch it is possible to 
> mount an overlay with only upper layer, so all the operations are
> pass through except for the credentials, e.g.:
> 
> mount -t snapshot -o upper=<origin> shiftfs_test <mark location>

OK, so since you don't need to special case the permission checks, I
can see why this might work for you because you don't need to modify
overlayfs to do this.  Since I can't consume the overlay code as is, it
doesn't work for me because I'd have to add lots of special case code
to it.

James

> If you think this concept is workable, then the functionality of
> mounting overlayfs with only upper should be integrated into plain 
> overlayfs and shiftfs could be a very thin variant of overlayfs mount 
> using shitfs_fs_type, just for the sake of having FS_USERNS_MOUNT,
> e.g:
> 
> + /*
> +  * XXX: reusing ovl_mount()/ovl_fill_super(), but could also just
> reuse
> +  *
> ovl_dentry_operations/ovl_super_operations/ovl_xattr_handlers/ovl_new
> _inode()
> +  */
> +static struct file_system_type shiftfs_type = {
> +       .owner          = THIS_MODULE,
> +       .name           = "shiftfs",
> +       .mount          = ovl_mount,
> +       .kill_sb        = kill_anon_super,
> +       .fs_flags       = FS_USERNS_MOUNT,
> +};
> +MODULE_ALIAS_FS("shiftfs");
> +MODULE_ALIAS("shiftfs");
> +#define IS_SHIFTFS_SB(sb) ((sb)->s_type == &shiftfs_type)
> 
> And instead of verifying that shiftfs is mounted inside container
> over shiftfs,
> verify that it is mounted over an overlayfs noexec mount e.g.:
> 
> +       if (IS_SHIFTFS_SB(sb)) {
> +               /*
> +                * this leg executes if we're admin capable in
> +                * the namespace, so be very careful
> +                */
> +               if (path.dentry->d_sb->s_magic != OVERLAYFS_MAGIC ||
> !(path.dentry->d_sb->s_iflags & SB_I_NOEXEC))
> +                       goto out_put;
> 
> From users manual POV:
> 
> in host:
> mount -t overlay -o noexec,upper=<origin> container_visible <mark
> location>
> 
> in container:
> mount -t shiftfs -o upper=<mark location> container_writable
> <somewhere in my local mount ns>
> 
> Thought?
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-04 19:19 ` [RFC 1/1] shiftfs: uid/gid shifting bind mount James Bottomley
  2017-02-05  7:51   ` Amir Goldstein
@ 2017-02-06  3:25   ` J. R. Okajima
  2017-02-06  6:38     ` Amir Goldstein
  2017-02-06  6:46     ` James Bottomley
  2017-02-07  9:19   ` Christoph Hellwig
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 82+ messages in thread
From: J. R. Okajima @ 2017-02-06  3:25 UTC (permalink / raw)
  To: James Bottomley
  Cc: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

James Bottomley:
> This allows any subtree to be uid/gid shifted and bound elsewhere.  It
	:::

Interesting.
But I am afraid that the inconsistency problem of the inode numbers will
happen.

shiftfs_new_inode() uses get_next_ino() which means
- 1st time: inodeA is created and cached, inumA is assigned
- after using inodeA, it will be discarded from the cache
- 2nd time: inodeA is looked-up again, and another inode number (inumB)
  is assgined.

This inconsistency will not be a problem for the "pure virtual" fs such
as procfs and sysfs. But your shiftfs is not pure as them. Shiftfs will
be used as a wrapper (or "binder" which means bind-mount) of an orginary
filesystem.
The symptom of this problem from users perspective will be
- find -inum doesn't work
- git-status doesn't work, which keeps st_dev and st_ino and compares
  the current files.
Of course they will be limited to when the target dir is huge and/or
system memory is low. As long as the inode cache is large enough to hold
all necessary inodes, the problem won't happen.

If shiftfs will supports exporting via NFS in the future, the
consistency of inum will be important too.


J. R. Okajima

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-06  3:25   ` J. R. Okajima
@ 2017-02-06  6:38     ` Amir Goldstein
  2017-02-06 16:29       ` James Bottomley
  2017-02-06  6:46     ` James Bottomley
  1 sibling, 1 reply; 82+ messages in thread
From: Amir Goldstein @ 2017-02-06  6:38 UTC (permalink / raw)
  To: J. R. Okajima
  Cc: James Bottomley, Djalal Harouni, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Mon, Feb 6, 2017 at 5:25 AM, J. R. Okajima <hooanon05g@gmail.com> wrote:
> James Bottomley:
>> This allows any subtree to be uid/gid shifted and bound elsewhere.  It
>         :::
>
> Interesting.
> But I am afraid that the inconsistency problem of the inode numbers will
> happen.
>

Yet another example that overlayfs already is in the process of solving
(it is fixed for stat of merged directory inode).
In fact, fir the case of single layer overlay (as well as shiftfs) the
solution is trivial -
preserve underlying inode st_ino/d_ino and use the overlayed fs st_dev.

> shiftfs_new_inode() uses get_next_ino() which means
> - 1st time: inodeA is created and cached, inumA is assigned
> - after using inodeA, it will be discarded from the cache
> - 2nd time: inodeA is looked-up again, and another inode number (inumB)
>   is assgined.
>
> This inconsistency will not be a problem for the "pure virtual" fs such
> as procfs and sysfs. But your shiftfs is not pure as them. Shiftfs will
> be used as a wrapper (or "binder" which means bind-mount) of an orginary
> filesystem.
> The symptom of this problem from users perspective will be
> - find -inum doesn't work
> - git-status doesn't work, which keeps st_dev and st_ino and compares
>   the current files.
> Of course they will be limited to when the target dir is huge and/or
> system memory is low. As long as the inode cache is large enough to hold
> all necessary inodes, the problem won't happen.
>
> If shiftfs will supports exporting via NFS in the future, the
> consistency of inum will be important too.
>
>
> J. R. Okajima

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-06  3:25   ` J. R. Okajima
  2017-02-06  6:38     ` Amir Goldstein
@ 2017-02-06  6:46     ` James Bottomley
  2017-02-06 14:50       ` Theodore Ts'o
  2017-02-06 16:24       ` J. R. Okajima
  1 sibling, 2 replies; 82+ messages in thread
From: James Bottomley @ 2017-02-06  6:46 UTC (permalink / raw)
  To: J. R. Okajima
  Cc: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

On Mon, 2017-02-06 at 12:25 +0900, J. R. Okajima wrote:
> James Bottomley:
> > This allows any subtree to be uid/gid shifted and bound elsewhere. 
> >  It
> 	:::
> 
> Interesting.
> But I am afraid that the inconsistency problem of the inode numbers 
> will happen.
> 
> shiftfs_new_inode() uses get_next_ino() which means
> - 1st time: inodeA is created and cached, inumA is assigned
> - after using inodeA, it will be discarded from the cache
> - 2nd time: inodeA is looked-up again, and another inode number 
> (inumB)   is assgined.

Yes, I know the problem.  However, I believe most current linux
filesystems no longer guarantee stable, for the lifetime of the file,
inode numbers.  The usual docker container root is overlayfs, which,
similarly doesn't support stable inode numbers.  I see the odd
complaint about docker with overlayfs having unstable inode numbers,
but none seems to have any serious repercussions.

[...]
> If shiftfs will supports exporting via NFS in the future, the
> consistency of inum will be important too.

If it's a problem, then it's fixable with s_export_op, but I was mostly
thinking that because it's not a problem for overlayfs based
containers, it wouldn't be one for shiftfs based ones, which is why I
didn't implement it.

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-06  1:18     ` James Bottomley
@ 2017-02-06  6:59       ` Amir Goldstein
  2017-02-06 14:41         ` James Bottomley
  2017-02-14 23:03       ` Vivek Goyal
  1 sibling, 1 reply; 82+ messages in thread
From: Amir Goldstein @ 2017-02-06  6:59 UTC (permalink / raw)
  To: James Bottomley
  Cc: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, LSM List, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

On Mon, Feb 6, 2017 at 3:18 AM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> On Sun, 2017-02-05 at 09:51 +0200, Amir Goldstein wrote:
>> On Sat, Feb 4, 2017 at 9:19 PM, James Bottomley
>> <James.Bottomley@hansenpartnership.com> wrote:
>> > This allows any subtree to be uid/gid shifted and bound elsewhere.
>> >  It does this by operating simlarly to overlayfs.  Its primary use
>> > is for shifting the underlying uids of filesystems used to support
>> > unpriviliged (uid shifted) containers.  The usual use case here is
>> > that the container is operating with an uid shifted unprivileged
>> > root but sometimes needs to make use of or work with a filesystem
>> > image that has root at real uid 0.
>> >
>> > The mechanism is to allow any subordinate mount namespace to mount
>> > a shiftfs filesystem (by marking it FS_USERNS_MOUNT) but only
>> > allowing it to mount marked subtrees (using the -o mark option as
>> > root).   Once mounted, the subtree is mapped via the super block
>> > user namespace so that the interior ids of the mounting user
>> > namespace are the ids written to the filesystem.
>> >
>> > Signed-off-by: James Bottomley <
>> > James.Bottomley@HansenPartnership.com>
>> >
>>
>> James,
>>
>> Allow me to point out some problems in this patch and offer a
>> slightly different approach.
>>
>> First of all, the subject says "uid/gid shifting bind mount", but
>> it's not really a bind mount. What it is is a stackable mount and 2
>> levels of stack no less.
>
> The reason for the description is to have it behave exactly like a bind
> mount.  You can assert that a bind mount is, in fact, a stacked mount,
> but we don't currently.  I'm also not sure where you get your 2 levels
> from?
>

A bind mount does not incur recursion into VFS code, a stacked fs does.
And there is a programmable limit of stack depth of 2, which stacked
fs need to comply with.
Your proposed setup has 2 stacked fs, the mark shitfs by admin
and the uid shitfs by container user. Or maybe I misunderstood.


>>  So one thing that is missing is increasing of sb->s_stack_depth and
>> that also means that shiftfs cannot be used to recursively shift uids
>> in child userns if that was ever the intention.
>
> I can't think of a use case that would ever need that, but perhaps
> other container people can.
>
>> The other problem is that by forking overlayfs functionality,
>
> So this wouldn't really be the right way to look at it: shiftfs shares
> no code with overlayfs at all, so is definitely not a fork.  The only
> piece of functionality it has which is similar to overlayfs is the way
> it does lookups via a new dentry cache.  However, that functionality is
> not unique to overlayfs and if you look, you'll see that
> shiftfs_lookup() actually has far more in common with
> ecryptfs_lookup().

That's a good point. All stackable file systems may share similar problems
and solutions (e.g. consistent st_ino/st_dev). Perhaps it calls for shared
library code or more generic VFS code.
At the moment ecryptfs is not seeing much development, so everything
happens in overlayfs. If there is going to be more than 1 actively developed
stackable fs, we need to see about that.

>
>>  shiftfs is going to miss out on overlayfs bug fixes related to user
>> credentials differ from mounter credentials, like fd3220d ("ovl:
>> update S_ISGID when setting posix ACLs"). I am not sure that this
>> specific case is relevant to shiftfs, but there could be other.
>
> OK, so shiftfs doesn't have this bug and the reason why is
> illustrative: basically shiftfs does three things
>
>    1. lookups via a uid/gid shifted dentry cache
>    2. shifted credential inode operations permission checks on the
>       underlying filesystem
>    3. location marking for unprivileged mount
>
> I think we've already seen that 1. isn't from overlayfs but the
> functionality could be added to overlayfs, I suppose.  The big problem
> is 2.  The overlayfs code emulates the permission checks, which makes
> it rather complex (this is where you get your bugs like the above
> from).  I did actually look at adding 2. to overlayfs on the theory
> that a single layer overlay might be closest to what this is, but
> eventually concluded I'd have to take the special cases and add a whole
> lot more to them ... it really would increase the maintenance burden
> substantially and make the code an unreadable rats nest.
>

The use cases for uid shifting are still overwelming for me.
I take your word for it that its going to be a maintanace burdon
to add this functionality to overlayfs.

> When you think about it this way, it becomes obvious that the clean
> separation is if shiftfs functionality is layered on top of overlayfs
> and when you do that, doing it as its own filesystem is more logical.
>

Yes, I agree with that statement. This is inline with the solution I outlined
at the end of my previous email, where single layer overlayfs is used
for the host "mark" mount, although I wonder if the same cannot be
achieved with a bind mount?

in host:
mount -t overlay -o noexec,upper=<origin> container_visible <mark location>

in container:
mount -t shiftfs -o <mark location> <somewhere in my local mount ns>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-06  6:59       ` Amir Goldstein
@ 2017-02-06 14:41         ` James Bottomley
  0 siblings, 0 replies; 82+ messages in thread
From: James Bottomley @ 2017-02-06 14:41 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, LSM List, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

On Mon, 2017-02-06 at 08:59 +0200, Amir Goldstein wrote:
> On Mon, Feb 6, 2017 at 3:18 AM, James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> > On Sun, 2017-02-05 at 09:51 +0200, Amir Goldstein wrote:
> > > On Sat, Feb 4, 2017 at 9:19 PM, James Bottomley
> > > <James.Bottomley@hansenpartnership.com> wrote:
> > > > This allows any subtree to be uid/gid shifted and bound 
> > > > elsewhere.  It does this by operating simlarly to overlayfs. 
> > > >  Its primary use is for shifting the underlying uids of 
> > > > filesystems used to support unpriviliged (uid shifted) 
> > > > containers.  The usual use case here is that the container is 
> > > > operating with an uid shifted unprivileged root but sometimes 
> > > > needs to make use of or work with a filesystem image that has
> > > > root at real uid 0.
> > > > 
> > > > The mechanism is to allow any subordinate mount namespace to 
> > > > mount a shiftfs filesystem (by marking it FS_USERNS_MOUNT) but 
> > > > only allowing it to mount marked subtrees (using the -o mark 
> > > > option as root).   Once mounted, the subtree is mapped via the 
> > > > super block user namespace so that the interior ids of the 
> > > > mounting user namespace are the ids written to the filesystem.
> > > > 
> > > > Signed-off-by: James Bottomley <
> > > > James.Bottomley@HansenPartnership.com>
> > > > 
> > > 
> > > James,
> > > 
> > > Allow me to point out some problems in this patch and offer a
> > > slightly different approach.
> > > 
> > > First of all, the subject says "uid/gid shifting bind mount", but
> > > it's not really a bind mount. What it is is a stackable mount and 
> > > 2 levels of stack no less.
> > 
> > The reason for the description is to have it behave exactly like a 
> > bind mount.  You can assert that a bind mount is, in fact, a 
> > stacked mount, but we don't currently.  I'm also not sure where you 
> > get your 2 levels from?
> > 
> 
> A bind mount does not incur recursion into VFS code, a stacked fs 
> does. And there is a programmable limit of stack depth of 2, which 
> stacked fs need to comply with. Your proposed setup has 2 stacked fs, 
> the mark shitfs by admin and the uid shitfs by container user. Or
> maybe I misunderstood.

Oh, right, actually, it wouldn't be 2 because once the unprivileged
mount uses the marked filesystem, what it uses is the mnt and dentry
from the underlying filesystem (what you would have got from a path
lookup on it).

That said, it does perform recursive calls to the underlying filesystem
unlike a true bind mount, so I can add the depth easily enough.

> > >  So one thing that is missing is increasing of sb->s_stack_depth 
> > > and that also means that shiftfs cannot be used to recursively 
> > > shift uids in child userns if that was ever the intention.
> > 
> > I can't think of a use case that would ever need that, but perhaps
> > other container people can.
> > 
> > > The other problem is that by forking overlayfs functionality,
> > 
> > So this wouldn't really be the right way to look at it: shiftfs 
> > shares no code with overlayfs at all, so is definitely not a fork. 
> >  The only piece of functionality it has which is similar to 
> > overlayfs is the way it does lookups via a new dentry cache. 
> >  However, that functionality is not unique to overlayfs and if you 
> > look, you'll see that shiftfs_lookup() actually has far more in 
> > common with ecryptfs_lookup().
> 
> That's a good point. All stackable file systems may share similar 
> problems and solutions (e.g. consistent st_ino/st_dev). Perhaps it 
> calls for shared library code or more generic VFS code. At the moment 
> ecryptfs is not seeing much development, so everything happens in 
> overlayfs. If there is going to be more than 1 actively developed
> stackable fs, we need to see about that.

I believe we already do ... if you look at the lookup functions of each
of them, you see the only common thing is encapsulated in a variant of
the lookup_one_len() functions.  After that, even simple things like
our negative dentry handling differs.

> > >  shiftfs is going to miss out on overlayfs bug fixes related to 
> > > user credentials differ from mounter credentials, like fd3220d 
> > > ("ovl: update S_ISGID when setting posix ACLs"). I am not sure 
> > > that this specific case is relevant to shiftfs, but there could
> > > be other.
> > 
> > OK, so shiftfs doesn't have this bug and the reason why is
> > illustrative: basically shiftfs does three things
> > 
> >    1. lookups via a uid/gid shifted dentry cache
> >    2. shifted credential inode operations permission checks on the
> >       underlying filesystem
> >    3. location marking for unprivileged mount
> > 
> > I think we've already seen that 1. isn't from overlayfs but the
> > functionality could be added to overlayfs, I suppose.  The big 
> > problem is 2.  The overlayfs code emulates the permission checks, 
> > which makes it rather complex (this is where you get your bugs like 
> > the above from).  I did actually look at adding 2. to overlayfs on 
> > the theory that a single layer overlay might be closest to what 
> > this is, but eventually concluded I'd have to take the special 
> > cases and add a whole lot more to them ... it really would increase 
> > the maintenance burden substantially and make the code an
> > unreadable rats nest.
> > 
> 
> The use cases for uid shifting are still overwelming for me.
> I take your word for it that its going to be a maintanace burdon
> to add this functionality to overlayfs.
> 
> > When you think about it this way, it becomes obvious that the clean
> > separation is if shiftfs functionality is layered on top of 
> > overlayfs and when you do that, doing it as its own filesystem is 
> > more logical.
> > 
> 
> Yes, I agree with that statement. This is inline with the solution I 
> outlined at the end of my previous email, where single layer 
> overlayfs is used for the host "mark" mount, although I wonder if the 
> same cannot be achieved with a bind mount?

I understand, but once I can't consume overlayfs to construct it, the
idea of trying to use it becomes a negative not a positive.

We could achieve the same thing using bind mounts, if the vfsmount
structure carried a private field, but it doesn't.  I think given the
prevalence of this structure throughout the mount tree, that's a
deliberate decision to keep it thin.

> in host:
> mount -t overlay -o noexec,upper=<origin> container_visible <mark
> location>
> 
> in container:
> mount -t shiftfs -o <mark location> <somewhere in my local mount ns>

So I'm not sure it's a more widespread problem: mount --bind is usable
inside an unprivileged container, which means you can bridge filesystem
subtrees even only being local container admin.  The problem is
mounting other filesystems types.  Marking a type safe for mounting is
done by the FS_USERNS_MOUNT flag but it means for things like shiftfs
that you do have to restrict the source location, but for most
filesystem types, that source will be a device, so they will need other
checking than a mount mark.

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-06  6:46     ` James Bottomley
@ 2017-02-06 14:50       ` Theodore Ts'o
  2017-02-06 15:18         ` James Bottomley
  2017-02-06 16:24       ` J. R. Okajima
  1 sibling, 1 reply; 82+ messages in thread
From: Theodore Ts'o @ 2017-02-06 14:50 UTC (permalink / raw)
  To: James Bottomley
  Cc: J. R. Okajima, Djalal Harouni, Chris Mason, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

On Sun, Feb 05, 2017 at 10:46:23PM -0800, James Bottomley wrote:
> Yes, I know the problem.  However, I believe most current linux
> filesystems no longer guarantee stable, for the lifetime of the file,
> inode numbers.  The usual docker container root is overlayfs, which,
> similarly doesn't support stable inode numbers.  I see the odd
> complaint about docker with overlayfs having unstable inode numbers,
> but none seems to have any serious repercussions.

Um, no.  Most current linux file systems *do* guarantee stable inode
numbers.  For one thing, NFS would break horribly if you didn't have
stable inode numbers.  Never mind applications which depend on POSIX
semantics.  And you wouldn't be able to save games in rogue or
nethack, either.  :-)

Overlayfs may not, currently, but it's considered a bug.

	      	   	      	       - Ted

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-06 14:50       ` Theodore Ts'o
@ 2017-02-06 15:18         ` James Bottomley
  2017-02-06 15:38           ` lkml
  2017-02-06 21:52           ` J. Bruce Fields
  0 siblings, 2 replies; 82+ messages in thread
From: James Bottomley @ 2017-02-06 15:18 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: J. R. Okajima, Djalal Harouni, Chris Mason, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

On Mon, 2017-02-06 at 09:50 -0500, Theodore Ts'o wrote:
> On Sun, Feb 05, 2017 at 10:46:23PM -0800, James Bottomley wrote:
> > Yes, I know the problem.  However, I believe most current linux
> > filesystems no longer guarantee stable, for the lifetime of the 
> > file, inode numbers.  The usual docker container root is overlayfs,
> > which, similarly doesn't support stable inode numbers.  I see the 
> > odd complaint about docker with overlayfs having unstable inode
> > numbers, but none seems to have any serious repercussions.
> 
> Um, no.  Most current linux file systems *do* guarantee stable inode
> numbers.  For one thing, NFS would break horribly if you didn't have
> stable inode numbers.  Never mind applications which depend on POSIX
> semantics.  And you wouldn't be able to save games in rogue or
> nethack, either.  :-)

I believe that's why we have the superblock export operations to
manufacture unique filehandles in the absence of inode number
stability.  The generic one uses inode numbers, but it doesn't have to.
 I thought reiserfs (if we can go back that far) was the first
generally used filesystem that didn't guarantee stable inode numbers,
so we have a lot of historical precedence.

Thanks to reiserfs, I thought we also iterated to weak stability
guarantees for inode numbers which mean no inconsistencies in
applications that use inode numbers for caching?  It's still not POSIX,
but I thought it was good enough for most use cases.

> Overlayfs may not, currently, but it's considered a bug.

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-06 15:18         ` James Bottomley
@ 2017-02-06 15:38           ` lkml
  2017-02-06 17:32             ` James Bottomley
  2017-02-06 21:52           ` J. Bruce Fields
  1 sibling, 1 reply; 82+ messages in thread
From: lkml @ 2017-02-06 15:38 UTC (permalink / raw)
  To: James Bottomley
  Cc: Theodore Ts'o, J. R. Okajima, Djalal Harouni, Chris Mason,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, linux-security-module, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Mon, Feb 06, 2017 at 07:18:16AM -0800, James Bottomley wrote:
> On Mon, 2017-02-06 at 09:50 -0500, Theodore Ts'o wrote:
> > On Sun, Feb 05, 2017 at 10:46:23PM -0800, James Bottomley wrote:
> > > Yes, I know the problem.  However, I believe most current linux
> > > filesystems no longer guarantee stable, for the lifetime of the 
> > > file, inode numbers.  The usual docker container root is overlayfs,
> > > which, similarly doesn't support stable inode numbers.  I see the 
> > > odd complaint about docker with overlayfs having unstable inode
> > > numbers, but none seems to have any serious repercussions.
> > 
> > Um, no.  Most current linux file systems *do* guarantee stable inode
> > numbers.  For one thing, NFS would break horribly if you didn't have
> > stable inode numbers.  Never mind applications which depend on POSIX
> > semantics.  And you wouldn't be able to save games in rogue or
> > nethack, either.  :-)
> 
> I believe that's why we have the superblock export operations to
> manufacture unique filehandles in the absence of inode number
> stability.  The generic one uses inode numbers, but it doesn't have to.
>  I thought reiserfs (if we can go back that far) was the first
> generally used filesystem that didn't guarantee stable inode numbers,
> so we have a lot of historical precedence.
> 
> Thanks to reiserfs, I thought we also iterated to weak stability
> guarantees for inode numbers which mean no inconsistencies in
> applications that use inode numbers for caching?  It's still not POSIX,
> but I thought it was good enough for most use cases.
> 

Even plain tar extraction is sensitive to directory inode stability:
http://git.savannah.gnu.org/cgit/tar.git/tree/src/extract.c?h=release_1_29#n867

This caused errors on overlayfs if the extraction churned through enough
of the dentry cache to evict the relevant directory (can be forced to
reproduce reliably via drop_caches).

Regards,
Vito Caputo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-06  6:46     ` James Bottomley
  2017-02-06 14:50       ` Theodore Ts'o
@ 2017-02-06 16:24       ` J. R. Okajima
  2017-02-21  0:48         ` James Bottomley
  1 sibling, 1 reply; 82+ messages in thread
From: J. R. Okajima @ 2017-02-06 16:24 UTC (permalink / raw)
  To: James Bottomley
  Cc: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

James Bottomley:
> Yes, I know the problem.  However, I believe most current linux
> filesystems no longer guarantee stable, for the lifetime of the file,
> inode numbers.  The usual docker container root is overlayfs, which,
> similarly doesn't support stable inode numbers.  I see the odd
> complaint about docker with overlayfs having unstable inode numbers,
> but none seems to have any serious repercussions.

I think it serious.
Reusing the backend fs' inum is a good approach which Amir wrote.
Based on this, I'd suggest you to support the hardlinks.

bakend_dentry = lookup_one_len()
if (d_inode->i_nlink != 1)
	shiftfs_inode = ilookup();
if (!shiftfs_inode) {
	shiftfs_inode = new_inode();
	shiftfs_inode->i_ino = bakend_dentry->d_inode->i_ino;
}


J. R. Okajima

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-06  6:38     ` Amir Goldstein
@ 2017-02-06 16:29       ` James Bottomley
  0 siblings, 0 replies; 82+ messages in thread
From: James Bottomley @ 2017-02-06 16:29 UTC (permalink / raw)
  To: Amir Goldstein, J. R. Okajima
  Cc: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, LSM List, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

On Mon, 2017-02-06 at 08:38 +0200, Amir Goldstein wrote:
> On Mon, Feb 6, 2017 at 5:25 AM, J. R. Okajima <hooanon05g@gmail.com>
> wrote:
> > James Bottomley:
> > > This allows any subtree to be uid/gid shifted and bound
> > > elsewhere.  It
> >         :::
> > 
> > Interesting.
> > But I am afraid that the inconsistency problem of the inode numbers 
> > will happen.
> > 
> 
> Yet another example that overlayfs already is in the process of 
> solving (it is fixed for stat of merged directory inode).
> In fact, fir the case of single layer overlay (as well as shiftfs) 
> the solution is trivial - preserve underlying inode st_ino/d_ino and 
> use the overlayed fs st_dev.

not sure I follow what st_ino is, do you mean  s_root->d_inode->i_ino?
or did you mean s_dev (which is more traditional)?

The problem with this is there's no way to ensure global uniqueness in
a mapping that goes (ino, ino) -> (ino) (or (s_dev, ino) -> (ino)) and
I believe global uniqueness is more important because the i_ino is used
in the hashed lookups.  Secondly you're not guaranteed that s_root
->d_inode->i_ino is unique ... historically a lot of filesystems use a
well known inode number as the root, that's why filehandles
traditionally used something representing the device and the inode
number (we also have s_dev uniqueness problems for tmpfs which is used
in some overlays).

We can certainly construct a filehandle using an export operations
override that is unique and can be used to lookup the underlying object
(based on the underlying device and inode).

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-06 15:38           ` lkml
@ 2017-02-06 17:32             ` James Bottomley
  0 siblings, 0 replies; 82+ messages in thread
From: James Bottomley @ 2017-02-06 17:32 UTC (permalink / raw)
  To: lkml
  Cc: Theodore Ts'o, J. R. Okajima, Djalal Harouni, Chris Mason,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, linux-security-module, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Mon, 2017-02-06 at 09:38 -0600, lkml@pengaru.com wrote:
> On Mon, Feb 06, 2017 at 07:18:16AM -0800, James Bottomley wrote:
> > On Mon, 2017-02-06 at 09:50 -0500, Theodore Ts'o wrote:
> > > On Sun, Feb 05, 2017 at 10:46:23PM -0800, James Bottomley wrote:
> > > > Yes, I know the problem.  However, I believe most current linux
> > > > filesystems no longer guarantee stable, for the lifetime of the
> > > > file, inode numbers.  The usual docker container root is
> > > > overlayfs,
> > > > which, similarly doesn't support stable inode numbers.  I see
> > > > the 
> > > > odd complaint about docker with overlayfs having unstable inode
> > > > numbers, but none seems to have any serious repercussions.
> > > 
> > > Um, no.  Most current linux file systems *do* guarantee stable
> > > inode
> > > numbers.  For one thing, NFS would break horribly if you didn't
> > > have
> > > stable inode numbers.  Never mind applications which depend on
> > > POSIX
> > > semantics.  And you wouldn't be able to save games in rogue or
> > > nethack, either.  :-)
> > 
> > I believe that's why we have the superblock export operations to
> > manufacture unique filehandles in the absence of inode number
> > stability.  The generic one uses inode numbers, but it doesn't have
> > to.
> >  I thought reiserfs (if we can go back that far) was the first
> > generally used filesystem that didn't guarantee stable inode
> > numbers,
> > so we have a lot of historical precedence.
> > 
> > Thanks to reiserfs, I thought we also iterated to weak stability
> > guarantees for inode numbers which mean no inconsistencies in
> > applications that use inode numbers for caching?  It's still not
> > POSIX,
> > but I thought it was good enough for most use cases.
> > 
> 
> Even plain tar extraction is sensitive to directory inode stability:
> http://git.savannah.gnu.org/cgit/tar.git/tree/src/extract.c?h=release
> _1_29#n867
> 
> This caused errors on overlayfs if the extraction churned through 
> enough of the dentry cache to evict the relevant directory (can be 
> forced to reproduce reliably via drop_caches).

Yes, I know the bug.  I think it's up to tar maintainers, but if they
want to support weakly posix filesystems, they should really be using
the filehandle for this check, not device and inode number.

That said, I believe reiserfs was our only other filesystem with weak
inode number stability guarantees and that's hardly in common use
today, so if we can find a solution that gives strong stability
guarantees for out current problem filesystems, there's no reason not
to use it generally.

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-06 15:18         ` James Bottomley
  2017-02-06 15:38           ` lkml
@ 2017-02-06 21:52           ` J. Bruce Fields
  2017-02-07  0:10             ` James Bottomley
  1 sibling, 1 reply; 82+ messages in thread
From: J. Bruce Fields @ 2017-02-06 21:52 UTC (permalink / raw)
  To: James Bottomley
  Cc: Theodore Ts'o, J. R. Okajima, Djalal Harouni, Chris Mason,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, linux-security-module, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Mon, Feb 06, 2017 at 07:18:16AM -0800, James Bottomley wrote:
> On Mon, 2017-02-06 at 09:50 -0500, Theodore Ts'o wrote:
> > On Sun, Feb 05, 2017 at 10:46:23PM -0800, James Bottomley wrote:
> > > Yes, I know the problem.  However, I believe most current linux
> > > filesystems no longer guarantee stable, for the lifetime of the 
> > > file, inode numbers.  The usual docker container root is overlayfs,
> > > which, similarly doesn't support stable inode numbers.  I see the 
> > > odd complaint about docker with overlayfs having unstable inode
> > > numbers, but none seems to have any serious repercussions.
> > 
> > Um, no.  Most current linux file systems *do* guarantee stable inode
> > numbers.  For one thing, NFS would break horribly if you didn't have
> > stable inode numbers.  Never mind applications which depend on POSIX
> > semantics.  And you wouldn't be able to save games in rogue or
> > nethack, either.  :-)
> 
> I believe that's why we have the superblock export operations to
> manufacture unique filehandles in the absence of inode number
> stability.

Where did you hear that?

I'd expect an NFS client to handle non-unique filehandles
better than non-unique inode numbers.  I believe our client will -EIO on
encountering an inode number change (see nfs_check_inode_attributes().)

See also https://tools.ietf.org/html/rfc5661#section-10.3.4.

--b.

> The generic one uses inode numbers, but it doesn't have to.
>  I thought reiserfs (if we can go back that far) was the first
> generally used filesystem that didn't guarantee stable inode numbers,
> so we have a lot of historical precedence.
> 
> Thanks to reiserfs, I thought we also iterated to weak stability
> guarantees for inode numbers which mean no inconsistencies in
> applications that use inode numbers for caching?  It's still not POSIX,
> but I thought it was good enough for most use cases.
> 
> > Overlayfs may not, currently, but it's considered a bug.
> 
> James
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-06 21:52           ` J. Bruce Fields
@ 2017-02-07  0:10             ` James Bottomley
  2017-02-07  1:35               ` J. Bruce Fields
  0 siblings, 1 reply; 82+ messages in thread
From: James Bottomley @ 2017-02-07  0:10 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Theodore Ts'o, J. R. Okajima, Djalal Harouni, Chris Mason,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, linux-security-module, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Mon, 2017-02-06 at 16:52 -0500, J. Bruce Fields wrote:
> On Mon, Feb 06, 2017 at 07:18:16AM -0800, James Bottomley wrote:
> > On Mon, 2017-02-06 at 09:50 -0500, Theodore Ts'o wrote:
> > > On Sun, Feb 05, 2017 at 10:46:23PM -0800, James Bottomley wrote:
> > > > Yes, I know the problem.  However, I believe most current linux
> > > > filesystems no longer guarantee stable, for the lifetime of the
> > > > file, inode numbers.  The usual docker container root is 
> > > > overlayfs, which, similarly doesn't support stable inode 
> > > > numbers.  I see the odd complaint about docker with overlayfs 
> > > > having unstable inode numbers, but none seems to have any
> > > > serious repercussions.
> > > 
> > > Um, no.  Most current linux file systems *do* guarantee stable 
> > > inode numbers.  For one thing, NFS would break horribly if you 
> > > didn't have stable inode numbers.  Never mind applications which 
> > > depend on POSIX semantics.  And you wouldn't be able to save 
> > > games in rogue or nethack, either.  :-)
> > 
> > I believe that's why we have the superblock export operations to
> > manufacture unique filehandles in the absence of inode number
> > stability.
> 
> Where did you hear that?
> 
> I'd expect an NFS client to handle non-unique filehandles
> better than non-unique inode numbers.  I believe our client will -EIO 
> on encountering an inode number change (see
> nfs_check_inode_attributes().)
> 
> See also https://tools.ietf.org/html/rfc5661#section-10.3.4.

Could you clarify your point a bit further, please?  Both the
check_inode_attributes() code and section 10.3.4 are talking about
fileids, which are the things that are constructed in the export_ops
... admittedly a lot of fileid_types are based on inode numbers, but
several aren't.  For those that aren't, I believe NFS doesn't care
about the underlying inode number of the exported file.

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-07  0:10             ` James Bottomley
@ 2017-02-07  1:35               ` J. Bruce Fields
  2017-02-07 19:01                 ` James Bottomley
  0 siblings, 1 reply; 82+ messages in thread
From: J. Bruce Fields @ 2017-02-07  1:35 UTC (permalink / raw)
  To: James Bottomley
  Cc: Theodore Ts'o, J. R. Okajima, Djalal Harouni, Chris Mason,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, linux-security-module, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Mon, Feb 06, 2017 at 04:10:11PM -0800, James Bottomley wrote:
> On Mon, 2017-02-06 at 16:52 -0500, J. Bruce Fields wrote:
> > On Mon, Feb 06, 2017 at 07:18:16AM -0800, James Bottomley wrote:
> > > On Mon, 2017-02-06 at 09:50 -0500, Theodore Ts'o wrote:
> > > > On Sun, Feb 05, 2017 at 10:46:23PM -0800, James Bottomley wrote:
> > > > > Yes, I know the problem.  However, I believe most current linux
> > > > > filesystems no longer guarantee stable, for the lifetime of the
> > > > > file, inode numbers.  The usual docker container root is 
> > > > > overlayfs, which, similarly doesn't support stable inode 
> > > > > numbers.  I see the odd complaint about docker with overlayfs 
> > > > > having unstable inode numbers, but none seems to have any
> > > > > serious repercussions.
> > > > 
> > > > Um, no.  Most current linux file systems *do* guarantee stable 
> > > > inode numbers.  For one thing, NFS would break horribly if you 
> > > > didn't have stable inode numbers.  Never mind applications which 
> > > > depend on POSIX semantics.  And you wouldn't be able to save 
> > > > games in rogue or nethack, either.  :-)
> > > 
> > > I believe that's why we have the superblock export operations to
> > > manufacture unique filehandles in the absence of inode number
> > > stability.
> > 
> > Where did you hear that?
> > 
> > I'd expect an NFS client to handle non-unique filehandles
> > better than non-unique inode numbers.  I believe our client will -EIO 
> > on encountering an inode number change (see
> > nfs_check_inode_attributes().)
> > 
> > See also https://tools.ietf.org/html/rfc5661#section-10.3.4.
> 
> Could you clarify your point a bit further, please?  Both the
> check_inode_attributes() code and section 10.3.4 are talking about
> fileids, which are the things that are constructed in the export_ops

No, the filehandle structure isn't discussed in the rfc at all, that's
opaque to clients, and the "fileid" you see in the export code isn't
what's discussed here.

The "fileid" here is an NFS attribute, really just the NFS protocol's
name for the inode number.  The server code that returns fileid's:

	if (bmval0 & FATTR4_WORD0_FILEID) {
		p = xdr_reserve_space(xdr, 8);
		if (!p)
			goto out_resource;
		p = xdr_encode_hyper(p, stat.ino);
	}

The client getattr code:

	stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));

--b.

> ... admittedly a lot of fileid_types are based on inode numbers, but
> several aren't.  For those that aren't, I believe NFS doesn't care
> about the underlying inode number of the exported file.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-04 19:19 ` [RFC 1/1] shiftfs: uid/gid shifting bind mount James Bottomley
  2017-02-05  7:51   ` Amir Goldstein
  2017-02-06  3:25   ` J. R. Okajima
@ 2017-02-07  9:19   ` Christoph Hellwig
  2017-02-07  9:39     ` Djalal Harouni
  2017-02-07 16:37     ` James Bottomley
  2017-02-15 20:34   ` Vivek Goyal
  2017-02-17  2:29   ` Al Viro
  4 siblings, 2 replies; 82+ messages in thread
From: Christoph Hellwig @ 2017-02-07  9:19 UTC (permalink / raw)
  To: James Bottomley
  Cc: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

On Sat, Feb 04, 2017 at 11:19:32AM -0800, James Bottomley wrote:
> This allows any subtree to be uid/gid shifted and bound elsewhere.  It
> does this by operating simlarly to overlayfs.  Its primary use is for
> shifting the underlying uids of filesystems used to support
> unpriviliged (uid shifted) containers.  The usual use case here is
> that the container is operating with an uid shifted unprivileged root
> but sometimes needs to make use of or work with a filesystem image
> that has root at real uid 0.
> 
> The mechanism is to allow any subordinate mount namespace to mount a
> shiftfs filesystem (by marking it FS_USERNS_MOUNT) but only allowing
> it to mount marked subtrees (using the -o mark option as root).  Once
> mounted, the subtree is mapped via the super block user namespace so
> that the interior ids of the mounting user namespace are the ids
> written to the filesystem.

Please move this into VFS instead of a stackable fs.  We might need
addtional parameters to getattr/setattr to specify the ID translation,
but that's why better than a horrible hack like this.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-07  9:19   ` Christoph Hellwig
@ 2017-02-07  9:39     ` Djalal Harouni
  2017-02-07  9:53       ` Christoph Hellwig
  2017-02-07 16:37     ` James Bottomley
  1 sibling, 1 reply; 82+ messages in thread
From: Djalal Harouni @ 2017-02-07  9:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: James Bottomley, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, Linux FS Devel,
	linux-kernel, LSM List, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

Hi,

On Tue, Feb 7, 2017 at 10:19 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Sat, Feb 04, 2017 at 11:19:32AM -0800, James Bottomley wrote:
>> This allows any subtree to be uid/gid shifted and bound elsewhere.  It
>> does this by operating simlarly to overlayfs.  Its primary use is for
>> shifting the underlying uids of filesystems used to support
>> unpriviliged (uid shifted) containers.  The usual use case here is
>> that the container is operating with an uid shifted unprivileged root
>> but sometimes needs to make use of or work with a filesystem image
>> that has root at real uid 0.
>>
>> The mechanism is to allow any subordinate mount namespace to mount a
>> shiftfs filesystem (by marking it FS_USERNS_MOUNT) but only allowing
>> it to mount marked subtrees (using the -o mark option as root).  Once
>> mounted, the subtree is mapped via the super block user namespace so
>> that the interior ids of the mounting user namespace are the ids
>> written to the filesystem.
>
> Please move this into VFS instead of a stackable fs.  We might need
> addtional parameters to getattr/setattr to specify the ID translation,
> but that's why better than a horrible hack like this.

I proposed an RFC months ago which implements all of this at the VFS
layer [1], I received some feedback especially from Dave Chinner,
however I failed to fix my bugs and improve it not enough resources...

The problems discussed here about a new filesystem: inodes numbers,
quota and many other things where all noted in that thread and
previous threads about shiftfs. We are turning this to a heavy problem
compared to all other namespaces... other namespaces integrate
perfectly with other subsystems and the rest of layers, there is no
special treatment...

Christoph, for the getattr/setattr it won't work since internally the
resolved path may point to a different mount context where we do not
want the ID translation, and we may end up using the wrong vfsmount. A
simple getattr/setattr won't work unless there are bigger changes
too...

[1] https://lkml.org/lkml/2016/5/4/411

-- 
tixxdz

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-07  9:39     ` Djalal Harouni
@ 2017-02-07  9:53       ` Christoph Hellwig
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2017-02-07  9:53 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Christoph Hellwig, James Bottomley, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	Linux FS Devel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Tue, Feb 07, 2017 at 10:39:48AM +0100, Djalal Harouni wrote:
> I proposed an RFC months ago which implements all of this at the VFS
> layer [1], I received some feedback especially from Dave Chinner,
> however I failed to fix my bugs and improve it not enough resources...

And none of the issues goes away by hiding them in a stackable fs,
in fact many of them are getting worse.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-07  9:19   ` Christoph Hellwig
  2017-02-07  9:39     ` Djalal Harouni
@ 2017-02-07 16:37     ` James Bottomley
  2017-02-07 17:59       ` Amir Goldstein
  1 sibling, 1 reply; 82+ messages in thread
From: James Bottomley @ 2017-02-07 16:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

On Tue, 2017-02-07 at 01:19 -0800, Christoph Hellwig wrote:
> On Sat, Feb 04, 2017 at 11:19:32AM -0800, James Bottomley wrote:
> > This allows any subtree to be uid/gid shifted and bound elsewhere. 
> >  It does this by operating simlarly to overlayfs.  Its primary use 
> > is for shifting the underlying uids of filesystems used to support
> > unpriviliged (uid shifted) containers.  The usual use case here is
> > that the container is operating with an uid shifted unprivileged 
> > root but sometimes needs to make use of or work with a filesystem 
> > image that has root at real uid 0.
> > 
> > The mechanism is to allow any subordinate mount namespace to mount 
> > a shiftfs filesystem (by marking it FS_USERNS_MOUNT) but only
> > allowing it to mount marked subtrees (using the -o mark option as 
> > root).  Once mounted, the subtree is mapped via the super block 
> > user namespace so that the interior ids of the mounting user 
> > namespace are the ids written to the filesystem.
> 
> Please move this into VFS instead of a stackable fs.  We might need
> addtional parameters to getattr/setattr to specify the ID 
> translation, but that's why better than a horrible hack like this.

I would need a lot more than that: getattr controls the cosmetic
permission display to the user, but enforcement is done in the core
permission checks which are inode based.  To make this a real bind
mount, the core permission checks will have to become subtree aware
because knowledge of whether we need a uid shift in the permission
check becomes a subtree property.  Effectively inode_permission would
become dentry_permission and generic_permission would take a dentry
instead of an inode.  This will be a huge amount of VFS and underlying
filesystem churn, since the permissions calls are threaded through a
huge chunk of code.

Is this the approach that you really want?

I suppose I could see the security people linking it because all the
security hooks in the permission code become path aware.

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-07 16:37     ` James Bottomley
@ 2017-02-07 17:59       ` Amir Goldstein
  2017-02-07 18:10         ` Christoph Hellwig
  2017-02-07 18:20         ` James Bottomley
  0 siblings, 2 replies; 82+ messages in thread
From: Amir Goldstein @ 2017-02-07 17:59 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christoph Hellwig, Djalal Harouni, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Tue, Feb 7, 2017 at 6:37 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> On Tue, 2017-02-07 at 01:19 -0800, Christoph Hellwig wrote:
>> On Sat, Feb 04, 2017 at 11:19:32AM -0800, James Bottomley wrote:
>> > This allows any subtree to be uid/gid shifted and bound elsewhere.
>> >  It does this by operating simlarly to overlayfs.  Its primary use
>> > is for shifting the underlying uids of filesystems used to support
>> > unpriviliged (uid shifted) containers.  The usual use case here is
>> > that the container is operating with an uid shifted unprivileged
>> > root but sometimes needs to make use of or work with a filesystem
>> > image that has root at real uid 0.
>> >
>> > The mechanism is to allow any subordinate mount namespace to mount
>> > a shiftfs filesystem (by marking it FS_USERNS_MOUNT) but only
>> > allowing it to mount marked subtrees (using the -o mark option as
>> > root).  Once mounted, the subtree is mapped via the super block
>> > user namespace so that the interior ids of the mounting user
>> > namespace are the ids written to the filesystem.
>>
>> Please move this into VFS instead of a stackable fs.  We might need
>> addtional parameters to getattr/setattr to specify the ID
>> translation, but that's why better than a horrible hack like this.
>
> I would need a lot more than that: getattr controls the cosmetic
> permission display to the user, but enforcement is done in the core
> permission checks which are inode based.  To make this a real bind
> mount, the core permission checks will have to become subtree aware
> because knowledge of whether we need a uid shift in the permission
> check becomes a subtree property.  Effectively inode_permission would
> become dentry_permission and generic_permission would take a dentry
> instead of an inode.  This will be a huge amount of VFS and underlying
> filesystem churn, since the permissions calls are threaded through a
> huge chunk of code.
>

I am not even sure that would be enough.
dentry does not contain information about the mount user came from,
and sb contains only information about the user ns of the mounter of
the file system, not the mounter of the bind mount, right?
I think I am missing some big pieces of the big picture.
Would love to hear what Eric has to say.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-07 17:59       ` Amir Goldstein
@ 2017-02-07 18:10         ` Christoph Hellwig
  2017-02-07 19:02           ` James Bottomley
  2017-02-07 18:20         ` James Bottomley
  1 sibling, 1 reply; 82+ messages in thread
From: Christoph Hellwig @ 2017-02-07 18:10 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: James Bottomley, Christoph Hellwig, Djalal Harouni, Chris Mason,
	Theodore Tso, Josh Triplett, Eric W. Biederman, Andy Lutomirski,
	Seth Forshee, linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Tue, Feb 07, 2017 at 07:59:00PM +0200, Amir Goldstein wrote:
> I am not even sure that would be enough.
> dentry does not contain information about the mount user came from,
> and sb contains only information about the user ns of the mounter of
> the file system, not the mounter of the bind mount, right?
> I think I am missing some big pieces of the big picture.
> Would love to hear what Eric has to say.

IFF we want to do what shiftfs does properly we need vfsmount + inode,
no need for the dentry.

But maybe we need to go back and decice if we want to allow uid/gid
remapping for arbitrary subtrees anyway.  Another option would be
to require something like a project as used for project quotas as the
root.  This would also be conveniant as it could storge the used
remapping tables.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-07 17:59       ` Amir Goldstein
  2017-02-07 18:10         ` Christoph Hellwig
@ 2017-02-07 18:20         ` James Bottomley
  2017-02-07 19:48           ` Djalal Harouni
  1 sibling, 1 reply; 82+ messages in thread
From: James Bottomley @ 2017-02-07 18:20 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christoph Hellwig, Djalal Harouni, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Tue, 2017-02-07 at 19:59 +0200, Amir Goldstein wrote:
> On Tue, Feb 7, 2017 at 6:37 PM, James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> > On Tue, 2017-02-07 at 01:19 -0800, Christoph Hellwig wrote:
> > > On Sat, Feb 04, 2017 at 11:19:32AM -0800, James Bottomley wrote:
> > > > This allows any subtree to be uid/gid shifted and bound
> > > > elsewhere.
> > > >  It does this by operating simlarly to overlayfs.  Its primary
> > > > use
> > > > is for shifting the underlying uids of filesystems used to
> > > > support
> > > > unpriviliged (uid shifted) containers.  The usual use case here
> > > > is
> > > > that the container is operating with an uid shifted
> > > > unprivileged
> > > > root but sometimes needs to make use of or work with a
> > > > filesystem
> > > > image that has root at real uid 0.
> > > > 
> > > > The mechanism is to allow any subordinate mount namespace to
> > > > mount
> > > > a shiftfs filesystem (by marking it FS_USERNS_MOUNT) but only
> > > > allowing it to mount marked subtrees (using the -o mark option
> > > > as
> > > > root).  Once mounted, the subtree is mapped via the super block
> > > > user namespace so that the interior ids of the mounting user
> > > > namespace are the ids written to the filesystem.
> > > 
> > > Please move this into VFS instead of a stackable fs.  We might
> > > need
> > > addtional parameters to getattr/setattr to specify the ID
> > > translation, but that's why better than a horrible hack like
> > > this.
> > 
> > I would need a lot more than that: getattr controls the cosmetic
> > permission display to the user, but enforcement is done in the core
> > permission checks which are inode based.  To make this a real bind
> > mount, the core permission checks will have to become subtree aware
> > because knowledge of whether we need a uid shift in the permission
> > check becomes a subtree property.  Effectively inode_permission
> > would
> > become dentry_permission and generic_permission would take a dentry
> > instead of an inode.  This will be a huge amount of VFS and
> > underlying
> > filesystem churn, since the permissions calls are threaded through
> > a
> > huge chunk of code.
> > 
> 
> I am not even sure that would be enough.
> dentry does not contain information about the mount user came from,
> and sb contains only information about the user ns of the mounter of
> the file system, not the mounter of the bind mount, right?
> I think I am missing some big pieces of the big picture.
> Would love to hear what Eric has to say.

I'm not really sure until it gets prototyped, but I think the
filesystem user namespace would also have to become a subtree property.

The whole reason for shiftfs being a properly mounted filesystem is
because it needs a super block to capture the namespace it's being
mounted in.

However, when you have a container that you want remapping inside, you
must have a user namespace which owns a mount namespace, so we can
deduce the information from the mount namespace.  All we probably need
the subtree to tell us is if we're shifting or not.

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-07  1:35               ` J. Bruce Fields
@ 2017-02-07 19:01                 ` James Bottomley
  2017-02-07 19:47                   ` Christoph Hellwig
  0 siblings, 1 reply; 82+ messages in thread
From: James Bottomley @ 2017-02-07 19:01 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Theodore Ts'o, J. R. Okajima, Djalal Harouni, Chris Mason,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, linux-security-module, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Mon, 2017-02-06 at 20:35 -0500, J. Bruce Fields wrote:
> On Mon, Feb 06, 2017 at 04:10:11PM -0800, James Bottomley wrote:
> > On Mon, 2017-02-06 at 16:52 -0500, J. Bruce Fields wrote:
> > > On Mon, Feb 06, 2017 at 07:18:16AM -0800, James Bottomley wrote:
> > > > On Mon, 2017-02-06 at 09:50 -0500, Theodore Ts'o wrote:
> > > > > On Sun, Feb 05, 2017 at 10:46:23PM -0800, James Bottomley
> > > > > wrote:
> > > > > > Yes, I know the problem.  However, I believe most current
> > > > > > linux
> > > > > > filesystems no longer guarantee stable, for the lifetime of
> > > > > > the
> > > > > > file, inode numbers.  The usual docker container root is 
> > > > > > overlayfs, which, similarly doesn't support stable inode 
> > > > > > numbers.  I see the odd complaint about docker with
> > > > > > overlayfs 
> > > > > > having unstable inode numbers, but none seems to have any
> > > > > > serious repercussions.
> > > > > 
> > > > > Um, no.  Most current linux file systems *do* guarantee
> > > > > stable 
> > > > > inode numbers.  For one thing, NFS would break horribly if
> > > > > you 
> > > > > didn't have stable inode numbers.  Never mind applications
> > > > > which 
> > > > > depend on POSIX semantics.  And you wouldn't be able to save 
> > > > > games in rogue or nethack, either.  :-)
> > > > 
> > > > I believe that's why we have the superblock export operations
> > > > to
> > > > manufacture unique filehandles in the absence of inode number
> > > > stability.
> > > 
> > > Where did you hear that?
> > > 
> > > I'd expect an NFS client to handle non-unique filehandles
> > > better than non-unique inode numbers.  I believe our client will 
> > > -EIO 
> > > on encountering an inode number change (see
> > > nfs_check_inode_attributes().)
> > > 
> > > See also https://tools.ietf.org/html/rfc5661#section-10.3.4.
> > 
> > Could you clarify your point a bit further, please?  Both the
> > check_inode_attributes() code and section 10.3.4 are talking about
> > fileids, which are the things that are constructed in the
> > export_ops
> 
> No, the filehandle structure isn't discussed in the rfc at all,
> that's
> opaque to clients, and the "fileid" you see in the export code isn't
> what's discussed here.
> 
> The "fileid" here is an NFS attribute, really just the NFS protocol's
> name for the inode number.  The server code that returns fileid's:
> 
> 	if (bmval0 & FATTR4_WORD0_FILEID) {
> 		p = xdr_reserve_space(xdr, 8);
> 		if (!p)
> 			goto out_resource;
> 		p = xdr_encode_hyper(p, stat.ino);
> 	}
> 
> The client getattr code:
> 
> 	stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));

OK, I now believe we may be talking about different things.  When I
said

> I believe that's why we have the superblock export operations to
> manufacture unique filehandles in the absence of inode number
> stability. 

I was talking about inode stability in the filesystem underlying the
export.  I believe you're talking about inode number stability
guarantees of the nfs client code itself, which are unrelated to the
inode number guarantees of the exported filesystem?

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-07 18:10         ` Christoph Hellwig
@ 2017-02-07 19:02           ` James Bottomley
  2017-02-07 19:49             ` Christoph Hellwig
  0 siblings, 1 reply; 82+ messages in thread
From: James Bottomley @ 2017-02-07 19:02 UTC (permalink / raw)
  To: Christoph Hellwig, Amir Goldstein
  Cc: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, LSM List, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

On Tue, 2017-02-07 at 10:10 -0800, Christoph Hellwig wrote:
> On Tue, Feb 07, 2017 at 07:59:00PM +0200, Amir Goldstein wrote:
> > I am not even sure that would be enough.
> > dentry does not contain information about the mount user came from,
> > and sb contains only information about the user ns of the mounter
> > of
> > the file system, not the mounter of the bind mount, right?
> > I think I am missing some big pieces of the big picture.
> > Would love to hear what Eric has to say.
> 
> IFF we want to do what shiftfs does properly we need vfsmount + 
> inode, no need for the dentry.

Yes, sorry ... I was thinking the dentry contained the mnt, but it
doesn't, that's the path.  However, threading the mnt through looks
substantially harder.

> But maybe we need to go back and decice if we want to allow uid/gid
> remapping for arbitrary subtrees anyway.

So those were the original patches Djalal was referring to.  The
problem there is that a lot of orchestration systems don't store images
they want to bind mount into containers on separately mounted
filesystems, which is what's needed to avoid this being per-subtree. 
 However, the clinching argument for me is that the canonical container
image *is* a subtree (unlike a vm image which has to be mounted).  If
we don't make this work on subtrees people go back to daft stacks for
containers like copying the image subtree into a loopback mounted
filesystem just to make this all work (and then complain about
performance and caching and so on).

>   Another option would be to require something like a project as used 
> for project quotas as the root.  This would also be conveniant as it 
> could storge the used remapping tables.

So this would be like the current project quota except set on a
subtree?  I could see it being done that way but I don't see what
advantage it has over using flags in the subtree itself (the mapping is
known based on the mount namespace, so there's really only a single bit
of information to store).

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-07 19:01                 ` James Bottomley
@ 2017-02-07 19:47                   ` Christoph Hellwig
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2017-02-07 19:47 UTC (permalink / raw)
  To: James Bottomley
  Cc: J. Bruce Fields, Theodore Ts'o, J. R. Okajima,
	Djalal Harouni, Chris Mason, Josh Triplett, Eric W. Biederman,
	Andy Lutomirski, Seth Forshee, linux-fsdevel, linux-kernel,
	linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

On Tue, Feb 07, 2017 at 11:01:08AM -0800, James Bottomley wrote:
> I was talking about inode stability in the filesystem underlying the
> export.  I believe you're talking about inode number stability
> guarantees of the nfs client code itself, which are unrelated to the
> inode number guarantees of the exported filesystem?

They are 1:1 correlated for a Linux server at least.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-07 18:20         ` James Bottomley
@ 2017-02-07 19:48           ` Djalal Harouni
  0 siblings, 0 replies; 82+ messages in thread
From: Djalal Harouni @ 2017-02-07 19:48 UTC (permalink / raw)
  To: James Bottomley
  Cc: Amir Goldstein, Christoph Hellwig, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Tue, Feb 7, 2017 at 7:20 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> On Tue, 2017-02-07 at 19:59 +0200, Amir Goldstein wrote:
>> On Tue, Feb 7, 2017 at 6:37 PM, James Bottomley
>> <James.Bottomley@hansenpartnership.com> wrote:
>> > On Tue, 2017-02-07 at 01:19 -0800, Christoph Hellwig wrote:
>> > > On Sat, Feb 04, 2017 at 11:19:32AM -0800, James Bottomley wrote:
>> > > > This allows any subtree to be uid/gid shifted and bound
>> > > > elsewhere.
>> > > >  It does this by operating simlarly to overlayfs.  Its primary
>> > > > use
>> > > > is for shifting the underlying uids of filesystems used to
>> > > > support
>> > > > unpriviliged (uid shifted) containers.  The usual use case here
>> > > > is
>> > > > that the container is operating with an uid shifted
>> > > > unprivileged
>> > > > root but sometimes needs to make use of or work with a
>> > > > filesystem
>> > > > image that has root at real uid 0.
>> > > >
>> > > > The mechanism is to allow any subordinate mount namespace to
>> > > > mount
>> > > > a shiftfs filesystem (by marking it FS_USERNS_MOUNT) but only
>> > > > allowing it to mount marked subtrees (using the -o mark option
>> > > > as
>> > > > root).  Once mounted, the subtree is mapped via the super block
>> > > > user namespace so that the interior ids of the mounting user
>> > > > namespace are the ids written to the filesystem.
>> > >
>> > > Please move this into VFS instead of a stackable fs.  We might
>> > > need
>> > > addtional parameters to getattr/setattr to specify the ID
>> > > translation, but that's why better than a horrible hack like
>> > > this.
>> >
>> > I would need a lot more than that: getattr controls the cosmetic
>> > permission display to the user, but enforcement is done in the core
>> > permission checks which are inode based.  To make this a real bind
>> > mount, the core permission checks will have to become subtree aware
>> > because knowledge of whether we need a uid shift in the permission
>> > check becomes a subtree property.  Effectively inode_permission
>> > would
>> > become dentry_permission and generic_permission would take a dentry
>> > instead of an inode.  This will be a huge amount of VFS and
>> > underlying
>> > filesystem churn, since the permissions calls are threaded through
>> > a
>> > huge chunk of code.
>> >
>>
>> I am not even sure that would be enough.
>> dentry does not contain information about the mount user came from,
>> and sb contains only information about the user ns of the mounter of
>> the file system, not the mounter of the bind mount, right?
>> I think I am missing some big pieces of the big picture.
>> Would love to hear what Eric has to say.
>
> I'm not really sure until it gets prototyped, but I think the
> filesystem user namespace would also have to become a subtree property.

Sorry I don't want to derail the thread, but that was already prototyped

> The whole reason for shiftfs being a properly mounted filesystem is
> because it needs a super block to capture the namespace it's being
> mounted in.
>
> However, when you have a container that you want remapping inside, you
> must have a user namespace which owns a mount namespace, so we can
> deduce the information from the mount namespace.  All we probably need
> the subtree to tell us is if we're shifting or not.

That's one of the use cases that you will definitely end up with... if
anyone did read that incomplete VFS RFC proposal:

"2) The solution is based on VFS and mount namespaces, we use the user
namespace of the containing mount namespace to check if we should shift
UIDs/GIDs from/to virtual <=> on-disk view.
If a filesystem was mounted with "vfs_shift_uids" and "vfs_shift_gids"
options, and if it shows up inside a mount namespace that supports VFS
UIDs/GIDs shifts then during each access we will remap UID/GID either
to virtual or to on-disk view using simple helper functions to allow the
access. In case the mount or current mount namespace do not support VFS
UID/GID shifts, we fallback to the old behaviour, no shift is performed." [1]

[1] https://lkml.org/lkml/2016/5/4/411



-- 
tixxdz

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-07 19:02           ` James Bottomley
@ 2017-02-07 19:49             ` Christoph Hellwig
  2017-02-07 20:05               ` James Bottomley
  2017-02-08  1:54               ` Josh Triplett
  0 siblings, 2 replies; 82+ messages in thread
From: Christoph Hellwig @ 2017-02-07 19:49 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christoph Hellwig, Amir Goldstein, Djalal Harouni, Chris Mason,
	Theodore Tso, Josh Triplett, Eric W. Biederman, Andy Lutomirski,
	Seth Forshee, linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley wrote:
> >   Another option would be to require something like a project as used 
> > for project quotas as the root.  This would also be conveniant as it 
> > could storge the used remapping tables.
> 
> So this would be like the current project quota except set on a
> subtree?  I could see it being done that way but I don't see what
> advantage it has over using flags in the subtree itself (the mapping is
> known based on the mount namespace, so there's really only a single bit
> of information to store).

projects (which are the underling concept for project quotas) are
per-subtree in practice - the flag is set on an inode and then
all directories and files underneath inherit the project ID,
hardlinking outside a project is prohinited.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-07 19:49             ` Christoph Hellwig
@ 2017-02-07 20:05               ` James Bottomley
  2017-02-07 21:01                 ` Amir Goldstein
  2017-02-08  1:54               ` Josh Triplett
  1 sibling, 1 reply; 82+ messages in thread
From: James Bottomley @ 2017-02-07 20:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Amir Goldstein, Djalal Harouni, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Tue, 2017-02-07 at 11:49 -0800, Christoph Hellwig wrote:
> On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley wrote:
> > >   Another option would be to require something like a project as
> > > used 
> > > for project quotas as the root.  This would also be conveniant as
> > > it 
> > > could storge the used remapping tables.
> > 
> > So this would be like the current project quota except set on a
> > subtree?  I could see it being done that way but I don't see what
> > advantage it has over using flags in the subtree itself (the
> > mapping is
> > known based on the mount namespace, so there's really only a single
> > bit
> > of information to store).
> 
> projects (which are the underling concept for project quotas) are
> per-subtree in practice - the flag is set on an inode and then
> all directories and files underneath inherit the project ID,
> hardlinking outside a project is prohinited.

OK, this is what I don't understand: how is something that's inode
based limited to be per-subtree?  The way I've seen the VFS operate it
seems that any given inode (and indeed dentry) can appear in many
subtrees so how do I limit them to just one?

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-07 20:05               ` James Bottomley
@ 2017-02-07 21:01                 ` Amir Goldstein
  2017-02-07 22:25                   ` Christoph Hellwig
  0 siblings, 1 reply; 82+ messages in thread
From: Amir Goldstein @ 2017-02-07 21:01 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christoph Hellwig, Djalal Harouni, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Tue, Feb 7, 2017 at 10:05 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> On Tue, 2017-02-07 at 11:49 -0800, Christoph Hellwig wrote:
>> On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley wrote:
>> > >   Another option would be to require something like a project as
>> > > used
>> > > for project quotas as the root.  This would also be conveniant as
>> > > it
>> > > could storge the used remapping tables.
>> >
>> > So this would be like the current project quota except set on a
>> > subtree?  I could see it being done that way but I don't see what
>> > advantage it has over using flags in the subtree itself (the
>> > mapping is
>> > known based on the mount namespace, so there's really only a single
>> > bit
>> > of information to store).
>>
>> projects (which are the underling concept for project quotas) are
>> per-subtree in practice - the flag is set on an inode and then
>> all directories and files underneath inherit the project ID,
>> hardlinking outside a project is prohinited.
>
> OK, this is what I don't understand: how is something that's inode
> based limited to be per-subtree?  The way I've seen the VFS operate it
> seems that any given inode (and indeed dentry) can appear in many
> subtrees so how do I limit them to just one?
>

Project id's are not exactly "subtree" semantic, but inheritance semantics,
which is not the same when non empty directories get their project id changed.
Here is a recap:
https://lwn.net/Articles/623835/

So if you created an empty directory and "marked" it for shiftuid and all
descendants inherited this property you would be able to check that property
on a per inode basis. Not sure that is what you are looking for?
I guess we should define the semantics for the required sub-tree marking,
before we can talk about solutions.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-07 21:01                 ` Amir Goldstein
@ 2017-02-07 22:25                   ` Christoph Hellwig
  2017-02-07 23:42                     ` James Bottomley
  0 siblings, 1 reply; 82+ messages in thread
From: Christoph Hellwig @ 2017-02-07 22:25 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: James Bottomley, Christoph Hellwig, Djalal Harouni, Chris Mason,
	Theodore Tso, Josh Triplett, Eric W. Biederman, Andy Lutomirski,
	Seth Forshee, linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Tue, Feb 07, 2017 at 11:01:29PM +0200, Amir Goldstein wrote:
> Project id's are not exactly "subtree" semantic, but inheritance semantics,
> which is not the same when non empty directories get their project id changed.
> Here is a recap:
> https://lwn.net/Articles/623835/

Yes - but if we abuse them for containers we could refine the semantics
to simply not allow change of project ids from inside containers
based on say capabilities.

> I guess we should define the semantics for the required sub-tree marking,
> before we can talk about solutions.

Good plan.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-07 22:25                   ` Christoph Hellwig
@ 2017-02-07 23:42                     ` James Bottomley
  2017-02-08  6:44                       ` Amir Goldstein
  0 siblings, 1 reply; 82+ messages in thread
From: James Bottomley @ 2017-02-07 23:42 UTC (permalink / raw)
  To: Christoph Hellwig, Amir Goldstein
  Cc: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, LSM List, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

On Tue, 2017-02-07 at 14:25 -0800, Christoph Hellwig wrote:
> On Tue, Feb 07, 2017 at 11:01:29PM +0200, Amir Goldstein wrote:
> > Project id's are not exactly "subtree" semantic, but inheritance
> > semantics,
> > which is not the same when non empty directories get their project
> > id changed.
> > Here is a recap:
> > https://lwn.net/Articles/623835/
> 
> Yes - but if we abuse them for containers we could refine the 
> semantics to simply not allow change of project ids from inside 
> containers based on say capabilities.

We can't really abuse projectid, it's part of the user namespace
mapping (for project quota).  What we can do is have a new id that
behaves like it.

But like I said, we don't really need a ful ID, it would basically just
be a single bit mark to say remap or not when doing permission checks
against this inode.  It would follow some of the project id semantics
(like inheritance from parent dir)

> > I guess we should define the semantics for the required sub-tree 
> > marking, before we can talk about solutions.
> 
> Good plan.

So I've been thinking about how to do this without subtree marking and
yet retain the subtree properties similar to project id.  The advantage
would be that if it can be done using only inode properties, then none
of the permission prototypes need change.  The only real subtree
property we need is ability to bind into an unprivileged mount
namespace, but we already have that.  The gotcha about marking inodes
is that they're all or nothing, so every subtree that gets access to
the inode inherits the mark.  This means that we cannot allow a user
access to a marked inode without the cover of an unprivileged user
namespace, but I think that's fixable in the permission check
(basically if the inode is marked you *only* get access if you have a
user_ns != init_user_ns and we do the permission shifts or you have
user_ns == init_user_ns and you are admin capable).

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-07 19:49             ` Christoph Hellwig
  2017-02-07 20:05               ` James Bottomley
@ 2017-02-08  1:54               ` Josh Triplett
  2017-02-08 15:22                 ` James Bottomley
  1 sibling, 1 reply; 82+ messages in thread
From: Josh Triplett @ 2017-02-08  1:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: James Bottomley, Amir Goldstein, Djalal Harouni, Chris Mason,
	Theodore Tso, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Tue, Feb 07, 2017 at 11:49:33AM -0800, Christoph Hellwig wrote:
> On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley wrote:
> > >   Another option would be to require something like a project as used 
> > > for project quotas as the root.  This would also be conveniant as it 
> > > could storge the used remapping tables.
> > 
> > So this would be like the current project quota except set on a
> > subtree?  I could see it being done that way but I don't see what
> > advantage it has over using flags in the subtree itself (the mapping is
> > known based on the mount namespace, so there's really only a single bit
> > of information to store).
> 
> projects (which are the underling concept for project quotas) are
> per-subtree in practice - the flag is set on an inode and then
> all directories and files underneath inherit the project ID,
> hardlinking outside a project is prohinited.

I'm interested in having a VFS-level way to do more than just a shift;
I'd like to be able to arbitrarily remap IDs between what's on disk and
the system IDs.  If we're talking about developing a VFS-level solution
for this, I'd like to avoid limiting it to just a shift.  (A shift/range
would definitely be the simplest solution for many common container
cases, but not all.)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-07 23:42                     ` James Bottomley
@ 2017-02-08  6:44                       ` Amir Goldstein
  2017-02-08 11:45                         ` Konstantin Khlebnikov
                                           ` (2 more replies)
  0 siblings, 3 replies; 82+ messages in thread
From: Amir Goldstein @ 2017-02-08  6:44 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christoph Hellwig, Djalal Harouni, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes, Konstantin Khlebnikov

On Wed, Feb 8, 2017 at 1:42 AM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> On Tue, 2017-02-07 at 14:25 -0800, Christoph Hellwig wrote:
>> On Tue, Feb 07, 2017 at 11:01:29PM +0200, Amir Goldstein wrote:
>> > Project id's are not exactly "subtree" semantic, but inheritance
>> > semantics,
>> > which is not the same when non empty directories get their project
>> > id changed.
>> > Here is a recap:
>> > https://lwn.net/Articles/623835/
>>
>> Yes - but if we abuse them for containers we could refine the
>> semantics to simply not allow change of project ids from inside
>> containers based on say capabilities.
>

You mean something like this:
https://lwn.net/Articles/632917/

With the suggested protected_projects, projid 0 (also inside container)
gets a special meaning, much like user 0, so we may do interesting
things with the projid that is mapped to 0.

> We can't really abuse projectid, it's part of the user namespace
> mapping (for project quota).  What we can do is have a new id that
> behaves like it.
>

Perhaps we *can* use projid without abusing it.
userns already maps projids, but there is no concept of "owning project"
for a userns, nor does it make a lot of sense, because projid is not
part of the credentials.
But if we re-brand it as "container root projid", we can try to use it
for defining semantics to grant unprivileged access to a subtree.

The functionality you are trying to get with shiftfs mark does
sounds a bit like "container root projid":
- inodes with mapped projid MAY be uid/gid shifted
- inodes with unmapped projid MAY NOT

I realize this may be very raw, but its a start. If you like this
direction we can try to develop it.

> But like I said, we don't really need a ful ID, it would basically just
> be a single bit mark to say remap or not when doing permission checks
> against this inode.  It would follow some of the project id semantics
> (like inheritance from parent dir)
>

But a single bit would only work for single level of userns nesting won't it?


>> > I guess we should define the semantics for the required sub-tree
>> > marking, before we can talk about solutions.
>>
>> Good plan.
>
> So I've been thinking about how to do this without subtree marking and
> yet retain the subtree properties similar to project id.  The advantage
> would be that if it can be done using only inode properties, then none
> of the permission prototypes need change.  The only real subtree
> property we need is ability to bind into an unprivileged mount
> namespace, but we already have that.  The gotcha about marking inodes
> is that they're all or nothing, so every subtree that gets access to
> the inode inherits the mark.  This means that we cannot allow a user
> access to a marked inode without the cover of an unprivileged user
> namespace, but I think that's fixable in the permission check
> (basically if the inode is marked you *only* get access if you have a
> user_ns != init_user_ns and we do the permission shifts or you have
> user_ns == init_user_ns and you are admin capable).
>

I didn't follow, but it sounds like your proposed solutions is only
good for single level of userns nesting.
Do you think you can redefine it in terms of "container root projid".

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-08  6:44                       ` Amir Goldstein
@ 2017-02-08 11:45                         ` Konstantin Khlebnikov
  2017-02-08 14:57                         ` James Bottomley
  2017-02-08 15:15                         ` James Bottomley
  2 siblings, 0 replies; 82+ messages in thread
From: Konstantin Khlebnikov @ 2017-02-08 11:45 UTC (permalink / raw)
  To: Amir Goldstein, James Bottomley
  Cc: Christoph Hellwig, Djalal Harouni, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On 08.02.2017 09:44, Amir Goldstein wrote:
> On Wed, Feb 8, 2017 at 1:42 AM, James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
>> On Tue, 2017-02-07 at 14:25 -0800, Christoph Hellwig wrote:
>>> On Tue, Feb 07, 2017 at 11:01:29PM +0200, Amir Goldstein wrote:
>>>> Project id's are not exactly "subtree" semantic, but inheritance
>>>> semantics,
>>>> which is not the same when non empty directories get their project
>>>> id changed.
>>>> Here is a recap:
>>>> https://lwn.net/Articles/623835/
>>>
>>> Yes - but if we abuse them for containers we could refine the
>>> semantics to simply not allow change of project ids from inside
>>> containers based on say capabilities.
>>
>
> You mean something like this:
> https://lwn.net/Articles/632917/
>
> With the suggested protected_projects, projid 0 (also inside container)
> gets a special meaning, much like user 0, so we may do interesting
> things with the projid that is mapped to 0.
>
>> We can't really abuse projectid, it's part of the user namespace
>> mapping (for project quota).  What we can do is have a new id that
>> behaves like it.
>>
>
> Perhaps we *can* use projid without abusing it.
> userns already maps projids, but there is no concept of "owning project"
> for a userns, nor does it make a lot of sense, because projid is not
> part of the credentials.
> But if we re-brand it as "container root projid", we can try to use it
> for defining semantics to grant unprivileged access to a subtree.
>
> The functionality you are trying to get with shiftfs mark does
> sounds a bit like "container root projid":
> - inodes with mapped projid MAY be uid/gid shifted
> - inodes with unmapped projid MAY NOT
>
> I realize this may be very raw, but its a start. If you like this
> direction we can try to develop it.
>
>> But like I said, we don't really need a ful ID, it would basically just
>> be a single bit mark to say remap or not when doing permission checks
>> against this inode.  It would follow some of the project id semantics
>> (like inheritance from parent dir)
>>
>
> But a single bit would only work for single level of userns nesting won't it?
>
>
>>>> I guess we should define the semantics for the required sub-tree
>>>> marking, before we can talk about solutions.
>>>
>>> Good plan.
>>
>> So I've been thinking about how to do this without subtree marking and
>> yet retain the subtree properties similar to project id.  The advantage
>> would be that if it can be done using only inode properties, then none
>> of the permission prototypes need change.  The only real subtree
>> property we need is ability to bind into an unprivileged mount
>> namespace, but we already have that.  The gotcha about marking inodes
>> is that they're all or nothing, so every subtree that gets access to
>> the inode inherits the mark.  This means that we cannot allow a user
>> access to a marked inode without the cover of an unprivileged user
>> namespace, but I think that's fixable in the permission check
>> (basically if the inode is marked you *only* get access if you have a
>> user_ns != init_user_ns and we do the permission shifts or you have
>> user_ns == init_user_ns and you are admin capable).
>>
>
> I didn't follow, but it sounds like your proposed solutions is only
> good for single level of userns nesting.
> Do you think you can redefine it in terms of "container root projid".
>

Looks like all this started from mangling uid/gid or some other metadata.
As usual, I have to propose funny/insane solutions:
proxify filesystem with fuse and mangle everything in userspace.
Or add some kind of userspace-driver remapping/mangling into overlay,
for example using BPF script (I see it everywhere nowdays).

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-08  6:44                       ` Amir Goldstein
  2017-02-08 11:45                         ` Konstantin Khlebnikov
@ 2017-02-08 14:57                         ` James Bottomley
  2017-02-08 15:15                         ` James Bottomley
  2 siblings, 0 replies; 82+ messages in thread
From: James Bottomley @ 2017-02-08 14:57 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christoph Hellwig, Djalal Harouni, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes, Konstantin Khlebnikov

On Wed, 2017-02-08 at 08:44 +0200, Amir Goldstein wrote:
> On Wed, Feb 8, 2017 at 1:42 AM, James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> > On Tue, 2017-02-07 at 14:25 -0800, Christoph Hellwig wrote:
> > > On Tue, Feb 07, 2017 at 11:01:29PM +0200, Amir Goldstein wrote:
> > > > Project id's are not exactly "subtree" semantic, but 
> > > > inheritance semantics,
> > > > which is not the same when non empty directories get their
> > > > project
> > > > id changed.
> > > > Here is a recap:
> > > > https://lwn.net/Articles/623835/
> > > 
> > > Yes - but if we abuse them for containers we could refine the
> > > semantics to simply not allow change of project ids from inside
> > > containers based on say capabilities.
> > 
> 
> You mean something like this:
> https://lwn.net/Articles/632917/
> 
> With the suggested protected_projects, projid 0 (also inside 
> container) gets a special meaning, much like user 0, so we may do 
> interesting things with the projid that is mapped to 0.
> 
> > We can't really abuse projectid, it's part of the user namespace
> > mapping (for project quota).  What we can do is have a new id that
> > behaves like it.
> > 
> 
> Perhaps we *can* use projid without abusing it. userns already maps 
> projids, but there is no concept of "owning project" for a userns, 
> nor does it make a lot of sense, because projid is not part of the 
> credentials. But if we re-brand it as "container root projid", we can 
> try to use it for defining semantics to grant unprivileged access to
> a subtree.
> 
> The functionality you are trying to get with shiftfs mark does
> sounds a bit like "container root projid":
> - inodes with mapped projid MAY be uid/gid shifted
> - inodes with unmapped projid MAY NOT
> 
> I realize this may be very raw, but its a start. If you like this
> direction we can try to develop it.

So I don't think hijacking project id is the way to go.  If we do that
we interfere with using project quotas within containers.  Now that
project quotas work for both xfs and ext4, it's no longer really an xfs
specific feature.

I could see adding a shift on a per projectid basis, so project id
still had its quota meaning, but you could get the uid/gid shift from a
given project id.  However, the big kicker is that the only filesystems
you can actually set a projectid on (via the fsxattr) are ext4 and xfs.
 That's too few to make it work universally (we'd at least need btrfs
and possibly a few others).

However, that's just mechanism.  We can begin with a volatile mark and
work out how we want to store it later.  I think following projectid
properties is the important one, so the choice of whether to hijack, or
attach to projectid is preserved but not mandated.

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-08  6:44                       ` Amir Goldstein
  2017-02-08 11:45                         ` Konstantin Khlebnikov
  2017-02-08 14:57                         ` James Bottomley
@ 2017-02-08 15:15                         ` James Bottomley
  2 siblings, 0 replies; 82+ messages in thread
From: James Bottomley @ 2017-02-08 15:15 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christoph Hellwig, Djalal Harouni, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes, Konstantin Khlebnikov

On Wed, 2017-02-08 at 08:44 +0200, Amir Goldstein wrote:
> On Wed, Feb 8, 2017 at 1:42 AM, James Bottomley
[...]
> > So I've been thinking about how to do this without subtree marking 
> > and yet retain the subtree properties similar to project id.  The
> > advantage would be that if it can be done using only inode 
> > properties, then none of the permission prototypes need change. 
> >  The only real subtree property we need is ability to bind into an 
> > unprivileged mount namespace, but we already have that.  The gotcha 
> > about marking inodes is that they're all or nothing, so every 
> > subtree that gets access to the inode inherits the mark.  This 
> > means that we cannot allow a user access to a marked inode without 
> > the cover of an unprivileged user namespace, but I think that's 
> > fixable in the permission check (basically if the inode is marked 
> > you *only* get access if you have a user_ns != init_user_ns and we 
> > do the permission shifts or you have user_ns == init_user_ns and
> > you are admin capable).
> > 
> 
> I didn't follow, but it sounds like your proposed solutions is only
> good for single level of userns nesting. Do you think you can
> redefine it in terms of "container root projid".

I don't quite understand what you're getting at.  user_ns mappings
nest, but what we see depends on where you're trying to look at it. 
 Let's take the kernel's view as the primary one.  That's the kuid_t. 
 The user has a different view, the uid_t and now we have the
filesystem view (no actual type for this).  The user view is produced
by from the kernel view by chaining up all the maps from the
current_user_ns and the filesystem view is produced by doing the same
thing for the s_user_ns.  So however many levels of user namespace
nesting we have operating, we only have three views of what an id is:
the user view, the kernel view and the filesystem view.  All nesting
does is change how those views are mapped but it doesn't alter the
number of views.

What the original shiftfs patches (not the ones that use s_user_ns) did
was to introduce effectively an inode view and map between the kernel
and the inode view using the shift mapping parameters; then the inode
view would get mapped through the s_user_ns to become the filesystem
view.  In the s_user_ns version of shiftfs (the current patches),
there's still an inode view, but we know that what we want to write to
disk is the user view, so effectively the user view and the inode view
become the same if the filesystem is marked otherwise the inode view
and the kernel view are the same if it isn't.  That's why I only need a
single bit to tell me if I'm mapping or not and there are two separate
regimes to check the permissions in: the user == inode view and the
kernel == inode view.

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-08  1:54               ` Josh Triplett
@ 2017-02-08 15:22                 ` James Bottomley
  2017-02-09 10:36                   ` Josh Triplett
  0 siblings, 1 reply; 82+ messages in thread
From: James Bottomley @ 2017-02-08 15:22 UTC (permalink / raw)
  To: Josh Triplett, Christoph Hellwig
  Cc: Amir Goldstein, Djalal Harouni, Chris Mason, Theodore Tso,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, LSM List, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

On Tue, 2017-02-07 at 17:54 -0800, Josh Triplett wrote:
> On Tue, Feb 07, 2017 at 11:49:33AM -0800, Christoph Hellwig wrote:
> > On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley wrote:
> > > >   Another option would be to require something like a project
> > > > as used 
> > > > for project quotas as the root.  This would also be conveniant
> > > > as it 
> > > > could storge the used remapping tables.
> > > 
> > > So this would be like the current project quota except set on a
> > > subtree?  I could see it being done that way but I don't see what
> > > advantage it has over using flags in the subtree itself (the
> > > mapping is
> > > known based on the mount namespace, so there's really only a
> > > single bit
> > > of information to store).
> > 
> > projects (which are the underling concept for project quotas) are
> > per-subtree in practice - the flag is set on an inode and then
> > all directories and files underneath inherit the project ID,
> > hardlinking outside a project is prohinited.
> 
> I'm interested in having a VFS-level way to do more than just a 
> shift; I'd like to be able to arbitrarily remap IDs between what's on 
> disk and the system IDs.

OK, so the shift is effectively an arbitrary remap because it allows
multiple ranges to be mapped (althought the userns currently imposes a
maximum number of five extents but that limit is a bit arbitrary just
to try to limit the amount of space the parametrisation takes).  See
kernel/user_namespace.c:map_id_up/down()

>   If we're talking about developing a VFS-level solution for this, 
> I'd like to avoid limiting it to just a shift.  (A shift/range
> would definitely be the simplest solution for many common container
> cases, but not all.)

I assume the above satisfies you on this point, but raises the
question: do you want an arbitrary shift not parametrised by a user
namespace?  If so how many such shifts do you want ... giving some
details of the use case would be helpful.

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-08 15:22                 ` James Bottomley
@ 2017-02-09 10:36                   ` Josh Triplett
  2017-02-09 15:34                     ` James Bottomley
  0 siblings, 1 reply; 82+ messages in thread
From: Josh Triplett @ 2017-02-09 10:36 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christoph Hellwig, Amir Goldstein, Djalal Harouni, Chris Mason,
	Theodore Tso, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Wed, Feb 08, 2017 at 07:22:45AM -0800, James Bottomley wrote:
> On Tue, 2017-02-07 at 17:54 -0800, Josh Triplett wrote:
> > On Tue, Feb 07, 2017 at 11:49:33AM -0800, Christoph Hellwig wrote:
> > > On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley wrote:
> > > > >   Another option would be to require something like a project
> > > > > as used 
> > > > > for project quotas as the root.  This would also be conveniant
> > > > > as it 
> > > > > could storge the used remapping tables.
> > > > 
> > > > So this would be like the current project quota except set on a
> > > > subtree?  I could see it being done that way but I don't see what
> > > > advantage it has over using flags in the subtree itself (the
> > > > mapping is
> > > > known based on the mount namespace, so there's really only a
> > > > single bit
> > > > of information to store).
> > > 
> > > projects (which are the underling concept for project quotas) are
> > > per-subtree in practice - the flag is set on an inode and then
> > > all directories and files underneath inherit the project ID,
> > > hardlinking outside a project is prohinited.
> > 
> > I'm interested in having a VFS-level way to do more than just a 
> > shift; I'd like to be able to arbitrarily remap IDs between what's on 
> > disk and the system IDs.
> 
> OK, so the shift is effectively an arbitrary remap because it allows
> multiple ranges to be mapped (althought the userns currently imposes a
> maximum number of five extents but that limit is a bit arbitrary just
> to try to limit the amount of space the parametrisation takes).  See
> kernel/user_namespace.c:map_id_up/down()
> 
> >   If we're talking about developing a VFS-level solution for this, 
> > I'd like to avoid limiting it to just a shift.  (A shift/range
> > would definitely be the simplest solution for many common container
> > cases, but not all.)
> 
> I assume the above satisfies you on this point, but raises the
> question: do you want an arbitrary shift not parametrised by a user
> namespace?  If so how many such shifts do you want ... giving some
> details of the use case would be helpful.

The limit of five extents means this may not work in the most general
case, no.

One use case: given an on-disk filesystem, its name-to-number mapping,
and your host name-to-number mapping, mount the filesystem with all the
UIDs bidirectionally mapped to those on your host system.

Another use case: given an on-disk filesystem with potentially arbitrary
UIDs (not necessarily in a clean contiguous block), and a pile of
unprivileged UIDs, mount the filesystem such that every on-disk UID gets
a unique unprivileged UID.

(I have some additional use cases, but they would require the ability to
extend the mapping on the fly without remounting.)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-09 10:36                   ` Josh Triplett
@ 2017-02-09 15:34                     ` James Bottomley
  2017-02-13 10:15                       ` Eric W. Biederman
  0 siblings, 1 reply; 82+ messages in thread
From: James Bottomley @ 2017-02-09 15:34 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Christoph Hellwig, Amir Goldstein, Djalal Harouni, Chris Mason,
	Theodore Tso, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Thu, 2017-02-09 at 02:36 -0800, Josh Triplett wrote:
> On Wed, Feb 08, 2017 at 07:22:45AM -0800, James Bottomley wrote:
> > On Tue, 2017-02-07 at 17:54 -0800, Josh Triplett wrote:
> > > On Tue, Feb 07, 2017 at 11:49:33AM -0800, Christoph Hellwig
> > > wrote:
> > > > On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley
> > > > wrote:
> > > > > >   Another option would be to require something like a 
> > > > > > project as used for project quotas as the root.  This would 
> > > > > > also be conveniant as it could storge the used remapping
> > > > > > tables.
> > > > > 
> > > > > So this would be like the current project quota except set on 
> > > > > a subtree?  I could see it being done that way but I don't 
> > > > > see what advantage it has over using flags in the subtree 
> > > > > itself (the mapping is known based on the mount namespace, so 
> > > > > there's really only a single bit of information to store).
> > > > 
> > > > projects (which are the underling concept for project quotas) 
> > > > are per-subtree in practice - the flag is set on an inode and 
> > > > then all directories and files underneath inherit the project 
> > > > ID, hardlinking outside a project is prohinited.
> > > 
> > > I'm interested in having a VFS-level way to do more than just a 
> > > shift; I'd like to be able to arbitrarily remap IDs between 
> > > what's on disk and the system IDs.
> > 
> > OK, so the shift is effectively an arbitrary remap because it 
> > allows multiple ranges to be mapped (althought the userns currently
> > imposes a maximum number of five extents but that limit is a bit 
> > arbitrary just to try to limit the amount of space the 
> > parametrisation takes).  See
> > kernel/user_namespace.c:map_id_up/down()
> > 
> > >   If we're talking about developing a VFS-level solution for 
> > > this, I'd like to avoid limiting it to just a shift.  (A 
> > > shift/range would definitely be the simplest solution for many 
> > > common container cases, but not all.)
> > 
> > I assume the above satisfies you on this point, but raises the
> > question: do you want an arbitrary shift not parametrised by a user
> > namespace?  If so how many such shifts do you want ... giving some
> > details of the use case would be helpful.
> 
> The limit of five extents means this may not work in the most general
> case, no.

That's not an API limit, so it can be changed if there's a need.  The
problem was merely how to parametrise a mapping without taking too much
space.

> One use case: given an on-disk filesystem, its name-to-number 
> mapping, and your host name-to-number mapping, mount the filesystem 
> with all the UIDs bidirectionally mapped to those on your host
> system.

This is pretty much what the s_user_ns does.

> Another use case: given an on-disk filesystem with potentially 
> arbitrary UIDs (not necessarily in a clean contiguous block), and a 
> pile of unprivileged UIDs, mount the filesystem such that every on
> -disk UID gets a unique unprivileged UID.

So is this.  Basically anything that begins by mounting gets a super
block and can use the s_user_ns to map from the filesystem view to the
kernel view of ids.  Apart from greater sophistication in the
parametrisation, it sounds like we have all the machinery you need. 
 I'm sure the containers people will consider reasonable patches to
change this.

James

> (I have some additional use cases, but they would require the ability 
> to extend the mapping on the fly without remounting.)
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-09 15:34                     ` James Bottomley
@ 2017-02-13 10:15                       ` Eric W. Biederman
  2017-02-15  9:33                         ` Djalal Harouni
  0 siblings, 1 reply; 82+ messages in thread
From: Eric W. Biederman @ 2017-02-13 10:15 UTC (permalink / raw)
  To: James Bottomley
  Cc: Josh Triplett, Christoph Hellwig, Amir Goldstein, Djalal Harouni,
	Chris Mason, Theodore Tso, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

James Bottomley <James.Bottomley@HansenPartnership.com> writes:

> On Thu, 2017-02-09 at 02:36 -0800, Josh Triplett wrote:
>> On Wed, Feb 08, 2017 at 07:22:45AM -0800, James Bottomley wrote:
>> > On Tue, 2017-02-07 at 17:54 -0800, Josh Triplett wrote:
>> > > On Tue, Feb 07, 2017 at 11:49:33AM -0800, Christoph Hellwig
>> > > wrote:
>> > > > On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley
>> > > > wrote:
>> > > > > >   Another option would be to require something like a 
>> > > > > > project as used for project quotas as the root.  This would 
>> > > > > > also be conveniant as it could storge the used remapping
>> > > > > > tables.
>> > > > > 
>> > > > > So this would be like the current project quota except set on 
>> > > > > a subtree?  I could see it being done that way but I don't 
>> > > > > see what advantage it has over using flags in the subtree 
>> > > > > itself (the mapping is known based on the mount namespace, so 
>> > > > > there's really only a single bit of information to store).
>> > > > 
>> > > > projects (which are the underling concept for project quotas) 
>> > > > are per-subtree in practice - the flag is set on an inode and 
>> > > > then all directories and files underneath inherit the project 
>> > > > ID, hardlinking outside a project is prohinited.
>> > > 
>> > > I'm interested in having a VFS-level way to do more than just a 
>> > > shift; I'd like to be able to arbitrarily remap IDs between 
>> > > what's on disk and the system IDs.
>> > 
>> > OK, so the shift is effectively an arbitrary remap because it 
>> > allows multiple ranges to be mapped (althought the userns currently
>> > imposes a maximum number of five extents but that limit is a bit 
>> > arbitrary just to try to limit the amount of space the 
>> > parametrisation takes).  See
>> > kernel/user_namespace.c:map_id_up/down()
>> > 
>> > >   If we're talking about developing a VFS-level solution for 
>> > > this, I'd like to avoid limiting it to just a shift.  (A 
>> > > shift/range would definitely be the simplest solution for many 
>> > > common container cases, but not all.)
>> > 
>> > I assume the above satisfies you on this point, but raises the
>> > question: do you want an arbitrary shift not parametrised by a user
>> > namespace?  If so how many such shifts do you want ... giving some
>> > details of the use case would be helpful.
>> 
>> The limit of five extents means this may not work in the most general
>> case, no.
>
> That's not an API limit, so it can be changed if there's a need.  The
> problem was merely how to parametrise a mapping without taking too much
> space.
>
>> One use case: given an on-disk filesystem, its name-to-number 
>> mapping, and your host name-to-number mapping, mount the filesystem 
>> with all the UIDs bidirectionally mapped to those on your host
>> system.
>
> This is pretty much what the s_user_ns does.
>
>> Another use case: given an on-disk filesystem with potentially 
>> arbitrary UIDs (not necessarily in a clean contiguous block), and a 
>> pile of unprivileged UIDs, mount the filesystem such that every on
>> -disk UID gets a unique unprivileged UID.
>
> So is this.  Basically anything that begins by mounting gets a super
> block and can use the s_user_ns to map from the filesystem view to the
> kernel view of ids.  Apart from greater sophistication in the
> parametrisation, it sounds like we have all the machinery you need. 
>  I'm sure the containers people will consider reasonable patches to
> change this.

Yes.

And to be clear we have all of that merged now and mostly present and
hooked up in all filesystems without any shiftfs like changes needed.

To use this with a filesystem a last pass needs to be had to verify that
the cases where something does not map are handled cleanly.

Eric

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-06  1:18     ` James Bottomley
  2017-02-06  6:59       ` Amir Goldstein
@ 2017-02-14 23:03       ` Vivek Goyal
  2017-02-14 23:45         ` James Bottomley
  1 sibling, 1 reply; 82+ messages in thread
From: Vivek Goyal @ 2017-02-14 23:03 UTC (permalink / raw)
  To: James Bottomley
  Cc: Amir Goldstein, Djalal Harouni, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Sun, Feb 05, 2017 at 05:18:11PM -0800, James Bottomley wrote:

[..]
> >  shiftfs is going to miss out on overlayfs bug fixes related to user 
> > credentials differ from mounter credentials, like fd3220d ("ovl: 
> > update S_ISGID when setting posix ACLs"). I am not sure that this 
> > specific case is relevant to shiftfs, but there could be other.
> 
> OK, so shiftfs doesn't have this bug and the reason why is
> illustrative: basically shiftfs does three things
> 
>    1. lookups via a uid/gid shifted dentry cache
>    2. shifted credential inode operations permission checks on the
>       underlying filesystem
>    3. location marking for unprivileged mount
> 
> I think we've already seen that 1. isn't from overlayfs but the
> functionality could be added to overlayfs, I suppose.  The big problem
> is 2.  The overlayfs code emulates the permission checks, which makes
> it rather complex (this is where you get your bugs like the above
> from).  I did actually look at adding 2. to overlayfs on the theory
> that a single layer overlay might be closest to what this is, but
> eventually concluded I'd have to take the special cases and add a whole
> lot more to them ... it really would increase the maintenance burden
> substantially and make the code an unreadable rats nest.

Hi James,

If we merge this functionality in overlayfs, then we could avoid extra
copy of dentry/inode and that might be a significant advantage.

W.r.t permission checks, I am wondering will it make sense to do what
overlayfs is doing for shiftfs. That is permission is checked on
two inodes. We use creds of task for checking permission on
shiftfs/overlay inode and mounter's creds on real inode.

Given we have already shifted the uid/gid for shiftfs inode, I am 
wondering that why can't we simply call generic_permission(shiftfs_inode,
mask) directly in the context of caller. Something like..

shiftfs_permission() {
  err = generic_permission(inode, mask);
  if (err)
  	return err;

  switch_to_mounter_creds;
  err = inode_permission(reali, mask);
  revert_creds();

  return err;
}

Vivek

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-14 23:03       ` Vivek Goyal
@ 2017-02-14 23:45         ` James Bottomley
  2017-02-15 14:17           ` Vivek Goyal
  0 siblings, 1 reply; 82+ messages in thread
From: James Bottomley @ 2017-02-14 23:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Amir Goldstein, Djalal Harouni, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Tue, 2017-02-14 at 18:03 -0500, Vivek Goyal wrote:
> On Sun, Feb 05, 2017 at 05:18:11PM -0800, James Bottomley wrote:
> 
> [..]
> > >  shiftfs is going to miss out on overlayfs bug fixes related to
> > > user 
> > > credentials differ from mounter credentials, like fd3220d ("ovl: 
> > > update S_ISGID when setting posix ACLs"). I am not sure that this
> > > specific case is relevant to shiftfs, but there could be other.
> > 
> > OK, so shiftfs doesn't have this bug and the reason why is
> > illustrative: basically shiftfs does three things
> > 
> >    1. lookups via a uid/gid shifted dentry cache
> >    2. shifted credential inode operations permission checks on the
> >       underlying filesystem
> >    3. location marking for unprivileged mount
> > 
> > I think we've already seen that 1. isn't from overlayfs but the
> > functionality could be added to overlayfs, I suppose.  The big 
> > problem is 2.  The overlayfs code emulates the permission checks, 
> > which makes it rather complex (this is where you get your bugs like 
> > the above from).  I did actually look at adding 2. to overlayfs on 
> > the theory that a single layer overlay might be closest to what 
> > this is, but eventually concluded I'd have to take the special 
> > cases and add a whole lot more to them ... it really would increase 
> > the maintenance burden substantially and make the code an
> > unreadable rats nest.
> 
> Hi James,
> 
> If we merge this functionality in overlayfs, then we could avoid 
> extra copy of dentry/inode and that might be a significant advantage.

I made that argument to Viro originally when I tried to do all lookups
via the underlying cache.  In the end, it's 192 bytes per dentry and
584 per inode, all of which are reclaimable, so it's not much of an
advantage and it is a great simplification to the code.  In general if
you have a natural separation, you should make the layers reflect it.

My container use case doesn't use overlayfs currently, so to me it
wouldn't provide any advantage whatsoever.

> W.r.t permission checks, I am wondering will it make sense to do what
> overlayfs is doing for shiftfs. That is permission is checked on
> two inodes. We use creds of task for checking permission on
> shiftfs/overlay inode and mounter's creds on real inode.

The mounter's creds for overlay are usually admin ones, so it's local
permission check asks should I? and the later one asks can I? (as in
would my original admin creds allow this).  In many ways, overlayfs is
ignoring the fact that the underlying ->permissions() call might have
failed for some good reason on the current creds.  I don't think any
serious trouble results from this but it strikes me as icky.

> Given we have already shifted the uid/gid for shiftfs inode, I am 
> wondering that why can't we simply call generic_permission(shiftfs_in
> ode, mask) directly in the context of caller. Something like..
> 
> shiftfs_permission() {
>   err = generic_permission(inode, mask);
>   if (err)
>   	return err;
> 
>   switch_to_mounter_creds;
>   err = inode_permission(reali, mask);
>   revert_creds();
> 
>   return err;
> }

Because if the reali->d_iop->permission exists, you should use it.  You
could argue shiftfs_permission should be

	if (iop->permission) {
		oldcred = shiftfs_new_creds(&newcred, inode->i_sb);
		err = iop->permission(reali, mask);
		shiftfs_old_creds(oldcred, &newcred);
	} else
		err = generic_permission(inode, mask);

But really that's a small optimisation.

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-13 10:15                       ` Eric W. Biederman
@ 2017-02-15  9:33                         ` Djalal Harouni
  2017-02-15  9:37                           ` Eric W. Biederman
  0 siblings, 1 reply; 82+ messages in thread
From: Djalal Harouni @ 2017-02-15  9:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: James Bottomley, Josh Triplett, Christoph Hellwig,
	Amir Goldstein, Chris Mason, Theodore Tso, Andy Lutomirski,
	Seth Forshee, linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Mon, Feb 13, 2017 at 11:15 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> James Bottomley <James.Bottomley@HansenPartnership.com> writes:
>
>> On Thu, 2017-02-09 at 02:36 -0800, Josh Triplett wrote:
>>> On Wed, Feb 08, 2017 at 07:22:45AM -0800, James Bottomley wrote:
>>> > On Tue, 2017-02-07 at 17:54 -0800, Josh Triplett wrote:
>>> > > On Tue, Feb 07, 2017 at 11:49:33AM -0800, Christoph Hellwig
>>> > > wrote:
>>> > > > On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley
>>> > > > wrote:
>>> > > > > >   Another option would be to require something like a
>>> > > > > > project as used for project quotas as the root.  This would
>>> > > > > > also be conveniant as it could storge the used remapping
>>> > > > > > tables.
>>> > > > >
>>> > > > > So this would be like the current project quota except set on
>>> > > > > a subtree?  I could see it being done that way but I don't
>>> > > > > see what advantage it has over using flags in the subtree
>>> > > > > itself (the mapping is known based on the mount namespace, so
>>> > > > > there's really only a single bit of information to store).
>>> > > >
>>> > > > projects (which are the underling concept for project quotas)
>>> > > > are per-subtree in practice - the flag is set on an inode and
>>> > > > then all directories and files underneath inherit the project
>>> > > > ID, hardlinking outside a project is prohinited.
>>> > >
>>> > > I'm interested in having a VFS-level way to do more than just a
>>> > > shift; I'd like to be able to arbitrarily remap IDs between
>>> > > what's on disk and the system IDs.
>>> >
>>> > OK, so the shift is effectively an arbitrary remap because it
>>> > allows multiple ranges to be mapped (althought the userns currently
>>> > imposes a maximum number of five extents but that limit is a bit
>>> > arbitrary just to try to limit the amount of space the
>>> > parametrisation takes).  See
>>> > kernel/user_namespace.c:map_id_up/down()
>>> >
>>> > >   If we're talking about developing a VFS-level solution for
>>> > > this, I'd like to avoid limiting it to just a shift.  (A
>>> > > shift/range would definitely be the simplest solution for many
>>> > > common container cases, but not all.)
>>> >
>>> > I assume the above satisfies you on this point, but raises the
>>> > question: do you want an arbitrary shift not parametrised by a user
>>> > namespace?  If so how many such shifts do you want ... giving some
>>> > details of the use case would be helpful.
>>>
>>> The limit of five extents means this may not work in the most general
>>> case, no.
>>
>> That's not an API limit, so it can be changed if there's a need.  The
>> problem was merely how to parametrise a mapping without taking too much
>> space.
>>
>>> One use case: given an on-disk filesystem, its name-to-number
>>> mapping, and your host name-to-number mapping, mount the filesystem
>>> with all the UIDs bidirectionally mapped to those on your host
>>> system.
>>
>> This is pretty much what the s_user_ns does.
>>
>>> Another use case: given an on-disk filesystem with potentially
>>> arbitrary UIDs (not necessarily in a clean contiguous block), and a
>>> pile of unprivileged UIDs, mount the filesystem such that every on
>>> -disk UID gets a unique unprivileged UID.
>>
>> So is this.  Basically anything that begins by mounting gets a super
>> block and can use the s_user_ns to map from the filesystem view to the
>> kernel view of ids.  Apart from greater sophistication in the
>> parametrisation, it sounds like we have all the machinery you need.
>>  I'm sure the containers people will consider reasonable patches to
>> change this.
>
> Yes.
>
> And to be clear we have all of that merged now and mostly present and
> hooked up in all filesystems without any shiftfs like changes needed.
>
> To use this with a filesystem a last pass needs to be had to verify that
> the cases where something does not map are handled cleanly.

Still this does not answer the question how to dynamically
*attach/share* data or read-only volumes as defined by
orchestration/container tools into several containers. Am I missing
something or is the plan to have per superblock mount for each one ?

-- 
tixxdz

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-15  9:33                         ` Djalal Harouni
@ 2017-02-15  9:37                           ` Eric W. Biederman
  2017-02-15 10:04                             ` Djalal Harouni
  0 siblings, 1 reply; 82+ messages in thread
From: Eric W. Biederman @ 2017-02-15  9:37 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: James Bottomley, Josh Triplett, Christoph Hellwig,
	Amir Goldstein, Chris Mason, Theodore Tso, Andy Lutomirski,
	Seth Forshee, linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

Djalal Harouni <tixxdz@gmail.com> writes:

> On Mon, Feb 13, 2017 at 11:15 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> James Bottomley <James.Bottomley@HansenPartnership.com> writes:

>>> So is this.  Basically anything that begins by mounting gets a super
>>> block and can use the s_user_ns to map from the filesystem view to the
>>> kernel view of ids.  Apart from greater sophistication in the
>>> parametrisation, it sounds like we have all the machinery you need.
>>>  I'm sure the containers people will consider reasonable patches to
>>> change this.
>>
>> Yes.
>>
>> And to be clear we have all of that merged now and mostly present and
>> hooked up in all filesystems without any shiftfs like changes needed.
>>
>> To use this with a filesystem a last pass needs to be had to verify that
>> the cases where something does not map are handled cleanly.
>
> Still this does not answer the question how to dynamically
> *attach/share* data or read-only volumes as defined by
> orchestration/container tools into several containers. Am I missing
> something or is the plan to have per superblock mount for each one ?

Agreed.  That is a related problem and the problem that shiftfs
is working to solve.

If you only need a single mapping the infrastructure is basically done
in the kernel today.  If you need multiple mappings we need something
more.

Eric

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-15  9:37                           ` Eric W. Biederman
@ 2017-02-15 10:04                             ` Djalal Harouni
  0 siblings, 0 replies; 82+ messages in thread
From: Djalal Harouni @ 2017-02-15 10:04 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: James Bottomley, Josh Triplett, Christoph Hellwig,
	Amir Goldstein, Chris Mason, Theodore Tso, Andy Lutomirski,
	Seth Forshee, linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Wed, Feb 15, 2017 at 10:37 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Djalal Harouni <tixxdz@gmail.com> writes:
>
>> On Mon, Feb 13, 2017 at 11:15 AM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> James Bottomley <James.Bottomley@HansenPartnership.com> writes:
>
>>>> So is this.  Basically anything that begins by mounting gets a super
>>>> block and can use the s_user_ns to map from the filesystem view to the
>>>> kernel view of ids.  Apart from greater sophistication in the
>>>> parametrisation, it sounds like we have all the machinery you need.
>>>>  I'm sure the containers people will consider reasonable patches to
>>>> change this.
>>>
>>> Yes.
>>>
>>> And to be clear we have all of that merged now and mostly present and
>>> hooked up in all filesystems without any shiftfs like changes needed.
>>>
>>> To use this with a filesystem a last pass needs to be had to verify that
>>> the cases where something does not map are handled cleanly.
>>
>> Still this does not answer the question how to dynamically
>> *attach/share* data or read-only volumes as defined by
>> orchestration/container tools into several containers. Am I missing
>> something or is the plan to have per superblock mount for each one ?
>
> Agreed.  That is a related problem and the problem that shiftfs
> is working to solve.
>
> If you only need a single mapping the infrastructure is basically done
> in the kernel today.  If you need multiple mappings we need something
> more.

Yes, I'm asking since there is that vfs+userns proposed approach that
I linked in this thread, that deals with this particular problem: in
which mount namespace<->container the volume appears, maybe that can
be used on top of the s_user_ns ...

Thanks!


-- 
tixxdz

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-14 23:45         ` James Bottomley
@ 2017-02-15 14:17           ` Vivek Goyal
  2017-02-16 15:51             ` James Bottomley
  0 siblings, 1 reply; 82+ messages in thread
From: Vivek Goyal @ 2017-02-15 14:17 UTC (permalink / raw)
  To: James Bottomley
  Cc: Amir Goldstein, Djalal Harouni, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Tue, Feb 14, 2017 at 03:45:55PM -0800, James Bottomley wrote:
> On Tue, 2017-02-14 at 18:03 -0500, Vivek Goyal wrote:
> > On Sun, Feb 05, 2017 at 05:18:11PM -0800, James Bottomley wrote:
> > 
> > [..]
> > > >  shiftfs is going to miss out on overlayfs bug fixes related to
> > > > user 
> > > > credentials differ from mounter credentials, like fd3220d ("ovl: 
> > > > update S_ISGID when setting posix ACLs"). I am not sure that this
> > > > specific case is relevant to shiftfs, but there could be other.
> > > 
> > > OK, so shiftfs doesn't have this bug and the reason why is
> > > illustrative: basically shiftfs does three things
> > > 
> > >    1. lookups via a uid/gid shifted dentry cache
> > >    2. shifted credential inode operations permission checks on the
> > >       underlying filesystem
> > >    3. location marking for unprivileged mount
> > > 
> > > I think we've already seen that 1. isn't from overlayfs but the
> > > functionality could be added to overlayfs, I suppose.  The big 
> > > problem is 2.  The overlayfs code emulates the permission checks, 
> > > which makes it rather complex (this is where you get your bugs like 
> > > the above from).  I did actually look at adding 2. to overlayfs on 
> > > the theory that a single layer overlay might be closest to what 
> > > this is, but eventually concluded I'd have to take the special 
> > > cases and add a whole lot more to them ... it really would increase 
> > > the maintenance burden substantially and make the code an
> > > unreadable rats nest.
> > 
> > Hi James,
> > 
> > If we merge this functionality in overlayfs, then we could avoid 
> > extra copy of dentry/inode and that might be a significant advantage.
> 
> I made that argument to Viro originally when I tried to do all lookups
> via the underlying cache.  In the end, it's 192 bytes per dentry and
> 584 per inode, all of which are reclaimable, so it's not much of an
> advantage and it is a great simplification to the code.  In general if
> you have a natural separation, you should make the layers reflect it.

ok.

> 
> My container use case doesn't use overlayfs currently, so to me it
> wouldn't provide any advantage whatsoever.

In docker and other use cases, this probably will be used in conjunction
with overlayfs as containers would like to write data and that should not
go back to image dir and should be sent to container specific dir.

> 
> > W.r.t permission checks, I am wondering will it make sense to do what
> > overlayfs is doing for shiftfs. That is permission is checked on
> > two inodes. We use creds of task for checking permission on
> > shiftfs/overlay inode and mounter's creds on real inode.
> 
> The mounter's creds for overlay are usually admin ones, so it's local
> permission check asks should I? and the later one asks can I? (as in
> would my original admin creds allow this).  In many ways, overlayfs is
> ignoring the fact that the underlying ->permissions() call might have
> failed for some good reason on the current creds.  I don't think any
> serious trouble results from this but it strikes me as icky.

So we do call ->permission() of underlying inode but with the creds of
mounter (as you noted). Given we don't call reali->permission() with
the creds of task, it resulted in issues with disk quota. mounter
had CAP_SYS_RESOURCE so disk quota was being ignored. But that's easily
fixable by taking away CAP_SYS_RESOURCE from mounter's creds if caller
does not have CAP_SYS_RESOURCE.

> 
> > Given we have already shifted the uid/gid for shiftfs inode, I am 
> > wondering that why can't we simply call generic_permission(shiftfs_in
> > ode, mask) directly in the context of caller. Something like..
> > 
> > shiftfs_permission() {
> >   err = generic_permission(inode, mask);
> >   if (err)
> >   	return err;
> > 
> >   switch_to_mounter_creds;
> >   err = inode_permission(reali, mask);
> >   revert_creds();
> > 
> >   return err;
> > }
> 
> Because if the reali->d_iop->permission exists, you should use it.  You
> could argue shiftfs_permission should be
> 
> 	if (iop->permission) {
> 		oldcred = shiftfs_new_creds(&newcred, inode->i_sb);
> 		err = iop->permission(reali, mask);
> 		shiftfs_old_creds(oldcred, &newcred);
> 	} else
> 		err = generic_permission(inode, mask);
> 
> But really that's a small optimisation.

ok. I thought using mounter's creds for real inode checks, will probably
do away with need of modifying caller's user namespace in
shiftfs_get_up_creds().

cred->fsuid = KUIDT_INIT(from_kuid(sb->s_user_ns, cred->fsuid));
cred->fsgid = KGIDT_INIT(from_kgid(sb->s_user_ns, cred->fsgid));
cred->user_ns = ssi->userns;

IIUC, we are shifting caller's fsuid and fsgid into caller's user
namespace but at the same time using the user_ns of reali->sb->sb_user_ns.
Feels little twisted to me. May be I am misunderstanding it.

Two levels of checks will simplify this a bit. Top level inode will belong
to the user namespace of caller and checks should pass. And mounter's
creds will have ownership over the real inode so no additional namespace
shifting required there. We could also save these creds at mount time
once and don't have to prepare it for every call. (not sure if it has
significant performance issue or not).  Just thinking aloud...

Vivek

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-04 19:19 ` [RFC 1/1] shiftfs: uid/gid shifting bind mount James Bottomley
                     ` (2 preceding siblings ...)
  2017-02-07  9:19   ` Christoph Hellwig
@ 2017-02-15 20:34   ` Vivek Goyal
  2017-02-16 15:56     ` James Bottomley
  2017-02-17  2:29   ` Al Viro
  4 siblings, 1 reply; 82+ messages in thread
From: Vivek Goyal @ 2017-02-15 20:34 UTC (permalink / raw)
  To: James Bottomley
  Cc: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

On Sat, Feb 04, 2017 at 11:19:32AM -0800, James Bottomley wrote:

[..]
> +static struct dentry *shiftfs_lookup(struct inode *dir, struct dentry *dentry,
> +				     unsigned int flags)
> +{
> +	struct dentry *real = dir->i_private, *new;
> +	struct inode *reali = real->d_inode, *newi;
> +	const struct cred *oldcred, *newcred;
> +
> +	inode_lock(reali);
> +	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
> +	new = lookup_one_len(dentry->d_name.name, real, dentry->d_name.len);
> +	shiftfs_old_creds(oldcred, &newcred);
> +	inode_unlock(reali);
> +
> +	if (IS_ERR(new))
> +		return new;
> +
> +	dentry->d_fsdata = new;
> +
> +	if (!new->d_inode)
> +		return NULL;
> +
> +	newi = shiftfs_new_inode(dentry->d_sb, new->d_inode->i_mode, new);
> +	if (!newi) {
> +		dput(new);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	d_splice_alias(newi, dentry);

Hi James,

Should it be "return d_splice_alias()" so that if we find an alias it is
returned back to caller and passed in dentry can be freed. Though I don't
know in what cases alias can be found. And if alias is found how do we
make sure alias_dentry->d_fsdata is pointing to new (real dentry).

> +
> +	return NULL;
> +}

Vivek

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-15 14:17           ` Vivek Goyal
@ 2017-02-16 15:51             ` James Bottomley
  2017-02-16 16:42               ` Vivek Goyal
  0 siblings, 1 reply; 82+ messages in thread
From: James Bottomley @ 2017-02-16 15:51 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Amir Goldstein, Djalal Harouni, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Wed, 2017-02-15 at 09:17 -0500, Vivek Goyal wrote:
> On Tue, Feb 14, 2017 at 03:45:55PM -0800, James Bottomley wrote:
> > On Tue, 2017-02-14 at 18:03 -0500, Vivek Goyal wrote:

[...]
> > > Given we have already shifted the uid/gid for shiftfs inode, I am
> > > wondering that why can't we simply call
> > > generic_permission(shiftfs_inode, mask) directly in the context
> > > of caller. Something like..
> > > 
> > > shiftfs_permission() {
> > >   err = generic_permission(inode, mask);
> > >   if (err)
> > >   	return err;
> > > 
> > >   switch_to_mounter_creds;
> > >   err = inode_permission(reali, mask);
> > >   revert_creds();
> > > 
> > >   return err;
> > > }
> > 
> > Because if the reali->d_iop->permission exists, you should use it. 
> >  You could argue shiftfs_permission should be
> > 
> > 	if (iop->permission) {
> > 		oldcred = shiftfs_new_creds(&newcred, inode->i_sb);
> > 		err = iop->permission(reali, mask);
> > 		shiftfs_old_creds(oldcred, &newcred);
> > 	} else
> > 		err = generic_permission(inode, mask);
> > 
> > But really that's a small optimisation.
> 
> ok. I thought using mounter's creds for real inode checks, will 
> probably do away with need of modifying caller's user namespace in
> shiftfs_get_up_creds().

Well, no ... the mounter of a marked superblock is container admin, but
the owner in the filesystem view is real root.  The unprivileged
mounter's credentials aren't sufficient, therefore. 

> cred->fsuid = KUIDT_INIT(from_kuid(sb->s_user_ns, cred->fsuid));
> cred->fsgid = KGIDT_INIT(from_kgid(sb->s_user_ns, cred->fsgid));
> cred->user_ns = ssi->userns;
> 
> IIUC, we are shifting caller's fsuid and fsgid into caller's user
> namespace but at the same time using the user_ns of reali->sb
> ->sb_user_ns. Feels little twisted to me. May be I am
> misunderstanding it.

Actually what we're doing is shifting the credentials into the
underlying mount's filesystem view.

> Two levels of checks will simplify this a bit. Top level inode will 
> belong to the user namespace of caller and checks should pass. And 
> mounter's creds will have ownership over the real inode so no 
> additional namespace shifting required there.

That's the problem: for a marked mount, they don't.

>  We could also save these creds at mount time once and don't have to
> prepare it for every call. (not sure if it has significant
> performance issue or not).  Just thinking aloud...

If it proves to be an issue, I suppose the shift could be cached, but I
really don't think it matters that much.

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-15 20:34   ` Vivek Goyal
@ 2017-02-16 15:56     ` James Bottomley
  2017-02-17  2:55       ` Al Viro
  0 siblings, 1 reply; 82+ messages in thread
From: James Bottomley @ 2017-02-16 15:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

On Wed, 2017-02-15 at 15:34 -0500, Vivek Goyal wrote:
> On Sat, Feb 04, 2017 at 11:19:32AM -0800, James Bottomley wrote:
> 
> [..]
> > +static struct dentry *shiftfs_lookup(struct inode *dir, struct
> > dentry *dentry,
> > +				     unsigned int flags)
> > +{
> > +	struct dentry *real = dir->i_private, *new;
> > +	struct inode *reali = real->d_inode, *newi;
> > +	const struct cred *oldcred, *newcred;
> > +
> > +	inode_lock(reali);
> > +	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
> > +	new = lookup_one_len(dentry->d_name.name, real, dentry
> > ->d_name.len);
> > +	shiftfs_old_creds(oldcred, &newcred);
> > +	inode_unlock(reali);
> > +
> > +	if (IS_ERR(new))
> > +		return new;
> > +
> > +	dentry->d_fsdata = new;
> > +
> > +	if (!new->d_inode)
> > +		return NULL;
> > +
> > +	newi = shiftfs_new_inode(dentry->d_sb, new->d_inode
> > ->i_mode, new);
> > +	if (!newi) {
> > +		dput(new);
> > +		return ERR_PTR(-ENOMEM);
> > +	}
> > +
> > +	d_splice_alias(newi, dentry);
> 
> Hi James,
> 
> Should it be "return d_splice_alias()" so that if we find an alias it 
> is returned back to caller and passed in dentry can be freed. Though 
> I don't know in what cases alias can be found. And if alias is found 
> how do we make sure alias_dentry->d_fsdata is pointing to new (real
> dentry).

It probably should be for the sake of the pattern.  In our case I don't
think we can have any root aliases because the root dentry is always
pinned in the cache, so cache lookup should always find it.

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-16 15:51             ` James Bottomley
@ 2017-02-16 16:42               ` Vivek Goyal
  2017-02-16 16:58                 ` James Bottomley
  0 siblings, 1 reply; 82+ messages in thread
From: Vivek Goyal @ 2017-02-16 16:42 UTC (permalink / raw)
  To: James Bottomley
  Cc: Amir Goldstein, Djalal Harouni, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Thu, Feb 16, 2017 at 07:51:58AM -0800, James Bottomley wrote:

[..]
> > Two levels of checks will simplify this a bit. Top level inode will 
> > belong to the user namespace of caller and checks should pass. And 
> > mounter's creds will have ownership over the real inode so no 
> > additional namespace shifting required there.
> 
> That's the problem: for a marked mount, they don't.

In this new model it does not fit directly. 

I was playing with a slightly different approach and modified patches so
that real root still does the mounting and takes an mount option which
specifies which user namespace we want to shift into. Thanks to Eric for
the idea.

mount -t shiftfs -o userns_fd=<fd> source shifted-fs

In this case real-root is mounter and notion of using mounter's creds on
real-inode works.

This requires a user namespace to be created before shiftfs can be mounted
and then container admin should be able to bind mount shifted-fs.

In this model, intervention of real-root is still required to setup
container and shiftfs. I guess that might not satisfy your needs where
unprivileged user should be able to launch container and be able to
make use of shiftfs, IIUC.

Vivek

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-16 16:42               ` Vivek Goyal
@ 2017-02-16 16:58                 ` James Bottomley
  2017-02-17  1:57                   ` Eric W. Biederman
  0 siblings, 1 reply; 82+ messages in thread
From: James Bottomley @ 2017-02-16 16:58 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Amir Goldstein, Djalal Harouni, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Thu, 2017-02-16 at 11:42 -0500, Vivek Goyal wrote:
> On Thu, Feb 16, 2017 at 07:51:58AM -0800, James Bottomley wrote:
> 
> [..]
> > > Two levels of checks will simplify this a bit. Top level inode 
> > > will belong to the user namespace of caller and checks should 
> > > pass. And mounter's creds will have ownership over the real inode 
> > > so no additional namespace shifting required there.
> > 
> > That's the problem: for a marked mount, they don't.
> 
> In this new model it does not fit directly. 
> 
> I was playing with a slightly different approach and modified patches 
> so that real root still does the mounting and takes an mount option
> which specifies which user namespace we want to shift into. Thanks to 
> Eric for the idea.
> 
> mount -t shiftfs -o userns_fd=<fd> source shifted-fs

This is a non-starter because it doesn't work for the unprivileged use
case, which is what I'm really interested in.  For fully unprivileged
containers you don't have an orchestration system to ask to build the
container.  You can get init scripts to set stuff up for you, like the
marks, but ideally it should just work even without that (so an inode
flag following project semantics seems really appealing), but after
that the unprivileged user should be able to build their own
containers.

As you saw from the reply to Eric, this approach (which I have tried)
also opens up a whole can of worms for non-FS_USERNS_MOUNT filesystems.

James


> In this case real-root is mounter and notion of using mounter's creds 
> on real-inode works. 
> This requires a user namespace to be created before shiftfs can be 
> mounted and then container admin should be able to bind mount shifted
> -fs.
> 
> In this model, intervention of real-root is still required to setup
> container and shiftfs. I guess that might not satisfy your needs 
> where unprivileged user should be able to launch container and be 
> able to make use of shiftfs, IIUC.
> 
> Vivek
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-16 16:58                 ` James Bottomley
@ 2017-02-17  1:57                   ` Eric W. Biederman
  2017-02-17  8:39                     ` Djalal Harouni
  2017-02-17 17:19                     ` James Bottomley
  0 siblings, 2 replies; 82+ messages in thread
From: Eric W. Biederman @ 2017-02-17  1:57 UTC (permalink / raw)
  To: James Bottomley
  Cc: Vivek Goyal, Amir Goldstein, Djalal Harouni, Chris Mason,
	Theodore Tso, Josh Triplett, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

James Bottomley <James.Bottomley@HansenPartnership.com> writes:

> On Thu, 2017-02-16 at 11:42 -0500, Vivek Goyal wrote:
>> On Thu, Feb 16, 2017 at 07:51:58AM -0800, James Bottomley wrote:
>> 
>> [..]
>> > > Two levels of checks will simplify this a bit. Top level inode 
>> > > will belong to the user namespace of caller and checks should 
>> > > pass. And mounter's creds will have ownership over the real inode 
>> > > so no additional namespace shifting required there.
>> > 
>> > That's the problem: for a marked mount, they don't.
>> 
>> In this new model it does not fit directly. 
>> 
>> I was playing with a slightly different approach and modified patches 
>> so that real root still does the mounting and takes an mount option
>> which specifies which user namespace we want to shift into. Thanks to 
>> Eric for the idea.
>> 
>> mount -t shiftfs -o userns_fd=<fd> source shifted-fs

> This is a non-starter because it doesn't work for the unprivileged use
> case, which is what I'm really interested in.

But I believe it does.  It just requires a bit more work for in the
shiftfs filesystem above.  It should be perfectly possible with the help
of newuidmap to create a user namespace with the desired mappings.

My understanding is that Vivek started with requiring root to mount the
filesystem only as a simplification during development, and that the
plan is to get the basic use case working and then allow unprivileged
mounting.

> For fully unprivileged
> containers you don't have an orchestration system to ask to build the
> container.  You can get init scripts to set stuff up for you, like the
> marks, but ideally it should just work even without that (so an inode
> flag following project semantics seems really appealing), but after
> that the unprivileged user should be able to build their own
> containers.
>
> As you saw from the reply to Eric, this approach (which I have tried)
> also opens up a whole can of worms for non-FS_USERNS_MOUNT filesystems.
>

>From what I can see we have two cases we care about.
A) A non-default mapping from the filesystem to the rest of the system
   and roughly s_user_ns provides that but we need a review of the
   filesystems to make certain something has not been forgotten.

B) A filesystem image sitting around in a directory somewhere that
   we want to map differently into different user namespaces while
   using the same files as backing store.

For the second case what is interesting technically is that we want
multiple mappings.  A user namespace appears adequate to specify those
extra mappings (effectively from kuids to kuids).

So we need something to associate the additional mapping with a
directory tree.  A stackable filesystem with it's own s_user_ns field
appears a very straight forward way to do that.  Especially if it can
figure out how to assert that the underlying filesystem image is
read-only (doesn't overlayfs require that?).  Making the entire stack
read-only.

I don't see a problem with that for unprivileged use (except possibly
the read-only enforcement on the unerlying directory tree).

What Vivek is talking about appears to be perfectly correct.  Performing
the underlying filesystem permission checks as the possibly unprivileged
user who mounted shiftfs.  After performing a set of permission checks
(at the shiftfs level) as the user who is accessing the files.


. . .


I think I am missing something but I completely do not understand that
subthread that says use file marks and perform the work in the vfs.
The problem is that fundamentally we need multiple mappings and I don't
see a mark on a file (even an inherited mark) providing the mapping so I
don't see the point.

Which leaves two possible places to store the extra mapping.  In the
struct mount.  Or in a stacked filesystem super_block.  For a stacked
filesystem I can see where to store the extra translation.  In the upper
filesystems upper inode.   And we can perform the practical permission
check on the upper inode as well.

For a vfs level solution it looks like we would have to change all of
the permission checking code in the kernel to have a special case for
this kind of mapping.  Which does not sound maintainable.

So at the moment I don't think a vfs level solution makes any sense.

And then if you have a stacked filesystem with FS_USERNS_MOUNT set so it
can be mounted by an unprivileged user.   I think it makes sense to
check the mounters creds agains the real inode.  To verify the user that
mounted the filesystem has the permission to perform the desired access.

Which makes only allows the mounter as much permisison as the mounter
would have if they performed the work with fuse instead of a special
in-kernel filesystem.

In a DAC model of the world that makes lots of sense.  I don't know what
actually makes sense in a MAC world.  But I am certain that is something
that can be worked through.

Eric

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-04 19:19 ` [RFC 1/1] shiftfs: uid/gid shifting bind mount James Bottomley
                     ` (3 preceding siblings ...)
  2017-02-15 20:34   ` Vivek Goyal
@ 2017-02-17  2:29   ` Al Viro
  2017-02-17 17:24     ` James Bottomley
  4 siblings, 1 reply; 82+ messages in thread
From: Al Viro @ 2017-02-17  2:29 UTC (permalink / raw)
  To: James Bottomley
  Cc: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Serge E. Hallyn, Phil Estes

On Sat, Feb 04, 2017 at 11:19:32AM -0800, James Bottomley wrote:

> +static const struct dentry_operations shiftfs_dentry_ops = {
> +	.d_release	= shiftfs_d_release,
> +	.d_real		= shiftfs_d_real,
> +};

In other words, those dentries are *never* revalidated.  Nevermind that
underlying fs might be mounted elsewhere and be actively modified under
you.

> +static struct dentry *shiftfs_lookup(struct inode *dir, struct dentry *dentry,
> +				     unsigned int flags)
> +{
> +	struct dentry *real = dir->i_private, *new;
> +	struct inode *reali = real->d_inode, *newi;
> +	const struct cred *oldcred, *newcred;
> +
> +	inode_lock(reali);
> +	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
> +	new = lookup_one_len(dentry->d_name.name, real, dentry->d_name.len);
> +	shiftfs_old_creds(oldcred, &newcred);
> +	inode_unlock(reali);
> +
> +	if (IS_ERR(new))
> +		return new;
> +
> +	dentry->d_fsdata = new;
> +
> +	if (!new->d_inode)
> +		return NULL;

What happens when somebody comes along and creates the damn thing on the
underlying fs?  _Not_ via your code, that is - using the underlying fs
mounted elsewhere.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-16 15:56     ` James Bottomley
@ 2017-02-17  2:55       ` Al Viro
  2017-02-17 17:34         ` James Bottomley
  0 siblings, 1 reply; 82+ messages in thread
From: Al Viro @ 2017-02-17  2:55 UTC (permalink / raw)
  To: James Bottomley
  Cc: Vivek Goyal, Djalal Harouni, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, linux-security-module, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Serge E. Hallyn,
	Phil Estes

On Thu, Feb 16, 2017 at 07:56:30AM -0800, James Bottomley wrote:

> > Hi James,
> > 
> > Should it be "return d_splice_alias()" so that if we find an alias it 
> > is returned back to caller and passed in dentry can be freed. Though 
> > I don't know in what cases alias can be found. And if alias is found 
> > how do we make sure alias_dentry->d_fsdata is pointing to new (real
> > dentry).
> 
> It probably should be for the sake of the pattern.  In our case I don't
> think we can have any root aliases because the root dentry is always
> pinned in the cache, so cache lookup should always find it.

What does that have to do with root dentry?  The real reason why that code
works (FVerySVO) is that the damn thing allocates a new inode every time.
Including the hardlinks, BTW.  So d_splice_alias() will always return NULL -
there's no way for any dentries to be pointing to in-core struct inode you've
just allocated.  Short of a use-after-free, that is...

Unless I'm missing something subtle, the whole thing is fucked
in head wrt cache coherency - its dentries are blindly assumed to be
forever valid, no matter what's happening with the underlying filesystem.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-17  1:57                   ` Eric W. Biederman
@ 2017-02-17  8:39                     ` Djalal Harouni
  2017-02-17 17:19                     ` James Bottomley
  1 sibling, 0 replies; 82+ messages in thread
From: Djalal Harouni @ 2017-02-17  8:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: James Bottomley, Vivek Goyal, Amir Goldstein, Chris Mason,
	Theodore Tso, Josh Triplett, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Fri, Feb 17, 2017 at 2:57 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> James Bottomley <James.Bottomley@HansenPartnership.com> writes:
>
>> On Thu, 2017-02-16 at 11:42 -0500, Vivek Goyal wrote:
>>> On Thu, Feb 16, 2017 at 07:51:58AM -0800, James Bottomley wrote:
>>>
>>> [..]
>>> > > Two levels of checks will simplify this a bit. Top level inode
>>> > > will belong to the user namespace of caller and checks should
>>> > > pass. And mounter's creds will have ownership over the real inode
>>> > > so no additional namespace shifting required there.
>>> >
>>> > That's the problem: for a marked mount, they don't.
>>>
>>> In this new model it does not fit directly.
>>>
>>> I was playing with a slightly different approach and modified patches
>>> so that real root still does the mounting and takes an mount option
>>> which specifies which user namespace we want to shift into. Thanks to
>>> Eric for the idea.
>>>
>>> mount -t shiftfs -o userns_fd=<fd> source shifted-fs
>
>> This is a non-starter because it doesn't work for the unprivileged use
>> case, which is what I'm really interested in.
>
> But I believe it does.  It just requires a bit more work for in the
> shiftfs filesystem above.  It should be perfectly possible with the help
> of newuidmap to create a user namespace with the desired mappings.
>
> My understanding is that Vivek started with requiring root to mount the
> filesystem only as a simplification during development, and that the
> plan is to get the basic use case working and then allow unprivileged
> mounting.
>
>> For fully unprivileged
>> containers you don't have an orchestration system to ask to build the
>> container.  You can get init scripts to set stuff up for you, like the
>> marks, but ideally it should just work even without that (so an inode
>> flag following project semantics seems really appealing), but after
>> that the unprivileged user should be able to build their own
>> containers.
>>
>> As you saw from the reply to Eric, this approach (which I have tried)
>> also opens up a whole can of worms for non-FS_USERNS_MOUNT filesystems.
>>
>
> From what I can see we have two cases we care about.
> A) A non-default mapping from the filesystem to the rest of the system
>    and roughly s_user_ns provides that but we need a review of the
>    filesystems to make certain something has not been forgotten.
>
> B) A filesystem image sitting around in a directory somewhere that
>    we want to map differently into different user namespaces while
>    using the same files as backing store.
>
> For the second case what is interesting technically is that we want
> multiple mappings.  A user namespace appears adequate to specify those
> extra mappings (effectively from kuids to kuids).
>
> So we need something to associate the additional mapping with a
> directory tree.  A stackable filesystem with it's own s_user_ns field
> appears a very straight forward way to do that.  Especially if it can
> figure out how to assert that the underlying filesystem image is
> read-only (doesn't overlayfs require that?).  Making the entire stack
> read-only.
>
> I don't see a problem with that for unprivileged use (except possibly
> the read-only enforcement on the unerlying directory tree).
>
> What Vivek is talking about appears to be perfectly correct.  Performing
> the underlying filesystem permission checks as the possibly unprivileged
> user who mounted shiftfs.  After performing a set of permission checks
> (at the shiftfs level) as the user who is accessing the files.
>
>
> . . .
>
>
> I think I am missing something but I completely do not understand that
> subthread that says use file marks and perform the work in the vfs.
> The problem is that fundamentally we need multiple mappings and I don't
> see a mark on a file (even an inherited mark) providing the mapping so I
> don't see the point.
>
> Which leaves two possible places to store the extra mapping.  In the
> struct mount.  Or in a stacked filesystem super_block.  For a stacked
> filesystem I can see where to store the extra translation.  In the upper
> filesystems upper inode.   And we can perform the practical permission
> check on the upper inode as well.
>
> For a vfs level solution it looks like we would have to change all of
> the permission checking code in the kernel to have a special case for
> this kind of mapping.  Which does not sound maintainable.

Facts: for basic permissions: 3 files changed, 19 insertions(+), 6
deletions(-)  https://lkml.org/lkml/2016/5/4/417

That made permissions work for basically *all* filesystems. However
yes it does not handle xattr acls...

> So at the moment I don't think a vfs level solution makes any sense.
>
The permissions change was already done when userns were merged. What
you may need is VFS accessors, instead of working directly on
inode->i_uid ask the VFS to give you the right i_uid (which can also
be the case of projectid proposed by Christoph iff I got it right...)
you need it for both ways: to report to userspace and the other way to
pass it to the underlying filesystem for writes/quota which Dave
Chinner pointed out.

Any way seems the ship has settled, so my thoughts at that time were
to follow the change made for i_uid_read(), i_gid_read() helpers where
userns were merged. The code comment says: "Helper functions so that
in most cases filesystems will not need to deal directly with kuid_t
and kgid_t" so the start was from there: VFS should be the one to
handle everything using accessors for both directions. Now if you guys
think that having multiple user namespaces contexts for every
container, mount namepsace user namespace, s_user_ns and shiftfs user
ns ... or multiple APIs that will just add confusion, me I can see
this directly with orchestration/containers developers they just don't
understand what's happening... ? they want something like bind mounts!
A new filesystem is a new filesystem.

Maybe Eric you will find something useful from these comments.

Thanks!


-- 
tixxdz

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-17  1:57                   ` Eric W. Biederman
  2017-02-17  8:39                     ` Djalal Harouni
@ 2017-02-17 17:19                     ` James Bottomley
  2017-02-20  4:24                       ` Eric W. Biederman
  1 sibling, 1 reply; 82+ messages in thread
From: James Bottomley @ 2017-02-17 17:19 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Vivek Goyal, Amir Goldstein, Djalal Harouni, Chris Mason,
	Theodore Tso, Josh Triplett, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Fri, 2017-02-17 at 14:57 +1300, Eric W. Biederman wrote:
> I think I am missing something but I completely do not understand 
> that subthread that says use file marks and perform the work in the 
> vfs. The problem is that fundamentally we need multiple mappings and 
> I don't see a mark on a file (even an inherited mark) providing the 
> mapping so I don't see the point.

The point of the mark is that it's a statement by the system
administrator that the underlying subtree is safe to be mounted by an
unprivileged container in the containers user view (i.e. with
current_user_ns() == s_user_ns).  For the unprivileged container
there's no real arbitrary s_user_ns use case because the unprivileged
container must prove it can set up the mapping, so it would likely
always be mounting from within a user_ns with the mapping it wanted.

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-17  2:29   ` Al Viro
@ 2017-02-17 17:24     ` James Bottomley
  2017-02-17 17:51       ` Al Viro
  0 siblings, 1 reply; 82+ messages in thread
From: James Bottomley @ 2017-02-17 17:24 UTC (permalink / raw)
  To: Al Viro
  Cc: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Serge E. Hallyn, Phil Estes

On Fri, 2017-02-17 at 02:29 +0000, Al Viro wrote:
> On Sat, Feb 04, 2017 at 11:19:32AM -0800, James Bottomley wrote:
> 
> > +static const struct dentry_operations shiftfs_dentry_ops = {
> > +	.d_release	= shiftfs_d_release,
> > +	.d_real		= shiftfs_d_real,
> > +};
> 
> In other words, those dentries are *never* revalidated.  Nevermind 
> that underlying fs might be mounted elsewhere and be actively 
> modified under you.
> 
> > +static struct dentry *shiftfs_lookup(struct inode *dir, struct
> > dentry *dentry,
> > +				     unsigned int flags)
> > +{
> > +	struct dentry *real = dir->i_private, *new;
> > +	struct inode *reali = real->d_inode, *newi;
> > +	const struct cred *oldcred, *newcred;
> > +
> > +	inode_lock(reali);
> > +	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
> > +	new = lookup_one_len(dentry->d_name.name, real, dentry
> > ->d_name.len);
> > +	shiftfs_old_creds(oldcred, &newcred);
> > +	inode_unlock(reali);
> > +
> > +	if (IS_ERR(new))
> > +		return new;
> > +
> > +	dentry->d_fsdata = new;
> > +
> > +	if (!new->d_inode)
> > +		return NULL;
> 
> What happens when somebody comes along and creates the damn thing on 
> the underlying fs?  _Not_ via your code, that is - using the 
> underlying fs mounted elsewhere.

Point taken.  This, I think fixes the dcache revalidation issue.

James

---

diff --git a/fs/shiftfs.c b/fs/shiftfs.c
index a4a1f98..1e71efe 100644
--- a/fs/shiftfs.c
+++ b/fs/shiftfs.c
@@ -118,9 +118,43 @@ static struct dentry *shiftfs_d_real(struct dentry *dentry,
 	return real;
 }
 
+static int shiftfs_d_weak_revalidate(struct dentry *dentry, unsigned int flags)
+{
+	struct dentry *real = dentry->d_fsdata;
+
+	if (d_unhashed(real))
+		return 0;
+
+	if (!(real->d_flags & DCACHE_OP_WEAK_REVALIDATE))
+		return 1;
+
+	return real->d_op->d_weak_revalidate(real, flags);
+}
+
+static int shiftfs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+	struct dentry *real = dentry->d_fsdata;
+	int ret;
+
+	if (d_unhashed(real))
+		return 0;
+
+	if (!(real->d_flags & DCACHE_OP_REVALIDATE))
+		return 1;
+
+	ret = real->d_op->d_revalidate(real, flags);
+
+	if (ret == 0 && !(flags & LOOKUP_RCU))
+		d_invalidate(real);
+
+	return ret;
+}
+
 static const struct dentry_operations shiftfs_dentry_ops = {
 	.d_release	= shiftfs_d_release,
 	.d_real		= shiftfs_d_real,
+	.d_revalidate	= shiftfs_d_revalidate,
+	.d_weak_revalidate = shiftfs_d_weak_revalidate,
 };
 
 static int shiftfs_readlink(struct dentry *dentry, char __user *data,
@@ -431,9 +465,7 @@ static struct dentry *shiftfs_lookup(struct inode *dir, struct dentry *dentry,
 		return ERR_PTR(-ENOMEM);
 	}
 
-	d_splice_alias(newi, dentry);
-
-	return NULL;
+	return d_splice_alias(newi, dentry);
 }
 
 static int shiftfs_permission(struct inode *inode, int mask)

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-17  2:55       ` Al Viro
@ 2017-02-17 17:34         ` James Bottomley
  2017-02-17 20:35           ` Vivek Goyal
  0 siblings, 1 reply; 82+ messages in thread
From: James Bottomley @ 2017-02-17 17:34 UTC (permalink / raw)
  To: Al Viro
  Cc: Vivek Goyal, Djalal Harouni, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, linux-security-module, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Serge E. Hallyn,
	Phil Estes

On Fri, 2017-02-17 at 02:55 +0000, Al Viro wrote:
> On Thu, Feb 16, 2017 at 07:56:30AM -0800, James Bottomley wrote:
> 
> > > Hi James,
> > > 
> > > Should it be "return d_splice_alias()" so that if we find an 
> > > alias it is returned back to caller and passed in dentry can be 
> > > freed. Though I don't know in what cases alias can be found. And 
> > > if alias is found how do we make sure alias_dentry->d_fsdata is 
> > > pointing to new (real dentry).
> > 
> > It probably should be for the sake of the pattern.  In our case I 
> > don't think we can have any root aliases because the root dentry is
> > always pinned in the cache, so cache lookup should always find it.
> 
> What does that have to do with root dentry?  The real reason why that 
> code works (FVerySVO) is that the damn thing allocates a new inode 
> every time. Including the hardlinks, BTW.

Yes, this is a known characteristic of stacked filesystems.  Is there
some magic I don't know about that would make it easier to reflect hard
links as aliases?

>   So d_splice_alias() will always return NULL - there's no way for 
> any dentries to be pointing to in-core struct inode you've
> just allocated.  Short of a use-after-free, that is...
> 
> Unless I'm missing something subtle, the whole thing is fucked
> in head wrt cache coherency - its dentries are blindly assumed to be
> forever valid, no matter what's happening with the underlying 
> filesystem.

Hopefully the patch in the previous email fixes this.

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-17 17:24     ` James Bottomley
@ 2017-02-17 17:51       ` Al Viro
  2017-02-17 20:27         ` Vivek Goyal
  2017-02-17 20:50         ` James Bottomley
  0 siblings, 2 replies; 82+ messages in thread
From: Al Viro @ 2017-02-17 17:51 UTC (permalink / raw)
  To: James Bottomley
  Cc: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Serge E. Hallyn, Phil Estes

On Fri, Feb 17, 2017 at 09:24:40AM -0800, James Bottomley wrote:

> > What happens when somebody comes along and creates the damn thing on 
> > the underlying fs?  _Not_ via your code, that is - using the 
> > underlying fs mounted elsewhere.
> 
> Point taken.  This, I think fixes the dcache revalidation issue.

No, it doesn't.  Consider a local filesystem.  Those do not have any
->d_revalidate() - the kernel bloody well knows what happens to
directories.  If e.g. a previously absent file gets created, it's
been done by the kernel itself and dentry has been made positive; if
a previously existing file has been removed, dentry has either become
negative or, if it had been pinned (e.g. file was opened at the time,
or your code had been holding a reference to it, etc.) it will be unhashed
so that new lookups won't find it, etc.  No need to revalidate anything.

Now, consider your code.  You've done a lookup in the underlying fs.
It has, at the time, come negative, so you have your (negative) dentry
pointing to that on the underlying fs.  If somebody comes and does
e.g. mkdir() via your fs, it will call vfs_mkdir() on the underlying
sucker, hopefully turning it positive and associate a new in-core inode
with your previously negative dentry.  But what happens if mkdir is done
via underlying fs, or via another instance of yours over the same tree?
Underlying dentry goes positive; yours is still negative.  The underlying
fs either doesn't have ->d_revalidate() or, if there is one it says that
the underlying dentry is valid, thank you very much, no need to invalidate
anything.

In other words, your patch does nothing for object getting created.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-17 17:51       ` Al Viro
@ 2017-02-17 20:27         ` Vivek Goyal
  2017-02-17 20:50         ` James Bottomley
  1 sibling, 0 replies; 82+ messages in thread
From: Vivek Goyal @ 2017-02-17 20:27 UTC (permalink / raw)
  To: Al Viro
  Cc: James Bottomley, Djalal Harouni, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, linux-security-module, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Serge E. Hallyn,
	Phil Estes

On Fri, Feb 17, 2017 at 05:51:18PM +0000, Al Viro wrote:
> On Fri, Feb 17, 2017 at 09:24:40AM -0800, James Bottomley wrote:
> 
> > > What happens when somebody comes along and creates the damn thing on 
> > > the underlying fs?  _Not_ via your code, that is - using the 
> > > underlying fs mounted elsewhere.
> > 
> > Point taken.  This, I think fixes the dcache revalidation issue.
> 
> No, it doesn't.  Consider a local filesystem.  Those do not have any
> ->d_revalidate() - the kernel bloody well knows what happens to
> directories.  If e.g. a previously absent file gets created, it's
> been done by the kernel itself and dentry has been made positive; if
> a previously existing file has been removed, dentry has either become
> negative or, if it had been pinned (e.g. file was opened at the time,
> or your code had been holding a reference to it, etc.) it will be unhashed
> so that new lookups won't find it, etc.  No need to revalidate anything.
> 
> Now, consider your code.  You've done a lookup in the underlying fs.
> It has, at the time, come negative, so you have your (negative) dentry
> pointing to that on the underlying fs.  If somebody comes and does
> e.g. mkdir() via your fs, it will call vfs_mkdir() on the underlying
> sucker, hopefully turning it positive and associate a new in-core inode
> with your previously negative dentry.  But what happens if mkdir is done
> via underlying fs, or via another instance of yours over the same tree?
> Underlying dentry goes positive; yours is still negative.  The underlying
> fs either doesn't have ->d_revalidate() or, if there is one it says that
> the underlying dentry is valid, thank you very much, no need to invalidate
> anything.
> 
> In other words, your patch does nothing for object getting created.

I thought assumption here is that underlying subtree is not changed
outside of shiftfs. IIUC, overlayfs has the same assumption.

Two shiftfs instances writing to same dir will be a problem though.

Vivek

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-17 17:34         ` James Bottomley
@ 2017-02-17 20:35           ` Vivek Goyal
  2017-02-19  3:24             ` James Bottomley
  0 siblings, 1 reply; 82+ messages in thread
From: Vivek Goyal @ 2017-02-17 20:35 UTC (permalink / raw)
  To: James Bottomley
  Cc: Al Viro, Djalal Harouni, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, linux-security-module, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Serge E. Hallyn,
	Phil Estes

On Fri, Feb 17, 2017 at 09:34:07AM -0800, James Bottomley wrote:
> On Fri, 2017-02-17 at 02:55 +0000, Al Viro wrote:
> > On Thu, Feb 16, 2017 at 07:56:30AM -0800, James Bottomley wrote:
> > 
> > > > Hi James,
> > > > 
> > > > Should it be "return d_splice_alias()" so that if we find an 
> > > > alias it is returned back to caller and passed in dentry can be 
> > > > freed. Though I don't know in what cases alias can be found. And 
> > > > if alias is found how do we make sure alias_dentry->d_fsdata is 
> > > > pointing to new (real dentry).
> > > 
> > > It probably should be for the sake of the pattern.  In our case I 
> > > don't think we can have any root aliases because the root dentry is
> > > always pinned in the cache, so cache lookup should always find it.
> > 
> > What does that have to do with root dentry?  The real reason why that 
> > code works (FVerySVO) is that the damn thing allocates a new inode 
> > every time. Including the hardlinks, BTW.
> 
> Yes, this is a known characteristic of stacked filesystems.  Is there
> some magic I don't know about that would make it easier to reflect hard
> links as aliases?

I think overlayfs had the same issue in the beginning and miklos fixed it.

commit 51f7e52dc943468c6929fa0a82d4afac3c8e9636
Author: Miklos Szeredi <mszeredi@redhat.com>
Date:   Fri Jul 29 12:05:24 2016 +0200

    ovl: share inode for hard link

Vivek

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-17 17:51       ` Al Viro
  2017-02-17 20:27         ` Vivek Goyal
@ 2017-02-17 20:50         ` James Bottomley
  1 sibling, 0 replies; 82+ messages in thread
From: James Bottomley @ 2017-02-17 20:50 UTC (permalink / raw)
  To: Al Viro
  Cc: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Serge E. Hallyn, Phil Estes

On Fri, 2017-02-17 at 17:51 +0000, Al Viro wrote:
> On Fri, Feb 17, 2017 at 09:24:40AM -0800, James Bottomley wrote:
> 
> > > What happens when somebody comes along and creates the damn thing
> > > on 
> > > the underlying fs?  _Not_ via your code, that is - using the 
> > > underlying fs mounted elsewhere.
> > 
> > Point taken.  This, I think fixes the dcache revalidation issue.
> 
> No, it doesn't.  Consider a local filesystem.  Those do not have any
> ->d_revalidate() - the kernel bloody well knows what happens to
> directories.  If e.g. a previously absent file gets created, it's
> been done by the kernel itself and dentry has been made positive; if
> a previously existing file has been removed, dentry has either become
> negative or, if it had been pinned (e.g. file was opened at the time,
> or your code had been holding a reference to it, etc.) it will be 
> unhashed so that new lookups won't find it, etc.  No need to 
> revalidate anything.
> 
> Now, consider your code.  You've done a lookup in the underlying fs.
> It has, at the time, come negative, so you have your (negative) 
> dentry pointing to that on the underlying fs.  If somebody comes and 
> does e.g. mkdir() via your fs, it will call vfs_mkdir() on the 
> underlying sucker, hopefully turning it positive and associate a new 
> in-core inode with your previously negative dentry.  But what happens 
> if mkdir is done via underlying fs, or via another instance of yours 
> over the same tree?
> Underlying dentry goes positive; yours is still negative.  The 
> underlying fs either doesn't have ->d_revalidate() or, if there is 
> one it says that the underlying dentry is valid, thank you very much, 
> no need to invalidate anything.
> 
> In other words, your patch does nothing for object getting created.

Right at the moment, the upper layer doesn't cache negative dentries,
but that's only a partial solution.  I assume you'd now like me to
cache negative dentries (principle of least surprise) and handle the
underlying negative to positive transition in d_revalidate?

I can do that.

James

---

diff --git a/fs/shiftfs.c b/fs/shiftfs.c
index a4a1f98..5b50447 100644
--- a/fs/shiftfs.c
+++ b/fs/shiftfs.c
@@ -118,9 +118,50 @@ static struct dentry *shiftfs_d_real(struct dentry *dentry,
 	return real;
 }
 
+static int shiftfs_d_weak_revalidate(struct dentry *dentry, unsigned int flags)
+{
+	struct dentry *real = dentry->d_fsdata;
+
+	if (d_unhashed(real))
+		return 0;
+
+	if (!(real->d_flags & DCACHE_OP_WEAK_REVALIDATE))
+		return 1;
+
+	return real->d_op->d_weak_revalidate(real, flags);
+}
+
+static int shiftfs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+	struct dentry *real = dentry->d_fsdata;
+	int ret;
+
+	if (d_unhashed(real))
+		return 0;
+
+	/*
+	 * inode state of underlying changed from positive to negative
+	 * or vice versa; force a lookup to update our view
+	 */
+	if (d_is_negative(real) != d_is_negative(dentry))
+		return 0;
+
+	if (!(real->d_flags & DCACHE_OP_REVALIDATE))
+		return 1;
+
+	ret = real->d_op->d_revalidate(real, flags);
+
+	if (ret == 0 && !(flags & LOOKUP_RCU))
+		d_invalidate(real);
+
+	return ret;
+}
+
 static const struct dentry_operations shiftfs_dentry_ops = {
 	.d_release	= shiftfs_d_release,
 	.d_real		= shiftfs_d_real,
+	.d_revalidate	= shiftfs_d_revalidate,
+	.d_weak_revalidate = shiftfs_d_weak_revalidate,
 };
 
 static int shiftfs_readlink(struct dentry *dentry, char __user *data,
@@ -423,7 +464,7 @@ static struct dentry *shiftfs_lookup(struct inode *dir, struct dentry *dentry,
 	dentry->d_fsdata = new;
 
 	if (!new->d_inode)
-		return NULL;
+		goto out;
 
 	newi = shiftfs_new_inode(dentry->d_sb, new->d_inode->i_mode, new);
 	if (!newi) {
@@ -431,9 +472,8 @@ static struct dentry *shiftfs_lookup(struct inode *dir, struct dentry *dentry,
 		return ERR_PTR(-ENOMEM);
 	}
 
-	d_splice_alias(newi, dentry);
-
-	return NULL;
+ out:
+	return d_splice_alias(newi, dentry);
 }
 
 static int shiftfs_permission(struct inode *inode, int mask)

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-17 20:35           ` Vivek Goyal
@ 2017-02-19  3:24             ` James Bottomley
  2017-02-20 19:26               ` Vivek Goyal
  0 siblings, 1 reply; 82+ messages in thread
From: James Bottomley @ 2017-02-19  3:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Al Viro, Djalal Harouni, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, linux-security-module, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Serge E. Hallyn,
	Phil Estes

On Fri, 2017-02-17 at 15:35 -0500, Vivek Goyal wrote:
> On Fri, Feb 17, 2017 at 09:34:07AM -0800, James Bottomley wrote:
> > On Fri, 2017-02-17 at 02:55 +0000, Al Viro wrote:
> > > On Thu, Feb 16, 2017 at 07:56:30AM -0800, James Bottomley wrote:
> > > 
> > > > > Hi James,
> > > > > 
> > > > > Should it be "return d_splice_alias()" so that if we find an 
> > > > > alias it is returned back to caller and passed in dentry can 
> > > > > be freed. Though I don't know in what cases alias can be 
> > > > > found. And if alias is found how do we make sure alias_dentry
> > > > > ->d_fsdata is pointing to new (real dentry).
> > > > 
> > > > It probably should be for the sake of the pattern.  In our case 
> > > > I don't think we can have any root aliases because the root
> > > > dentry is always pinned in the cache, so cache lookup should 
> > > > always find it.
> > > 
> > > What does that have to do with root dentry?  The real reason why 
> > > that code works (FVerySVO) is that the damn thing allocates a new
> > > inode every time. Including the hardlinks, BTW.
> > 
> > Yes, this is a known characteristic of stacked filesystems.  Is 
> > there some magic I don't know about that would make it easier to 
> > reflect hard links as aliases?
> 
> I think overlayfs had the same issue in the beginning and miklos
> fixed it.
> 
> commit 51f7e52dc943468c6929fa0a82d4afac3c8e9636
> Author: Miklos Szeredi <mszeredi@redhat.com>
> Date:   Fri Jul 29 12:05:24 2016 +0200
> 
>     ovl: share inode for hard link

That's rather complex, but the principle is simple: use the inode hash
for all upper inodes that may have aliases.  Aliasable means the
underlying inode isn't a directory and has i_nlink > 1, so all I have
to do is perform a lookup through the hash if the underlying is
aliasable, invalidate the dentry in d_revalidate if the aliasing
conditions to the underlying change and manually handle hard links and
it should all work.

Like this?

James

---

diff --git a/fs/shiftfs.c b/fs/shiftfs.c
index 5b50447..c659812 100644
--- a/fs/shiftfs.c
+++ b/fs/shiftfs.c
@@ -134,6 +134,7 @@ static int shiftfs_d_weak_revalidate(struct dentry *dentry, unsigned int flags)
 static int shiftfs_d_revalidate(struct dentry *dentry, unsigned int flags)
 {
 	struct dentry *real = dentry->d_fsdata;
+	struct inode *reali = d_inode(real), *inode = d_inode(dentry);
 	int ret;
 
 	if (d_unhashed(real))
@@ -146,6 +147,15 @@ static int shiftfs_d_revalidate(struct dentry *dentry, unsigned int flags)
 	if (d_is_negative(real) != d_is_negative(dentry))
 		return 0;
 
+	/*
+	 * non dir link count is > 1 and our inode is currently not in
+	 * the inode hash => need to drop and reget our dentry to make
+	 * sure we're aliasing it correctly.
+	 */
+	if (reali &&!S_ISDIR(reali->i_mode) && reali->i_nlink > 1 &&
+	    (!inode || inode_unhashed(inode)))
+		return 0;
+
 	if (!(real->d_flags & DCACHE_OP_REVALIDATE))
 		return 1;
 
@@ -285,7 +295,8 @@ static int shiftfs_make_object(struct inode *dir, struct dentry *dentry,
 			       umode_t mode, const char *symlink,
 			       struct dentry *hardlink, bool excl)
 {
-	struct dentry *real = dir->i_private, *new = dentry->d_fsdata;
+	struct dentry *real = dir->i_private, *new = dentry->d_fsdata,
+		*realhardlink = NULL;
 	struct inode *reali = real->d_inode, *newi;
 	const struct inode_operations *iop = reali->i_op;
 	int err;
@@ -293,6 +304,7 @@ static int shiftfs_make_object(struct inode *dir, struct dentry *dentry,
 	bool op_ok = false;
 
 	if (hardlink) {
+		realhardlink = hardlink->d_fsdata;
 		op_ok = iop->link;
 	} else {
 		switch (mode & S_IFMT) {
@@ -310,7 +322,7 @@ static int shiftfs_make_object(struct inode *dir, struct dentry *dentry,
 		return -EINVAL;
 
 
-	newi = shiftfs_new_inode(dentry->d_sb, mode, NULL);
+	newi = shiftfs_new_inode(dentry->d_sb, mode, realhardlink);
 	if (!newi)
 		return -ENOMEM;
 
@@ -320,8 +332,6 @@ static int shiftfs_make_object(struct inode *dir, struct dentry *dentry,
 
 	err = -EINVAL;		/* shut gcc up about uninit var */
 	if (hardlink) {
-		struct dentry *realhardlink = hardlink->d_fsdata;
-
 		err = vfs_link(realhardlink, reali, new, NULL);
 	} else {
 		switch (mode & S_IFMT) {
@@ -341,7 +351,16 @@ static int shiftfs_make_object(struct inode *dir, struct dentry *dentry,
 	if (err)
 		goto out_dput;
 
-	shiftfs_fill_inode(newi, new);
+	if (!hardlink)
+		shiftfs_fill_inode(newi, new);
+	else if (inode_unhashed(newi) && !S_ISDIR(newi->i_mode))
+		/*
+		 * although dentry and hardlink now each point to
+		 * newi, the link count was 1 when they were created,
+		 * so insert into the inode cache now that the link
+		 * count has gone above one.
+		 */
+		__insert_inode_hash(newi, (unsigned long)d_inode(new));
 
 	d_instantiate(dentry, newi);
 
@@ -569,12 +588,55 @@ static const struct inode_operations shiftfs_inode_ops = {
 	.listxattr	= shiftfs_listxattr,
 };
 
+static int shiftfs_test(struct inode *inode, void *data)
+{
+	struct dentry *d1 = inode->i_private, *d2 = data;
+	struct inode *i1 = d_inode(d1), *i2 = d_inode(d2);
+
+	return i1 && i1 == i2;
+}
+
+static int shiftfs_set(struct inode *inode, void *data)
+{
+	struct dentry *dentry = data;
+
+	shiftfs_fill_inode(inode, dentry);
+
+	return 0;
+}
+
 static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
 				       struct dentry *dentry)
 {
 	struct inode *inode;
+	struct inode *reali = dentry ? d_inode(dentry): NULL;
+	bool use_inode_hash = false;
+
+	/*
+	 * Here we hash the inode only if the underlying link count is
+	 * greater than one and it's not a directory (meaning the hash
+	 * contains all items that might be aliases).  We keep this
+	 * accurate by checking the underlying link count on
+	 * revalidation and forcing a new lookup if the underlying
+	 * link count is raised.
+	 *
+	 * Note: if the link count drops again, we don't remove the
+	 * inode from the hash, so the hash contains all inodes that
+	 * may be aliases plus a few others.
+	 */
+	if (reali)
+		use_inode_hash = ACCESS_ONCE(reali->i_nlink) > 1 &&
+			!S_ISDIR(reali->i_mode);
+
+	if (use_inode_hash) {
+		inode = iget5_locked(sb, (unsigned long)reali, shiftfs_test,
+				     shiftfs_set, dentry);
+		if (inode && !(inode->i_state & I_NEW))
+			return inode;
+	} else {
+		inode = new_inode(sb);
+	}
 
-	inode = new_inode(sb);
 	if (!inode)
 		return NULL;
 
@@ -586,7 +648,10 @@ static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
 
 	inode->i_op = &shiftfs_inode_ops;
 
-	shiftfs_fill_inode(inode, dentry);
+	if (use_inode_hash)
+		unlock_new_inode(inode);
+	else
+		shiftfs_fill_inode(inode, dentry);
 
 	return inode;
 }

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-17 17:19                     ` James Bottomley
@ 2017-02-20  4:24                       ` Eric W. Biederman
  2017-02-22 12:01                         ` James Bottomley
  0 siblings, 1 reply; 82+ messages in thread
From: Eric W. Biederman @ 2017-02-20  4:24 UTC (permalink / raw)
  To: James Bottomley
  Cc: Vivek Goyal, Amir Goldstein, Djalal Harouni, Chris Mason,
	Theodore Tso, Josh Triplett, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

James Bottomley <James.Bottomley@HansenPartnership.com> writes:

> On Fri, 2017-02-17 at 14:57 +1300, Eric W. Biederman wrote:
>> I think I am missing something but I completely do not understand 
>> that subthread that says use file marks and perform the work in the 
>> vfs. The problem is that fundamentally we need multiple mappings and 
>> I don't see a mark on a file (even an inherited mark) providing the 
>> mapping so I don't see the point.
>
> The point of the mark is that it's a statement by the system
> administrator that the underlying subtree is safe to be mounted by an
> unprivileged container in the containers user view (i.e. with
> current_user_ns() == s_user_ns).  For the unprivileged container
> there's no real arbitrary s_user_ns use case because the unprivileged
> container must prove it can set up the mapping, so it would likely
> always be mounting from within a user_ns with the mapping it wanted.

As a statement that it is ok for the unprivileged mapping code to
operate that seems reasonable.  I don't currently the need for such an
ok from the system adminstrator, but if you need it a flag that
propagates to children and child directories seems reasonable.

Eric

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-19  3:24             ` James Bottomley
@ 2017-02-20 19:26               ` Vivek Goyal
  2017-02-21  0:38                 ` James Bottomley
  0 siblings, 1 reply; 82+ messages in thread
From: Vivek Goyal @ 2017-02-20 19:26 UTC (permalink / raw)
  To: James Bottomley
  Cc: Al Viro, Djalal Harouni, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, linux-security-module, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Serge E. Hallyn,
	Phil Estes

On Sat, Feb 18, 2017 at 07:24:38PM -0800, James Bottomley wrote:

[..]
> > > Yes, this is a known characteristic of stacked filesystems.  Is 
> > > there some magic I don't know about that would make it easier to 
> > > reflect hard links as aliases?
> > 
> > I think overlayfs had the same issue in the beginning and miklos
> > fixed it.
> > 
> > commit 51f7e52dc943468c6929fa0a82d4afac3c8e9636
> > Author: Miklos Szeredi <mszeredi@redhat.com>
> > Date:   Fri Jul 29 12:05:24 2016 +0200
> > 
> >     ovl: share inode for hard link
> 
> That's rather complex, but the principle is simple: use the inode hash
> for all upper inodes that may have aliases.  Aliasable means the
> underlying inode isn't a directory and has i_nlink > 1, so all I have
> to do is perform a lookup through the hash if the underlying is
> aliasable, invalidate the dentry in d_revalidate if the aliasing
> conditions to the underlying change and manually handle hard links and
> it should all work.
> 
> Like this?

Sounds reasonable to me. I did basic testing and this seems to work for me.

In general, I am having random crashes. I just get following on serial
console

------[Cut Here]----------

And nothing after that.

Still trying to narrow down.

Vivek

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-20 19:26               ` Vivek Goyal
@ 2017-02-21  0:38                 ` James Bottomley
  0 siblings, 0 replies; 82+ messages in thread
From: James Bottomley @ 2017-02-21  0:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Al Viro, Djalal Harouni, Chris Mason, Theodore Tso,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, linux-security-module, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Serge E. Hallyn,
	Phil Estes

On Mon, 2017-02-20 at 14:26 -0500, Vivek Goyal wrote:
> On Sat, Feb 18, 2017 at 07:24:38PM -0800, James Bottomley wrote:
> 
> [..]
> > > > Yes, this is a known characteristic of stacked filesystems.  Is
> > > > there some magic I don't know about that would make it easier
> > > > to 
> > > > reflect hard links as aliases?
> > > 
> > > I think overlayfs had the same issue in the beginning and miklos
> > > fixed it.
> > > 
> > > commit 51f7e52dc943468c6929fa0a82d4afac3c8e9636
> > > Author: Miklos Szeredi <mszeredi@redhat.com>
> > > Date:   Fri Jul 29 12:05:24 2016 +0200
> > > 
> > >     ovl: share inode for hard link
> > 
> > That's rather complex, but the principle is simple: use the inode
> > hash
> > for all upper inodes that may have aliases.  Aliasable means the
> > underlying inode isn't a directory and has i_nlink > 1, so all I
> > have
> > to do is perform a lookup through the hash if the underlying is
> > aliasable, invalidate the dentry in d_revalidate if the aliasing
> > conditions to the underlying change and manually handle hard links
> > and
> > it should all work.
> > 
> > Like this?
> 
> Sounds reasonable to me. I did basic testing and this seems to work
> for me.
> 
> In general, I am having random crashes. I just get following on 
> serial console
> 
> ------[Cut Here]----------
> 
> And nothing after that.

That's indicative of some hard lockup.  I don't see this, but I'm also
using a second laptop for testing, which is suboptimal.  I'm going to
try moving to xfstests inside a VM tomorrow (that's what long aeroplane
flights are for).

> Still trying to narrow down.

Thanks.  There've been a lot of patches flying around, so I'll do a
collected repost under a v2 header to make sure we're all in sync.

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-06 16:24       ` J. R. Okajima
@ 2017-02-21  0:48         ` James Bottomley
  2017-02-21  2:57           ` J. R. Okajima
  0 siblings, 1 reply; 82+ messages in thread
From: James Bottomley @ 2017-02-21  0:48 UTC (permalink / raw)
  To: J. R. Okajima
  Cc: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

On Tue, 2017-02-07 at 01:24 +0900, J. R. Okajima wrote:
> James Bottomley:
> > Yes, I know the problem.  However, I believe most current linux
> > filesystems no longer guarantee stable, for the lifetime of the 
> > file, inode numbers.  The usual docker container root is overlayfs,
> > which, similarly doesn't support stable inode numbers.  I see the 
> > odd complaint about docker with overlayfs having unstable inode
> > numbers, but none seems to have any serious repercussions.
> 
> I think it serious.
> Reusing the backend fs' inum is a good approach which Amir wrote.
> Based on this, I'd suggest you to support the hardlinks.

I realised as I was trimming down the vestigial inode properties in the
patch that actually shiftfs does use the i_ino from the underlying for
userspace.  The reason why is that it comes from the getattr call in
stat and that's fully what the underlying filesystem returns (including
the inode number).

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-21  0:48         ` James Bottomley
@ 2017-02-21  2:57           ` J. R. Okajima
  2017-02-21  4:07             ` James Bottomley
  0 siblings, 1 reply; 82+ messages in thread
From: J. R. Okajima @ 2017-02-21  2:57 UTC (permalink / raw)
  To: James Bottomley
  Cc: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

James Bottomley:
> I realised as I was trimming down the vestigial inode properties in the
> patch that actually shiftfs does use the i_ino from the underlying for
> userspace.  The reason why is that it comes from the getattr call in
> stat and that's fully what the underlying filesystem returns (including
> the inode number).

Let me make sure.
- shiftfs has its own inode, but it will never be visible to userspace.
- the inode attr visible to users are equivalent to the underlying one,
  includeing dev:ino pair.
right?
If so, I am afraid it will make users confused. The dev:ino pair is a
system-wide identity, but shiftfs creates the same dev:ino pair with
different owner. Though I don't know whether the actual application or
LSM exists or not who will be damaged by this situation.
For git-status case which I wrote previously, it might not be a problem
as long as dev:ino is unchanged from git index.
But such filesystem looks weird.


J. R. Okajima

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-21  2:57           ` J. R. Okajima
@ 2017-02-21  4:07             ` James Bottomley
  2017-02-21  4:34               ` J. R. Okajima
  0 siblings, 1 reply; 82+ messages in thread
From: James Bottomley @ 2017-02-21  4:07 UTC (permalink / raw)
  To: J. R. Okajima
  Cc: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

On Tue, 2017-02-21 at 11:57 +0900, J. R. Okajima wrote:
> James Bottomley:
> > I realised as I was trimming down the vestigial inode properties in 
> > the patch that actually shiftfs does use the i_ino from the 
> > underlying for userspace.  The reason why is that it comes from the 
> > getattr call in stat and that's fully what the underlying 
> > filesystem returns (including the inode number).
> 
> Let me make sure.
> - shiftfs has its own inode, but it will never be visible to 
> userspace. - the inode attr visible to users are equivalent to the 
> underlying one,   includeing dev:ino pair.
> right?

Yes, it behaves like a bind mount.

> If so, I am afraid it will make users confused. The dev:ino pair is a
> system-wide identity,

I don't believe it will, otherwise they'd have the same confusion over
a real bind mount.  The dev:inum pair identifies an inode.  An inode
may have many paths and shiftfs just adds a path.

>  but shiftfs creates the same dev:ino pair with different owner.

With a different owner view, but that's irrelevant to the underlying
inode.

>  Though I don't know whether the actual application or LSM exists or
> not who will be damaged by this situation.
> For git-status case which I wrote previously, it might not be a 
> problem as long as dev:ino is unchanged from git index.
> But such filesystem looks weird.

It behaves as much as possible like a bind mount and the user view is
standard behaviour, so it can't really be classified as "weird".  What
won't work like a classic bind mount in this scenario is NFS exporting,
but that's about the only thing.

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-21  4:07             ` James Bottomley
@ 2017-02-21  4:34               ` J. R. Okajima
  0 siblings, 0 replies; 82+ messages in thread
From: J. R. Okajima @ 2017-02-21  4:34 UTC (permalink / raw)
  To: James Bottomley
  Cc: Djalal Harouni, Chris Mason, Theodore Tso, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro, Serge E. Hallyn,
	Phil Estes

James Bottomley:
> With a different owner view, but that's irrelevant to the underlying
> inode.

Ok, the different ownership is limited within shitfs (or userns,
container). Good. I might forget that shiftfs wants to behave like
bind-mount.

And I noticed that shiftfs setattr() converts uid/gid before calling
backend fs' ->setattr(). It is good too.
But how about acl? Won't such conversion be necessary for acl too?


J. R. Okajima

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2017-02-20  4:24                       ` Eric W. Biederman
@ 2017-02-22 12:01                         ` James Bottomley
  0 siblings, 0 replies; 82+ messages in thread
From: James Bottomley @ 2017-02-22 12:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Vivek Goyal, Amir Goldstein, Djalal Harouni, Chris Mason,
	Theodore Tso, Josh Triplett, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, LSM List, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro,
	Serge E. Hallyn, Phil Estes

On Mon, 2017-02-20 at 17:24 +1300, Eric W. Biederman wrote:
> James Bottomley <James.Bottomley@HansenPartnership.com> writes:
> 
> > On Fri, 2017-02-17 at 14:57 +1300, Eric W. Biederman wrote:
> > > I think I am missing something but I completely do not understand
> > > that subthread that says use file marks and perform the work in 
> > > the vfs. The problem is that fundamentally we need multiple 
> > > mappings and I don't see a mark on a file (even an inherited 
> > > mark) providing the mapping so I don't see the point.
> > 
> > The point of the mark is that it's a statement by the system
> > administrator that the underlying subtree is safe to be mounted by 
> > an unprivileged container in the containers user view (i.e. with
> > current_user_ns() == s_user_ns).  For the unprivileged container
> > there's no real arbitrary s_user_ns use case because the 
> > unprivileged container must prove it can set up the mapping, so it 
> > would likely always be mounting from within a user_ns with the 
> > mapping it wanted.
> 
> As a statement that it is ok for the unprivileged mapping code to
> operate that seems reasonable.  I don't currently the need for such 
> an ok from the system adminstrator, but if you need it a flag that
> propagates to children and child directories seems reasonable.

The other way to do this is with an extended attribute.  I've played
around with that approach and quite like it: the advantage is that it's
sticky across system reboots; The down side is that it requires
additional VFS code to make sure you can't execute from the non-user_ns
view.

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2016-05-19  2:28             ` Serge E. Hallyn
@ 2016-05-19 10:53               ` James Bottomley
  0 siblings, 0 replies; 82+ messages in thread
From: James Bottomley @ 2016-05-19 10:53 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Serge Hallyn, Djalal Harouni, Chris Mason, tytso, Serge Hallyn,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, linux-security-module, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro

On Wed, 2016-05-18 at 21:28 -0500, Serge E. Hallyn wrote:
> Hey James,
> 
> yeah that's a lot better.  I do still get some syslog messages,
> but i was trivially able to bind a shiftfs into a container and
> use it the way I'd want.
> 
> [  209.452274] ------------[ cut here ]------------
> [  209.452296] WARNING: CPU: 0 PID: 3072 at fs/ext4/inode.c:3977 
> ext4_truncate+0x3f5/0x5b0

Heh, I really need to test with ext4; it seems much more careful.  XFS
doesn't warn on any of this.  These are both inode locking problems
with setattr.  It also looks like I'd have the same problem with
setxattr and removexattr.  Does this additional patch allow you to
operate without any warnings?

There's also something else you'll be running into soon: the xattr
calls aren't uid shifted.  I was a bit worried about how to do this
without leaking root attribute setting capability, but I'll think a bit
more carefully about how to do it.

Thanks,

James

---

diff --git a/fs/shiftfs.c b/fs/shiftfs.c
index d352377..29f343f 100644
--- a/fs/shiftfs.c
+++ b/fs/shiftfs.c
@@ -240,14 +240,17 @@ static int shiftfs_setxattr(struct dentry *dentry, const char *name,
 			    const void *value, size_t size, int flags)
 {
 	struct dentry *real = dentry->d_fsdata;
-	const struct inode_operations *iop = real->d_inode->i_op;
+	struct inode *reali = real->d_inode;
+	const struct inode_operations *iop = reali->i_op;
 	int err = -EOPNOTSUPP;
 
 	if (iop->setxattr) {
 		const struct cred *oldcred, *newcred;
 
 		oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+		inode_lock(reali);
 		err = iop->setxattr(real, name, value, size, flags);
+		inode_unlock(reali);
 		shiftfs_old_creds(oldcred, &newcred);
 	}
 
@@ -287,12 +290,17 @@ static ssize_t shiftfs_listxattr(struct dentry *dentry, char *list,
 static int shiftfs_removexattr(struct dentry *dentry, const char *name)
 {
 	struct dentry *real = dentry->d_fsdata;
-	const struct inode_operations *iop = real->d_inode->i_op;
+	struct inode *reali = real->d_inode;
+	const struct inode_operations *iop = reali->i_op;
+	int err = -EINVAL;
 
-	if (iop->removexattr)
-		return iop->removexattr(real, name);
+	if (iop->removexattr) {
+		inode_lock(reali);
+		err = iop->removexattr(real, name);
+		inode_unlock(reali);
+	}
 
-	return -EINVAL;
+	return err;
 }
 
 static void shiftfs_fill_inode(struct inode *inode, struct dentry *dentry)
@@ -548,11 +556,13 @@ static int shiftfs_setattr(struct dentry *dentry, struct iattr *attr)
 	newattr.ia_uid = KUIDT_INIT(map_id_up(&ssi->uid_map, __kuid_val(attr->ia_uid)));
 	newattr.ia_gid = KGIDT_INIT(map_id_up(&ssi->gid_map, __kgid_val(attr->ia_gid)));
 
+	inode_lock(reali);
 	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
 	if (iop->setattr)
 		err = iop->setattr(real, &newattr);
 	else
 		err = simple_setattr(real, &newattr);
+	inode_unlock(reali);
 	shiftfs_old_creds(oldcred, &newcred);
 
 	return err;

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2016-05-17 20:59           ` James Bottomley
@ 2016-05-19  2:28             ` Serge E. Hallyn
  2016-05-19 10:53               ` James Bottomley
  0 siblings, 1 reply; 82+ messages in thread
From: Serge E. Hallyn @ 2016-05-19  2:28 UTC (permalink / raw)
  To: James Bottomley
  Cc: Serge E. Hallyn, Serge Hallyn, Djalal Harouni, Chris Mason,
	tytso, Serge Hallyn, Josh Triplett, Eric W. Biederman,
	Andy Lutomirski, Seth Forshee, linux-fsdevel, linux-kernel,
	linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro

Hey James,

yeah that's a lot better.  I do still get some syslog messages,
but i was trivially able to bind a shiftfs into a container and
use it the way I'd want.

[  209.452274] ------------[ cut here ]------------
[  209.452296] WARNING: CPU: 0 PID: 3072 at fs/ext4/inode.c:3977 ext4_truncate+0x3f5/0x5b0
[  209.452299] Modules linked in: binfmt_misc veth ip6t_MASQUERADE nf_nat_masquerade_ipv6 ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6_tables xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp bridge stp llc iptable_filter ip_tables x_tables ppdev kvm_intel kvm irqbypass nls_utf8 isofs joydev input_leds serio_raw i2c_piix4 pvpanic parport_pc 8250_fintek mac_hid parport ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear cirrus ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
[  209.452388]  psmouse drm pata_acpi floppy
[  209.452401] CPU: 0 PID: 3072 Comm: bash Not tainted 4.6.0-rc5+ #11
[  209.452404] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[  209.452407]  0000000000000286 00000000ccc8425d ffff88007a1cfa98 ffffffff8145dae3
[  209.452412]  0000000000000000 0000000000000000 ffff88007a1cfad8 ffffffff8108c25b
[  209.452416]  00000f897a1cfaf8 ffff880052efe340 ffff88007a1cfbb8 ffff880052efe560
[  209.452421] Call Trace:
[  209.452431]  [<ffffffff8145dae3>] dump_stack+0x85/0xc2
[  209.452437]  [<ffffffff8108c25b>] __warn+0xcb/0xf0
[  209.452440]  [<ffffffff8108c38d>] warn_slowpath_null+0x1d/0x20
[  209.452444]  [<ffffffff81306d45>] ext4_truncate+0x3f5/0x5b0
[  209.452447]  [<ffffffff81309447>] ext4_setattr+0x627/0xa40
[  209.452457]  [<ffffffff813b6483>] ? security_prepare_creds+0x43/0x60
[  209.452468]  [<ffffffff810b63d2>] ? creds_are_invalid.part.1+0x12/0x40
[  209.452478]  [<ffffffff81396491>] shiftfs_setattr+0x181/0x202
[  209.452492]  [<ffffffff812831f5>] notify_change+0x235/0x360
[  209.452500]  [<ffffffff8125f057>] do_truncate+0x77/0xc0
[  209.452505]  [<ffffffff81271959>] path_openat+0x269/0x1350
[  209.452509]  [<ffffffff81273f01>] do_filp_open+0x91/0x100
[  209.452517]  [<ffffffff819036d7>] ? _raw_spin_unlock+0x27/0x40
[  209.452522]  [<ffffffff81284799>] ? __alloc_fd+0xf9/0x210
[  209.452526]  [<ffffffff81260654>] do_sys_open+0x124/0x210
[  209.452529]  [<ffffffff8126075e>] SyS_open+0x1e/0x20
[  209.452534]  [<ffffffff81003f89>] do_syscall_64+0x69/0x160
[  209.452537]  [<ffffffff81904103>] entry_SYSCALL64_slow_path+0x25/0x25
[  209.452541] ---[ end trace b995e24e590f8b85 ]---
[  209.452790] ------------[ cut here ]------------
[  209.452800] WARNING: CPU: 0 PID: 3072 at fs/ext4/namei.c:2778 ext4_orphan_add+0x11a/0x290
[  209.452803] Modules linked in: binfmt_misc veth ip6t_MASQUERADE nf_nat_masquerade_ipv6 ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6_tables xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp bridge stp llc iptable_filter ip_tables x_tables ppdev kvm_intel kvm irqbypass nls_utf8 isofs joydev input_leds serio_raw i2c_piix4 pvpanic parport_pc 8250_fintek mac_hid parport ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear cirrus ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
[  209.452896]  psmouse drm pata_acpi floppy
[  209.452903] CPU: 0 PID: 3072 Comm: bash Tainted: G        W       4.6.0-rc5+ #11
[  209.452905] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[  209.452907]  0000000000000286 00000000ccc8425d ffff88007a1cfa30 ffffffff8145dae3
[  209.452912]  0000000000000000 0000000000000000 ffff88007a1cfa70 ffffffff8108c25b
[  209.452917]  00000ada00000008 ffff880052efe340 ffff88007c3ba0c0 ffff880036806000
[  209.452921] Call Trace:
[  209.452925]  [<ffffffff8145dae3>] dump_stack+0x85/0xc2
[  209.452929]  [<ffffffff8108c25b>] __warn+0xcb/0xf0
[  209.452933]  [<ffffffff8108c38d>] warn_slowpath_null+0x1d/0x20
[  209.452936]  [<ffffffff813126ca>] ext4_orphan_add+0x11a/0x290
[  209.452940]  [<ffffffff81306a9e>] ? ext4_truncate+0x14e/0x5b0
[  209.452948]  [<ffffffff81338b98>] ? __ext4_journal_start_sb+0x88/0x1f0
[  209.452953]  [<ffffffff81306ad1>] ext4_truncate+0x181/0x5b0
[  209.452956]  [<ffffffff81309447>] ext4_setattr+0x627/0xa40
[  209.452960]  [<ffffffff813b6483>] ? security_prepare_creds+0x43/0x60
[  209.452964]  [<ffffffff810b63d2>] ? creds_are_invalid.part.1+0x12/0x40
[  209.452967]  [<ffffffff81396491>] shiftfs_setattr+0x181/0x202
[  209.452971]  [<ffffffff812831f5>] notify_change+0x235/0x360
[  209.452975]  [<ffffffff8125f057>] do_truncate+0x77/0xc0
[  209.452978]  [<ffffffff81271959>] path_openat+0x269/0x1350
[  209.452982]  [<ffffffff81273f01>] do_filp_open+0x91/0x100
[  209.452986]  [<ffffffff819036d7>] ? _raw_spin_unlock+0x27/0x40
[  209.452989]  [<ffffffff81284799>] ? __alloc_fd+0xf9/0x210
[  209.452993]  [<ffffffff81260654>] do_sys_open+0x124/0x210
[  209.452997]  [<ffffffff8126075e>] SyS_open+0x1e/0x20
[  209.453001]  [<ffffffff81003f89>] do_syscall_64+0x69/0x160
[  209.453004]  [<ffffffff81904103>] entry_SYSCALL64_slow_path+0x25/0x25
[  209.453007] ---[ end trace b995e24e590f8b86 ]---
[  209.453541] ------------[ cut here ]------------
[  209.453548] WARNING: CPU: 0 PID: 3072 at fs/ext4/namei.c:2860 ext4_orphan_del+0x18c/0x2a0
[  209.453550] Modules linked in: binfmt_misc veth ip6t_MASQUERADE nf_nat_masquerade_ipv6 ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6_tables xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp bridge stp llc iptable_filter ip_tables x_tables ppdev kvm_intel kvm irqbypass nls_utf8 isofs joydev input_leds serio_raw i2c_piix4 pvpanic parport_pc 8250_fintek mac_hid parport ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear cirrus ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
[  209.453625]  psmouse drm pata_acpi floppy
[  209.453632] CPU: 0 PID: 3072 Comm: bash Tainted: G        W       4.6.0-rc5+ #11
[  209.453635] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[  209.453637]  0000000000000286 00000000ccc8425d ffff88007a1cfa18 ffffffff8145dae3
[  209.453641]  0000000000000000 0000000000000000 ffff88007a1cfa58 ffffffff8108c25b
[  209.453646]  00000b2c8103fca9 ffff880052efe340 ffff88007c3ba0c0 ffff88007c3ba0c0
[  209.453650] Call Trace:
[  209.453655]  [<ffffffff8145dae3>] dump_stack+0x85/0xc2
[  209.453658]  [<ffffffff8108c25b>] __warn+0xcb/0xf0
[  209.453662]  [<ffffffff8108c38d>] warn_slowpath_null+0x1d/0x20
[  209.453665]  [<ffffffff81313d0c>] ext4_orphan_del+0x18c/0x2a0
[  209.453668]  [<ffffffff81903cf7>] ? _raw_write_unlock+0x27/0x40
[  209.453673]  [<ffffffff81306d72>] ext4_truncate+0x422/0x5b0
[  209.453692]  [<ffffffff81309447>] ext4_setattr+0x627/0xa40
[  209.453697]  [<ffffffff813b6483>] ? security_prepare_creds+0x43/0x60
[  209.453701]  [<ffffffff810b63d2>] ? creds_are_invalid.part.1+0x12/0x40
[  209.453705]  [<ffffffff81396491>] shiftfs_setattr+0x181/0x202
[  209.453709]  [<ffffffff812831f5>] notify_change+0x235/0x360
[  209.453712]  [<ffffffff8125f057>] do_truncate+0x77/0xc0
[  209.453716]  [<ffffffff81271959>] path_openat+0x269/0x1350
[  209.453720]  [<ffffffff81273f01>] do_filp_open+0x91/0x100
[  209.453724]  [<ffffffff819036d7>] ? _raw_spin_unlock+0x27/0x40
[  209.453727]  [<ffffffff81284799>] ? __alloc_fd+0xf9/0x210
[  209.453731]  [<ffffffff81260654>] do_sys_open+0x124/0x210
[  209.453734]  [<ffffffff8126075e>] SyS_open+0x1e/0x20
[  209.453738]  [<ffffffff81003f89>] do_syscall_64+0x69/0x160
[  209.453741]  [<ffffffff81904103>] entry_SYSCALL64_slow_path+0x25/0x25
[  209.453745] ---[ end trace b995e24e590f8b87 ]---

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2016-05-17 10:23         ` James Bottomley
@ 2016-05-17 20:59           ` James Bottomley
  2016-05-19  2:28             ` Serge E. Hallyn
  0 siblings, 1 reply; 82+ messages in thread
From: James Bottomley @ 2016-05-17 20:59 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Serge Hallyn, Djalal Harouni, Chris Mason, tytso, Serge Hallyn,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, linux-security-module, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro

On Tue, 2016-05-17 at 06:23 -0400, James Bottomley wrote:
> On Mon, 2016-05-16 at 22:47 -0500, Serge E. Hallyn wrote:
> > On Mon, May 16, 2016 at 10:28:32PM -0400, James Bottomley wrote:
> > > On Mon, 2016-05-16 at 19:41 +0000, Serge Hallyn wrote:
> > > > Hey James,
> > > > 
> > > > I probably did something wrong - but i applied your patch onto
> > > > 4.6,
> > > > compiled in shiftfs, did
> > > > 
> > > > mount -t shiftfs -o uidmap=0:100000:65536,gidmap=0:100000:65536
> > > > /home/ubuntu /mnt
> > > > 
> > > > and ls segfaults and gives me kernel syslog msgs like:
> > > 
> > > Hm, it looks to be something IMA related, since the SUSE default
> > > is
> > > no
> > > IMA and this BUG in the filesystem is to do with the IMA version
> > > of
> > > i_readcount_dec.  I'll recompile my kernel to see if I can
> > > reproduce. 
> > >  Just in case, what's the underlying filesystem on /home/ubuntu?
> > 
> > It was ext4
> 
> Thanks.  I've got it to reproduce with CONFIG_IMA set ... just
> debugging now.

OK, I think this is the fix, can you apply on top of what you have
(it's two fixes, one for the RCU lookup and the other for the IMA
problem).

This probably has to be fixed in the VFS, but at least it will prove
I've got the correct problem and diagnosis.

Thanks,

James

---

diff --git a/fs/shiftfs.c b/fs/shiftfs.c
index d352377..2699b95 100644
--- a/fs/shiftfs.c
+++ b/fs/shiftfs.c
@@ -525,6 +525,9 @@ static int shiftfs_permission(struct inode *inode, int mask)
 	int err;
 	const struct cred *oldcred, *newcred;
 
+	if (mask & MAY_NOT_BLOCK)
+		return -ECHILD;
+
 	oldcred = shiftfs_new_creds(&newcred, inode->i_sb);
 	if (iop->permission)
 		err = iop->permission(reali, mask);
@@ -598,6 +601,15 @@ static int shiftfs_release(struct inode *inode, struct file *file)
 	if (sfc->release)
 		err = sfc->release(inode, file);
 
+#ifdef CONFIG_IMA
+	/* FIXME: IMA calls aren't balanced across ->open ->release
+	 * they occur after ->open and after ->release, so manually
+	 * swizzle here */
+
+	if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
+		i_readcount_dec(sfc->inode);
+#endif
+
 	file->f_inode = sfc->inode;
 	file->f_op = sfc->inode->i_fop;
 	fops_put(inode->i_fop);
@@ -631,6 +643,16 @@ static int shiftfs_open(struct inode *inode, struct file *file)
 	file->f_op = &sfc->fop;
 	file->f_inode = reali;
 
+#ifdef CONFIG_IMA
+	/* FIXME: IMA calls always operate on a saved copy of the
+	 * inode so they increment the above and decrement the
+	 * underlying. fix that here */
+
+	if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
+		i_readcount_inc(reali);
+#endif
+
+
 	if (fop->open)
 		err = fop->open(reali, file);
 

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2016-05-17  3:47       ` Serge E. Hallyn
@ 2016-05-17 10:23         ` James Bottomley
  2016-05-17 20:59           ` James Bottomley
  0 siblings, 1 reply; 82+ messages in thread
From: James Bottomley @ 2016-05-17 10:23 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Serge Hallyn, Djalal Harouni, Chris Mason, tytso, Serge Hallyn,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, linux-security-module, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro

On Mon, 2016-05-16 at 22:47 -0500, Serge E. Hallyn wrote:
> On Mon, May 16, 2016 at 10:28:32PM -0400, James Bottomley wrote:
> > On Mon, 2016-05-16 at 19:41 +0000, Serge Hallyn wrote:
> > > Hey James,
> > > 
> > > I probably did something wrong - but i applied your patch onto
> > > 4.6,
> > > compiled in shiftfs, did
> > > 
> > > mount -t shiftfs -o uidmap=0:100000:65536,gidmap=0:100000:65536
> > > /home/ubuntu /mnt
> > > 
> > > and ls segfaults and gives me kernel syslog msgs like:
> > 
> > Hm, it looks to be something IMA related, since the SUSE default is
> > no
> > IMA and this BUG in the filesystem is to do with the IMA version of
> > i_readcount_dec.  I'll recompile my kernel to see if I can
> > reproduce. 
> >  Just in case, what's the underlying filesystem on /home/ubuntu?
> 
> It was ext4

Thanks.  I've got it to reproduce with CONFIG_IMA set ... just
debugging now.

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2016-05-17  2:28     ` James Bottomley
@ 2016-05-17  3:47       ` Serge E. Hallyn
  2016-05-17 10:23         ` James Bottomley
  0 siblings, 1 reply; 82+ messages in thread
From: Serge E. Hallyn @ 2016-05-17  3:47 UTC (permalink / raw)
  To: James Bottomley
  Cc: Serge Hallyn, Djalal Harouni, Chris Mason, tytso, Serge Hallyn,
	Josh Triplett, Eric W. Biederman, Andy Lutomirski, Seth Forshee,
	linux-fsdevel, linux-kernel, linux-security-module, Dongsu Park,
	David Herrmann, Miklos Szeredi, Alban Crequy, Al Viro

On Mon, May 16, 2016 at 10:28:32PM -0400, James Bottomley wrote:
> On Mon, 2016-05-16 at 19:41 +0000, Serge Hallyn wrote:
> > Hey James,
> > 
> > I probably did something wrong - but i applied your patch onto 4.6,
> > compiled in shiftfs, did
> > 
> > mount -t shiftfs -o uidmap=0:100000:65536,gidmap=0:100000:65536
> > /home/ubuntu /mnt
> > 
> > and ls segfaults and gives me kernel syslog msgs like:
> 
> Hm, it looks to be something IMA related, since the SUSE default is no
> IMA and this BUG in the filesystem is to do with the IMA version of
> i_readcount_dec.  I'll recompile my kernel to see if I can reproduce. 
>  Just in case, what's the underlying filesystem on /home/ubuntu?

It was ext4

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2016-05-16 19:41   ` Serge Hallyn
@ 2016-05-17  2:28     ` James Bottomley
  2016-05-17  3:47       ` Serge E. Hallyn
  0 siblings, 1 reply; 82+ messages in thread
From: James Bottomley @ 2016-05-17  2:28 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Djalal Harouni, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro

On Mon, 2016-05-16 at 19:41 +0000, Serge Hallyn wrote:
> Hey James,
> 
> I probably did something wrong - but i applied your patch onto 4.6,
> compiled in shiftfs, did
> 
> mount -t shiftfs -o uidmap=0:100000:65536,gidmap=0:100000:65536
> /home/ubuntu /mnt
> 
> and ls segfaults and gives me kernel syslog msgs like:

Hm, it looks to be something IMA related, since the SUSE default is no
IMA and this BUG in the filesystem is to do with the IMA version of
i_readcount_dec.  I'll recompile my kernel to see if I can reproduce. 
 Just in case, what's the underlying filesystem on /home/ubuntu?

Thanks,

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2016-05-12 19:07 ` [RFC 1/1] shiftfs: uid/gid shifting bind mount James Bottomley
@ 2016-05-16 19:41   ` Serge Hallyn
  2016-05-17  2:28     ` James Bottomley
  0 siblings, 1 reply; 82+ messages in thread
From: Serge Hallyn @ 2016-05-16 19:41 UTC (permalink / raw)
  To: James Bottomley
  Cc: Djalal Harouni, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro

Hey James,

I probably did something wrong - but i applied your patch onto 4.6,
compiled in shiftfs, did

mount -t shiftfs -o uidmap=0:100000:65536,gidmap=0:100000:65536 /home/ubuntu /mnt

and ls segfaults and gives me kernel syslog msgs like:


[ 1089.744726] ===============================
[ 1089.748851] [ INFO: suspicious RCU usage. ]
[ 1089.752901] 4.6.0-rc5+ #10 Not tainted
[ 1089.756315] -------------------------------
[ 1089.760021] include/linux/rcupdate.h:569 Illegal context switch in RCU read-side critical section!
[ 1089.767348]
               other info that might help us debug this:

[ 1089.773401]
               rcu_scheduler_active = 1, debug_locks = 0
[ 1089.778417] 1 lock held by ls/3053:
[ 1089.781112]  #0:  (rcu_read_lock){......}, at: [<ffffffff81270907>] path_init+0x667/0x770
[ 1089.787492]
               stack backtrace:
[ 1089.790827] CPU: 0 PID: 3053 Comm: ls Not tainted 4.6.0-rc5+ #10
[ 1089.795304] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 1089.801376]  0000000000000286 000000005ed87b3e ffff88007a70bb10 ffffffff8145daa3
[ 1089.807098]  ffff88007a688000 0000000000000001 ffff88007a70bb40 ffffffff810e7587
[ 1089.812793]  0000000000000000 ffffffff81ca8baf 0000000000000184 ffff88007d08f640
[ 1089.818320] Call Trace:
[ 1089.820205]  [<ffffffff8145daa3>] dump_stack+0x85/0xc2
[ 1089.824046]  [<ffffffff810e7587>] lockdep_rcu_suspicious+0xd7/0x110
[ 1089.828871]  [<ffffffff810baf97>] ___might_sleep+0xa7/0x230
[ 1089.833024]  [<ffffffff810bb169>] __might_sleep+0x49/0x80
[ 1089.837118]  [<ffffffff81238109>] kmem_cache_alloc+0x1d9/0x2d0
[ 1089.841725]  [<ffffffff810b667a>] prepare_creds+0x3a/0x130
[ 1089.845827]  [<ffffffff813954a7>] shiftfs_new_creds+0x17/0x120
[ 1089.850170]  [<ffffffff81395cb2>] shiftfs_permission+0x42/0xd0
[ 1089.854507]  [<ffffffff8126d58b>] __inode_permission+0x6b/0xb0
[ 1089.858925]  [<ffffffff8126d5e4>] inode_permission+0x14/0x50
[ 1089.863190]  [<ffffffff812710cd>] link_path_walk+0x7d/0x510
[ 1089.867454]  [<ffffffff812707cb>] ? path_init+0x52b/0x770
[ 1089.871570]  [<ffffffff81270907>] ? path_init+0x667/0x770
[ 1089.875577]  [<ffffffff8127165c>] path_lookupat+0x7c/0x110
[ 1089.879830]  [<ffffffff812732c1>] filename_lookup+0xb1/0x180
[ 1089.883937]  [<ffffffff81272ec6>] ? getname_flags+0x56/0x1f0
[ 1089.888042]  [<ffffffff8110a25d>] ? rcu_read_lock_sched_held+0x6d/0x80
[ 1089.892841]  [<ffffffff81238193>] ? kmem_cache_alloc+0x263/0x2d0
[ 1089.897282]  [<ffffffff81272ee2>] ? getname_flags+0x72/0x1f0
[ 1089.901483]  [<ffffffff81273466>] user_path_at_empty+0x36/0x40
[ 1089.905768]  [<ffffffff81267166>] vfs_fstatat+0x66/0xc0
[ 1089.909596]  [<ffffffff81267761>] SYSC_newlstat+0x31/0x60
[ 1089.913616]  [<ffffffff81202d16>] ? __might_fault+0x96/0xa0
[ 1089.917684]  [<ffffffff81202ccd>] ? __might_fault+0x4d/0xa0
[ 1089.922750]  [<ffffffff810e9879>] ? trace_hardirqs_on_caller+0x129/0x1b0
[ 1089.928605]  [<ffffffff8100301b>] ? trace_hardirqs_on_thunk+0x1b/0x1d
[ 1089.934347]  [<ffffffff8126789e>] SyS_newlstat+0xe/0x10
[ 1089.939193]  [<ffffffff81904000>] entry_SYSCALL_64_fastpath+0x23/0xc1
[ 1089.945045] BUG: sleeping function called from invalid context at mm/slab.h:388
[ 1089.951474] in_atomic(): 1, irqs_disabled(): 0, pid: 3053, name: ls
[ 1089.957214] INFO: lockdep is turned off.
[ 1089.961166] CPU: 0 PID: 3053 Comm: ls Not tainted 4.6.0-rc5+ #10
[ 1089.966739] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 1089.973975]  0000000000000286 000000005ed87b3e ffff88007a70bb40 ffffffff8145daa3
[ 1089.980644]  ffff88007a688000 ffffffff81ca8baf ffff88007a70bb68 ffffffff810bb069
[ 1089.987297]  ffffffff81ca8baf 0000000000000184 0000000000000000 ffff88007a70bb90
[ 1089.994180] Call Trace:
[ 1089.997097]  [<ffffffff8145daa3>] dump_stack+0x85/0xc2
[ 1090.002051]  [<ffffffff810bb069>] ___might_sleep+0x179/0x230
[ 1090.007255]  [<ffffffff810bb169>] __might_sleep+0x49/0x80
[ 1090.012290]  [<ffffffff81238109>] kmem_cache_alloc+0x1d9/0x2d0
[ 1090.017679]  [<ffffffff810b667a>] prepare_creds+0x3a/0x130
[ 1090.022736]  [<ffffffff813954a7>] shiftfs_new_creds+0x17/0x120
[ 1090.028090]  [<ffffffff81395cb2>] shiftfs_permission+0x42/0xd0
[ 1090.033454]  [<ffffffff8126d58b>] __inode_permission+0x6b/0xb0
[ 1090.039006]  [<ffffffff8126d5e4>] inode_permission+0x14/0x50
[ 1090.044304]  [<ffffffff812710cd>] link_path_walk+0x7d/0x510
[ 1090.049593]  [<ffffffff812707cb>] ? path_init+0x52b/0x770
[ 1090.054795]  [<ffffffff81270907>] ? path_init+0x667/0x770
[ 1090.059950]  [<ffffffff8127165c>] path_lookupat+0x7c/0x110
[ 1090.065218]  [<ffffffff812732c1>] filename_lookup+0xb1/0x180
[ 1090.070629]  [<ffffffff81272ec6>] ? getname_flags+0x56/0x1f0
[ 1090.076265]  [<ffffffff8110a25d>] ? rcu_read_lock_sched_held+0x6d/0x80
[ 1090.082559]  [<ffffffff81238193>] ? kmem_cache_alloc+0x263/0x2d0
[ 1090.088153]  [<ffffffff81272ee2>] ? getname_flags+0x72/0x1f0
[ 1090.093478]  [<ffffffff81273466>] user_path_at_empty+0x36/0x40
[ 1090.099164]  [<ffffffff81267166>] vfs_fstatat+0x66/0xc0
[ 1090.104236]  [<ffffffff81267761>] SYSC_newlstat+0x31/0x60
[ 1090.109449]  [<ffffffff81202d16>] ? __might_fault+0x96/0xa0
[ 1090.115506]  [<ffffffff81202ccd>] ? __might_fault+0x4d/0xa0
[ 1090.120418]  [<ffffffff810e9879>] ? trace_hardirqs_on_caller+0x129/0x1b0
[ 1090.126325]  [<ffffffff8100301b>] ? trace_hardirqs_on_thunk+0x1b/0x1d
[ 1090.133230]  [<ffffffff8126789e>] SyS_newlstat+0xe/0x10
[ 1090.138320]  [<ffffffff81904000>] entry_SYSCALL_64_fastpath+0x23/0xc1
[ 1090.146513] ------------[ cut here ]------------
[ 1090.151061] kernel BUG at include/linux/fs.h:2574!
[ 1090.155883] invalid opcode: 0000 [#1] SMP
[ 1090.160131] Modules linked in: binfmt_misc veth ip6t_MASQUERADE nf_nat_masquerade_ipv6 ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6_tables xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp bridge stp llc iptable_filter ip_tables x_tables ppdev kvm_intel kvm irqbypass joydev input_leds serio_raw nls_utf8 isofs i2c_piix4 mac_hid parport_pc parport 8250_fintek pvpanic ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear cirrus ttm drm_kms_helper syscopyarea sysfillrect sysimgblt psmouse
[ 1090.223228]  fb_sys_fops drm pata_acpi floppy
[ 1090.226948] CPU: 0 PID: 3053 Comm: ls Not tainted 4.6.0-rc5+ #10
[ 1090.232806] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 1090.240377] task: ffff88007a688000 ti: ffff88007a708000 task.ti: ffff88007a708000
[ 1090.247359] RIP: 0010:[<ffffffff81263ef5>]  [<ffffffff81263ef5>] __fput+0x235/0x240
[ 1090.254759] RSP: 0018:ffff88007a70be70  EFLAGS: 00010246
[ 1090.260430] RAX: 0000000000000000 RBX: ffff880035739a00 RCX: 000000000007937c
[ 1090.267476] RDX: 0000000000000001 RSI: ffff88007fddada0 RDI: 0000000000000000
[ 1090.274538] RBP: ffff88007a70bea8 R08: 0000000000000000 R09: ffff8800367ff270
[ 1090.281637] R10: ffff880079d66c10 R11: ffff880035739a10 R12: 0000000040000010
[ 1090.288731] R13: ffff880079d66c10 R14: ffff88007a1b63a0 R15: ffff880050e6b000
[ 1090.295648] FS:  00007fec3f20c800(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[ 1090.303194] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1090.308945] CR2: 00007f7fe394c000 CR3: 000000007a72e000 CR4: 00000000000006f0
[ 1090.315954] Stack:
[ 1090.318947]  ffff880079d66c10 ffff880035739a10 ffffffff822ebab0 ffff88007a688710
[ 1090.326268]  ffff88007a688000 0000000000000000 ffff88007a688000 ffff88007a70beb8
[ 1090.333392]  ffffffff81263f3e ffff88007a70bee8 ffffffff810b2153 0000000000000002
[ 1090.340618] Call Trace:
[ 1090.343863]  [<ffffffff81263f3e>] ____fput+0xe/0x10
[ 1090.349178]  [<ffffffff810b2153>] task_work_run+0x73/0xa0
[ 1090.354941]  [<ffffffff810032bc>] exit_to_usermode_loop+0xcc/0xd0
[ 1090.361297]  [<ffffffff81003f0c>] syscall_return_slowpath+0xcc/0xe0
[ 1090.367735]  [<ffffffff8190409c>] entry_SYSCALL_64_fastpath+0xbf/0xc1
[ 1090.374412] Code: 00 e9 be fe ff ff 48 8b 43 28 48 8b 80 80 00 00 00 48 85 c0 0f 84 bf fe ff ff 31 d2 48 89 de bf ff ff ff ff ff d0 e9 ae fe ff ff <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 31 ff 48 87 3d
[ 1090.394163] RIP  [<ffffffff81263ef5>] __fput+0x235/0x240
[ 1090.399624]  RSP <ffff88007a70be70>
[ 1090.406515] ---[ end trace 909301922855c45e ]---
[ 1121.390946] audit: type=1400 audit(1463427449.647:19): apparmor="STATUS" operation="profile_load" name="lxd-x1_</var/lib/lxd>" pid=3076 comm="apparmor_parser"
[ 1121.427553] lxdbr0: port 1(vethBUS8OC) entered blocking state
[ 1121.432842] lxdbr0: port 1(vethBUS8OC) entered disabled state
[ 1121.439138] device vethBUS8OC entered promiscuous mode
[ 1121.449963] IPv6: ADDRCONF(NETDEV_UP): vethBUS8OC: link is not ready
[ 1121.494963] eth0: renamed from vethVNDWLE
[ 1121.502817] IPv6: ADDRCONF(NETDEV_CHANGE): vethBUS8OC: link becomes ready
[ 1121.512573] lxdbr0: port 1(vethBUS8OC) entered blocking state
[ 1121.518224] lxdbr0: port 1(vethBUS8OC) entered forwarding state
[ 1125.274210] BUG: sleeping function called from invalid context at mm/slab.h:388
[ 1125.280904] in_atomic(): 1, irqs_disabled(): 0, pid: 3760, name: ls
[ 1125.286508] INFO: lockdep is turned off.
[ 1125.290856] CPU: 0 PID: 3760 Comm: ls Tainted: G      D         4.6.0-rc5+ #10
[ 1125.298026] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 1125.305921]  0000000000000286 00000000323611df ffff88003099bb20 ffffffff8145daa3
[ 1125.313356]  ffff88002f1fe500 ffffffff81ca8baf ffff88003099bb48 ffffffff810bb069
[ 1125.320806]  ffffffff81ca8baf 0000000000000184 0000000000000000 ffff88003099bb70
[ 1125.328228] Call Trace:
[ 1125.331545]  [<ffffffff8145daa3>] dump_stack+0x85/0xc2
[ 1125.336984]  [<ffffffff810bb069>] ___might_sleep+0x179/0x230
[ 1125.342816]  [<ffffffff810bb169>] __might_sleep+0x49/0x80
[ 1125.348595]  [<ffffffff81238109>] kmem_cache_alloc+0x1d9/0x2d0
[ 1125.354678]  [<ffffffff810b667a>] prepare_creds+0x3a/0x130
[ 1125.360259]  [<ffffffff813954a7>] shiftfs_new_creds+0x17/0x120
[ 1125.366258]  [<ffffffff81395cb2>] shiftfs_permission+0x42/0xd0
[ 1125.372281]  [<ffffffff8126d58b>] __inode_permission+0x6b/0xb0
[ 1125.378283]  [<ffffffff8126d5e4>] inode_permission+0x14/0x50
[ 1125.384105]  [<ffffffff812710cd>] link_path_walk+0x7d/0x510
[ 1125.389733]  [<ffffffff812707cb>] ? path_init+0x52b/0x770
[ 1125.395147]  [<ffffffff81270907>] ? path_init+0x667/0x770
[ 1125.400481]  [<ffffffff8127165c>] path_lookupat+0x7c/0x110
[ 1125.405974]  [<ffffffff812732c1>] filename_lookup+0xb1/0x180
[ 1125.411831]  [<ffffffff81238126>] ? kmem_cache_alloc+0x1f6/0x2d0
[ 1125.417833]  [<ffffffff81273466>] user_path_at_empty+0x36/0x40
[ 1125.423601]  [<ffffffff81267166>] vfs_fstatat+0x66/0xc0
[ 1125.428933]  [<ffffffff81267761>] SYSC_newlstat+0x31/0x60
[ 1125.434390]  [<ffffffff81003a68>] ? syscall_trace_enter_phase1+0xc8/0x140
[ 1125.441067]  [<ffffffff8126789e>] SyS_newlstat+0xe/0x10
[ 1125.446541]  [<ffffffff81003f89>] do_syscall_64+0x69/0x160
[ 1125.452315]  [<ffffffff819040c3>] entry_SYSCALL64_slow_path+0x25/0x25
[ 1125.791437] ------------[ cut here ]------------
[ 1125.795754] kernel BUG at include/linux/fs.h:2574!
[ 1125.800529] invalid opcode: 0000 [#2] SMP
[ 1125.804923] Modules linked in: binfmt_misc veth ip6t_MASQUERADE nf_nat_masquerade_ipv6 ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6_tables xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp bridge stp llc iptable_filter ip_tables x_tables ppdev kvm_intel kvm irqbypass joydev input_leds serio_raw nls_utf8 isofs i2c_piix4 mac_hid parport_pc parport 8250_fintek pvpanic ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear cirrus ttm drm_kms_helper syscopyarea sysfillrect sysimgblt psmouse
[ 1125.871862]  fb_sys_fops drm pata_acpi floppy
[ 1125.875745] CPU: 0 PID: 3760 Comm: ls Tainted: G      D         4.6.0-rc5+ #10
[ 1125.882927] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 1125.890945] task: ffff88002f1fe500 ti: ffff880030998000 task.ti: ffff880030998000
[ 1125.898617] RIP: 0010:[<ffffffff81263ef5>]  [<ffffffff81263ef5>] __fput+0x235/0x240
[ 1125.906342] RSP: 0018:ffff88003099be70  EFLAGS: 00010246
[ 1125.912078] RAX: 0000000000000000 RBX: ffff880030846600 RCX: 0000000000085f05
[ 1125.919331] RDX: 0000000000000001 RSI: ffff88007fddada0 RDI: 0000000000000000
[ 1125.926545] RBP: ffff88003099bea8 R08: 0000000000000000 R09: ffff8800770bc2a8
[ 1125.933706] R10: 000000000010000f R11: ffff880030846601 R12: 0000000040000010
[ 1125.940782] R13: ffff880079d66c10 R14: ffff88007990cc60 R15: ffff880050e6b000
[ 1125.947844] FS:  00007f8297abc800(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[ 1125.955772] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1125.961908] CR2: 000055918a8d9018 CR3: 00000000309a4000 CR4: 00000000000006f0
[ 1125.969232] Stack:
[ 1125.972341]  ffff880079d66c10 ffff880030846610 ffffffff822ebab0 ffff88002f1fec10
[ 1125.979890]  ffff88002f1fe500 0000000000000000 ffff88002f1fe500 ffff88003099beb8
[ 1125.987279]  ffffffff81263f3e ffff88003099bee8 ffffffff810b2153 0000000000000102
[ 1125.994850] Call Trace:
[ 1125.998345]  [<ffffffff81263f3e>] ____fput+0xe/0x10
[ 1126.003695]  [<ffffffff810b2153>] task_work_run+0x73/0xa0
[ 1126.009377]  [<ffffffff810032bc>] exit_to_usermode_loop+0xcc/0xd0
[ 1126.015880]  [<ffffffff81004000>] do_syscall_64+0xe0/0x160
[ 1126.021848]  [<ffffffff819040c3>] entry_SYSCALL64_slow_path+0x25/0x25
[ 1126.028612] Code: 00 e9 be fe ff ff 48 8b 43 28 48 8b 80 80 00 00 00 48 85 c0 0f 84 bf fe ff ff 31 d2 48 89 de bf ff ff ff ff ff d0 e9 ae fe ff ff <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 31 ff 48 87 3d
[ 1126.049139] RIP  [<ffffffff81263ef5>] __fput+0x235/0x240
[ 1126.055150]  RSP <ffff88003099be70>
[ 1126.059746] ---[ end trace 909301922855c45f ]---
root@shiftfs:~#

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [RFC 1/1] shiftfs: uid/gid shifting bind mount
  2016-05-12 19:06 [RFC 0/1] shiftfs: uid/gid shifting filesystem James Bottomley
@ 2016-05-12 19:07 ` James Bottomley
  2016-05-16 19:41   ` Serge Hallyn
  0 siblings, 1 reply; 82+ messages in thread
From: James Bottomley @ 2016-05-12 19:07 UTC (permalink / raw)
  To: Djalal Harouni, Chris Mason, tytso, Serge Hallyn, Josh Triplett,
	Eric W. Biederman, Andy Lutomirski, Seth Forshee, linux-fsdevel,
	linux-kernel, linux-security-module, Dongsu Park, David Herrmann,
	Miklos Szeredi, Alban Crequy, Al Viro

This allows any subtree to be uid/gid shifted and bound elsewhere.  It
does this by operating simlarly to overlayfs, except that since
there's only a single underlying layer, all dentry lookups go through
this.  Its primary use is for shifting the underlying uids of
filesystems used to support unpriviliged (uid shifted) containers.
The usual use case here is that the container is operating with an uid
shifted unprivileged root but sometimes needs to make use of or work
with a filesystem image that has root at real uid 0.

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>

---

Changes so far: fixed up locking and addressed viro's comments

use negative dentries on the underlying cached in d_fsdata to remove
the extra lookup_one_len() calls

Add show_options/statfs callbacks

Add proper Kconfig plumbing

diff --git a/fs/Kconfig b/fs/Kconfig
index 6725f59..a9b0834 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -94,6 +94,14 @@ source "fs/autofs4/Kconfig"
 source "fs/fuse/Kconfig"
 source "fs/overlayfs/Kconfig"
 
+config SHIFT_FS
+	tristate "UID/GID shifting overlay filesystem for containers"
+	help
+	  This filesystem can overlay any mounted filesystem and shift
+	  the uid/gid the files appear at.  The idea is that
+	  unprivileged containers can use this to mount root volumes
+	  using this technique.
+
 menu "Caches"
 
 source "fs/fscache/Kconfig"
diff --git a/fs/Makefile b/fs/Makefile
index 85b6e13..ff9890e 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -128,3 +128,4 @@ obj-y				+= exofs/ # Multiple modules
 obj-$(CONFIG_CEPH_FS)		+= ceph/
 obj-$(CONFIG_PSTORE)		+= pstore/
 obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
+obj-$(CONFIG_SHIFT_FS)		+= shiftfs.o
diff --git a/fs/shiftfs.c b/fs/shiftfs.c
new file mode 100644
index 0000000..d352377
--- /dev/null
+++ b/fs/shiftfs.c
@@ -0,0 +1,833 @@
+#include <linux/cred.h>
+#include <linux/mount.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/magic.h>
+#include <linux/parser.h>
+#include <linux/seq_file.h>
+#include <linux/statfs.h>
+#include <linux/slab.h>
+#include <linux/user_namespace.h>
+#include <linux/uidgid.h>
+
+struct shiftfs_super_info {
+	struct vfsmount *mnt;
+	struct uid_gid_map uid_map, gid_map;
+};
+
+static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
+				       struct dentry *dentry);
+
+enum {
+	OPT_UIDMAP,
+	OPT_GIDMAP,
+	OPT_LAST,
+};
+
+/* global filesystem options */
+static const match_table_t tokens = {
+	{ OPT_UIDMAP, "uidmap=%u:%u:%u" },
+	{ OPT_GIDMAP, "gidmap=%u:%u:%u" },
+	{ OPT_LAST, NULL }
+};
+
+/*
+ * code stolen from user_namespace.c ... except that these functions
+ * return the same id back if unmapped ... should probably have a
+ * library?
+ */
+static u32 map_id_down(struct uid_gid_map *map, u32 id)
+{
+	unsigned idx, extents;
+	u32 first, last;
+
+	/* Find the matching extent */
+	extents = map->nr_extents;
+	smp_rmb();
+	for (idx = 0; idx < extents; idx++) {
+		first = map->extent[idx].first;
+		last = first + map->extent[idx].count - 1;
+		if (id >= first && id <= last)
+			break;
+	}
+	/* Map the id or note failure */
+	if (idx < extents)
+		id = (id - first) + map->extent[idx].lower_first;
+
+	return id;
+}
+
+static u32 map_id_up(struct uid_gid_map *map, u32 id)
+{
+	unsigned idx, extents;
+	u32 first, last;
+
+	/* Find the matching extent */
+	extents = map->nr_extents;
+	smp_rmb();
+	for (idx = 0; idx < extents; idx++) {
+		first = map->extent[idx].lower_first;
+		last = first + map->extent[idx].count - 1;
+		if (id >= first && id <= last)
+			break;
+	}
+	/* Map the id or note failure */
+	if (idx < extents)
+		id = (id - first) + map->extent[idx].first;
+
+	return id;
+}
+
+static bool mappings_overlap(struct uid_gid_map *new_map,
+			     struct uid_gid_extent *extent)
+{
+	u32 upper_first, lower_first, upper_last, lower_last;
+	unsigned idx;
+
+	upper_first = extent->first;
+	lower_first = extent->lower_first;
+	upper_last = upper_first + extent->count - 1;
+	lower_last = lower_first + extent->count - 1;
+
+	for (idx = 0; idx < new_map->nr_extents; idx++) {
+		u32 prev_upper_first, prev_lower_first;
+		u32 prev_upper_last, prev_lower_last;
+		struct uid_gid_extent *prev;
+
+		prev = &new_map->extent[idx];
+
+		prev_upper_first = prev->first;
+		prev_lower_first = prev->lower_first;
+		prev_upper_last = prev_upper_first + prev->count - 1;
+		prev_lower_last = prev_lower_first + prev->count - 1;
+
+		/* Does the upper range intersect a previous extent? */
+		if ((prev_upper_first <= upper_last) &&
+		    (prev_upper_last >= upper_first))
+			return true;
+
+		/* Does the lower range intersect a previous extent? */
+		if ((prev_lower_first <= lower_last) &&
+		    (prev_lower_last >= lower_first))
+			return true;
+	}
+	return false;
+}
+/* end code stolen from user_namespace.c */
+
+static const struct cred *shiftfs_get_up_creds(struct super_block *sb)
+{
+	struct cred *cred = prepare_creds();
+	struct shiftfs_super_info *ssi = sb->s_fs_info;
+
+	if (!cred)
+		return NULL;
+
+	cred->fsuid = KUIDT_INIT(map_id_up(&ssi->uid_map, __kuid_val(cred->fsuid)));
+	cred->fsgid = KGIDT_INIT(map_id_up(&ssi->gid_map, __kgid_val(cred->fsgid)));
+
+	return cred;
+}
+
+static const struct cred *shiftfs_new_creds(const struct cred **newcred,
+					    struct super_block *sb)
+{
+	const struct cred *cred = shiftfs_get_up_creds(sb);
+
+	*newcred = cred;
+
+	if (cred)
+		cred = override_creds(cred);
+	else
+		printk(KERN_ERR "Credential override failed: no memory\n");
+
+	return cred;
+}
+
+static void shiftfs_old_creds(const struct cred *oldcred,
+			      const struct cred **newcred)
+{
+	if (!*newcred)
+		return;
+
+	revert_creds(oldcred);
+	put_cred(*newcred);
+}
+
+static int shiftfs_parse_options(struct shiftfs_super_info *ssi, char *options)
+{
+	char *p;
+	substring_t args[MAX_OPT_ARGS];
+	int from, to, count;
+	struct uid_gid_map *map, *maps[2] = {
+		[OPT_UIDMAP] = &ssi->uid_map,
+		[OPT_GIDMAP] = &ssi->gid_map,
+	};
+
+	while ((p = strsep(&options, ",")) != NULL) {
+		int token;
+		struct uid_gid_extent ext;
+
+		if (!*p)
+			continue;
+
+		token = match_token(p, tokens, args);
+		if (token != OPT_UIDMAP && token != OPT_GIDMAP)
+			return -EINVAL;
+		if (match_int(&args[0], &from) ||
+		    match_int(&args[1], &to) ||
+		    match_int(&args[2], &count))
+			return -EINVAL;
+		map = maps[token];
+		if (map->nr_extents >= UID_GID_MAP_MAX_EXTENTS)
+			return -EINVAL;
+		ext.first = from;
+		ext.lower_first = to;
+		ext.count = count;
+		if (mappings_overlap(map, &ext))
+			return -EINVAL;
+		map->extent[map->nr_extents++] = ext;
+	}
+	return 0;
+}
+
+static void shiftfs_d_release(struct dentry *dentry)
+{
+	struct dentry *real = dentry->d_fsdata;
+
+	dput(real);
+}
+
+static const struct dentry_operations shiftfs_dentry_ops = {
+	.d_release	= shiftfs_d_release,
+};
+
+static int shiftfs_readlink(struct dentry *dentry, char __user *data,
+			    int flags)
+{
+	struct dentry *real = dentry->d_fsdata;
+	const struct inode_operations *iop = real->d_inode->i_op;
+
+	if (iop->readlink)
+		return iop->readlink(real, data, flags);
+
+	return -EINVAL;
+}
+
+static const char *shiftfs_get_link(struct dentry *dentry, struct inode *inode,
+				    struct delayed_call *done)
+{
+	if (dentry) {
+		struct dentry *real = dentry->d_fsdata;
+		struct inode *reali = real->d_inode;
+		const struct inode_operations *iop = reali->i_op;
+		const char *res = ERR_PTR(-EPERM);
+
+		if (iop->get_link)
+			res = iop->get_link(real, reali, done);
+
+		return res;
+	} else {
+		/* RCU lookup not supported */
+		return ERR_PTR(-ECHILD);
+	}
+}
+
+static int shiftfs_setxattr(struct dentry *dentry, const char *name,
+			    const void *value, size_t size, int flags)
+{
+	struct dentry *real = dentry->d_fsdata;
+	const struct inode_operations *iop = real->d_inode->i_op;
+	int err = -EOPNOTSUPP;
+
+	if (iop->setxattr) {
+		const struct cred *oldcred, *newcred;
+
+		oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+		err = iop->setxattr(real, name, value, size, flags);
+		shiftfs_old_creds(oldcred, &newcred);
+	}
+
+	return err;
+}
+
+static ssize_t shiftfs_getxattr(struct dentry *dentry, const char *name,
+				void *value, size_t size)
+{
+	struct dentry *real = dentry->d_fsdata;
+	const struct inode_operations *iop = real->d_inode->i_op;
+	int err = -EOPNOTSUPP;
+
+	if (iop->getxattr) {
+		const struct cred *oldcred, *newcred;
+
+		oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+		err = iop->getxattr(real, name, value, size);
+		shiftfs_old_creds(oldcred, &newcred);
+	}
+
+	return err;
+}
+
+static ssize_t shiftfs_listxattr(struct dentry *dentry, char *list,
+				 size_t size)
+{
+	struct dentry *real = dentry->d_fsdata;
+	const struct inode_operations *iop = real->d_inode->i_op;
+
+	if (iop->listxattr)
+		return iop->listxattr(real, list, size);
+
+	return -EINVAL;
+}
+
+static int shiftfs_removexattr(struct dentry *dentry, const char *name)
+{
+	struct dentry *real = dentry->d_fsdata;
+	const struct inode_operations *iop = real->d_inode->i_op;
+
+	if (iop->removexattr)
+		return iop->removexattr(real, name);
+
+	return -EINVAL;
+}
+
+static void shiftfs_fill_inode(struct inode *inode, struct dentry *dentry)
+{
+	struct inode *reali;
+	struct shiftfs_super_info *ssi = inode->i_sb->s_fs_info;
+
+	if (!dentry)
+		return;
+
+	reali = dentry->d_inode;
+
+	if (!reali->i_op->get_link)
+		inode->i_opflags |= IOP_NOFOLLOW;
+
+	inode->i_mapping = reali->i_mapping;
+	inode->i_private = dentry;
+
+	inode->i_uid = KUIDT_INIT(map_id_down(&ssi->uid_map, __kuid_val(reali->i_uid)));
+	inode->i_gid = KGIDT_INIT(map_id_down(&ssi->gid_map, __kgid_val(reali->i_gid)));
+}
+
+static int shiftfs_make_object(struct inode *dir, struct dentry *dentry,
+			       umode_t mode, const char *symlink,
+			       struct dentry *hardlink, bool excl)
+{
+	struct dentry *real = dir->i_private, *new = dentry->d_fsdata;
+	struct inode *reali = real->d_inode, *newi;
+	const struct inode_operations *iop = reali->i_op;
+	int err;
+	const struct cred *oldcred, *newcred;
+	bool op_ok = false;
+
+	if (hardlink) {
+		op_ok = iop->link;
+	} else {
+		switch (mode & S_IFMT) {
+		case S_IFDIR:
+			op_ok = iop->mkdir;
+			break;
+		case S_IFREG:
+			op_ok = iop->create;
+			break;
+		case S_IFLNK:
+			op_ok = iop->symlink;
+		}
+	}
+	if (!op_ok)
+		return -EINVAL;
+
+
+	newi = shiftfs_new_inode(dentry->d_sb, mode, NULL);
+	if (!newi)
+		return -ENOMEM;
+
+	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+
+	inode_lock_nested(reali, I_MUTEX_PARENT);
+
+	err = -EINVAL;		/* shut gcc up about uninit var */
+	if (hardlink) {
+		struct dentry *realhardlink = hardlink->d_fsdata;
+
+		err = vfs_link(realhardlink, reali, new, NULL);
+	} else {
+		switch (mode & S_IFMT) {
+		case S_IFDIR:
+			err = vfs_mkdir(reali, new, mode);
+			break;
+		case S_IFREG:
+			err = vfs_create(reali, new, mode, excl);
+			break;
+		case S_IFLNK:
+			err = vfs_symlink(reali, new, symlink);
+		}
+	}
+
+	shiftfs_old_creds(oldcred, &newcred);
+
+	if (err)
+		goto out_dput;
+
+	shiftfs_fill_inode(newi, new);
+
+	d_instantiate(dentry, newi);
+
+	new = NULL;
+	newi = NULL;
+
+ out_dput:
+	dput(new);
+	iput(newi);
+	inode_unlock(reali);
+
+	return err;
+}
+
+static int shiftfs_create(struct inode *dir, struct dentry *dentry,
+			  umode_t mode,  bool excl)
+{
+	mode |= S_IFREG;
+
+	return shiftfs_make_object(dir, dentry, mode, NULL, NULL, excl);
+}
+
+static int shiftfs_mkdir(struct inode *dir, struct dentry *dentry,
+			 umode_t mode)
+{
+	mode |= S_IFDIR;
+
+	return shiftfs_make_object(dir, dentry, mode, NULL, NULL, false);
+}
+
+static int shiftfs_link(struct dentry *hardlink, struct inode *dir,
+			struct dentry *dentry)
+{
+	return shiftfs_make_object(dir, dentry, 0, NULL, hardlink, false);
+}
+
+static int shiftfs_symlink(struct inode *dir, struct dentry *dentry,
+			   const char *symlink)
+{
+	return shiftfs_make_object(dir, dentry, S_IFLNK, symlink, NULL, false);
+}
+
+static int shiftfs_rm(struct inode *dir, struct dentry *dentry, bool rmdir)
+{
+	struct dentry *real = dir->i_private, *new = dentry->d_fsdata;
+	struct inode *reali = real->d_inode;
+	int err;
+	const struct cred *oldcred, *newcred;
+
+	inode_lock_nested(reali, I_MUTEX_PARENT);
+
+	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+
+	if (rmdir)
+		err = vfs_rmdir(reali, new);
+	else
+		err = vfs_unlink(reali, new, NULL);
+
+	shiftfs_old_creds(oldcred, &newcred);
+	inode_unlock(reali);
+
+	return err;
+}
+
+static int shiftfs_unlink(struct inode *dir, struct dentry *dentry)
+{
+	return shiftfs_rm(dir, dentry, false);
+}
+
+static int shiftfs_rmdir(struct inode *dir, struct dentry *dentry)
+{
+	return shiftfs_rm(dir, dentry, true);
+}
+
+static int shiftfs_rename2(struct inode *olddir, struct dentry *old,
+			   struct inode *newdir, struct dentry *new,
+			   unsigned int flags)
+{
+	struct dentry *rodd = olddir->i_private, *rndd = newdir->i_private,
+		*realold = old->d_fsdata,
+		*realnew = new->d_fsdata, *trap;
+	struct inode *realolddir = rodd->d_inode, *realnewdir = rndd->d_inode;
+	int err = -EINVAL;
+	const struct cred *oldcred, *newcred;
+
+	trap = lock_rename(rndd, rodd);
+
+	if (trap == realold || trap == realnew)
+		goto out_unlock;
+
+	oldcred = shiftfs_new_creds(&newcred, old->d_sb);
+
+	err = vfs_rename(realolddir, realold, realnewdir,
+			 realnew, NULL, flags);
+
+	shiftfs_old_creds(oldcred, &newcred);
+
+ out_unlock:
+	unlock_rename(rndd, rodd);
+
+	return err;
+}
+
+static struct dentry *shiftfs_lookup(struct inode *dir, struct dentry *dentry,
+				     unsigned int flags)
+{
+	struct dentry *real = dir->i_private, *new;
+	struct inode *reali = real->d_inode, *newi;
+	const struct cred *oldcred, *newcred;
+
+	/* note: violation of usual fs rules here: dentries are never
+	 * added with d_add.  This is because we want no dentry cache
+	 * for shiftfs.  All lookups proceed through the dentry cache
+	 * of the underlying filesystem, meaning we always see any
+	 * changes in the underlying */
+
+	inode_lock(reali);
+	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+	new = lookup_one_len(dentry->d_name.name, real, dentry->d_name.len);
+	shiftfs_old_creds(oldcred, &newcred);
+	inode_unlock(reali);
+
+	if (IS_ERR(new))
+		return new;
+
+	dentry->d_fsdata = new;
+
+	if (!new->d_inode)
+		return NULL;
+
+	newi = shiftfs_new_inode(dentry->d_sb, new->d_inode->i_mode, new);
+	if (!newi) {
+		dput(new);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	d_instantiate(dentry, newi);
+
+	return NULL;
+}
+
+static int shiftfs_permission(struct inode *inode, int mask)
+{
+	struct dentry *real = inode->i_private;
+	struct inode *reali = real->d_inode;
+	const struct inode_operations *iop = reali->i_op;
+	int err;
+	const struct cred *oldcred, *newcred;
+
+	oldcred = shiftfs_new_creds(&newcred, inode->i_sb);
+	if (iop->permission)
+		err = iop->permission(reali, mask);
+	else
+		err = generic_permission(reali, mask);
+	shiftfs_old_creds(oldcred, &newcred);
+
+	return err;
+}
+
+static int shiftfs_setattr(struct dentry *dentry, struct iattr *attr)
+{
+	struct dentry *real = dentry->d_fsdata;
+	struct inode *reali = real->d_inode;
+	const struct inode_operations *iop = reali->i_op;
+	struct iattr newattr = *attr;
+	const struct cred *oldcred, *newcred;
+	struct shiftfs_super_info *ssi = dentry->d_sb->s_fs_info;
+	int err;
+
+	newattr.ia_uid = KUIDT_INIT(map_id_up(&ssi->uid_map, __kuid_val(attr->ia_uid)));
+	newattr.ia_gid = KGIDT_INIT(map_id_up(&ssi->gid_map, __kgid_val(attr->ia_gid)));
+
+	oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
+	if (iop->setattr)
+		err = iop->setattr(real, &newattr);
+	else
+		err = simple_setattr(real, &newattr);
+	shiftfs_old_creds(oldcred, &newcred);
+
+	return err;
+}
+
+static int shiftfs_getattr(struct vfsmount *mnt, struct dentry *dentry,
+			   struct kstat *stat)
+{
+	struct inode *inode = dentry->d_inode;
+	struct dentry *real = inode->i_private;
+	struct inode *reali = real->d_inode;
+	const struct inode_operations *iop = reali->i_op;
+	int err = 0;
+
+	mnt = dentry->d_sb->s_fs_info;
+
+	if (iop->getattr)
+		err = iop->getattr(mnt, real, stat);
+	else
+		generic_fillattr(reali, stat);
+
+	if (err)
+		return err;
+
+	stat->uid = inode->i_uid;
+	stat->gid = inode->i_gid;
+	return 0;
+}
+
+struct shiftfs_fop_carrier {
+	struct inode *inode;
+	int (*release)(struct inode *, struct file *);
+	struct file_operations fop;
+};
+
+static int shiftfs_release(struct inode *inode, struct file *file)
+{
+	struct shiftfs_fop_carrier *sfc;
+	int err = 0;
+
+	sfc = container_of(file->f_op, struct shiftfs_fop_carrier, fop);
+
+	if (sfc->release)
+		err = sfc->release(inode, file);
+
+	file->f_inode = sfc->inode;
+	file->f_op = sfc->inode->i_fop;
+	fops_put(inode->i_fop);
+
+	kfree(sfc);
+
+	return err;
+}
+
+static int shiftfs_open(struct inode *inode, struct file *file)
+{
+	struct dentry *real = inode->i_private;
+	struct inode *reali = real->d_inode;
+	const struct file_operations *fop;
+	struct shiftfs_fop_carrier *sfc;
+	int err = 0;
+
+	sfc = kmalloc(sizeof(*sfc), GFP_KERNEL);
+	if (!sfc)
+		return -ENOMEM;
+
+	if (real->d_flags & DCACHE_OP_SELECT_INODE)
+		reali = real->d_op->d_select_inode(real, file->f_flags);
+
+	fop = fops_get(reali->i_fop);
+	sfc->inode = inode;
+	memcpy(&sfc->fop, fop, sizeof(*fop));
+	sfc->release = sfc->fop.release;
+	sfc->fop.release = shiftfs_release;
+
+	file->f_op = &sfc->fop;
+	file->f_inode = reali;
+
+	if (fop->open)
+		err = fop->open(reali, file);
+
+	return err;
+}
+
+static const struct inode_operations shiftfs_inode_ops = {
+	/* intercepted */
+	.lookup		= shiftfs_lookup,
+	.getattr	= shiftfs_getattr,
+	.setattr	= shiftfs_setattr,
+	.permission	= shiftfs_permission,
+
+	/*pass though */
+	.mkdir		= shiftfs_mkdir,
+	.symlink	= shiftfs_symlink,
+	.get_link	= shiftfs_get_link,
+	.readlink	= shiftfs_readlink,
+	.unlink		= shiftfs_unlink,
+	.rmdir		= shiftfs_rmdir,
+	.rename2	= shiftfs_rename2,
+	.link		= shiftfs_link,
+	.create		= shiftfs_create,
+	.mknod		= NULL,	/* no special files currently */
+	.setxattr	= shiftfs_setxattr,
+	.getxattr	= shiftfs_getxattr,
+	.listxattr	= shiftfs_listxattr,
+	.removexattr	= shiftfs_removexattr,
+};
+
+static const struct file_operations shiftfs_file_ops = {
+	.open		= shiftfs_open,
+};
+
+static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
+				       struct dentry *dentry)
+{
+	struct inode *inode;
+
+	inode = new_inode(sb);
+	if (!inode)
+		return NULL;
+
+	mode &= S_IFMT;
+
+	inode->i_ino = get_next_ino();
+	inode->i_mode = mode;
+	inode->i_flags |= S_NOATIME | S_NOCMTIME;
+
+	inode->i_op = &shiftfs_inode_ops;
+	inode->i_fop = &shiftfs_file_ops;
+
+	shiftfs_fill_inode(inode, dentry);
+
+	return inode;
+}
+
+static int shiftfs_show_options(struct seq_file *m, struct dentry *dentry)
+{
+	struct super_block *sb = dentry->d_sb;
+	struct shiftfs_super_info *ssi = sb->s_fs_info;
+
+	static const char *options[] = { "uidmap", "gidmap" };
+	const struct uid_gid_map *map[ARRAY_SIZE(options)] =
+		{ &ssi->uid_map, &ssi->gid_map };
+	int i, j;
+
+	for (i = 0; i < ARRAY_SIZE(options); i++) {
+		for (j = 0; j < map[i]->nr_extents; j++) {
+			const struct uid_gid_extent *ext = &map[i]->extent[j];
+
+			seq_show_option(m, options[i], NULL);
+			seq_printf(m, "=%u:%u:%u", ext->first,
+				   ext->lower_first, ext->count);
+		}
+	}
+
+	return 0;
+}
+
+static int shiftfs_statfs(struct dentry *dentry, struct kstatfs *buf)
+{
+	struct super_block *sb = dentry->d_sb;
+	struct shiftfs_super_info *ssi = sb->s_fs_info;
+	struct dentry *root = sb->s_root;
+	struct dentry *realroot = root->d_fsdata;
+	struct path realpath = { .mnt = ssi->mnt, .dentry = realroot };
+	int err;
+
+	err = vfs_statfs(&realpath, buf);
+	if (err)
+		return err;
+
+	buf->f_type = sb->s_magic;
+
+	return 0;
+}
+
+static void shiftfs_put_super(struct super_block *sb)
+{
+	struct shiftfs_super_info *ssi = sb->s_fs_info;
+
+	mntput(ssi->mnt);
+	kfree(ssi);
+}
+
+static const struct super_operations shiftfs_super_ops = {
+	.put_super	= shiftfs_put_super,
+	.show_options	= shiftfs_show_options,
+	.statfs		= shiftfs_statfs,
+};
+
+struct shiftfs_data {
+	void *data;
+	const char *path;
+};
+
+static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
+			      int silent)
+{
+	struct shiftfs_data *data = raw_data;
+	char *name = kstrdup(data->path, GFP_KERNEL);
+	int err = -ENOMEM;
+	struct shiftfs_super_info *ssi = NULL;
+	struct path path;
+
+	if (!name)
+		goto out;
+
+	ssi = kzalloc(sizeof(*ssi), GFP_KERNEL);
+	if (!ssi)
+		goto out;
+
+	err = -EPERM;
+	if (!capable(CAP_SYS_ADMIN))
+		goto out;
+
+	err = shiftfs_parse_options(ssi, data->data);
+	if (err)
+		goto out;
+
+	err = kern_path(name, LOOKUP_FOLLOW, &path);
+	if (err)
+		goto out;
+
+	if (!S_ISDIR(path.dentry->d_inode->i_mode)) {
+		err = -ENOTDIR;
+		goto out_put;
+	}
+	ssi->mnt = path.mnt;
+
+	sb->s_fs_info = ssi;
+	sb->s_magic = SHIFTFS_MAGIC;
+	sb->s_op = &shiftfs_super_ops;
+	sb->s_d_op = &shiftfs_dentry_ops;
+	sb->s_root = d_make_root(shiftfs_new_inode(sb, S_IFDIR, path.dentry));
+	sb->s_root->d_fsdata = path.dentry;
+
+	return 0;
+
+ out_put:
+	path_put(&path);
+ out:
+	kfree(name);
+	if (err)
+		kfree(ssi);
+	return err;
+}
+
+static struct dentry *shiftfs_mount(struct file_system_type *fs_type,
+				    int flags, const char *dev_name, void *data)
+{
+	struct shiftfs_data d = { data, dev_name };
+
+	return mount_nodev(fs_type, flags, &d, shiftfs_fill_super);
+}
+
+static struct file_system_type shiftfs_type = {
+	.owner		= THIS_MODULE,
+	.name		= "shiftfs",
+	.mount		= shiftfs_mount,
+	.kill_sb	= kill_anon_super,
+};
+
+static int __init shiftfs_init(void)
+{
+	return register_filesystem(&shiftfs_type);
+}
+
+static void __exit shiftfs_exit(void)
+{
+	unregister_filesystem(&shiftfs_type);
+}
+
+MODULE_ALIAS_FS("shiftfs");
+MODULE_AUTHOR("James Bottomley");
+MODULE_DESCRIPTION("uid/gid shifting bind filesystem");
+MODULE_LICENSE("GPL v2");
+module_init(shiftfs_init)
+module_exit(shiftfs_exit)
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 0de181a..d7992f5 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -79,4 +79,6 @@
 #define NSFS_MAGIC		0x6e736673
 #define BPF_FS_MAGIC		0xcafe4a11
 
+#define SHIFTFS_MAGIC		0x6a656a62
+
 #endif /* __LINUX_MAGIC_H__ */

^ permalink raw reply related	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2017-02-22 12:01 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-04 19:18 [RFC 0/1] shiftfs: uid/gid shifting filesystem (s_user_ns version) James Bottomley
2017-02-04 19:19 ` [RFC 1/1] shiftfs: uid/gid shifting bind mount James Bottomley
2017-02-05  7:51   ` Amir Goldstein
2017-02-06  1:18     ` James Bottomley
2017-02-06  6:59       ` Amir Goldstein
2017-02-06 14:41         ` James Bottomley
2017-02-14 23:03       ` Vivek Goyal
2017-02-14 23:45         ` James Bottomley
2017-02-15 14:17           ` Vivek Goyal
2017-02-16 15:51             ` James Bottomley
2017-02-16 16:42               ` Vivek Goyal
2017-02-16 16:58                 ` James Bottomley
2017-02-17  1:57                   ` Eric W. Biederman
2017-02-17  8:39                     ` Djalal Harouni
2017-02-17 17:19                     ` James Bottomley
2017-02-20  4:24                       ` Eric W. Biederman
2017-02-22 12:01                         ` James Bottomley
2017-02-06  3:25   ` J. R. Okajima
2017-02-06  6:38     ` Amir Goldstein
2017-02-06 16:29       ` James Bottomley
2017-02-06  6:46     ` James Bottomley
2017-02-06 14:50       ` Theodore Ts'o
2017-02-06 15:18         ` James Bottomley
2017-02-06 15:38           ` lkml
2017-02-06 17:32             ` James Bottomley
2017-02-06 21:52           ` J. Bruce Fields
2017-02-07  0:10             ` James Bottomley
2017-02-07  1:35               ` J. Bruce Fields
2017-02-07 19:01                 ` James Bottomley
2017-02-07 19:47                   ` Christoph Hellwig
2017-02-06 16:24       ` J. R. Okajima
2017-02-21  0:48         ` James Bottomley
2017-02-21  2:57           ` J. R. Okajima
2017-02-21  4:07             ` James Bottomley
2017-02-21  4:34               ` J. R. Okajima
2017-02-07  9:19   ` Christoph Hellwig
2017-02-07  9:39     ` Djalal Harouni
2017-02-07  9:53       ` Christoph Hellwig
2017-02-07 16:37     ` James Bottomley
2017-02-07 17:59       ` Amir Goldstein
2017-02-07 18:10         ` Christoph Hellwig
2017-02-07 19:02           ` James Bottomley
2017-02-07 19:49             ` Christoph Hellwig
2017-02-07 20:05               ` James Bottomley
2017-02-07 21:01                 ` Amir Goldstein
2017-02-07 22:25                   ` Christoph Hellwig
2017-02-07 23:42                     ` James Bottomley
2017-02-08  6:44                       ` Amir Goldstein
2017-02-08 11:45                         ` Konstantin Khlebnikov
2017-02-08 14:57                         ` James Bottomley
2017-02-08 15:15                         ` James Bottomley
2017-02-08  1:54               ` Josh Triplett
2017-02-08 15:22                 ` James Bottomley
2017-02-09 10:36                   ` Josh Triplett
2017-02-09 15:34                     ` James Bottomley
2017-02-13 10:15                       ` Eric W. Biederman
2017-02-15  9:33                         ` Djalal Harouni
2017-02-15  9:37                           ` Eric W. Biederman
2017-02-15 10:04                             ` Djalal Harouni
2017-02-07 18:20         ` James Bottomley
2017-02-07 19:48           ` Djalal Harouni
2017-02-15 20:34   ` Vivek Goyal
2017-02-16 15:56     ` James Bottomley
2017-02-17  2:55       ` Al Viro
2017-02-17 17:34         ` James Bottomley
2017-02-17 20:35           ` Vivek Goyal
2017-02-19  3:24             ` James Bottomley
2017-02-20 19:26               ` Vivek Goyal
2017-02-21  0:38                 ` James Bottomley
2017-02-17  2:29   ` Al Viro
2017-02-17 17:24     ` James Bottomley
2017-02-17 17:51       ` Al Viro
2017-02-17 20:27         ` Vivek Goyal
2017-02-17 20:50         ` James Bottomley
  -- strict thread matches above, loose matches on Subject: below --
2016-05-12 19:06 [RFC 0/1] shiftfs: uid/gid shifting filesystem James Bottomley
2016-05-12 19:07 ` [RFC 1/1] shiftfs: uid/gid shifting bind mount James Bottomley
2016-05-16 19:41   ` Serge Hallyn
2016-05-17  2:28     ` James Bottomley
2016-05-17  3:47       ` Serge E. Hallyn
2016-05-17 10:23         ` James Bottomley
2016-05-17 20:59           ` James Bottomley
2016-05-19  2:28             ` Serge E. Hallyn
2016-05-19 10:53               ` James Bottomley

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.