linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/10] mount ownership and unprivileged mount syscall (v7)
@ 2008-01-16 12:31 Miklos Szeredi
  2008-01-16 12:31 ` [patch 01/10] unprivileged mounts: add user mounts to the kernel Miklos Szeredi
                   ` (9 more replies)
  0 siblings, 10 replies; 28+ messages in thread
From: Miklos Szeredi @ 2008-01-16 12:31 UTC (permalink / raw)
  To: akpm, hch, serue, viro, kzak
  Cc: linux-fsdevel, linux-kernel, containers, util-linux-ng

Thanks to everyone for the comments on the previous submission.

Christoph, could you please look through the patches if they are
acceptable from the VFS point of view?

Thanks,
Miklos

v6 -> v7:

 - add '/proc/sys/fs/types/<type>/usermount_safe' tunable (new patch)
 - do not make FUSE safe by default, describe possible problems
   associated with unprivileged FUSE mounts in patch header
 - return EMFILE instead of EPERM, if maximum user mount count is exceeded
 - rename option 'nomnt' -> 'nosubmnt'
 - clean up error propagation in dup_mnt_ns
 - update util-linux-ng patch

v5 -> v6:

 - update to latest -mm
 - preliminary util-linux-ng support (will post right after this series)

v4 -> v5:

 - fold back Andrew's changes
 - fold back my update patch:
    o use fsuid instead of ruid
    o allow forced unpriv. unmounts for "safe" filesystems
    o allow mounting over special files, but not over symlinks
    o set nosuid and nodev based on lack of specific capability
 - patch header updates
 - new patch: on propagation inherit owner from parent
 - new patch: add "no submounts" mount flag

v3 -> v4:

 - simplify interface as much as possible, now only a single option
   ("user=UID") is used to control everything
 - no longer allow/deny mounting based on file/directory permissions,
   that approach does not always make sense

v1 -> v3:

 - add mount flags to set/clear mnt_flags individually
 - add "usermnt" mount flag.  If it is set, then allow unprivileged
   submounts under this mount
 - make max number of user mounts default to 1024, since now the
   usermnt flag will prevent user mounts by default

--

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 01/10] unprivileged mounts: add user mounts to the kernel
  2008-01-16 12:31 [patch 00/10] mount ownership and unprivileged mount syscall (v7) Miklos Szeredi
@ 2008-01-16 12:31 ` Miklos Szeredi
  2008-01-16 12:31 ` [patch 02/10] unprivileged mounts: allow unprivileged umount Miklos Szeredi
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Miklos Szeredi @ 2008-01-16 12:31 UTC (permalink / raw)
  To: akpm, hch, serue, viro, kzak
  Cc: linux-fsdevel, linux-kernel, containers, util-linux-ng

[-- Attachment #1: unprivileged-mounts-add-user-mounts-to-the-kernel.patch --]
[-- Type: text/plain, Size: 8602 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

This patchset adds support for keeping mount ownership information in the
kernel, and allow unprivileged mount(2) and umount(2) in certain cases.

The mount owner has the following privileges:

  - unmount the owned mount
  - create a submount under the owned mount

The sysadmin can set the owner explicitly on mount and remount.  When an
unprivileged user creates a mount, then the owner is automatically set to the
user.

The following use cases are envisioned:

1) Private namespace, with selected mounts owned by user.  E.g.
   /home/$USER is a good candidate for allowing unpriv mounts and unmounts
   within.

2) Private namespace, with all mounts owned by user and having the "nosuid"
   flag.  User can mount and umount anywhere within the namespace, but suid
   programs will not work.

3) Global namespace, with a designated directory, which is a mount owned by
   the user.  E.g.  /mnt/users/$USER is set up so that it is bind mounted onto
   itself, and set to be owned by $USER.  The user can add/remove mounts only
   under this directory.

The following extra security measures are taken for unprivileged mounts:

 - usermounts are limited by a sysctl tunable
 - force "nosuid,nodev" mount options on the created mount

For testing unprivileged mounts (and for other purposes) simple
mount/umount utilities are available from:

  http://www.kernel.org/pub/linux/kernel/people/mszeredi/mmount/

After this series I'll be posting a preliminary patch for util-linux-ng,
to add the same functionality to mount(8) and umount(8).

This patch:

A new mount flag, MS_SETUSER is used to make a mount owned by a user.  If this
flag is specified, then the owner will be set to the current fsuid and the
mount will be marked with the MNT_USER flag.  On remount don't preserve
previous owner, and treat MS_SETUSER as for a new mount.  The MS_SETUSER flag
is ignored on mount move.

The MNT_USER flag is not copied on any kind of mount cloning: namespace
creation, binding or propagation.  For bind mounts the cloned mount(s) are set
to MNT_USER depending on the MS_SETUSER mount flag.  In all the other cases
MNT_USER is always cleared.

For MNT_USER mounts a "user=UID" option is added to /proc/PID/mounts.  This is
compatible with how mount ownership is stored in /etc/mtab.

The rationale for using MS_SETUSER and MNT_USER, to distinguish "user"
mounts from "non-user" or "legacy" mounts are follows:

  a) Mount(2) and umount(2) on legacy mounts always need CAP_SYS_ADMIN
     capability.  As opposed to user mounts, which will only require,
     that the mount owner matches the current fsuid.  So a process
     with fsuid=0 should not be able to mount/umount legacy mounts
     without the CAP_SYS_ADMIN capability.

  b) Legacy userspace programs may set fsuid to nonzero before calling
     mount(2).  In such an unlikely case, this patchset would cause
     an unintended side effect of making the mount owned by the fsuid.

  c) For legacy mounts, no "user=UID" option should be shown in
     /proc/mounts for backwards compatibility.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---

Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c	2008-01-16 13:24:53.000000000 +0100
+++ linux/fs/namespace.c	2008-01-16 13:25:05.000000000 +0100
@@ -477,6 +477,13 @@ static struct vfsmount *skip_mnt_tree(st
 	return p;
 }
 
+static void set_mnt_user(struct vfsmount *mnt)
+{
+	WARN_ON(mnt->mnt_flags & MNT_USER);
+	mnt->mnt_uid = current->fsuid;
+	mnt->mnt_flags |= MNT_USER;
+}
+
 static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root,
 					int flag)
 {
@@ -491,6 +498,11 @@ static struct vfsmount *clone_mnt(struct
 		mnt->mnt_mountpoint = mnt->mnt_root;
 		mnt->mnt_parent = mnt;
 
+		/* don't copy the MNT_USER flag */
+		mnt->mnt_flags &= ~MNT_USER;
+		if (flag & CL_SETUSER)
+			set_mnt_user(mnt);
+
 		if (flag & CL_SLAVE) {
 			list_add(&mnt->mnt_slave, &old->mnt_slave_list);
 			mnt->mnt_master = old;
@@ -644,6 +656,8 @@ static int show_vfsmnt(struct seq_file *
 		if (mnt->mnt_flags & fs_infop->flag)
 			seq_puts(m, fs_infop->str);
 	}
+	if (mnt->mnt_flags & MNT_USER)
+		seq_printf(m, ",user=%i", mnt->mnt_uid);
 	if (mnt->mnt_sb->s_op->show_options)
 		err = mnt->mnt_sb->s_op->show_options(m, mnt);
 	seq_puts(m, " 0 0\n");
@@ -1181,8 +1195,9 @@ static int do_change_type(struct nameida
 /*
  * do loopback mount.
  */
-static int do_loopback(struct nameidata *nd, char *old_name, int recurse)
+static int do_loopback(struct nameidata *nd, char *old_name, int flags)
 {
+	int clone_fl;
 	struct nameidata old_nd;
 	struct vfsmount *mnt = NULL;
 	int err = mount_is_safe(nd);
@@ -1202,11 +1217,12 @@ static int do_loopback(struct nameidata 
 	if (!check_mnt(nd->path.mnt) || !check_mnt(old_nd.path.mnt))
 		goto out;
 
+	clone_fl = (flags & MS_SETUSER) ? CL_SETUSER : 0;
 	err = -ENOMEM;
-	if (recurse)
-		mnt = copy_tree(old_nd.path.mnt, old_nd.path.dentry, 0);
+	if (flags & MS_REC)
+		mnt = copy_tree(old_nd.path.mnt, old_nd.path.dentry, clone_fl);
 	else
-		mnt = clone_mnt(old_nd.path.mnt, old_nd.path.dentry, 0);
+		mnt = clone_mnt(old_nd.path.mnt, old_nd.path.dentry, clone_fl);
 
 	if (!mnt)
 		goto out;
@@ -1268,8 +1284,11 @@ static int do_remount(struct nameidata *
 		err = change_mount_flags(nd->path.mnt, flags);
 	else
 		err = do_remount_sb(sb, flags, data, 0);
-	if (!err)
+	if (!err) {
 		nd->path.mnt->mnt_flags = mnt_flags;
+		if (flags & MS_SETUSER)
+			set_mnt_user(nd->path.mnt);
+	}
 	up_write(&sb->s_umount);
 	if (!err)
 		security_sb_post_remount(nd->path.mnt, flags, data);
@@ -1378,10 +1397,13 @@ static int do_new_mount(struct nameidata
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
-	mnt = do_kern_mount(type, flags, name, data);
+	mnt = do_kern_mount(type, flags & ~MS_SETUSER, name, data);
 	if (IS_ERR(mnt))
 		return PTR_ERR(mnt);
 
+	if (flags & MS_SETUSER)
+		set_mnt_user(mnt);
+
 	return do_add_mount(mnt, nd, mnt_flags, NULL);
 }
 
@@ -1413,7 +1435,8 @@ int do_add_mount(struct vfsmount *newmnt
 	if (S_ISLNK(newmnt->mnt_root->d_inode->i_mode))
 		goto unlock;
 
-	newmnt->mnt_flags = mnt_flags;
+	/* MNT_USER was set earlier */
+	newmnt->mnt_flags |= mnt_flags;
 	if ((err = graft_tree(newmnt, nd)))
 		goto unlock;
 
@@ -1735,7 +1758,7 @@ long do_mount(char *dev_name, char *dir_
 		retval = do_remount(&nd, flags & ~MS_REMOUNT, mnt_flags,
 				    data_page);
 	else if (flags & MS_BIND)
-		retval = do_loopback(&nd, dev_name, flags & MS_REC);
+		retval = do_loopback(&nd, dev_name, flags);
 	else if (flags & (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
 		retval = do_change_type(&nd, flags);
 	else if (flags & MS_MOVE)
Index: linux/fs/pnode.h
===================================================================
--- linux.orig/fs/pnode.h	2008-01-16 13:24:53.000000000 +0100
+++ linux/fs/pnode.h	2008-01-16 13:25:05.000000000 +0100
@@ -23,6 +23,7 @@
 #define CL_MAKE_SHARED 		0x08
 #define CL_PROPAGATION 		0x10
 #define CL_PRIVATE 		0x20
+#define CL_SETUSER		0x40
 
 static inline void set_mnt_shared(struct vfsmount *mnt)
 {
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h	2008-01-16 13:24:53.000000000 +0100
+++ linux/include/linux/fs.h	2008-01-16 13:25:05.000000000 +0100
@@ -125,6 +125,7 @@ extern int dir_notify_enable;
 #define MS_RELATIME	(1<<21)	/* Update atime relative to mtime/ctime. */
 #define MS_KERNMOUNT	(1<<22) /* this is a kern_mount call */
 #define MS_I_VERSION	(1<<23) /* Update inode I_version field */
+#define MS_SETUSER	(1<<24) /* set mnt_uid to current user */
 #define MS_ACTIVE	(1<<30)
 #define MS_NOUSER	(1<<31)
 
Index: linux/include/linux/mount.h
===================================================================
--- linux.orig/include/linux/mount.h	2008-01-16 13:24:53.000000000 +0100
+++ linux/include/linux/mount.h	2008-01-16 13:25:05.000000000 +0100
@@ -33,6 +33,7 @@ struct mnt_namespace;
 
 #define MNT_SHRINKABLE	0x100
 #define MNT_IMBALANCED_WRITE_COUNT	0x200 /* just for debugging */
+#define MNT_USER	0x400
 
 #define MNT_SHARED	0x1000	/* if the vfsmount is a shared mount */
 #define MNT_UNBINDABLE	0x2000	/* if the vfsmount is a unbindable mount */
@@ -69,6 +70,8 @@ struct vfsmount {
 	 * are held, and all mnt_writer[]s on this mount have 0 as their ->count
 	 */
 	atomic_t __mnt_writers;
+
+	uid_t mnt_uid;			/* owner of the mount */
 };
 
 static inline struct vfsmount *mntget(struct vfsmount *mnt)

--

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 02/10] unprivileged mounts: allow unprivileged umount
  2008-01-16 12:31 [patch 00/10] mount ownership and unprivileged mount syscall (v7) Miklos Szeredi
  2008-01-16 12:31 ` [patch 01/10] unprivileged mounts: add user mounts to the kernel Miklos Szeredi
@ 2008-01-16 12:31 ` Miklos Szeredi
  2008-01-16 12:31 ` [patch 03/10] unprivileged mounts: propagate error values from clone_mnt Miklos Szeredi
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Miklos Szeredi @ 2008-01-16 12:31 UTC (permalink / raw)
  To: akpm, hch, serue, viro, kzak
  Cc: linux-fsdevel, linux-kernel, containers, util-linux-ng

[-- Attachment #1: unprivileged-mounts-allow-unprivileged-umount.patch --]
[-- Type: text/plain, Size: 1519 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

The owner doesn't need sysadmin capabilities to call umount().

Similar behavior as umount(8) on mounts having "user=UID" option in /etc/mtab.
The difference is that umount also checks /etc/fstab, presumably to exclude
another mount on the same mountpoint.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---

Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c	2008-01-16 13:25:05.000000000 +0100
+++ linux/fs/namespace.c	2008-01-16 13:25:06.000000000 +0100
@@ -894,6 +894,27 @@ static int do_umount(struct vfsmount *mn
 	return retval;
 }
 
+static bool is_mount_owner(struct vfsmount *mnt, uid_t uid)
+{
+	return (mnt->mnt_flags & MNT_USER) && mnt->mnt_uid == uid;
+}
+
+/*
+ * umount is permitted for
+ *  - sysadmin
+ *  - mount owner, if not forced umount
+ */
+static bool permit_umount(struct vfsmount *mnt, int flags)
+{
+	if (capable(CAP_SYS_ADMIN))
+		return true;
+
+	if (flags & MNT_FORCE)
+		return false;
+
+	return is_mount_owner(mnt, current->fsuid);
+}
+
 /*
  * Now umount can handle mount points as well as block devices.
  * This is important for filesystems which use unnamed block devices.
@@ -917,7 +938,7 @@ asmlinkage long sys_umount(char __user *
 		goto dput_and_out;
 
 	retval = -EPERM;
-	if (!capable(CAP_SYS_ADMIN))
+	if (!permit_umount(nd.path.mnt, flags))
 		goto dput_and_out;
 
 	retval = do_umount(nd.path.mnt, flags);

--

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 03/10] unprivileged mounts: propagate error values from clone_mnt
  2008-01-16 12:31 [patch 00/10] mount ownership and unprivileged mount syscall (v7) Miklos Szeredi
  2008-01-16 12:31 ` [patch 01/10] unprivileged mounts: add user mounts to the kernel Miklos Szeredi
  2008-01-16 12:31 ` [patch 02/10] unprivileged mounts: allow unprivileged umount Miklos Szeredi
@ 2008-01-16 12:31 ` Miklos Szeredi
  2008-01-16 12:31 ` [patch 04/10] unprivileged mounts: account user mounts Miklos Szeredi
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Miklos Szeredi @ 2008-01-16 12:31 UTC (permalink / raw)
  To: akpm, hch, serue, viro, kzak
  Cc: linux-fsdevel, linux-kernel, containers, util-linux-ng

[-- Attachment #1: unprivileged-mounts-propagate-error-values-from-clone_mnt.patch --]
[-- Type: text/plain, Size: 5394 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Allow clone_mnt() to return errors other than ENOMEM.  This will be used for
returning a different error value when the number of user mounts goes over the
limit.

Fix copy_tree() to return EPERM for unbindable mounts.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---

Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c	2008-01-16 13:25:06.000000000 +0100
+++ linux/fs/namespace.c	2008-01-16 13:25:07.000000000 +0100
@@ -490,41 +490,42 @@ static struct vfsmount *clone_mnt(struct
 	struct super_block *sb = old->mnt_sb;
 	struct vfsmount *mnt = alloc_vfsmnt(old->mnt_devname);
 
-	if (mnt) {
-		mnt->mnt_flags = old->mnt_flags;
-		atomic_inc(&sb->s_active);
-		mnt->mnt_sb = sb;
-		mnt->mnt_root = dget(root);
-		mnt->mnt_mountpoint = mnt->mnt_root;
-		mnt->mnt_parent = mnt;
-
-		/* don't copy the MNT_USER flag */
-		mnt->mnt_flags &= ~MNT_USER;
-		if (flag & CL_SETUSER)
-			set_mnt_user(mnt);
-
-		if (flag & CL_SLAVE) {
-			list_add(&mnt->mnt_slave, &old->mnt_slave_list);
-			mnt->mnt_master = old;
-			CLEAR_MNT_SHARED(mnt);
-		} else if (!(flag & CL_PRIVATE)) {
-			if ((flag & CL_PROPAGATION) || IS_MNT_SHARED(old))
-				list_add(&mnt->mnt_share, &old->mnt_share);
-			if (IS_MNT_SLAVE(old))
-				list_add(&mnt->mnt_slave, &old->mnt_slave);
-			mnt->mnt_master = old->mnt_master;
-		}
-		if (flag & CL_MAKE_SHARED)
-			set_mnt_shared(mnt);
+	if (!mnt)
+		return ERR_PTR(-ENOMEM);
 
-		/* stick the duplicate mount on the same expiry list
-		 * as the original if that was on one */
-		if (flag & CL_EXPIRE) {
-			spin_lock(&vfsmount_lock);
-			if (!list_empty(&old->mnt_expire))
-				list_add(&mnt->mnt_expire, &old->mnt_expire);
-			spin_unlock(&vfsmount_lock);
-		}
+	mnt->mnt_flags = old->mnt_flags;
+	atomic_inc(&sb->s_active);
+	mnt->mnt_sb = sb;
+	mnt->mnt_root = dget(root);
+	mnt->mnt_mountpoint = mnt->mnt_root;
+	mnt->mnt_parent = mnt;
+
+	/* don't copy the MNT_USER flag */
+	mnt->mnt_flags &= ~MNT_USER;
+	if (flag & CL_SETUSER)
+		set_mnt_user(mnt);
+
+	if (flag & CL_SLAVE) {
+		list_add(&mnt->mnt_slave, &old->mnt_slave_list);
+		mnt->mnt_master = old;
+		CLEAR_MNT_SHARED(mnt);
+	} else if (!(flag & CL_PRIVATE)) {
+		if ((flag & CL_PROPAGATION) || IS_MNT_SHARED(old))
+			list_add(&mnt->mnt_share, &old->mnt_share);
+		if (IS_MNT_SLAVE(old))
+			list_add(&mnt->mnt_slave, &old->mnt_slave);
+		mnt->mnt_master = old->mnt_master;
+	}
+	if (flag & CL_MAKE_SHARED)
+		set_mnt_shared(mnt);
+
+	/* stick the duplicate mount on the same expiry list
+	 * as the original if that was on one */
+	if (flag & CL_EXPIRE) {
+		spin_lock(&vfsmount_lock);
+		if (!list_empty(&old->mnt_expire))
+			list_add(&mnt->mnt_expire, &old->mnt_expire);
+		spin_unlock(&vfsmount_lock);
 	}
 	return mnt;
 }
@@ -998,11 +999,11 @@ struct vfsmount *copy_tree(struct vfsmou
 	struct nameidata nd;
 
 	if (!(flag & CL_COPY_ALL) && IS_MNT_UNBINDABLE(mnt))
-		return NULL;
+		return ERR_PTR(-EPERM);
 
 	res = q = clone_mnt(mnt, dentry, flag);
-	if (!q)
-		goto Enomem;
+	if (IS_ERR(q))
+		goto error;
 	q->mnt_mountpoint = mnt->mnt_mountpoint;
 
 	p = mnt;
@@ -1023,8 +1024,8 @@ struct vfsmount *copy_tree(struct vfsmou
 			nd.path.mnt = q;
 			nd.path.dentry = p->mnt_mountpoint;
 			q = clone_mnt(p, p->mnt_root, flag);
-			if (!q)
-				goto Enomem;
+			if (IS_ERR(q))
+				goto error;
 			spin_lock(&vfsmount_lock);
 			list_add_tail(&q->mnt_list, &res->mnt_list);
 			attach_mnt(q, &nd);
@@ -1032,7 +1033,7 @@ struct vfsmount *copy_tree(struct vfsmou
 		}
 	}
 	return res;
-Enomem:
+ error:
 	if (res) {
 		LIST_HEAD(umount_list);
 		spin_lock(&vfsmount_lock);
@@ -1040,7 +1041,7 @@ Enomem:
 		spin_unlock(&vfsmount_lock);
 		release_mounts(&umount_list);
 	}
-	return NULL;
+	return q;
 }
 
 struct vfsmount *collect_mounts(struct vfsmount *mnt, struct dentry *dentry)
@@ -1239,13 +1240,13 @@ static int do_loopback(struct nameidata 
 		goto out;
 
 	clone_fl = (flags & MS_SETUSER) ? CL_SETUSER : 0;
-	err = -ENOMEM;
 	if (flags & MS_REC)
 		mnt = copy_tree(old_nd.path.mnt, old_nd.path.dentry, clone_fl);
 	else
 		mnt = clone_mnt(old_nd.path.mnt, old_nd.path.dentry, clone_fl);
 
-	if (!mnt)
+	err = PTR_ERR(mnt);
+	if (IS_ERR(mnt))
 		goto out;
 
 	err = graft_tree(mnt, nd);
@@ -1816,10 +1817,10 @@ static struct mnt_namespace *dup_mnt_ns(
 	/* First pass: copy the tree topology */
 	new_ns->root = copy_tree(mnt_ns->root, mnt_ns->root->mnt_root,
 					CL_COPY_ALL | CL_EXPIRE);
-	if (!new_ns->root) {
+	if (IS_ERR(new_ns->root)) {
 		up_write(&namespace_sem);
 		kfree(new_ns);
-		return ERR_PTR(-ENOMEM);;
+		return ERR_CAST(new_ns->root);
 	}
 	spin_lock(&vfsmount_lock);
 	list_add_tail(&new_ns->list, &new_ns->root->mnt_list);
Index: linux/fs/pnode.c
===================================================================
--- linux.orig/fs/pnode.c	2008-01-16 13:24:53.000000000 +0100
+++ linux/fs/pnode.c	2008-01-16 13:25:07.000000000 +0100
@@ -189,8 +189,9 @@ int propagate_mnt(struct vfsmount *dest_
 
 		source =  get_source(m, prev_dest_mnt, prev_src_mnt, &type);
 
-		if (!(child = copy_tree(source, source->mnt_root, type))) {
-			ret = -ENOMEM;
+		child = copy_tree(source, source->mnt_root, type);
+		if (IS_ERR(child)) {
+			ret = PTR_ERR(child);
 			list_splice(tree_list, tmp_list.prev);
 			goto out;
 		}

--

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 04/10] unprivileged mounts: account user mounts
  2008-01-16 12:31 [patch 00/10] mount ownership and unprivileged mount syscall (v7) Miklos Szeredi
                   ` (2 preceding siblings ...)
  2008-01-16 12:31 ` [patch 03/10] unprivileged mounts: propagate error values from clone_mnt Miklos Szeredi
@ 2008-01-16 12:31 ` Miklos Szeredi
  2008-01-16 12:31 ` [patch 05/10] unprivileged mounts: allow unprivileged bind mounts Miklos Szeredi
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Miklos Szeredi @ 2008-01-16 12:31 UTC (permalink / raw)
  To: akpm, hch, serue, viro, kzak
  Cc: linux-fsdevel, linux-kernel, containers, util-linux-ng

[-- Attachment #1: unprivileged-mounts-account-user-mounts.patch --]
[-- Type: text/plain, Size: 5802 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Add sysctl variables for accounting and limiting the number of user
mounts.

The maximum number of user mounts is set to 1024 by default.  This
won't in itself enable user mounts, setting a mount to be owned by a
user is first needed.

[akpm]
 - don't use enumerated sysctls

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---

Index: linux/Documentation/filesystems/proc.txt
===================================================================
--- linux.orig/Documentation/filesystems/proc.txt	2008-01-16 13:24:53.000000000 +0100
+++ linux/Documentation/filesystems/proc.txt	2008-01-16 13:25:07.000000000 +0100
@@ -1012,6 +1012,15 @@ reaches aio-max-nr then io_setup will fa
 raising aio-max-nr does not result in the pre-allocation or re-sizing
 of any kernel data structures.
 
+nr_user_mounts and max_user_mounts
+----------------------------------
+
+These represent the number of "user" mounts and the maximum number of
+"user" mounts respectively.  User mounts may be created by
+unprivileged users.  User mounts may also be created with sysadmin
+privileges on behalf of a user, in which case nr_user_mounts may
+exceed max_user_mounts.
+
 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats
 -----------------------------------------------------------
 
Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c	2008-01-16 13:25:07.000000000 +0100
+++ linux/fs/namespace.c	2008-01-16 13:25:07.000000000 +0100
@@ -44,6 +44,9 @@ static struct list_head *mount_hashtable
 static struct kmem_cache *mnt_cache __read_mostly;
 static struct rw_semaphore namespace_sem;
 
+int nr_user_mounts;
+int max_user_mounts = 1024;
+
 /* /sys/fs */
 struct kobject *fs_kobj;
 EXPORT_SYMBOL_GPL(fs_kobj);
@@ -477,21 +480,70 @@ static struct vfsmount *skip_mnt_tree(st
 	return p;
 }
 
-static void set_mnt_user(struct vfsmount *mnt)
+static void dec_nr_user_mounts(void)
+{
+	spin_lock(&vfsmount_lock);
+	nr_user_mounts--;
+	spin_unlock(&vfsmount_lock);
+}
+
+static int reserve_user_mount(void)
+{
+	int err = 0;
+
+	spin_lock(&vfsmount_lock);
+	/*
+	 * EMFILE was error returned by mount(2) in the old days, when
+	 * the mount count was limited.  Reuse this error value to
+	 * mean, that the maximum number of user mounts has been
+	 * exceeded.
+	 */
+	if (nr_user_mounts >= max_user_mounts && !capable(CAP_SYS_ADMIN))
+		err = -EMFILE;
+	else
+		nr_user_mounts++;
+	spin_unlock(&vfsmount_lock);
+	return err;
+}
+
+static void __set_mnt_user(struct vfsmount *mnt)
 {
 	WARN_ON(mnt->mnt_flags & MNT_USER);
 	mnt->mnt_uid = current->fsuid;
 	mnt->mnt_flags |= MNT_USER;
 }
 
+static void set_mnt_user(struct vfsmount *mnt)
+{
+	__set_mnt_user(mnt);
+	spin_lock(&vfsmount_lock);
+	nr_user_mounts++;
+	spin_unlock(&vfsmount_lock);
+}
+
+static void clear_mnt_user(struct vfsmount *mnt)
+{
+	if (mnt->mnt_flags & MNT_USER) {
+		mnt->mnt_uid = 0;
+		mnt->mnt_flags &= ~MNT_USER;
+		dec_nr_user_mounts();
+	}
+}
+
 static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root,
 					int flag)
 {
 	struct super_block *sb = old->mnt_sb;
-	struct vfsmount *mnt = alloc_vfsmnt(old->mnt_devname);
+	struct vfsmount *mnt;
 
+	if (flag & CL_SETUSER) {
+		int err = reserve_user_mount();
+		if (err)
+			return ERR_PTR(err);
+	}
+	mnt = alloc_vfsmnt(old->mnt_devname);
 	if (!mnt)
-		return ERR_PTR(-ENOMEM);
+		goto alloc_failed;
 
 	mnt->mnt_flags = old->mnt_flags;
 	atomic_inc(&sb->s_active);
@@ -503,7 +555,7 @@ static struct vfsmount *clone_mnt(struct
 	/* don't copy the MNT_USER flag */
 	mnt->mnt_flags &= ~MNT_USER;
 	if (flag & CL_SETUSER)
-		set_mnt_user(mnt);
+		__set_mnt_user(mnt);
 
 	if (flag & CL_SLAVE) {
 		list_add(&mnt->mnt_slave, &old->mnt_slave_list);
@@ -528,6 +580,11 @@ static struct vfsmount *clone_mnt(struct
 		spin_unlock(&vfsmount_lock);
 	}
 	return mnt;
+
+ alloc_failed:
+	if (flag & CL_SETUSER)
+		dec_nr_user_mounts();
+	return ERR_PTR(-ENOMEM);
 }
 
 static inline void __mntput(struct vfsmount *mnt)
@@ -543,6 +600,7 @@ static inline void __mntput(struct vfsmo
 	 */
 	WARN_ON(atomic_read(&mnt->__mnt_writers));
 	dput(mnt->mnt_root);
+	clear_mnt_user(mnt);
 	free_vfsmnt(mnt);
 	deactivate_super(sb);
 }
@@ -1307,6 +1365,7 @@ static int do_remount(struct nameidata *
 	else
 		err = do_remount_sb(sb, flags, data, 0);
 	if (!err) {
+		clear_mnt_user(nd->path.mnt);
 		nd->path.mnt->mnt_flags = mnt_flags;
 		if (flags & MS_SETUSER)
 			set_mnt_user(nd->path.mnt);
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h	2008-01-16 13:25:05.000000000 +0100
+++ linux/include/linux/fs.h	2008-01-16 13:25:07.000000000 +0100
@@ -50,6 +50,9 @@ extern struct inodes_stat_t inodes_stat;
 
 extern int leases_enable, lease_break_time;
 
+extern int nr_user_mounts;
+extern int max_user_mounts;
+
 #ifdef CONFIG_DNOTIFY
 extern int dir_notify_enable;
 #endif
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c	2008-01-16 13:24:53.000000000 +0100
+++ linux/kernel/sysctl.c	2008-01-16 13:25:07.000000000 +0100
@@ -1288,6 +1288,22 @@ static struct ctl_table fs_table[] = {
 #endif	
 #endif
 	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "nr_user_mounts",
+		.data		= &nr_user_mounts,
+		.maxlen		= sizeof(int),
+		.mode		= 0444,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "max_user_mounts",
+		.data		= &max_user_mounts,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
 		.ctl_name	= KERN_SETUID_DUMPABLE,
 		.procname	= "suid_dumpable",
 		.data		= &suid_dumpable,

--

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 05/10] unprivileged mounts: allow unprivileged bind mounts
  2008-01-16 12:31 [patch 00/10] mount ownership and unprivileged mount syscall (v7) Miklos Szeredi
                   ` (3 preceding siblings ...)
  2008-01-16 12:31 ` [patch 04/10] unprivileged mounts: account user mounts Miklos Szeredi
@ 2008-01-16 12:31 ` Miklos Szeredi
  2008-01-16 12:31 ` [patch 06/10] unprivileged mounts: allow unprivileged mounts Miklos Szeredi
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Miklos Szeredi @ 2008-01-16 12:31 UTC (permalink / raw)
  To: akpm, hch, serue, viro, kzak
  Cc: linux-fsdevel, linux-kernel, containers, util-linux-ng

[-- Attachment #1: unprivileged-mounts-allow-unprivileged-bind-mounts.patch --]
[-- Type: text/plain, Size: 2541 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Allow bind mounts to unprivileged users if the following conditions are met:

  - mountpoint is not a symlink
  - parent mount is owned by the user
  - the number of user mounts is below the maximum

Unprivileged mounts imply MS_SETUSER, and will also have the "nosuid" and
"nodev" mount flags set.

In particular, if mounting process doesn't have CAP_SETUID capability,
then the "nosuid" flag will be added, and if it doesn't have CAP_MKNOD
capability, then the "nodev" flag will be added.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---

Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c	2008-01-16 13:25:07.000000000 +0100
+++ linux/fs/namespace.c	2008-01-16 13:25:08.000000000 +0100
@@ -511,6 +511,11 @@ static void __set_mnt_user(struct vfsmou
 	WARN_ON(mnt->mnt_flags & MNT_USER);
 	mnt->mnt_uid = current->fsuid;
 	mnt->mnt_flags |= MNT_USER;
+
+	if (!capable(CAP_SETUID))
+		mnt->mnt_flags |= MNT_NOSUID;
+	if (!capable(CAP_MKNOD))
+		mnt->mnt_flags |= MNT_NODEV;
 }
 
 static void set_mnt_user(struct vfsmount *mnt)
@@ -1021,22 +1026,26 @@ asmlinkage long sys_oldumount(char __use
 
 #endif
 
-static int mount_is_safe(struct nameidata *nd)
+/*
+ * Conditions for unprivileged mounts are:
+ * - mountpoint is not a symlink
+ * - mountpoint is in a mount owned by the user
+ */
+static bool permit_mount(struct nameidata *nd, int *flags)
 {
+	struct inode *inode = nd->path.dentry->d_inode;
+
 	if (capable(CAP_SYS_ADMIN))
-		return 0;
-	return -EPERM;
-#ifdef notyet
-	if (S_ISLNK(nd->path.dentry->d_inode->i_mode))
-		return -EPERM;
-	if (nd->path.dentry->d_inode->i_mode & S_ISVTX) {
-		if (current->uid != nd->path.dentry->d_inode->i_uid)
-			return -EPERM;
-	}
-	if (vfs_permission(nd, MAY_WRITE))
-		return -EPERM;
-	return 0;
-#endif
+		return true;
+
+	if (S_ISLNK(inode->i_mode))
+		return false;
+
+	if (!is_mount_owner(nd->path.mnt, current->fsuid))
+		return false;
+
+	*flags |= MS_SETUSER;
+	return true;
 }
 
 static int lives_below_in_same_fs(struct dentry *d, struct dentry *dentry)
@@ -1280,9 +1289,10 @@ static int do_loopback(struct nameidata 
 	int clone_fl;
 	struct nameidata old_nd;
 	struct vfsmount *mnt = NULL;
-	int err = mount_is_safe(nd);
-	if (err)
-		return err;
+	int err;
+
+	if (!permit_mount(nd, &flags))
+		return -EPERM;
 	if (!old_name || !*old_name)
 		return -EINVAL;
 	err = path_lookup(old_name, LOOKUP_FOLLOW, &old_nd);

--

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 06/10] unprivileged mounts: allow unprivileged mounts
  2008-01-16 12:31 [patch 00/10] mount ownership and unprivileged mount syscall (v7) Miklos Szeredi
                   ` (4 preceding siblings ...)
  2008-01-16 12:31 ` [patch 05/10] unprivileged mounts: allow unprivileged bind mounts Miklos Szeredi
@ 2008-01-16 12:31 ` Miklos Szeredi
  2008-01-16 12:31 ` [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property Miklos Szeredi
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Miklos Szeredi @ 2008-01-16 12:31 UTC (permalink / raw)
  To: akpm, hch, serue, viro, kzak
  Cc: linux-fsdevel, linux-kernel, containers, util-linux-ng

[-- Attachment #1: unprivileged-mounts-allow-unprivileged-mounts.patch --]
[-- Type: text/plain, Size: 6191 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

For "safe" filesystems allow unprivileged mounting and forced
unmounting.

A filesystem type is considered "safe", if mounting it by an
unprivileged user may not cause a security problem.  This is somewhat
subjective, so setting this property is left to userspace (implemented
in the next patch).

Since most filesystems haven't been designed with unprivileged
mounting in mind, a thorough audit is recommended before setting this
property.

Make this a separate integer member in 'struct file_system_type'
instead of a flag, since that is easier to handle by sysctl code.

Move subtype handling from do_kern_mount() into do_new_mount().  All
other callers are kernel-internal and do not need subtype support.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---

Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c	2008-01-16 13:25:08.000000000 +0100
+++ linux/fs/namespace.c	2008-01-16 13:25:09.000000000 +0100
@@ -966,14 +966,16 @@ static bool is_mount_owner(struct vfsmou
 /*
  * umount is permitted for
  *  - sysadmin
- *  - mount owner, if not forced umount
+ *  - mount owner
+ *    o if not forced umount,
+ *    o if forced umount, and filesystem is "safe"
  */
 static bool permit_umount(struct vfsmount *mnt, int flags)
 {
 	if (capable(CAP_SYS_ADMIN))
 		return true;
 
-	if (flags & MNT_FORCE)
+	if ((flags & MNT_FORCE) && !(mnt->mnt_sb->s_type->fs_safe))
 		return false;
 
 	return is_mount_owner(mnt, current->fsuid);
@@ -1031,13 +1033,17 @@ asmlinkage long sys_oldumount(char __use
  * - mountpoint is not a symlink
  * - mountpoint is in a mount owned by the user
  */
-static bool permit_mount(struct nameidata *nd, int *flags)
+static bool permit_mount(struct nameidata *nd, struct file_system_type *type,
+			 int *flags)
 {
 	struct inode *inode = nd->path.dentry->d_inode;
 
 	if (capable(CAP_SYS_ADMIN))
 		return true;
 
+	if (type && !type->fs_safe)
+		return false;
+
 	if (S_ISLNK(inode->i_mode))
 		return false;
 
@@ -1291,7 +1297,7 @@ static int do_loopback(struct nameidata 
 	struct vfsmount *mnt = NULL;
 	int err;
 
-	if (!permit_mount(nd, &flags))
+	if (!permit_mount(nd, NULL, &flags))
 		return -EPERM;
 	if (!old_name || !*old_name)
 		return -EINVAL;
@@ -1472,30 +1478,76 @@ out:
 	return err;
 }
 
+static struct vfsmount *fs_set_subtype(struct vfsmount *mnt, const char *fstype)
+{
+	int err;
+	const char *subtype = strchr(fstype, '.');
+	if (subtype) {
+		subtype++;
+		err = -EINVAL;
+		if (!subtype[0])
+			goto err;
+	} else
+		subtype = "";
+
+	mnt->mnt_sb->s_subtype = kstrdup(subtype, GFP_KERNEL);
+	err = -ENOMEM;
+	if (!mnt->mnt_sb->s_subtype)
+		goto err;
+	return mnt;
+
+ err:
+	mntput(mnt);
+	return ERR_PTR(err);
+}
+
 /*
  * create a new mount for userspace and request it to be added into the
  * namespace's tree
  */
-static int do_new_mount(struct nameidata *nd, char *type, int flags,
+static int do_new_mount(struct nameidata *nd, char *fstype, int flags,
 			int mnt_flags, char *name, void *data)
 {
+	int err;
 	struct vfsmount *mnt;
+	struct file_system_type *type;
 
-	if (!type || !memchr(type, 0, PAGE_SIZE))
+	if (!fstype || !memchr(fstype, 0, PAGE_SIZE))
 		return -EINVAL;
 
-	/* we need capabilities... */
-	if (!capable(CAP_SYS_ADMIN))
-		return -EPERM;
-
-	mnt = do_kern_mount(type, flags & ~MS_SETUSER, name, data);
-	if (IS_ERR(mnt))
+	type = get_fs_type(fstype);
+	if (!type)
+		return -ENODEV;
+
+	err = -EPERM;
+	if (!permit_mount(nd, type, &flags))
+		goto out_put_filesystem;
+
+	if (flags & MS_SETUSER) {
+		err = reserve_user_mount();
+		if (err)
+			goto out_put_filesystem;
+	}
+
+	mnt = vfs_kern_mount(type, flags & ~MS_SETUSER, name, data);
+	if (!IS_ERR(mnt) && (type->fs_flags & FS_HAS_SUBTYPE) &&
+	    !mnt->mnt_sb->s_subtype)
+		mnt = fs_set_subtype(mnt, fstype);
+	put_filesystem(type);
+	if (IS_ERR(mnt)) {
+		if (flags & MS_SETUSER)
+			dec_nr_user_mounts();
 		return PTR_ERR(mnt);
+	}
 
 	if (flags & MS_SETUSER)
-		set_mnt_user(mnt);
+		__set_mnt_user(mnt);
 
 	return do_add_mount(mnt, nd, mnt_flags, NULL);
+
+ out_put_filesystem:
+	put_filesystem(type);
+	return err;
 }
 
 /*
@@ -1526,7 +1578,7 @@ int do_add_mount(struct vfsmount *newmnt
 	if (S_ISLNK(newmnt->mnt_root->d_inode->i_mode))
 		goto unlock;
 
-	/* MNT_USER was set earlier */
+	/* some flags may have been set earlier */
 	newmnt->mnt_flags |= mnt_flags;
 	if ((err = graft_tree(newmnt, nd)))
 		goto unlock;
Index: linux/fs/super.c
===================================================================
--- linux.orig/fs/super.c	2008-01-16 13:24:52.000000000 +0100
+++ linux/fs/super.c	2008-01-16 13:25:09.000000000 +0100
@@ -906,29 +906,6 @@ out:
 
 EXPORT_SYMBOL_GPL(vfs_kern_mount);
 
-static struct vfsmount *fs_set_subtype(struct vfsmount *mnt, const char *fstype)
-{
-	int err;
-	const char *subtype = strchr(fstype, '.');
-	if (subtype) {
-		subtype++;
-		err = -EINVAL;
-		if (!subtype[0])
-			goto err;
-	} else
-		subtype = "";
-
-	mnt->mnt_sb->s_subtype = kstrdup(subtype, GFP_KERNEL);
-	err = -ENOMEM;
-	if (!mnt->mnt_sb->s_subtype)
-		goto err;
-	return mnt;
-
- err:
-	mntput(mnt);
-	return ERR_PTR(err);
-}
-
 struct vfsmount *
 do_kern_mount(const char *fstype, int flags, const char *name, void *data)
 {
@@ -937,9 +914,6 @@ do_kern_mount(const char *fstype, int fl
 	if (!type)
 		return ERR_PTR(-ENODEV);
 	mnt = vfs_kern_mount(type, flags, name, data);
-	if (!IS_ERR(mnt) && (type->fs_flags & FS_HAS_SUBTYPE) &&
-	    !mnt->mnt_sb->s_subtype)
-		mnt = fs_set_subtype(mnt, fstype);
 	put_filesystem(type);
 	return mnt;
 }
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h	2008-01-16 13:25:07.000000000 +0100
+++ linux/include/linux/fs.h	2008-01-16 13:25:09.000000000 +0100
@@ -1430,6 +1430,7 @@ int sync_inode(struct inode *inode, stru
 struct file_system_type {
 	const char *name;
 	int fs_flags;
+	int fs_safe;
 	int (*get_sb) (struct file_system_type *, int,
 		       const char *, void *, struct vfsmount *);
 	void (*kill_sb) (struct super_block *);

--

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property
  2008-01-16 12:31 [patch 00/10] mount ownership and unprivileged mount syscall (v7) Miklos Szeredi
                   ` (5 preceding siblings ...)
  2008-01-16 12:31 ` [patch 06/10] unprivileged mounts: allow unprivileged mounts Miklos Szeredi
@ 2008-01-16 12:31 ` Miklos Szeredi
  2008-01-21 20:32   ` Serge E. Hallyn
  2008-01-16 12:31 ` [patch 08/10] unprivileged mounts: make fuse safe Miklos Szeredi
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Miklos Szeredi @ 2008-01-16 12:31 UTC (permalink / raw)
  To: akpm, hch, serue, viro, kzak
  Cc: linux-fsdevel, linux-kernel, containers, util-linux-ng

[-- Attachment #1: unprivileged-mounts-add-sysctl-tunable-for-safe-property.patch --]
[-- Type: text/plain, Size: 4290 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Add the following:

  /proc/sys/fs/types/${FS_TYPE}/usermount_safe

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/filesystems.c
===================================================================
--- linux.orig/fs/filesystems.c	2008-01-16 13:24:52.000000000 +0100
+++ linux/fs/filesystems.c	2008-01-16 13:25:09.000000000 +0100
@@ -12,6 +12,7 @@
 #include <linux/kmod.h>
 #include <linux/init.h>
 #include <linux/module.h>
+#include <linux/sysctl.h>
 #include <asm/uaccess.h>
 
 /*
@@ -51,6 +52,57 @@ static struct file_system_type **find_fi
 	return p;
 }
 
+#define MAX_FILESYSTEM_VARS 1
+
+struct filesystem_sysctl_table {
+	struct ctl_table_header *header;
+	struct ctl_table table[MAX_FILESYSTEM_VARS + 1];
+};
+
+/*
+ * Create /sys/fs/types/${FSNAME} directory with per fs-type tunables.
+ */
+static int filesystem_sysctl_register(struct file_system_type *fs)
+{
+	struct filesystem_sysctl_table *t;
+	struct ctl_path path[] = {
+		{ .procname = "fs", .ctl_name = CTL_FS },
+		{ .procname = "types", .ctl_name = CTL_UNNUMBERED },
+		{ .procname = fs->name, .ctl_name = CTL_UNNUMBERED },
+		{ }
+	};
+
+	t = kzalloc(sizeof(*t), GFP_KERNEL);
+	if (!t)
+		return -ENOMEM;
+
+
+	t->table[0].ctl_name = CTL_UNNUMBERED;
+	t->table[0].procname = "usermount_safe";
+	t->table[0].maxlen = sizeof(int);
+	t->table[0].data = &fs->fs_safe;
+	t->table[0].mode = 0644;
+	t->table[0].proc_handler = &proc_dointvec;
+
+	t->header = register_sysctl_paths(path, t->table);
+	if (!t->header) {
+		kfree(t);
+		return -ENOMEM;
+	}
+
+	fs->sysctl_table = t;
+
+	return 0;
+}
+
+static void filesystem_sysctl_unregister(struct file_system_type *fs)
+{
+	struct filesystem_sysctl_table *t = fs->sysctl_table;
+
+	unregister_sysctl_table(t->header);
+	kfree(t);
+}
+
 /**
  *	register_filesystem - register a new filesystem
  *	@fs: the file system structure
@@ -80,6 +132,13 @@ int register_filesystem(struct file_syst
 	else
 		*p = fs;
 	write_unlock(&file_systems_lock);
+
+	if (res == 0) {
+		res = filesystem_sysctl_register(fs);
+		if (res != 0)
+			unregister_filesystem(fs);
+	}
+
 	return res;
 }
 
@@ -108,6 +167,7 @@ int unregister_filesystem(struct file_sy
 			*tmp = fs->next;
 			fs->next = NULL;
 			write_unlock(&file_systems_lock);
+			filesystem_sysctl_unregister(fs);
 			return 0;
 		}
 		tmp = &(*tmp)->next;
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h	2008-01-16 13:25:09.000000000 +0100
+++ linux/include/linux/fs.h	2008-01-16 13:25:09.000000000 +0100
@@ -1437,6 +1437,7 @@ struct file_system_type {
 	struct module *owner;
 	struct file_system_type * next;
 	struct list_head fs_supers;
+	struct filesystem_sysctl_table *sysctl_table;
 
 	struct lock_class_key s_lock_key;
 	struct lock_class_key s_umount_key;
Index: linux/Documentation/filesystems/proc.txt
===================================================================
--- linux.orig/Documentation/filesystems/proc.txt	2008-01-16 13:25:07.000000000 +0100
+++ linux/Documentation/filesystems/proc.txt	2008-01-16 13:25:09.000000000 +0100
@@ -43,6 +43,7 @@ Table of Contents
   2.13	/proc/<pid>/oom_score - Display current oom-killer score
   2.14	/proc/<pid>/io - Display the IO accounting fields
   2.15	/proc/<pid>/coredump_filter - Core dump filtering settings
+  2.16	/proc/sys/fs/types - File system type specific parameters
 
 ------------------------------------------------------------------------------
 Preface
@@ -2283,4 +2284,21 @@ For example:
   $ echo 0x7 > /proc/self/coredump_filter
   $ ./some_program
 
+2.16 /proc/sys/fs/types/ - File system type specific parameters
+----------------------------------------------------------------
+
+There's a separate directory /proc/sys/fs/types/<type>/ for each
+filesystem type, containing the following files:
+
+usermount_safe
+--------------
+
+Setting this to non-zero will allow filesystems of this type to be
+mounted by unprivileged users (note, that there are other
+prerequisites as well).
+
+Care should be taken when enabling this, since most
+filesystems haven't been designed with unprivileged mounting
+in mind.
+
 ------------------------------------------------------------------------------

--

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 08/10] unprivileged mounts: make fuse safe
  2008-01-16 12:31 [patch 00/10] mount ownership and unprivileged mount syscall (v7) Miklos Szeredi
                   ` (6 preceding siblings ...)
  2008-01-16 12:31 ` [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property Miklos Szeredi
@ 2008-01-16 12:31 ` Miklos Szeredi
  2008-01-21 20:41   ` Serge E. Hallyn
  2008-01-16 12:31 ` [patch 09/10] unprivileged mounts: propagation: inherit owner from parent Miklos Szeredi
  2008-01-16 12:31 ` [patch 10/10] unprivileged mounts: add "no submounts" flag Miklos Szeredi
  9 siblings, 1 reply; 28+ messages in thread
From: Miklos Szeredi @ 2008-01-16 12:31 UTC (permalink / raw)
  To: akpm, hch, serue, viro, kzak
  Cc: linux-fsdevel, linux-kernel, containers, util-linux-ng

[-- Attachment #1: unprivileged-mounts-allow-unprivileged-fuse-mounts.patch --]
[-- Type: text/plain, Size: 3125 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Don't require the "user_id=" and "group_id=" options for unprivileged mounts,
but if they are present, verify them for sanity.

Disallow the "allow_other" option for unprivileged mounts.

FUSE was designed from the beginning to be safe for unprivileged
users.  This has also been verified in practice over many years, with
some distributions enabling unprivileged FUSE mounts by default.

However there are some properties of FUSE, that could make it unsafe
for certain situations (e.g. multiuser system with untrusted users):

 - It is not always possible to use kill(2) (not even with SIGKILL) to
   terminate a process using a FUSE filesystem.  However it is
   possible to use any of the following instead:
     o kill the filesystem daemon
     o use forced umounting
     o use the "fusectl" control filesystem

 - As a special case of the above, killing a self-deadlocked FUSE
   process is not possible, and even killall5 will not terminate it.

 - Due to the design of the process freezer, a hanging (due to network
   problems, etc) or malicious filesystem may prevent suspending to
   ram or hibernation to succeed.  This is not actually unique to
   FUSE, as any hanging network filesystem will have the same affect.

If the above could pose a threat to the system, it is recommended,
that the '/proc/sys/fs/types/fuse/safe' sysctl tunable is not turned
on, and/or '/dev/fuse' is not made world-readable and writable.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/fuse/inode.c
===================================================================
--- linux.orig/fs/fuse/inode.c	2008-01-16 13:24:52.000000000 +0100
+++ linux/fs/fuse/inode.c	2008-01-16 13:25:10.000000000 +0100
@@ -357,6 +357,19 @@ static int parse_fuse_opt(char *opt, str
 	d->max_read = ~0;
 	d->blksize = 512;
 
+	/*
+	 * For unprivileged mounts use current uid/gid.  Still allow
+	 * "user_id" and "group_id" options for compatibility, but
+	 * only if they match these values.
+	 */
+	if (!capable(CAP_SYS_ADMIN)) {
+		d->user_id = current->uid;
+		d->user_id_present = 1;
+		d->group_id = current->gid;
+		d->group_id_present = 1;
+
+	}
+
 	while ((p = strsep(&opt, ",")) != NULL) {
 		int token;
 		int value;
@@ -385,6 +398,8 @@ static int parse_fuse_opt(char *opt, str
 		case OPT_USER_ID:
 			if (match_int(&args[0], &value))
 				return 0;
+			if (d->user_id_present && d->user_id != value)
+				return 0;
 			d->user_id = value;
 			d->user_id_present = 1;
 			break;
@@ -392,6 +407,8 @@ static int parse_fuse_opt(char *opt, str
 		case OPT_GROUP_ID:
 			if (match_int(&args[0], &value))
 				return 0;
+			if (d->group_id_present && d->group_id != value)
+				return 0;
 			d->group_id = value;
 			d->group_id_present = 1;
 			break;
@@ -596,6 +613,10 @@ static int fuse_fill_super(struct super_
 	if (!parse_fuse_opt((char *) data, &d, is_bdev))
 		return -EINVAL;
 
+	/* This is a privileged option */
+	if ((d.flags & FUSE_ALLOW_OTHER) && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	if (is_bdev) {
 #ifdef CONFIG_BLOCK
 		if (!sb_set_blocksize(sb, d.blksize))

--

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 09/10] unprivileged mounts: propagation: inherit owner from parent
  2008-01-16 12:31 [patch 00/10] mount ownership and unprivileged mount syscall (v7) Miklos Szeredi
                   ` (7 preceding siblings ...)
  2008-01-16 12:31 ` [patch 08/10] unprivileged mounts: make fuse safe Miklos Szeredi
@ 2008-01-16 12:31 ` Miklos Szeredi
  2008-01-21 20:23   ` Serge E. Hallyn
  2008-01-16 12:31 ` [patch 10/10] unprivileged mounts: add "no submounts" flag Miklos Szeredi
  9 siblings, 1 reply; 28+ messages in thread
From: Miklos Szeredi @ 2008-01-16 12:31 UTC (permalink / raw)
  To: akpm, hch, serue, viro, kzak
  Cc: linux-fsdevel, linux-kernel, containers, util-linux-ng

[-- Attachment #1: unprivileged-mounts-propagation-inherit-owner-from-parent.patch --]
[-- Type: text/plain, Size: 7330 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

On mount propagation, let the owner of the clone be inherited from the
parent into which it has been propagated.

If the parent has the "nosuid" flag, set this flag for the child as
well.  This is needed for the suid-less namespace (use case #2 in the
first patch header), where all mounts are owned by the user and have
the nosuid flag set.  In this case the propagated mount needs to have
nosuid, otherwise a suid executable may be misused by the user.

Similar treatment is not needed for "nodev", because devices can't be
abused this way: the user is not able to gain privileges to devices by
rearranging the mount namespace.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c	2008-01-16 13:25:09.000000000 +0100
+++ linux/fs/namespace.c	2008-01-16 13:25:11.000000000 +0100
@@ -506,10 +506,10 @@ static int reserve_user_mount(void)
 	return err;
 }
 
-static void __set_mnt_user(struct vfsmount *mnt)
+static void __set_mnt_user(struct vfsmount *mnt, uid_t owner)
 {
 	WARN_ON(mnt->mnt_flags & MNT_USER);
-	mnt->mnt_uid = current->fsuid;
+	mnt->mnt_uid = owner;
 	mnt->mnt_flags |= MNT_USER;
 
 	if (!capable(CAP_SETUID))
@@ -520,7 +520,7 @@ static void __set_mnt_user(struct vfsmou
 
 static void set_mnt_user(struct vfsmount *mnt)
 {
-	__set_mnt_user(mnt);
+	__set_mnt_user(mnt, current->fsuid);
 	spin_lock(&vfsmount_lock);
 	nr_user_mounts++;
 	spin_unlock(&vfsmount_lock);
@@ -536,7 +536,7 @@ static void clear_mnt_user(struct vfsmou
 }
 
 static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root,
-					int flag)
+					int flag, uid_t owner)
 {
 	struct super_block *sb = old->mnt_sb;
 	struct vfsmount *mnt;
@@ -560,7 +560,10 @@ static struct vfsmount *clone_mnt(struct
 	/* don't copy the MNT_USER flag */
 	mnt->mnt_flags &= ~MNT_USER;
 	if (flag & CL_SETUSER)
-		__set_mnt_user(mnt);
+		__set_mnt_user(mnt, owner);
+
+	if (flag & CL_NOSUID)
+		mnt->mnt_flags |= MNT_NOSUID;
 
 	if (flag & CL_SLAVE) {
 		list_add(&mnt->mnt_slave, &old->mnt_slave_list);
@@ -1066,7 +1069,7 @@ static int lives_below_in_same_fs(struct
 }
 
 struct vfsmount *copy_tree(struct vfsmount *mnt, struct dentry *dentry,
-					int flag)
+					int flag, uid_t owner)
 {
 	struct vfsmount *res, *p, *q, *r, *s;
 	struct nameidata nd;
@@ -1074,7 +1077,7 @@ struct vfsmount *copy_tree(struct vfsmou
 	if (!(flag & CL_COPY_ALL) && IS_MNT_UNBINDABLE(mnt))
 		return ERR_PTR(-EPERM);
 
-	res = q = clone_mnt(mnt, dentry, flag);
+	res = q = clone_mnt(mnt, dentry, flag, owner);
 	if (IS_ERR(q))
 		goto error;
 	q->mnt_mountpoint = mnt->mnt_mountpoint;
@@ -1096,7 +1099,7 @@ struct vfsmount *copy_tree(struct vfsmou
 			p = s;
 			nd.path.mnt = q;
 			nd.path.dentry = p->mnt_mountpoint;
-			q = clone_mnt(p, p->mnt_root, flag);
+			q = clone_mnt(p, p->mnt_root, flag, owner);
 			if (IS_ERR(q))
 				goto error;
 			spin_lock(&vfsmount_lock);
@@ -1121,7 +1124,7 @@ struct vfsmount *collect_mounts(struct v
 {
 	struct vfsmount *tree;
 	down_read(&namespace_sem);
-	tree = copy_tree(mnt, dentry, CL_COPY_ALL | CL_PRIVATE);
+	tree = copy_tree(mnt, dentry, CL_COPY_ALL | CL_PRIVATE, 0);
 	up_read(&namespace_sem);
 	return tree;
 }
@@ -1292,7 +1295,8 @@ static int do_change_type(struct nameida
  */
 static int do_loopback(struct nameidata *nd, char *old_name, int flags)
 {
-	int clone_fl;
+	int clone_fl = 0;
+	uid_t owner = 0;
 	struct nameidata old_nd;
 	struct vfsmount *mnt = NULL;
 	int err;
@@ -1313,11 +1317,17 @@ static int do_loopback(struct nameidata 
 	if (!check_mnt(nd->path.mnt) || !check_mnt(old_nd.path.mnt))
 		goto out;
 
-	clone_fl = (flags & MS_SETUSER) ? CL_SETUSER : 0;
+	if (flags & MS_SETUSER) {
+		clone_fl |= CL_SETUSER;
+		owner = current->fsuid;
+	}
+
 	if (flags & MS_REC)
-		mnt = copy_tree(old_nd.path.mnt, old_nd.path.dentry, clone_fl);
+		mnt = copy_tree(old_nd.path.mnt, old_nd.path.dentry, clone_fl,
+				owner);
 	else
-		mnt = clone_mnt(old_nd.path.mnt, old_nd.path.dentry, clone_fl);
+		mnt = clone_mnt(old_nd.path.mnt, old_nd.path.dentry, clone_fl,
+				owner);
 
 	err = PTR_ERR(mnt);
 	if (IS_ERR(mnt))
@@ -1541,7 +1551,7 @@ static int do_new_mount(struct nameidata
 	}
 
 	if (flags & MS_SETUSER)
-		__set_mnt_user(mnt);
+		__set_mnt_user(mnt, current->fsuid);
 
 	return do_add_mount(mnt, nd, mnt_flags, NULL);
 
@@ -1937,7 +1947,7 @@ static struct mnt_namespace *dup_mnt_ns(
 	down_write(&namespace_sem);
 	/* First pass: copy the tree topology */
 	new_ns->root = copy_tree(mnt_ns->root, mnt_ns->root->mnt_root,
-					CL_COPY_ALL | CL_EXPIRE);
+					CL_COPY_ALL | CL_EXPIRE, 0);
 	if (IS_ERR(new_ns->root)) {
 		up_write(&namespace_sem);
 		kfree(new_ns);
Index: linux/fs/pnode.c
===================================================================
--- linux.orig/fs/pnode.c	2008-01-16 13:25:07.000000000 +0100
+++ linux/fs/pnode.c	2008-01-16 13:25:11.000000000 +0100
@@ -181,15 +181,28 @@ int propagate_mnt(struct vfsmount *dest_
 
 	for (m = propagation_next(dest_mnt, dest_mnt); m;
 			m = propagation_next(m, dest_mnt)) {
-		int type;
+		int clflags;
+		uid_t owner = 0;
 		struct vfsmount *source;
 
 		if (IS_MNT_NEW(m))
 			continue;
 
-		source =  get_source(m, prev_dest_mnt, prev_src_mnt, &type);
+		source =  get_source(m, prev_dest_mnt, prev_src_mnt, &clflags);
 
-		child = copy_tree(source, source->mnt_root, type);
+		if (m->mnt_flags & MNT_USER) {
+			clflags |= CL_SETUSER;
+			owner = m->mnt_uid;
+
+			/*
+			 * If propagating into a user mount which doesn't
+			 * allow suid, then make sure, the child(ren) won't
+			 * allow suid either
+			 */
+			if (m->mnt_flags & MNT_NOSUID)
+				clflags |= CL_NOSUID;
+		}
+		child = copy_tree(source, source->mnt_root, clflags, owner);
 		if (IS_ERR(child)) {
 			ret = PTR_ERR(child);
 			list_splice(tree_list, tmp_list.prev);
Index: linux/fs/pnode.h
===================================================================
--- linux.orig/fs/pnode.h	2008-01-16 13:25:05.000000000 +0100
+++ linux/fs/pnode.h	2008-01-16 13:25:11.000000000 +0100
@@ -24,6 +24,7 @@
 #define CL_PROPAGATION 		0x10
 #define CL_PRIVATE 		0x20
 #define CL_SETUSER		0x40
+#define CL_NOSUID		0x80
 
 static inline void set_mnt_shared(struct vfsmount *mnt)
 {
@@ -36,4 +37,6 @@ int propagate_mnt(struct vfsmount *, str
 		struct list_head *);
 int propagate_umount(struct list_head *);
 int propagate_mount_busy(struct vfsmount *, int);
+struct vfsmount *copy_tree(struct vfsmount *, struct dentry *, int, uid_t);
+
 #endif /* _LINUX_PNODE_H */
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h	2008-01-16 13:25:09.000000000 +0100
+++ linux/include/linux/fs.h	2008-01-16 13:25:11.000000000 +0100
@@ -1493,7 +1493,6 @@ extern int may_umount(struct vfsmount *)
 extern void umount_tree(struct vfsmount *, int, struct list_head *);
 extern void release_mounts(struct list_head *);
 extern long do_mount(char *, char *, char *, unsigned long, void *);
-extern struct vfsmount *copy_tree(struct vfsmount *, struct dentry *, int);
 extern void mnt_set_mountpoint(struct vfsmount *, struct dentry *,
 				  struct vfsmount *);
 extern struct vfsmount *collect_mounts(struct vfsmount *, struct dentry *);

--

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 10/10] unprivileged mounts: add "no submounts" flag
  2008-01-16 12:31 [patch 00/10] mount ownership and unprivileged mount syscall (v7) Miklos Szeredi
                   ` (8 preceding siblings ...)
  2008-01-16 12:31 ` [patch 09/10] unprivileged mounts: propagation: inherit owner from parent Miklos Szeredi
@ 2008-01-16 12:31 ` Miklos Szeredi
  9 siblings, 0 replies; 28+ messages in thread
From: Miklos Szeredi @ 2008-01-16 12:31 UTC (permalink / raw)
  To: akpm, hch, serue, viro, kzak
  Cc: linux-fsdevel, linux-kernel, containers, util-linux-ng

[-- Attachment #1: unprivileged-mounts-add-no-submounts-flag.patch --]
[-- Type: text/plain, Size: 2927 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Add a new mount flag "nosubmnt", which denies submounts for the owner.
This would be useful, if we want to support traditional /etc/fstab
based user mounts.

In this case mount(8) would still have to be suid-root, to check the
mountpoint against the user/users flag in /etc/fstab, but /etc/mtab
would no longer be mandatory for storing the actual owner of the
mount.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---

Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c	2008-01-16 13:25:11.000000000 +0100
+++ linux/fs/namespace.c	2008-01-16 13:25:12.000000000 +0100
@@ -700,6 +700,7 @@ static int show_vfsmnt(struct seq_file *
 		{ MNT_NOATIME, ",noatime" },
 		{ MNT_NODIRATIME, ",nodiratime" },
 		{ MNT_RELATIME, ",relatime" },
+		{ MNT_NOSUBMNT, ",nosubmnt" },
 		{ 0, NULL }
 	};
 	struct proc_fs_info *fs_infop;
@@ -1050,6 +1051,9 @@ static bool permit_mount(struct nameidat
 	if (S_ISLNK(inode->i_mode))
 		return false;
 
+	if (nd->path.mnt->mnt_flags & MNT_NOSUBMNT)
+		return false;
+
 	if (!is_mount_owner(nd->path.mnt, current->fsuid))
 		return false;
 
@@ -1894,9 +1898,11 @@ long do_mount(char *dev_name, char *dir_
 		mnt_flags |= MNT_RELATIME;
 	if (flags & MS_RDONLY)
 		mnt_flags |= MNT_READONLY;
+	if (flags & MS_NOSUBMNT)
+		mnt_flags |= MNT_NOSUBMNT;
 
-	flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
-		   MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT);
+	flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_NOATIME |
+		   MS_NODIRATIME | MS_RELATIME | MS_KERNMOUNT | MS_NOSUBMNT);
 
 	/* ... and get the mountpoint */
 	retval = path_lookup(dir_name, LOOKUP_FOLLOW, &nd);
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h	2008-01-16 13:25:11.000000000 +0100
+++ linux/include/linux/fs.h	2008-01-16 13:25:12.000000000 +0100
@@ -129,6 +129,7 @@ extern int dir_notify_enable;
 #define MS_KERNMOUNT	(1<<22) /* this is a kern_mount call */
 #define MS_I_VERSION	(1<<23) /* Update inode I_version field */
 #define MS_SETUSER	(1<<24) /* set mnt_uid to current user */
+#define MS_NOSUBMNT	(1<<25) /* don't allow unprivileged submounts */
 #define MS_ACTIVE	(1<<30)
 #define MS_NOUSER	(1<<31)
 
Index: linux/include/linux/mount.h
===================================================================
--- linux.orig/include/linux/mount.h	2008-01-16 13:25:05.000000000 +0100
+++ linux/include/linux/mount.h	2008-01-16 13:25:12.000000000 +0100
@@ -30,6 +30,7 @@ struct mnt_namespace;
 #define MNT_NODIRATIME	0x10
 #define MNT_RELATIME	0x20
 #define MNT_READONLY	0x40	/* does the user want this to be r/o? */
+#define MNT_NOSUBMNT	0x80
 
 #define MNT_SHRINKABLE	0x100
 #define MNT_IMBALANCED_WRITE_COUNT	0x200 /* just for debugging */

--

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 09/10] unprivileged mounts: propagation: inherit owner from parent
  2008-01-16 12:31 ` [patch 09/10] unprivileged mounts: propagation: inherit owner from parent Miklos Szeredi
@ 2008-01-21 20:23   ` Serge E. Hallyn
  0 siblings, 0 replies; 28+ messages in thread
From: Serge E. Hallyn @ 2008-01-21 20:23 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: akpm, hch, serue, viro, kzak, linux-fsdevel, linux-kernel,
	containers, util-linux-ng

Quoting Miklos Szeredi (miklos@szeredi.hu):
> From: Miklos Szeredi <mszeredi@suse.cz>
> 
> On mount propagation, let the owner of the clone be inherited from the
> parent into which it has been propagated.
> 
> If the parent has the "nosuid" flag, set this flag for the child as
> well.  This is needed for the suid-less namespace (use case #2 in the
> first patch header), where all mounts are owned by the user and have
> the nosuid flag set.  In this case the propagated mount needs to have
> nosuid, otherwise a suid executable may be misused by the user.
> 
> Similar treatment is not needed for "nodev", because devices can't be
> abused this way: the user is not able to gain privileges to devices by
> rearranging the mount namespace.
> 
> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>

As discussed many months ago this does seem like the most appropriate
behavior for propagation.

Acked-by: Serge Hallyn <serue@us.ibm.com>

> ---
> 
> Index: linux/fs/namespace.c
> ===================================================================
> --- linux.orig/fs/namespace.c	2008-01-16 13:25:09.000000000 +0100
> +++ linux/fs/namespace.c	2008-01-16 13:25:11.000000000 +0100
> @@ -506,10 +506,10 @@ static int reserve_user_mount(void)
>  	return err;
>  }
> 
> -static void __set_mnt_user(struct vfsmount *mnt)
> +static void __set_mnt_user(struct vfsmount *mnt, uid_t owner)
>  {
>  	WARN_ON(mnt->mnt_flags & MNT_USER);
> -	mnt->mnt_uid = current->fsuid;
> +	mnt->mnt_uid = owner;
>  	mnt->mnt_flags |= MNT_USER;
> 
>  	if (!capable(CAP_SETUID))
> @@ -520,7 +520,7 @@ static void __set_mnt_user(struct vfsmou
> 
>  static void set_mnt_user(struct vfsmount *mnt)
>  {
> -	__set_mnt_user(mnt);
> +	__set_mnt_user(mnt, current->fsuid);
>  	spin_lock(&vfsmount_lock);
>  	nr_user_mounts++;
>  	spin_unlock(&vfsmount_lock);
> @@ -536,7 +536,7 @@ static void clear_mnt_user(struct vfsmou
>  }
> 
>  static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root,
> -					int flag)
> +					int flag, uid_t owner)
>  {
>  	struct super_block *sb = old->mnt_sb;
>  	struct vfsmount *mnt;
> @@ -560,7 +560,10 @@ static struct vfsmount *clone_mnt(struct
>  	/* don't copy the MNT_USER flag */
>  	mnt->mnt_flags &= ~MNT_USER;
>  	if (flag & CL_SETUSER)
> -		__set_mnt_user(mnt);
> +		__set_mnt_user(mnt, owner);
> +
> +	if (flag & CL_NOSUID)
> +		mnt->mnt_flags |= MNT_NOSUID;
> 
>  	if (flag & CL_SLAVE) {
>  		list_add(&mnt->mnt_slave, &old->mnt_slave_list);
> @@ -1066,7 +1069,7 @@ static int lives_below_in_same_fs(struct
>  }
> 
>  struct vfsmount *copy_tree(struct vfsmount *mnt, struct dentry *dentry,
> -					int flag)
> +					int flag, uid_t owner)
>  {
>  	struct vfsmount *res, *p, *q, *r, *s;
>  	struct nameidata nd;
> @@ -1074,7 +1077,7 @@ struct vfsmount *copy_tree(struct vfsmou
>  	if (!(flag & CL_COPY_ALL) && IS_MNT_UNBINDABLE(mnt))
>  		return ERR_PTR(-EPERM);
> 
> -	res = q = clone_mnt(mnt, dentry, flag);
> +	res = q = clone_mnt(mnt, dentry, flag, owner);
>  	if (IS_ERR(q))
>  		goto error;
>  	q->mnt_mountpoint = mnt->mnt_mountpoint;
> @@ -1096,7 +1099,7 @@ struct vfsmount *copy_tree(struct vfsmou
>  			p = s;
>  			nd.path.mnt = q;
>  			nd.path.dentry = p->mnt_mountpoint;
> -			q = clone_mnt(p, p->mnt_root, flag);
> +			q = clone_mnt(p, p->mnt_root, flag, owner);
>  			if (IS_ERR(q))
>  				goto error;
>  			spin_lock(&vfsmount_lock);
> @@ -1121,7 +1124,7 @@ struct vfsmount *collect_mounts(struct v
>  {
>  	struct vfsmount *tree;
>  	down_read(&namespace_sem);
> -	tree = copy_tree(mnt, dentry, CL_COPY_ALL | CL_PRIVATE);
> +	tree = copy_tree(mnt, dentry, CL_COPY_ALL | CL_PRIVATE, 0);
>  	up_read(&namespace_sem);
>  	return tree;
>  }
> @@ -1292,7 +1295,8 @@ static int do_change_type(struct nameida
>   */
>  static int do_loopback(struct nameidata *nd, char *old_name, int flags)
>  {
> -	int clone_fl;
> +	int clone_fl = 0;
> +	uid_t owner = 0;
>  	struct nameidata old_nd;
>  	struct vfsmount *mnt = NULL;
>  	int err;
> @@ -1313,11 +1317,17 @@ static int do_loopback(struct nameidata 
>  	if (!check_mnt(nd->path.mnt) || !check_mnt(old_nd.path.mnt))
>  		goto out;
> 
> -	clone_fl = (flags & MS_SETUSER) ? CL_SETUSER : 0;
> +	if (flags & MS_SETUSER) {
> +		clone_fl |= CL_SETUSER;
> +		owner = current->fsuid;
> +	}
> +
>  	if (flags & MS_REC)
> -		mnt = copy_tree(old_nd.path.mnt, old_nd.path.dentry, clone_fl);
> +		mnt = copy_tree(old_nd.path.mnt, old_nd.path.dentry, clone_fl,
> +				owner);
>  	else
> -		mnt = clone_mnt(old_nd.path.mnt, old_nd.path.dentry, clone_fl);
> +		mnt = clone_mnt(old_nd.path.mnt, old_nd.path.dentry, clone_fl,
> +				owner);
> 
>  	err = PTR_ERR(mnt);
>  	if (IS_ERR(mnt))
> @@ -1541,7 +1551,7 @@ static int do_new_mount(struct nameidata
>  	}
> 
>  	if (flags & MS_SETUSER)
> -		__set_mnt_user(mnt);
> +		__set_mnt_user(mnt, current->fsuid);
> 
>  	return do_add_mount(mnt, nd, mnt_flags, NULL);
> 
> @@ -1937,7 +1947,7 @@ static struct mnt_namespace *dup_mnt_ns(
>  	down_write(&namespace_sem);
>  	/* First pass: copy the tree topology */
>  	new_ns->root = copy_tree(mnt_ns->root, mnt_ns->root->mnt_root,
> -					CL_COPY_ALL | CL_EXPIRE);
> +					CL_COPY_ALL | CL_EXPIRE, 0);
>  	if (IS_ERR(new_ns->root)) {
>  		up_write(&namespace_sem);
>  		kfree(new_ns);
> Index: linux/fs/pnode.c
> ===================================================================
> --- linux.orig/fs/pnode.c	2008-01-16 13:25:07.000000000 +0100
> +++ linux/fs/pnode.c	2008-01-16 13:25:11.000000000 +0100
> @@ -181,15 +181,28 @@ int propagate_mnt(struct vfsmount *dest_
> 
>  	for (m = propagation_next(dest_mnt, dest_mnt); m;
>  			m = propagation_next(m, dest_mnt)) {
> -		int type;
> +		int clflags;
> +		uid_t owner = 0;
>  		struct vfsmount *source;
> 
>  		if (IS_MNT_NEW(m))
>  			continue;
> 
> -		source =  get_source(m, prev_dest_mnt, prev_src_mnt, &type);
> +		source =  get_source(m, prev_dest_mnt, prev_src_mnt, &clflags);
> 
> -		child = copy_tree(source, source->mnt_root, type);
> +		if (m->mnt_flags & MNT_USER) {
> +			clflags |= CL_SETUSER;
> +			owner = m->mnt_uid;
> +
> +			/*
> +			 * If propagating into a user mount which doesn't
> +			 * allow suid, then make sure, the child(ren) won't
> +			 * allow suid either
> +			 */
> +			if (m->mnt_flags & MNT_NOSUID)
> +				clflags |= CL_NOSUID;
> +		}
> +		child = copy_tree(source, source->mnt_root, clflags, owner);
>  		if (IS_ERR(child)) {
>  			ret = PTR_ERR(child);
>  			list_splice(tree_list, tmp_list.prev);
> Index: linux/fs/pnode.h
> ===================================================================
> --- linux.orig/fs/pnode.h	2008-01-16 13:25:05.000000000 +0100
> +++ linux/fs/pnode.h	2008-01-16 13:25:11.000000000 +0100
> @@ -24,6 +24,7 @@
>  #define CL_PROPAGATION 		0x10
>  #define CL_PRIVATE 		0x20
>  #define CL_SETUSER		0x40
> +#define CL_NOSUID		0x80
> 
>  static inline void set_mnt_shared(struct vfsmount *mnt)
>  {
> @@ -36,4 +37,6 @@ int propagate_mnt(struct vfsmount *, str
>  		struct list_head *);
>  int propagate_umount(struct list_head *);
>  int propagate_mount_busy(struct vfsmount *, int);
> +struct vfsmount *copy_tree(struct vfsmount *, struct dentry *, int, uid_t);
> +
>  #endif /* _LINUX_PNODE_H */
> Index: linux/include/linux/fs.h
> ===================================================================
> --- linux.orig/include/linux/fs.h	2008-01-16 13:25:09.000000000 +0100
> +++ linux/include/linux/fs.h	2008-01-16 13:25:11.000000000 +0100
> @@ -1493,7 +1493,6 @@ extern int may_umount(struct vfsmount *)
>  extern void umount_tree(struct vfsmount *, int, struct list_head *);
>  extern void release_mounts(struct list_head *);
>  extern long do_mount(char *, char *, char *, unsigned long, void *);
> -extern struct vfsmount *copy_tree(struct vfsmount *, struct dentry *, int);
>  extern void mnt_set_mountpoint(struct vfsmount *, struct dentry *,
>  				  struct vfsmount *);
>  extern struct vfsmount *collect_mounts(struct vfsmount *, struct dentry *);
> 
> --
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property
  2008-01-16 12:31 ` [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property Miklos Szeredi
@ 2008-01-21 20:32   ` Serge E. Hallyn
  2008-01-21 21:37     ` Miklos Szeredi
  0 siblings, 1 reply; 28+ messages in thread
From: Serge E. Hallyn @ 2008-01-21 20:32 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: akpm, hch, serue, viro, kzak, linux-fsdevel, linux-kernel,
	containers, util-linux-ng

Quoting Miklos Szeredi (miklos@szeredi.hu):
> From: Miklos Szeredi <mszeredi@suse.cz>
> 
> Add the following:
> 
>   /proc/sys/fs/types/${FS_TYPE}/usermount_safe
> 
> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
> ---
> 
> Index: linux/fs/filesystems.c
> ===================================================================
> --- linux.orig/fs/filesystems.c	2008-01-16 13:24:52.000000000 +0100
> +++ linux/fs/filesystems.c	2008-01-16 13:25:09.000000000 +0100
> @@ -12,6 +12,7 @@
>  #include <linux/kmod.h>
>  #include <linux/init.h>
>  #include <linux/module.h>
> +#include <linux/sysctl.h>
>  #include <asm/uaccess.h>
> 
>  /*
> @@ -51,6 +52,57 @@ static struct file_system_type **find_fi
>  	return p;
>  }
> 
> +#define MAX_FILESYSTEM_VARS 1
> +
> +struct filesystem_sysctl_table {
> +	struct ctl_table_header *header;
> +	struct ctl_table table[MAX_FILESYSTEM_VARS + 1];
> +};
> +
> +/*
> + * Create /sys/fs/types/${FSNAME} directory with per fs-type tunables.
> + */
> +static int filesystem_sysctl_register(struct file_system_type *fs)
> +{
> +	struct filesystem_sysctl_table *t;
> +	struct ctl_path path[] = {
> +		{ .procname = "fs", .ctl_name = CTL_FS },
> +		{ .procname = "types", .ctl_name = CTL_UNNUMBERED },
> +		{ .procname = fs->name, .ctl_name = CTL_UNNUMBERED },
> +		{ }
> +	};
> +
> +	t = kzalloc(sizeof(*t), GFP_KERNEL);
> +	if (!t)
> +		return -ENOMEM;
> +
> +
> +	t->table[0].ctl_name = CTL_UNNUMBERED;
> +	t->table[0].procname = "usermount_safe";
> +	t->table[0].maxlen = sizeof(int);
> +	t->table[0].data = &fs->fs_safe;
> +	t->table[0].mode = 0644;
> +	t->table[0].proc_handler = &proc_dointvec;
> +
> +	t->header = register_sysctl_paths(path, t->table);
> +	if (!t->header) {
> +		kfree(t);
> +		return -ENOMEM;
> +	}
> +
> +	fs->sysctl_table = t;
> +
> +	return 0;
> +}
> +
> +static void filesystem_sysctl_unregister(struct file_system_type *fs)
> +{
> +	struct filesystem_sysctl_table *t = fs->sysctl_table;
> +
> +	unregister_sysctl_table(t->header);
> +	kfree(t);
> +}
> +
>  /**
>   *	register_filesystem - register a new filesystem
>   *	@fs: the file system structure
> @@ -80,6 +132,13 @@ int register_filesystem(struct file_syst
>  	else
>  		*p = fs;
>  	write_unlock(&file_systems_lock);
> +
> +	if (res == 0) {
> +		res = filesystem_sysctl_register(fs);

What do you think about doing this only if FS_SAFE is also set,
so for instance at first only FUSE would allow itself to be
made user-mountable?

A safe thing to do, or overly intrusive?

> +		if (res != 0)
> +			unregister_filesystem(fs);
> +	}
> +
>  	return res;
>  }
> 
> @@ -108,6 +167,7 @@ int unregister_filesystem(struct file_sy
>  			*tmp = fs->next;
>  			fs->next = NULL;
>  			write_unlock(&file_systems_lock);
> +			filesystem_sysctl_unregister(fs);
>  			return 0;
>  		}
>  		tmp = &(*tmp)->next;
> Index: linux/include/linux/fs.h
> ===================================================================
> --- linux.orig/include/linux/fs.h	2008-01-16 13:25:09.000000000 +0100
> +++ linux/include/linux/fs.h	2008-01-16 13:25:09.000000000 +0100
> @@ -1437,6 +1437,7 @@ struct file_system_type {
>  	struct module *owner;
>  	struct file_system_type * next;
>  	struct list_head fs_supers;
> +	struct filesystem_sysctl_table *sysctl_table;
> 
>  	struct lock_class_key s_lock_key;
>  	struct lock_class_key s_umount_key;
> Index: linux/Documentation/filesystems/proc.txt
> ===================================================================
> --- linux.orig/Documentation/filesystems/proc.txt	2008-01-16 13:25:07.000000000 +0100
> +++ linux/Documentation/filesystems/proc.txt	2008-01-16 13:25:09.000000000 +0100
> @@ -43,6 +43,7 @@ Table of Contents
>    2.13	/proc/<pid>/oom_score - Display current oom-killer score
>    2.14	/proc/<pid>/io - Display the IO accounting fields
>    2.15	/proc/<pid>/coredump_filter - Core dump filtering settings
> +  2.16	/proc/sys/fs/types - File system type specific parameters
> 
>  ------------------------------------------------------------------------------
>  Preface
> @@ -2283,4 +2284,21 @@ For example:
>    $ echo 0x7 > /proc/self/coredump_filter
>    $ ./some_program
> 
> +2.16 /proc/sys/fs/types/ - File system type specific parameters
> +----------------------------------------------------------------
> +
> +There's a separate directory /proc/sys/fs/types/<type>/ for each
> +filesystem type, containing the following files:
> +
> +usermount_safe
> +--------------
> +
> +Setting this to non-zero will allow filesystems of this type to be
> +mounted by unprivileged users (note, that there are other
> +prerequisites as well).
> +
> +Care should be taken when enabling this, since most
> +filesystems haven't been designed with unprivileged mounting
> +in mind.
> +
>  ------------------------------------------------------------------------------
> 
> --
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 08/10] unprivileged mounts: make fuse safe
  2008-01-16 12:31 ` [patch 08/10] unprivileged mounts: make fuse safe Miklos Szeredi
@ 2008-01-21 20:41   ` Serge E. Hallyn
  0 siblings, 0 replies; 28+ messages in thread
From: Serge E. Hallyn @ 2008-01-21 20:41 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: akpm, hch, serue, viro, kzak, linux-fsdevel, linux-kernel,
	containers, util-linux-ng

Quoting Miklos Szeredi (miklos@szeredi.hu):
> From: Miklos Szeredi <mszeredi@suse.cz>
> 
> Don't require the "user_id=" and "group_id=" options for unprivileged mounts,
> but if they are present, verify them for sanity.
> 
> Disallow the "allow_other" option for unprivileged mounts.
> 
> FUSE was designed from the beginning to be safe for unprivileged
> users.  This has also been verified in practice over many years, with
> some distributions enabling unprivileged FUSE mounts by default.
> 
> However there are some properties of FUSE, that could make it unsafe
> for certain situations (e.g. multiuser system with untrusted users):
> 
>  - It is not always possible to use kill(2) (not even with SIGKILL) to
>    terminate a process using a FUSE filesystem.  However it is
>    possible to use any of the following instead:
>      o kill the filesystem daemon
>      o use forced umounting
>      o use the "fusectl" control filesystem
> 
>  - As a special case of the above, killing a self-deadlocked FUSE
>    process is not possible, and even killall5 will not terminate it.
> 
>  - Due to the design of the process freezer, a hanging (due to network
>    problems, etc) or malicious filesystem may prevent suspending to
>    ram or hibernation to succeed.  This is not actually unique to
>    FUSE, as any hanging network filesystem will have the same affect.
> 
> If the above could pose a threat to the system, it is recommended,
> that the '/proc/sys/fs/types/fuse/safe' sysctl tunable is not turned
> on, and/or '/dev/fuse' is not made world-readable and writable.
> 
> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>

I was going to say "this should of course be acked by a fuse
maintainer", then I look at MAINTAINERS :)  So never mind.

Acked-by: Serge Hallyn <serue@us.ibm.com>

> ---
> 
> Index: linux/fs/fuse/inode.c
> ===================================================================
> --- linux.orig/fs/fuse/inode.c	2008-01-16 13:24:52.000000000 +0100
> +++ linux/fs/fuse/inode.c	2008-01-16 13:25:10.000000000 +0100
> @@ -357,6 +357,19 @@ static int parse_fuse_opt(char *opt, str
>  	d->max_read = ~0;
>  	d->blksize = 512;
> 
> +	/*
> +	 * For unprivileged mounts use current uid/gid.  Still allow
> +	 * "user_id" and "group_id" options for compatibility, but
> +	 * only if they match these values.
> +	 */
> +	if (!capable(CAP_SYS_ADMIN)) {
> +		d->user_id = current->uid;
> +		d->user_id_present = 1;
> +		d->group_id = current->gid;
> +		d->group_id_present = 1;
> +
> +	}
> +
>  	while ((p = strsep(&opt, ",")) != NULL) {
>  		int token;
>  		int value;
> @@ -385,6 +398,8 @@ static int parse_fuse_opt(char *opt, str
>  		case OPT_USER_ID:
>  			if (match_int(&args[0], &value))
>  				return 0;
> +			if (d->user_id_present && d->user_id != value)
> +				return 0;
>  			d->user_id = value;
>  			d->user_id_present = 1;
>  			break;
> @@ -392,6 +407,8 @@ static int parse_fuse_opt(char *opt, str
>  		case OPT_GROUP_ID:
>  			if (match_int(&args[0], &value))
>  				return 0;
> +			if (d->group_id_present && d->group_id != value)
> +				return 0;
>  			d->group_id = value;
>  			d->group_id_present = 1;
>  			break;
> @@ -596,6 +613,10 @@ static int fuse_fill_super(struct super_
>  	if (!parse_fuse_opt((char *) data, &d, is_bdev))
>  		return -EINVAL;
> 
> +	/* This is a privileged option */
> +	if ((d.flags & FUSE_ALLOW_OTHER) && !capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
>  	if (is_bdev) {
>  #ifdef CONFIG_BLOCK
>  		if (!sb_set_blocksize(sb, d.blksize))
> 
> --
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property
  2008-01-21 20:32   ` Serge E. Hallyn
@ 2008-01-21 21:37     ` Miklos Szeredi
  2008-01-22 20:48       ` Serge E. Hallyn
  0 siblings, 1 reply; 28+ messages in thread
From: Miklos Szeredi @ 2008-01-21 21:37 UTC (permalink / raw)
  To: serue
  Cc: miklos, akpm, hch, serue, viro, kzak, linux-fsdevel,
	linux-kernel, containers, util-linux-ng

> What do you think about doing this only if FS_SAFE is also set,
> so for instance at first only FUSE would allow itself to be
> made user-mountable?
> 
> A safe thing to do, or overly intrusive?

It goes somewhat against the "no policy in kernel" policy ;).  I think
the warning in the documentation should be enough to make sysadmins
think twice before doing anything foolish:

> +Care should be taken when enabling this, since most
> +filesystems haven't been designed with unprivileged mounting
> +in mind.
> +

BTW, filesystems like 'proc' and 'sysfs' should also be safe, although
the only use for them being marked safe is if the users are allowed to
umount them from their private namespace (otherwise a 'mount --bind'
has the same effect as a new mount).

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property
  2008-01-21 21:37     ` Miklos Szeredi
@ 2008-01-22 20:48       ` Serge E. Hallyn
  2008-01-22 22:59         ` Miklos Szeredi
  0 siblings, 1 reply; 28+ messages in thread
From: Serge E. Hallyn @ 2008-01-22 20:48 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: serue, akpm, hch, viro, kzak, linux-fsdevel, linux-kernel,
	containers, util-linux-ng

Quoting Miklos Szeredi (miklos@szeredi.hu):
> > What do you think about doing this only if FS_SAFE is also set,
> > so for instance at first only FUSE would allow itself to be
> > made user-mountable?
> > 
> > A safe thing to do, or overly intrusive?
> 
> It goes somewhat against the "no policy in kernel" policy ;).  I think
> the warning in the documentation should be enough to make sysadmins
> think twice before doing anything foolish:

Warning in which documentation?  A sysadmin considering setting fs_safe
for ext2 or xfs isn't going to be looking at fuse docs, which I think is
what you're talking about.  Are you going to add a file under
Documentation/filesystems?

> > +Care should be taken when enabling this, since most
> > +filesystems haven't been designed with unprivileged mounting
> > +in mind.
> > +
> 
> BTW, filesystems like 'proc' and 'sysfs' should also be safe, although
> the only use for them being marked safe is if the users are allowed to
> umount them from their private namespace (otherwise a 'mount --bind'
> has the same effect as a new mount).
> 
> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property
  2008-01-22 20:48       ` Serge E. Hallyn
@ 2008-01-22 22:59         ` Miklos Szeredi
  2008-01-23  0:00           ` Serge E. Hallyn
  0 siblings, 1 reply; 28+ messages in thread
From: Miklos Szeredi @ 2008-01-22 22:59 UTC (permalink / raw)
  To: serue
  Cc: miklos, serue, akpm, hch, viro, kzak, linux-fsdevel,
	linux-kernel, containers, util-linux-ng

> > > What do you think about doing this only if FS_SAFE is also set,
> > > so for instance at first only FUSE would allow itself to be
> > > made user-mountable?
> > > 
> > > A safe thing to do, or overly intrusive?
> > 
> > It goes somewhat against the "no policy in kernel" policy ;).  I think
> > the warning in the documentation should be enough to make sysadmins
> > think twice before doing anything foolish:
> 
> Warning in which documentation?  A sysadmin considering setting fs_safe
> for ext2 or xfs isn't going to be looking at fuse docs, which I think is
> what you're talking about.  Are you going to add a file under
> Documentation/filesystems?

Yes, I meant documentation of the new sysctl tunable in
Documentation/filesystems/proc.txt:

> Index: linux/Documentation/filesystems/proc.txt
> ===================================================================
> --- linux.orig/Documentation/filesystems/proc.txt	2008-01-16 13:25:07.000000000 +0100
> +++ linux/Documentation/filesystems/proc.txt	2008-01-16 13:25:09.000000000 +0100
> @@ -43,6 +43,7 @@ Table of Contents
>    2.13	/proc/<pid>/oom_score - Display current oom-killer score
>    2.14	/proc/<pid>/io - Display the IO accounting fields
>    2.15	/proc/<pid>/coredump_filter - Core dump filtering settings
> +  2.16	/proc/sys/fs/types - File system type specific parameters
>  
>  ------------------------------------------------------------------------------
>  Preface
> @@ -2283,4 +2284,21 @@ For example:
>    $ echo 0x7 > /proc/self/coredump_filter
>    $ ./some_program
>  
> +2.16 /proc/sys/fs/types/ - File system type specific parameters
> +----------------------------------------------------------------
> +
> +There's a separate directory /proc/sys/fs/types/<type>/ for each
> +filesystem type, containing the following files:
> +
> +usermount_safe
> +--------------
> +
> +Setting this to non-zero will allow filesystems of this type to be
> +mounted by unprivileged users (note, that there are other
> +prerequisites as well).
> +
> +Care should be taken when enabling this, since most
> +filesystems haven't been designed with unprivileged mounting
> +in mind.
> +
>  ------------------------------------------------------------------------------
> 

Do you think this is enough?  Or do we need something more, to prevent
sysadmin inadvertently setting this for an unsafe filesystem?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property
  2008-01-22 22:59         ` Miklos Szeredi
@ 2008-01-23  0:00           ` Serge E. Hallyn
  0 siblings, 0 replies; 28+ messages in thread
From: Serge E. Hallyn @ 2008-01-23  0:00 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: serue, akpm, hch, viro, kzak, linux-fsdevel, linux-kernel,
	containers, util-linux-ng

Quoting Miklos Szeredi (miklos@szeredi.hu):
> > > > What do you think about doing this only if FS_SAFE is also set,
> > > > so for instance at first only FUSE would allow itself to be
> > > > made user-mountable?
> > > > 
> > > > A safe thing to do, or overly intrusive?
> > > 
> > > It goes somewhat against the "no policy in kernel" policy ;).  I think
> > > the warning in the documentation should be enough to make sysadmins
> > > think twice before doing anything foolish:
> > 
> > Warning in which documentation?  A sysadmin considering setting fs_safe
> > for ext2 or xfs isn't going to be looking at fuse docs, which I think is
> > what you're talking about.  Are you going to add a file under
> > Documentation/filesystems?
> 
> Yes, I meant documentation of the new sysctl tunable in
> Documentation/filesystems/proc.txt:

Argh, sorry.

> > Index: linux/Documentation/filesystems/proc.txt
> > ===================================================================
> > --- linux.orig/Documentation/filesystems/proc.txt	2008-01-16 13:25:07.000000000 +0100
> > +++ linux/Documentation/filesystems/proc.txt	2008-01-16 13:25:09.000000000 +0100
> > @@ -43,6 +43,7 @@ Table of Contents
> >    2.13	/proc/<pid>/oom_score - Display current oom-killer score
> >    2.14	/proc/<pid>/io - Display the IO accounting fields
> >    2.15	/proc/<pid>/coredump_filter - Core dump filtering settings
> > +  2.16	/proc/sys/fs/types - File system type specific parameters
> >  
> >  ------------------------------------------------------------------------------
> >  Preface
> > @@ -2283,4 +2284,21 @@ For example:
> >    $ echo 0x7 > /proc/self/coredump_filter
> >    $ ./some_program
> >  
> > +2.16 /proc/sys/fs/types/ - File system type specific parameters
> > +----------------------------------------------------------------
> > +
> > +There's a separate directory /proc/sys/fs/types/<type>/ for each
> > +filesystem type, containing the following files:
> > +
> > +usermount_safe
> > +--------------
> > +
> > +Setting this to non-zero will allow filesystems of this type to be
> > +mounted by unprivileged users (note, that there are other
> > +prerequisites as well).
> > +
> > +Care should be taken when enabling this, since most
> > +filesystems haven't been designed with unprivileged mounting
> > +in mind.
> > +
> >  ------------------------------------------------------------------------------
> > 
> 
> Do you think this is enough?  Or do we need something more, to prevent
> sysadmin inadvertently setting this for an unsafe filesystem?

I would think something more would be good.  First explaining
that fuse should be safe modulo warnings in the fuse documentation,
procfs and sysfs may be safe, while other filesystems are not known safe
at all.

Then explaining the dangers with not-known-safe filesystems and what is
needed to make them safe.  Clearly making sure input validation is
properly done so for instance getsb() doesn't turn into a buffer
overflow, etc.

Such a checklist also would be useful for holding a meaningful discussion
about the other filesystems and maybe turning some people loose on
an audit of other filesystems.

thanks,
-serge

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property
  2008-02-07 14:36             ` Miklos Szeredi
@ 2008-02-07 16:57               ` Serge E. Hallyn
  0 siblings, 0 replies; 28+ messages in thread
From: Serge E. Hallyn @ 2008-02-07 16:57 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: serue, akpm, hch, linux-fsdevel, linux-kernel

Quoting Miklos Szeredi (miklos@szeredi.hu):
> > > > > Maybe sysctls just need to check capabilities, instead of uids.  I
> > > > > think that would make a lot of sense anyway.
> > > > 
> > > > Would it be as simple as tagging the inodes with capability sets?  One
> > > > set for writing, or one each for reading and writing?
> > > 
> > > Yes, or something even simpler, like mapping the owner permission bits
> > > to CAP_SYS_ADMIN.  There seem to be very few different permissions
> > > under /proc/sys:
> > > 
> > > --w-------
> > > -r--r--r--
> > > -rw-------
> > > -rw-r--r--
> > > 
> > > As long as the group and other bits are always the same, and we accept
> > > that the owner bits really mean CAP_SYS_ADMIN and not something else,
> > 
> > But I would assume some things under /proc/sys/net/ipv4 or
> > /proc/sys/net/ath0 require CAP_NET_ADMIN rather than CAP_SYS_ADMIN?
> 
> I guess so.  I'm not very familiar with the different capabilities :)
> 
> How about this patch then: a hybrid solution between just relying on
> permission bits, and specifying separate capability sets for read and
> write in addition to the permission bits.
> 
> Untested, the 'cap' field obviously still needs to be filled in where
> appropriate.
> 
> Miklos
> ----
> 
> Index: linux/include/linux/sysctl.h
> ===================================================================
> --- linux.orig/include/linux/sysctl.h	2008-02-04 12:29:01.000000000 +0100
> +++ linux/include/linux/sysctl.h	2008-02-07 15:19:06.000000000 +0100
> @@ -1041,6 +1041,7 @@ struct ctl_table 
>  	void *data;
>  	int maxlen;
>  	mode_t mode;
> +	int cap;			/* Capability needed to read/write */
>  	struct ctl_table *child;
>  	struct ctl_table *parent;	/* Automatically set */
>  	proc_handler *proc_handler;	/* Callback for text formatting */
> Index: linux/kernel/sysctl.c
> ===================================================================
> --- linux.orig/kernel/sysctl.c	2008-02-05 22:17:05.000000000 +0100
> +++ linux/kernel/sysctl.c	2008-02-07 15:30:45.000000000 +0100
> @@ -1527,14 +1527,26 @@ out:
>   * some sysctl variables are readonly even to root.
>   */
> 
> -static int test_perm(int mode, int op)
> +static int test_perm(struct ctl_table *table, int op)
>  {
> -	if (!current->euid)
> -		mode >>= 6;
> -	else if (in_egroup_p(0))
> -		mode >>= 3;
> +	int cap = table->cap;
> +	mode_t mode = table->mode;
> +
> +	if (!cap)
> +		cap = CAP_SYS_ADMIN;
> +
> +	if ((op & MAY_READ) && !(mode & S_IRUGO))
> +		return -EACCES;
> +
> +	if ((op & MAY_WRITE) && !(mode & S_IWUGO))
> +		return -EACCES;
> +
> +	if (capable(cap))
> +		return 0;
> +
>  	if ((mode & op & 0007) == op)
>  		return 0;
> +
>  	return -EACCES;

I like how simple it appears to be :)

At first I missed the fact that owning uid is always 0 so I thought the
uid processing wasn't quite enough.  But since it's always 0, the only
question is whether there are any /proc/sys files whose users currently
depend on being setgid 0 and setgid non-0 with no capabilities.

On my laptop, 'find /proc/sys -type f -perm -020' gives me no results,
so that is promising.

So this certainly seems like a good first step.  In fact, combined with
/proc/sys/ being partially remounted per container like /proc/sys/net is
doing, we may not even need to do anything with CAP_NS_OVERRIDE.

thanks,
-serge

>  }
> 
> @@ -1544,7 +1556,7 @@ int sysctl_perm(struct ctl_table *table,
>  	error = security_sysctl(table, op);
>  	if (error)
>  		return error;
> -	return test_perm(table->mode, op);
> +	return test_perm(table, op);
>  }
> 
>  #ifdef CONFIG_SYSCTL_SYSCALL

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property
  2008-02-07 15:33   ` Aneesh Kumar K.V
@ 2008-02-07 16:24     ` Miklos Szeredi
  0 siblings, 0 replies; 28+ messages in thread
From: Miklos Szeredi @ 2008-02-07 16:24 UTC (permalink / raw)
  To: aneesh.kumar; +Cc: miklos, akpm, hch, serue, linux-fsdevel, linux-kernel

> On Tue, Feb 05, 2008 at 10:36:23PM +0100, Miklos Szeredi wrote:
> > From: Miklos Szeredi <mszeredi@suse.cz>
> > 
> > Add the following:
> > 
> >   /proc/sys/fs/types/${FS_TYPE}/usermount_safe
> > 
> 
> 
> There is  /proc/fs/<type>/ already. Since it is file system specific
> shouldn't it go there ?

The problem is exactly that it's filesystem specific, each filesystem
creates it's own stuff there.

Also we have a rule tp not create new things under /proc.  Things
should either go into /sys or into /proc/sys.  And I think a sysctl is
more appropriate for this than something under /sys.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property
  2008-02-05 21:36 ` [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property Miklos Szeredi
  2008-02-06 20:21   ` Serge E. Hallyn
@ 2008-02-07 15:33   ` Aneesh Kumar K.V
  2008-02-07 16:24     ` Miklos Szeredi
  1 sibling, 1 reply; 28+ messages in thread
From: Aneesh Kumar K.V @ 2008-02-07 15:33 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, hch, serue, linux-fsdevel, linux-kernel

On Tue, Feb 05, 2008 at 10:36:23PM +0100, Miklos Szeredi wrote:
> From: Miklos Szeredi <mszeredi@suse.cz>
> 
> Add the following:
> 
>   /proc/sys/fs/types/${FS_TYPE}/usermount_safe
> 


There is  /proc/fs/<type>/ already. Since it is file system specific
shouldn't it go there ?

-aneesh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property
  2008-02-07 14:05           ` Serge E. Hallyn
@ 2008-02-07 14:36             ` Miklos Szeredi
  2008-02-07 16:57               ` Serge E. Hallyn
  0 siblings, 1 reply; 28+ messages in thread
From: Miklos Szeredi @ 2008-02-07 14:36 UTC (permalink / raw)
  To: serue; +Cc: miklos, serue, akpm, hch, linux-fsdevel, linux-kernel

> > > > Maybe sysctls just need to check capabilities, instead of uids.  I
> > > > think that would make a lot of sense anyway.
> > > 
> > > Would it be as simple as tagging the inodes with capability sets?  One
> > > set for writing, or one each for reading and writing?
> > 
> > Yes, or something even simpler, like mapping the owner permission bits
> > to CAP_SYS_ADMIN.  There seem to be very few different permissions
> > under /proc/sys:
> > 
> > --w-------
> > -r--r--r--
> > -rw-------
> > -rw-r--r--
> > 
> > As long as the group and other bits are always the same, and we accept
> > that the owner bits really mean CAP_SYS_ADMIN and not something else,
> 
> But I would assume some things under /proc/sys/net/ipv4 or
> /proc/sys/net/ath0 require CAP_NET_ADMIN rather than CAP_SYS_ADMIN?

I guess so.  I'm not very familiar with the different capabilities :)

How about this patch then: a hybrid solution between just relying on
permission bits, and specifying separate capability sets for read and
write in addition to the permission bits.

Untested, the 'cap' field obviously still needs to be filled in where
appropriate.

Miklos
----

Index: linux/include/linux/sysctl.h
===================================================================
--- linux.orig/include/linux/sysctl.h	2008-02-04 12:29:01.000000000 +0100
+++ linux/include/linux/sysctl.h	2008-02-07 15:19:06.000000000 +0100
@@ -1041,6 +1041,7 @@ struct ctl_table 
 	void *data;
 	int maxlen;
 	mode_t mode;
+	int cap;			/* Capability needed to read/write */
 	struct ctl_table *child;
 	struct ctl_table *parent;	/* Automatically set */
 	proc_handler *proc_handler;	/* Callback for text formatting */
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c	2008-02-05 22:17:05.000000000 +0100
+++ linux/kernel/sysctl.c	2008-02-07 15:30:45.000000000 +0100
@@ -1527,14 +1527,26 @@ out:
  * some sysctl variables are readonly even to root.
  */
 
-static int test_perm(int mode, int op)
+static int test_perm(struct ctl_table *table, int op)
 {
-	if (!current->euid)
-		mode >>= 6;
-	else if (in_egroup_p(0))
-		mode >>= 3;
+	int cap = table->cap;
+	mode_t mode = table->mode;
+
+	if (!cap)
+		cap = CAP_SYS_ADMIN;
+
+	if ((op & MAY_READ) && !(mode & S_IRUGO))
+		return -EACCES;
+
+	if ((op & MAY_WRITE) && !(mode & S_IWUGO))
+		return -EACCES;
+
+	if (capable(cap))
+		return 0;
+
 	if ((mode & op & 0007) == op)
 		return 0;
+
 	return -EACCES;
 }
 
@@ -1544,7 +1556,7 @@ int sysctl_perm(struct ctl_table *table,
 	error = security_sysctl(table, op);
 	if (error)
 		return error;
-	return test_perm(table->mode, op);
+	return test_perm(table, op);
 }
 
 #ifdef CONFIG_SYSCTL_SYSCALL

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property
  2008-02-07  8:09         ` Miklos Szeredi
@ 2008-02-07 14:05           ` Serge E. Hallyn
  2008-02-07 14:36             ` Miklos Szeredi
  0 siblings, 1 reply; 28+ messages in thread
From: Serge E. Hallyn @ 2008-02-07 14:05 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: serue, akpm, hch, linux-fsdevel, linux-kernel

Quoting Miklos Szeredi (miklos@szeredi.hu):
> > > Maybe sysctls just need to check capabilities, instead of uids.  I
> > > think that would make a lot of sense anyway.
> > 
> > Would it be as simple as tagging the inodes with capability sets?  One
> > set for writing, or one each for reading and writing?
> 
> Yes, or something even simpler, like mapping the owner permission bits
> to CAP_SYS_ADMIN.  There seem to be very few different permissions
> under /proc/sys:
> 
> --w-------
> -r--r--r--
> -rw-------
> -rw-r--r--
> 
> As long as the group and other bits are always the same, and we accept
> that the owner bits really mean CAP_SYS_ADMIN and not something else,

But I would assume some things under /proc/sys/net/ipv4 or
/proc/sys/net/ath0 require CAP_NET_ADMIN rather than CAP_SYS_ADMIN?

> then the permission check would not need to look at uids or gids at
> all.
> 
> Miklos
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property
  2008-02-06 22:45       ` Serge E. Hallyn
@ 2008-02-07  8:09         ` Miklos Szeredi
  2008-02-07 14:05           ` Serge E. Hallyn
  0 siblings, 1 reply; 28+ messages in thread
From: Miklos Szeredi @ 2008-02-07  8:09 UTC (permalink / raw)
  To: serue; +Cc: miklos, serue, akpm, hch, linux-fsdevel, linux-kernel

> > Maybe sysctls just need to check capabilities, instead of uids.  I
> > think that would make a lot of sense anyway.
> 
> Would it be as simple as tagging the inodes with capability sets?  One
> set for writing, or one each for reading and writing?

Yes, or something even simpler, like mapping the owner permission bits
to CAP_SYS_ADMIN.  There seem to be very few different permissions
under /proc/sys:

--w-------
-r--r--r--
-rw-------
-rw-r--r--

As long as the group and other bits are always the same, and we accept
that the owner bits really mean CAP_SYS_ADMIN and not something else,
then the permission check would not need to look at uids or gids at
all.

Miklos

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property
  2008-02-06 21:11     ` Miklos Szeredi
@ 2008-02-06 22:45       ` Serge E. Hallyn
  2008-02-07  8:09         ` Miklos Szeredi
  0 siblings, 1 reply; 28+ messages in thread
From: Serge E. Hallyn @ 2008-02-06 22:45 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: serue, akpm, hch, linux-fsdevel, linux-kernel

Quoting Miklos Szeredi (miklos@szeredi.hu):
> > > +	t->table[0].mode = 0644;
> > 
> > Yikes, this could be a problem for containers, as it's simply tied to
> > uid 0, whereas tying it to a capability would let us solve it with
> > capability bounds.
> > 
> > This might mean more urgency to get user namespaces working at least
> > with sysfs, else this is a quick way around having CAP_SYS_ADMIN taken
> > out of a container's capability bounding set.
> 
> I think I understand the problem, but not the solution.  How do user
> namespaces going to help?

Well it somewhat depends on how we implement userns for filesystems
in the first place, and whether we end up splitting sysfs into
sub-filesystems as I think Eric Biederman has been advocating.  My
thoughts had been running along the lines of just tagging vfsmounts
with userns of the mounting process.  A task from outside the mounting
process' namespace would get user other permissions whether or not
its uid was the owning uid or uid 0 (unless the task had CAP_NS_OVERRIDE).

But really it gets more complicated for sysfs than something like ext2
since we really want to be able to filter files and directories for
different namespaces...  Handling sysfs user namespaces before we sort
out the rest of the sysfs stuff (being hashed out with network
namespaces) seems like jumping the gun a bit.

> Maybe sysctls just need to check capabilities, instead of uids.  I
> think that would make a lot of sense anyway.

Would it be as simple as tagging the inodes with capability sets?  One
set for writing, or one each for reading and writing?

thanks,
-serge

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property
  2008-02-06 20:21   ` Serge E. Hallyn
@ 2008-02-06 21:11     ` Miklos Szeredi
  2008-02-06 22:45       ` Serge E. Hallyn
  0 siblings, 1 reply; 28+ messages in thread
From: Miklos Szeredi @ 2008-02-06 21:11 UTC (permalink / raw)
  To: serue; +Cc: miklos, akpm, hch, serue, linux-fsdevel, linux-kernel

> > +	t->table[0].mode = 0644;
> 
> Yikes, this could be a problem for containers, as it's simply tied to
> uid 0, whereas tying it to a capability would let us solve it with
> capability bounds.
> 
> This might mean more urgency to get user namespaces working at least
> with sysfs, else this is a quick way around having CAP_SYS_ADMIN taken
> out of a container's capability bounding set.

I think I understand the problem, but not the solution.  How do user
namespaces going to help?

Maybe sysctls just need to check capabilities, instead of uids.  I
think that would make a lot of sense anyway.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property
  2008-02-05 21:36 ` [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property Miklos Szeredi
@ 2008-02-06 20:21   ` Serge E. Hallyn
  2008-02-06 21:11     ` Miklos Szeredi
  2008-02-07 15:33   ` Aneesh Kumar K.V
  1 sibling, 1 reply; 28+ messages in thread
From: Serge E. Hallyn @ 2008-02-06 20:21 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, hch, serue, linux-fsdevel, linux-kernel

Quoting Miklos Szeredi (miklos@szeredi.hu):
> From: Miklos Szeredi <mszeredi@suse.cz>
> 
> Add the following:
> 
>   /proc/sys/fs/types/${FS_TYPE}/usermount_safe
> 
> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>

Thanks, Miklos, good explanations in the docs.

Acked-by: Serge Hallyn <serue@us.ibm.com>

One comment inline, but not imo your problem :)

> ---
> 
> Index: linux/fs/filesystems.c
> ===================================================================
> --- linux.orig/fs/filesystems.c	2008-02-04 23:47:46.000000000 +0100
> +++ linux/fs/filesystems.c	2008-02-04 23:48:04.000000000 +0100
> @@ -12,6 +12,7 @@
>  #include <linux/kmod.h>
>  #include <linux/init.h>
>  #include <linux/module.h>
> +#include <linux/sysctl.h>
>  #include <asm/uaccess.h>
> 
>  /*
> @@ -51,6 +52,57 @@ static struct file_system_type **find_fi
>  	return p;
>  }
> 
> +#define MAX_FILESYSTEM_VARS 1
> +
> +struct filesystem_sysctl_table {
> +	struct ctl_table_header *header;
> +	struct ctl_table table[MAX_FILESYSTEM_VARS + 1];
> +};
> +
> +/*
> + * Create /sys/fs/types/${FSNAME} directory with per fs-type tunables.
> + */
> +static int filesystem_sysctl_register(struct file_system_type *fs)
> +{
> +	struct filesystem_sysctl_table *t;
> +	struct ctl_path path[] = {
> +		{ .procname = "fs", .ctl_name = CTL_FS },
> +		{ .procname = "types", .ctl_name = CTL_UNNUMBERED },
> +		{ .procname = fs->name, .ctl_name = CTL_UNNUMBERED },
> +		{ }
> +	};
> +
> +	t = kzalloc(sizeof(*t), GFP_KERNEL);
> +	if (!t)
> +		return -ENOMEM;
> +
> +
> +	t->table[0].ctl_name = CTL_UNNUMBERED;
> +	t->table[0].procname = "usermount_safe";
> +	t->table[0].maxlen = sizeof(int);
> +	t->table[0].data = &fs->fs_safe;
> +	t->table[0].mode = 0644;

Yikes, this could be a problem for containers, as it's simply tied to
uid 0, whereas tying it to a capability would let us solve it with
capability bounds.

This might mean more urgency to get user namespaces working at least
with sysfs, else this is a quick way around having CAP_SYS_ADMIN taken
out of a container's capability bounding set.

> +	t->table[0].proc_handler = &proc_dointvec;
> +
> +	t->header = register_sysctl_paths(path, t->table);
> +	if (!t->header) {
> +		kfree(t);
> +		return -ENOMEM;
> +	}
> +
> +	fs->sysctl_table = t;
> +
> +	return 0;
> +}
> +
> +static void filesystem_sysctl_unregister(struct file_system_type *fs)
> +{
> +	struct filesystem_sysctl_table *t = fs->sysctl_table;
> +
> +	unregister_sysctl_table(t->header);
> +	kfree(t);
> +}
> +
>  /**
>   *	register_filesystem - register a new filesystem
>   *	@fs: the file system structure
> @@ -80,6 +132,13 @@ int register_filesystem(struct file_syst
>  	else
>  		*p = fs;
>  	write_unlock(&file_systems_lock);
> +
> +	if (res == 0) {
> +		res = filesystem_sysctl_register(fs);
> +		if (res != 0)
> +			unregister_filesystem(fs);
> +	}
> +
>  	return res;
>  }
> 
> @@ -108,6 +167,7 @@ int unregister_filesystem(struct file_sy
>  			*tmp = fs->next;
>  			fs->next = NULL;
>  			write_unlock(&file_systems_lock);
> +			filesystem_sysctl_unregister(fs);
>  			return 0;
>  		}
>  		tmp = &(*tmp)->next;
> Index: linux/include/linux/fs.h
> ===================================================================
> --- linux.orig/include/linux/fs.h	2008-02-04 23:48:02.000000000 +0100
> +++ linux/include/linux/fs.h	2008-02-04 23:48:04.000000000 +0100
> @@ -1444,6 +1444,7 @@ struct file_system_type {
>  	struct module *owner;
>  	struct file_system_type * next;
>  	struct list_head fs_supers;
> +	struct filesystem_sysctl_table *sysctl_table;
> 
>  	struct lock_class_key s_lock_key;
>  	struct lock_class_key s_umount_key;
> Index: linux/Documentation/filesystems/proc.txt
> ===================================================================
> --- linux.orig/Documentation/filesystems/proc.txt	2008-02-04 23:47:58.000000000 +0100
> +++ linux/Documentation/filesystems/proc.txt	2008-02-04 23:48:04.000000000 +0100
> @@ -44,6 +44,7 @@ Table of Contents
>    2.14	/proc/<pid>/io - Display the IO accounting fields
>    2.15	/proc/<pid>/coredump_filter - Core dump filtering settings
>    2.16	/proc/<pid>/mountinfo - Information about mounts
> +  2.17	/proc/sys/fs/types - File system type specific parameters
> 
>  ------------------------------------------------------------------------------
>  Preface
> @@ -2392,4 +2393,34 @@ For more information see:
>    Documentation/filesystems/sharedsubtree.txt
> 
> 
> +2.17 /proc/sys/fs/types/ - File system type specific parameters
> +----------------------------------------------------------------
> +
> +There's a separate directory /proc/sys/fs/types/<type>/ for each
> +filesystem type, containing the following files:
> +
> +usermount_safe
> +--------------
> +
> +Setting this to non-zero will allow filesystems of this type to be
> +mounted by unprivileged users (note, that there are other
> +prerequisites as well).
> +
> +Fuse has been designed to be as safe as possible, and some
> +distributions already ship with unprivileged fuse mounts enabled by
> +default.  There are still some situations (multi-user systems with
> +untrusted users in particular), where enabling this for fuse might not
> +be appropriate.  For more details, see Documentation/filesystems/fuse.txt
> +
> +Procfs is also safe, but unprivileged mounting of it is not usually
> +necessary (bind mounting is equivalent).
> +
> +Most other filesystems are unsafe.  Here are just some of the issues,
> +that must be resolved before a filesystem can be declared safe:
> +
> + - no strict input checking (buffer overruns, directory loops, etc)
> + - network filesystem deadlocks when mounting from localhost
> + - no permission checking when opening the device
> + - changing mount options when mounting a new instance of a filesystem
> +
>  ------------------------------------------------------------------------------
> 
> --

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property
  2008-02-05 21:36 [patch 00/10] mount ownership and unprivileged mount syscall (v8) Miklos Szeredi
@ 2008-02-05 21:36 ` Miklos Szeredi
  2008-02-06 20:21   ` Serge E. Hallyn
  2008-02-07 15:33   ` Aneesh Kumar K.V
  0 siblings, 2 replies; 28+ messages in thread
From: Miklos Szeredi @ 2008-02-05 21:36 UTC (permalink / raw)
  To: akpm, hch, serue; +Cc: linux-fsdevel, linux-kernel

[-- Attachment #1: unprivileged-mounts-add-sysctl-tunable-for-safe-property.patch --]
[-- Type: text/plain, Size: 5007 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Add the following:

  /proc/sys/fs/types/${FS_TYPE}/usermount_safe

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/filesystems.c
===================================================================
--- linux.orig/fs/filesystems.c	2008-02-04 23:47:46.000000000 +0100
+++ linux/fs/filesystems.c	2008-02-04 23:48:04.000000000 +0100
@@ -12,6 +12,7 @@
 #include <linux/kmod.h>
 #include <linux/init.h>
 #include <linux/module.h>
+#include <linux/sysctl.h>
 #include <asm/uaccess.h>
 
 /*
@@ -51,6 +52,57 @@ static struct file_system_type **find_fi
 	return p;
 }
 
+#define MAX_FILESYSTEM_VARS 1
+
+struct filesystem_sysctl_table {
+	struct ctl_table_header *header;
+	struct ctl_table table[MAX_FILESYSTEM_VARS + 1];
+};
+
+/*
+ * Create /sys/fs/types/${FSNAME} directory with per fs-type tunables.
+ */
+static int filesystem_sysctl_register(struct file_system_type *fs)
+{
+	struct filesystem_sysctl_table *t;
+	struct ctl_path path[] = {
+		{ .procname = "fs", .ctl_name = CTL_FS },
+		{ .procname = "types", .ctl_name = CTL_UNNUMBERED },
+		{ .procname = fs->name, .ctl_name = CTL_UNNUMBERED },
+		{ }
+	};
+
+	t = kzalloc(sizeof(*t), GFP_KERNEL);
+	if (!t)
+		return -ENOMEM;
+
+
+	t->table[0].ctl_name = CTL_UNNUMBERED;
+	t->table[0].procname = "usermount_safe";
+	t->table[0].maxlen = sizeof(int);
+	t->table[0].data = &fs->fs_safe;
+	t->table[0].mode = 0644;
+	t->table[0].proc_handler = &proc_dointvec;
+
+	t->header = register_sysctl_paths(path, t->table);
+	if (!t->header) {
+		kfree(t);
+		return -ENOMEM;
+	}
+
+	fs->sysctl_table = t;
+
+	return 0;
+}
+
+static void filesystem_sysctl_unregister(struct file_system_type *fs)
+{
+	struct filesystem_sysctl_table *t = fs->sysctl_table;
+
+	unregister_sysctl_table(t->header);
+	kfree(t);
+}
+
 /**
  *	register_filesystem - register a new filesystem
  *	@fs: the file system structure
@@ -80,6 +132,13 @@ int register_filesystem(struct file_syst
 	else
 		*p = fs;
 	write_unlock(&file_systems_lock);
+
+	if (res == 0) {
+		res = filesystem_sysctl_register(fs);
+		if (res != 0)
+			unregister_filesystem(fs);
+	}
+
 	return res;
 }
 
@@ -108,6 +167,7 @@ int unregister_filesystem(struct file_sy
 			*tmp = fs->next;
 			fs->next = NULL;
 			write_unlock(&file_systems_lock);
+			filesystem_sysctl_unregister(fs);
 			return 0;
 		}
 		tmp = &(*tmp)->next;
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h	2008-02-04 23:48:02.000000000 +0100
+++ linux/include/linux/fs.h	2008-02-04 23:48:04.000000000 +0100
@@ -1444,6 +1444,7 @@ struct file_system_type {
 	struct module *owner;
 	struct file_system_type * next;
 	struct list_head fs_supers;
+	struct filesystem_sysctl_table *sysctl_table;
 
 	struct lock_class_key s_lock_key;
 	struct lock_class_key s_umount_key;
Index: linux/Documentation/filesystems/proc.txt
===================================================================
--- linux.orig/Documentation/filesystems/proc.txt	2008-02-04 23:47:58.000000000 +0100
+++ linux/Documentation/filesystems/proc.txt	2008-02-04 23:48:04.000000000 +0100
@@ -44,6 +44,7 @@ Table of Contents
   2.14	/proc/<pid>/io - Display the IO accounting fields
   2.15	/proc/<pid>/coredump_filter - Core dump filtering settings
   2.16	/proc/<pid>/mountinfo - Information about mounts
+  2.17	/proc/sys/fs/types - File system type specific parameters
 
 ------------------------------------------------------------------------------
 Preface
@@ -2392,4 +2393,34 @@ For more information see:
   Documentation/filesystems/sharedsubtree.txt
 
 
+2.17 /proc/sys/fs/types/ - File system type specific parameters
+----------------------------------------------------------------
+
+There's a separate directory /proc/sys/fs/types/<type>/ for each
+filesystem type, containing the following files:
+
+usermount_safe
+--------------
+
+Setting this to non-zero will allow filesystems of this type to be
+mounted by unprivileged users (note, that there are other
+prerequisites as well).
+
+Fuse has been designed to be as safe as possible, and some
+distributions already ship with unprivileged fuse mounts enabled by
+default.  There are still some situations (multi-user systems with
+untrusted users in particular), where enabling this for fuse might not
+be appropriate.  For more details, see Documentation/filesystems/fuse.txt
+
+Procfs is also safe, but unprivileged mounting of it is not usually
+necessary (bind mounting is equivalent).
+
+Most other filesystems are unsafe.  Here are just some of the issues,
+that must be resolved before a filesystem can be declared safe:
+
+ - no strict input checking (buffer overruns, directory loops, etc)
+ - network filesystem deadlocks when mounting from localhost
+ - no permission checking when opening the device
+ - changing mount options when mounting a new instance of a filesystem
+
 ------------------------------------------------------------------------------

--

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2008-02-07 16:58 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-16 12:31 [patch 00/10] mount ownership and unprivileged mount syscall (v7) Miklos Szeredi
2008-01-16 12:31 ` [patch 01/10] unprivileged mounts: add user mounts to the kernel Miklos Szeredi
2008-01-16 12:31 ` [patch 02/10] unprivileged mounts: allow unprivileged umount Miklos Szeredi
2008-01-16 12:31 ` [patch 03/10] unprivileged mounts: propagate error values from clone_mnt Miklos Szeredi
2008-01-16 12:31 ` [patch 04/10] unprivileged mounts: account user mounts Miklos Szeredi
2008-01-16 12:31 ` [patch 05/10] unprivileged mounts: allow unprivileged bind mounts Miklos Szeredi
2008-01-16 12:31 ` [patch 06/10] unprivileged mounts: allow unprivileged mounts Miklos Szeredi
2008-01-16 12:31 ` [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property Miklos Szeredi
2008-01-21 20:32   ` Serge E. Hallyn
2008-01-21 21:37     ` Miklos Szeredi
2008-01-22 20:48       ` Serge E. Hallyn
2008-01-22 22:59         ` Miklos Szeredi
2008-01-23  0:00           ` Serge E. Hallyn
2008-01-16 12:31 ` [patch 08/10] unprivileged mounts: make fuse safe Miklos Szeredi
2008-01-21 20:41   ` Serge E. Hallyn
2008-01-16 12:31 ` [patch 09/10] unprivileged mounts: propagation: inherit owner from parent Miklos Szeredi
2008-01-21 20:23   ` Serge E. Hallyn
2008-01-16 12:31 ` [patch 10/10] unprivileged mounts: add "no submounts" flag Miklos Szeredi
2008-02-05 21:36 [patch 00/10] mount ownership and unprivileged mount syscall (v8) Miklos Szeredi
2008-02-05 21:36 ` [patch 07/10] unprivileged mounts: add sysctl tunable for "safe" property Miklos Szeredi
2008-02-06 20:21   ` Serge E. Hallyn
2008-02-06 21:11     ` Miklos Szeredi
2008-02-06 22:45       ` Serge E. Hallyn
2008-02-07  8:09         ` Miklos Szeredi
2008-02-07 14:05           ` Serge E. Hallyn
2008-02-07 14:36             ` Miklos Szeredi
2008-02-07 16:57               ` Serge E. Hallyn
2008-02-07 15:33   ` Aneesh Kumar K.V
2008-02-07 16:24     ` Miklos Szeredi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).