linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts
@ 2015-05-14 17:30 Eric W. Biederman
  2015-05-14 17:33 ` [CFT][PATCH 04/10] fs: Add helper functions for permanently empty directories Eric W. Biederman
                   ` (4 more replies)
  0 siblings, 5 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-14 17:30 UTC (permalink / raw)
  To: Linux Containers
  Cc: Linux API, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda,
	Michael Kerrisk-manpages, Richard Weinberger,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo


The code is currently available at:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing

   HEAD: a524faf520600968e58bbc732063fccf2fdf9199 mnt: Update fs_fully_visible to test for permanently empty directories

The problem:  Mounting a new instance of proc of sysfs can allow things
that a bind mount of those filesystems would not.

That is the cases I am dealing with are:
     unshare --user --net --mount ; mount -t sysfs ...
     unshare --user --pid --mount ; mount -t proc ...

The big change is that this set of changes enforces the preservation of
locked mount flags, from the existing mount to the current mount.  Which
means that if proc was mounted read-only the current current will allow
a new instance of proc to be mounted read-write, and this set of changes
enforces that proc remain read-only.

The other gotcha is that the current code does not properly detect empty
directories so to prevent things slipping through the cracks this set of
changes annotates all mount points where nothing will be revealed if
the filesystem mounted on top is removed.

Enforcing the administrators policy can actually matter in the real
world as has been shown by the recent docker issue.

With this patchset I have two concerns:
- The enforcement of mount flag preservation on proc and sysfs may break
  things.  (I am especially worried about the implicit adding of nodev).

- I missed a filesystem mountpoint on proc or sysfs which would make a
  fresh copy unmountable for no good reason.

I don't want to break userspace if I can help it, and the code has been
this way for a while so I figure there is time to find any pitfalls and
address them before this code gets merged.

So if this works for you please give me your Tested-By

The well known mountpoints for pseudo filesystems that I could find are:
/dev/ffs*/                 functionfs
/dev/gadget/               gadgetfs
/dev/mqueue                mqueue
/dev/oprofile/             oprofilefs
/dev/pts/                  devpts
/dlm/                      ocfs2_dlmfs
/ipath/                    ipathfs
/proc/fs/nfsd/             nfsd
/proc/openprom/            openpromfs
/proc/sys/fs/binfmt_misc/  binfmt_misc
/spu/                      spufs
/sys/firmware/efi/efivars/ efivarfs
/sys/fs/cgroup/            cgroup
/sys/fs/fuse/connections/  fusectl
/sys/fs/pstore/            pstore
/sys/fs/selinux/           selinuxfs
/sys/fs/smackfs/           smackfs
/sys/hypervisor/s390/      s390_hypfs
/sys/kernel/config/        configfs
/sys/kernel/debug/         debugfs
/sys/kernel/security/      securityfs
/sys/kernel/tracing/       tracefs
/var/lib/ibmasm/           ibmasmfs
/var/lib/nfs/rpc_pipefs/   rpc_pipefs

Eric W. Biederman (10):
      mnt: Refactor the logic for mounting sysfs and proc in a user namespace
      mnt: Modify fs_fully_visible to deal with mount attributes
      vfs: Ignore unlocked mounts in fs_fully_visible
      fs: Add helper functions for permanently empty directories.
      sysctl: Allow creating permanently empty directories.
      proc: Allow creating permanently empty directories.
      kernfs: Add support for always empty directories.
      sysfs: Add support for permanently empty directories.
      sysfs: Create mountpoints with sysfs_create_empty_dir
      mnt: Update fs_fully_visible to test for permanently empty directories

 arch/s390/hypfs/inode.c      | 12 ++----
 drivers/firmware/efi/efi.c   |  6 +--
 fs/configfs/mount.c          | 10 ++---
 fs/debugfs/inode.c           | 11 ++---
 fs/fuse/inode.c              |  9 ++---
 fs/kernfs/dir.c              | 38 +++++++++++++++++-
 fs/kernfs/inode.c            |  2 +
 fs/libfs.c                   | 96 ++++++++++++++++++++++++++++++++++++++++++++
 fs/namespace.c               | 47 +++++++++++++++++++---
 fs/proc/generic.c            | 23 +++++++++++
 fs/proc/inode.c              |  3 ++
 fs/proc/internal.h           |  1 +
 fs/proc/proc_sysctl.c        | 37 +++++++++++++++++
 fs/proc/root.c               |  9 ++---
 fs/pstore/inode.c            | 12 ++----
 fs/sysfs/dir.c               | 34 ++++++++++++++++
 fs/sysfs/mount.c             |  5 +--
 fs/tracefs/inode.c           |  6 +--
 include/linux/fs.h           |  4 +-
 include/linux/kernfs.h       |  3 ++
 include/linux/sysctl.h       |  3 ++
 include/linux/sysfs.h        | 16 ++++++++
 kernel/cgroup.c              | 10 ++---
 kernel/sysctl.c              |  8 +---
 security/inode.c             | 10 ++---
 security/selinux/selinuxfs.c | 11 +++--
 security/smack/smackfs.c     |  8 ++--
 27 files changed, 344 insertions(+), 90 deletions(-)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [CFT][PATCH 01/10] mnt: Refactor the logic for mounting sysfs and proc in a user namespace
       [not found] ` <87pp63jcca.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-05-14 17:31   ` Eric W. Biederman
  2015-05-14 17:32   ` [CFT][PATCH 02/10] mnt: Modify fs_fully_visible to deal with mount attributes Eric W. Biederman
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-14 17:31 UTC (permalink / raw)
  To: Linux Containers
  Cc: Linux API, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda,
	Michael Kerrisk-manpages, Richard Weinberger,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo


Fresh mounts of proc and sysfs are a very special case that works very
much like a bind mount.  Unfortunately the current structure can not
preserve the MNT_LOCK... mount flags.  Therefore refactor the logic
into a form that can be modified to preserve those lock bits.

Add a new filesystem flag FS_USERNS_VISIBLE that requires some mount
of the filesystem be fully visible in the current mount namespace,
before the filesystem may be mounted.

Move the logic for calling fs_fully_visible from proc and sysfs into
fs/namespace.c where it has greater access to mount namespace state.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c     | 8 +++++++-
 fs/proc/root.c     | 5 +----
 fs/sysfs/mount.c   | 5 +----
 include/linux/fs.h | 2 +-
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 1b9e11167bae..8e7edaf60fe1 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2332,6 +2332,8 @@ unlock:
 	return err;
 }
 
+static bool fs_fully_visible(struct file_system_type *fs_type);
+
 /*
  * create a new mount for userspace and request it to be added into the
  * namespace's tree
@@ -2363,6 +2365,10 @@ static int do_new_mount(struct path *path, const char *fstype, int flags,
 			flags |= MS_NODEV;
 			mnt_flags |= MNT_NODEV | MNT_LOCK_NODEV;
 		}
+		if (type->fs_flags & FS_USERNS_VISIBLE) {
+			if (!fs_fully_visible(type))
+				return -EPERM;
+		}
 	}
 
 	mnt = vfs_kern_mount(type, flags, name, data);
@@ -3164,7 +3170,7 @@ bool current_chrooted(void)
 	return chrooted;
 }
 
-bool fs_fully_visible(struct file_system_type *type)
+static bool fs_fully_visible(struct file_system_type *type)
 {
 	struct mnt_namespace *ns = current->nsproxy->mnt_ns;
 	struct mount *mnt;
diff --git a/fs/proc/root.c b/fs/proc/root.c
index b7fa4bfe896a..64e1ab64bde6 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -112,9 +112,6 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
 		ns = task_active_pid_ns(current);
 		options = data;
 
-		if (!capable(CAP_SYS_ADMIN) && !fs_fully_visible(fs_type))
-			return ERR_PTR(-EPERM);
-
 		/* Does the mounter have privilege over the pid namespace? */
 		if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN))
 			return ERR_PTR(-EPERM);
@@ -159,7 +156,7 @@ static struct file_system_type proc_fs_type = {
 	.name		= "proc",
 	.mount		= proc_mount,
 	.kill_sb	= proc_kill_sb,
-	.fs_flags	= FS_USERNS_MOUNT,
+	.fs_flags	= FS_USERNS_VISIBLE | FS_USERNS_MOUNT,
 };
 
 void __init proc_root_init(void)
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 8a49486bf30c..1c6ac6fcee9f 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -31,9 +31,6 @@ static struct dentry *sysfs_mount(struct file_system_type *fs_type,
 	bool new_sb;
 
 	if (!(flags & MS_KERNMOUNT)) {
-		if (!capable(CAP_SYS_ADMIN) && !fs_fully_visible(fs_type))
-			return ERR_PTR(-EPERM);
-
 		if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET))
 			return ERR_PTR(-EPERM);
 	}
@@ -58,7 +55,7 @@ static struct file_system_type sysfs_fs_type = {
 	.name		= "sysfs",
 	.mount		= sysfs_mount,
 	.kill_sb	= sysfs_kill_sb,
-	.fs_flags	= FS_USERNS_MOUNT,
+	.fs_flags	= FS_USERNS_VISIBLE | FS_USERNS_MOUNT,
 };
 
 int __init sysfs_init(void)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 35ec87e490b1..2d24eeb8e59c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1897,6 +1897,7 @@ struct file_system_type {
 #define FS_HAS_SUBTYPE		4
 #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
 #define FS_USERNS_DEV_MOUNT	16 /* A userns mount does not imply MNT_NODEV */
+#define FS_USERNS_VISIBLE	32	/* FS must already be visible */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
 	struct dentry *(*mount) (struct file_system_type *, int,
 		       const char *, void *);
@@ -1984,7 +1985,6 @@ extern int vfs_ustat(dev_t, struct kstatfs *);
 extern int freeze_super(struct super_block *super);
 extern int thaw_super(struct super_block *super);
 extern bool our_mnt(struct vfsmount *mnt);
-extern bool fs_fully_visible(struct file_system_type *);
 
 extern int current_umask(void);
 
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [CFT][PATCH 02/10] mnt: Modify fs_fully_visible to deal with mount attributes
       [not found] ` <87pp63jcca.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-05-14 17:31   ` [CFT][PATCH 01/10] mnt: Refactor the logic for mounting sysfs and proc in a user namespace Eric W. Biederman
@ 2015-05-14 17:32   ` Eric W. Biederman
  2015-05-14 17:32   ` [CFT][PATCH 03/10] vfs: Ignore unlocked mounts in fs_fully_visible Eric W. Biederman
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-14 17:32 UTC (permalink / raw)
  To: Linux Containers
  Cc: Linux API, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda,
	Michael Kerrisk-manpages, Richard Weinberger,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo


Ignore an existing mount if it's locked attributes are less permissive
than the new mounts attributes.

On success ensure the new mount locks all of the same attributes as
the old mount.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 32 +++++++++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 8e7edaf60fe1..fccee9924e8c 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2332,7 +2332,7 @@ unlock:
 	return err;
 }
 
-static bool fs_fully_visible(struct file_system_type *fs_type);
+static bool fs_fully_visible(struct file_system_type *fs_type, int *new_mnt_flags);
 
 /*
  * create a new mount for userspace and request it to be added into the
@@ -2366,7 +2366,7 @@ static int do_new_mount(struct path *path, const char *fstype, int flags,
 			mnt_flags |= MNT_NODEV | MNT_LOCK_NODEV;
 		}
 		if (type->fs_flags & FS_USERNS_VISIBLE) {
-			if (!fs_fully_visible(type))
+			if (!fs_fully_visible(type, &mnt_flags))
 				return -EPERM;
 		}
 	}
@@ -3170,9 +3170,10 @@ bool current_chrooted(void)
 	return chrooted;
 }
 
-static bool fs_fully_visible(struct file_system_type *type)
+static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags)
 {
 	struct mnt_namespace *ns = current->nsproxy->mnt_ns;
+	int new_flags = *new_mnt_flags;
 	struct mount *mnt;
 	bool visible = false;
 
@@ -3191,6 +3192,25 @@ static bool fs_fully_visible(struct file_system_type *type)
 		if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
 			continue;
 
+		/* Verify the mount flags are equal to or more permissive
+		 * than the proposed new mount.
+		 */
+		if ((mnt->mnt.mnt_flags & MNT_LOCK_READONLY) &&
+		    !(new_flags & MNT_READONLY))
+			continue;
+		if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) &&
+		    !(new_flags & MNT_NODEV))
+			continue;
+		if ((mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) &&
+		    !(new_flags & MNT_NOSUID))
+			continue;
+		if ((mnt->mnt.mnt_flags & MNT_LOCK_NOEXEC) &&
+		    !(new_flags & MNT_NOEXEC))
+			continue;
+		if ((mnt->mnt.mnt_flags & MNT_LOCK_ATIME) &&
+		    ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
+			continue;
+
 		/* This mount is not fully visible if there are any child mounts
 		 * that cover anything except for empty directories.
 		 */
@@ -3201,6 +3221,12 @@ static bool fs_fully_visible(struct file_system_type *type)
 			if (inode->i_nlink > 2)
 				goto next;
 		}
+		/* Preserve the locked attributes */
+		*new_mnt_flags |= mnt->mnt.mnt_flags & (MNT_LOCK_READONLY | \
+							MNT_LOCK_NODEV    | \
+							MNT_LOCK_NOSUID   | \
+							MNT_LOCK_NOEXEC   | \
+							MNT_LOCK_ATIME);
 		visible = true;
 		goto found;
 	next:	;
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [CFT][PATCH 03/10] vfs: Ignore unlocked mounts in fs_fully_visible
       [not found] ` <87pp63jcca.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-05-14 17:31   ` [CFT][PATCH 01/10] mnt: Refactor the logic for mounting sysfs and proc in a user namespace Eric W. Biederman
  2015-05-14 17:32   ` [CFT][PATCH 02/10] mnt: Modify fs_fully_visible to deal with mount attributes Eric W. Biederman
@ 2015-05-14 17:32   ` Eric W. Biederman
  2015-05-14 17:34   ` [CFT][PATCH 06/10] proc: Allow creating permanently empty directories Eric W. Biederman
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-14 17:32 UTC (permalink / raw)
  To: Linux Containers
  Cc: Linux API, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda,
	Michael Kerrisk-manpages, Richard Weinberger,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo


Limit the mounts fs_fully_visible considers to locked mounts.
Unlocked can always be unmounted so considering them adds hassle
but no security benefit.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index fccee9924e8c..3ede0669b8d2 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3211,11 +3211,15 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags)
 		    ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
 			continue;
 
-		/* This mount is not fully visible if there are any child mounts
-		 * that cover anything except for empty directories.
+		/* This mount is not fully visible if there are any
+		 * locked child mounts that cover anything except for
+		 * empty directories.
 		 */
 		list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
 			struct inode *inode = child->mnt_mountpoint->d_inode;
+			/* Only worry about locked mounts */
+			if (!(mnt->mnt.mnt_flags & MNT_LOCKED))
+				continue;
 			if (!S_ISDIR(inode->i_mode))
 				goto next;
 			if (inode->i_nlink > 2)
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [CFT][PATCH 04/10] fs: Add helper functions for permanently empty directories.
  2015-05-14 17:30 [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts Eric W. Biederman
@ 2015-05-14 17:33 ` Eric W. Biederman
  2015-05-14 17:33 ` [CFT][PATCH 05/10] sysctl: Allow creating " Eric W. Biederman
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-14 17:33 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Linux API, Serge E. Hallyn, Andy Lutomirski,
	Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages,
	Stéphane Graber, Eric Windisch, Greg Kroah-Hartman,
	Tejun Heo


To ensure it is safe to mount proc and sysfs I need to check if
filesystems that are mounted on top of them are mounted on truly empty
directories.  Given that some directories can gain entries over time,
knowing that a directory is empty right now is insufficient.

Therefore add supporting infrastructure for permantently empty
directories that proc and sysfs can use when they create mount points
for filesystems and fs_fully_visible can use to test for permanently
empty directories to ensure that nothing will be gained by mounting a
fresh copy of proc or sysfs.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/libfs.c         | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |  2 ++
 2 files changed, 98 insertions(+)

diff --git a/fs/libfs.c b/fs/libfs.c
index cb1fb4b9b637..02813592e121 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1093,3 +1093,99 @@ simple_nosetlease(struct file *filp, long arg, struct file_lock **flp,
 	return -EINVAL;
 }
 EXPORT_SYMBOL(simple_nosetlease);
+
+
+/*
+ * Operations for a permanently empty directory.
+ */
+static struct dentry *empty_dir_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
+{
+	return ERR_PTR(-ENOENT);
+}
+
+static int empty_dir_getattr(struct vfsmount *mnt, struct dentry *dentry,
+				 struct kstat *stat)
+{
+	struct inode *inode = d_inode(dentry);
+	generic_fillattr(inode, stat);
+	return 0;
+}
+
+static int empty_dir_setattr(struct dentry *dentry, struct iattr *attr)
+{
+	return -EPERM;
+}
+
+static int empty_dir_setxattr(struct dentry *dentry, const char *name,
+			      const void *value, size_t size, int flags)
+{
+	return -EOPNOTSUPP;
+}
+
+static ssize_t empty_dir_getxattr(struct dentry *dentry, const char *name,
+				  void *value, size_t size)
+{
+	return -EOPNOTSUPP;
+}
+
+static int empty_dir_removexattr(struct dentry *dentry, const char *name)
+{
+	return -EOPNOTSUPP;
+}
+
+static ssize_t empty_dir_listxattr(struct dentry *dentry, char *list, size_t size)
+{
+	return -EOPNOTSUPP;
+}
+
+static const struct inode_operations empty_dir_inode_operations = {
+	.lookup		= empty_dir_lookup,
+	.permission	= generic_permission,
+	.setattr	= empty_dir_setattr,
+	.getattr	= empty_dir_getattr,
+	.setxattr	= empty_dir_setxattr,
+	.getxattr	= empty_dir_getxattr,
+	.removexattr	= empty_dir_removexattr,
+	.listxattr	= empty_dir_listxattr,
+};
+
+static loff_t empty_dir_llseek(struct file *file, loff_t offset, int whence)
+{
+	/* An empty directory has two entries . and .. at offsets 0 and 1 */
+	return generic_file_llseek_size(file, offset, whence, 2, 2);
+}
+
+static int empty_dir_readdir(struct file *file, struct dir_context *ctx)
+{
+	dir_emit_dots(file, ctx);
+	return 0;
+}
+
+static const struct file_operations empty_dir_operations = {
+	.llseek		= empty_dir_llseek,
+	.read		= generic_read_dir,
+	.iterate	= empty_dir_readdir,
+	.fsync		= noop_fsync,
+};
+
+
+void make_empty_dir_inode(struct inode *inode)
+{
+	set_nlink(inode, 2);
+	inode->i_mode = S_IFDIR | S_IRUGO | S_IXUGO;
+	inode->i_uid = GLOBAL_ROOT_UID;
+	inode->i_gid = GLOBAL_ROOT_GID;
+	inode->i_rdev = 0;
+	inode->i_size = 2;
+	inode->i_blkbits = PAGE_SHIFT;
+	inode->i_blocks = 0;
+
+	inode->i_op = &empty_dir_inode_operations;
+	inode->i_fop = &empty_dir_operations;
+}
+
+bool is_empty_dir_inode(struct inode *inode)
+{
+	return (inode->i_fop == &empty_dir_operations) &&
+		(inode->i_op == &empty_dir_inode_operations);
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2d24eeb8e59c..571aab91bfc0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2780,6 +2780,8 @@ extern struct dentry *simple_lookup(struct inode *, struct dentry *, unsigned in
 extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t *);
 extern const struct file_operations simple_dir_operations;
 extern const struct inode_operations simple_dir_inode_operations;
+extern void make_empty_dir_inode(struct inode *inode);
+extern bool is_empty_dir_inode(struct inode *inode);
 struct tree_descr { char *name; const struct file_operations *ops; int mode; };
 struct dentry *d_alloc_name(struct dentry *, const char *);
 extern int simple_fill_super(struct super_block *, unsigned long, struct tree_descr *);
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [CFT][PATCH 05/10] sysctl: Allow creating permanently empty directories.
  2015-05-14 17:30 [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts Eric W. Biederman
  2015-05-14 17:33 ` [CFT][PATCH 04/10] fs: Add helper functions for permanently empty directories Eric W. Biederman
@ 2015-05-14 17:33 ` Eric W. Biederman
       [not found] ` <87pp63jcca.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-14 17:33 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Linux API, Serge E. Hallyn, Andy Lutomirski,
	Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages,
	Stéphane Graber, Eric Windisch, Greg Kroah-Hartman,
	Tejun Heo


Add a magic sysctl table permanently_empty_table that when used to
create a directory forces that directory to be permanently empty.

Update the code to use make_empty_dir_inode when accessing permanently
empty directories.

Update the code to not allow adding to permanently empty directories.

Update /proc/sys/fs/binfmt_misc to be a permanently empty directory.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/proc/proc_sysctl.c  | 37 +++++++++++++++++++++++++++++++++++++
 include/linux/sysctl.h |  3 +++
 kernel/sysctl.c        |  8 +-------
 3 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index fea2561d773b..f9ade2caf438 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -19,6 +19,28 @@ static const struct inode_operations proc_sys_inode_operations;
 static const struct file_operations proc_sys_dir_file_operations;
 static const struct inode_operations proc_sys_dir_operations;
 
+/* Support for permanently empty directories */
+
+struct ctl_table permanently_empty_table[] = {
+	{ }
+};
+
+static bool is_empty_dir(struct ctl_table_header *head)
+{
+	return head->ctl_table[0].child == permanently_empty_table;
+}
+
+static void set_empty_dir(struct ctl_dir *dir)
+{
+	dir->header.ctl_table[0].child = permanently_empty_table;
+}
+
+static void clear_empty_dir(struct ctl_dir *dir)
+
+{
+	dir->header.ctl_table[0].child = NULL;
+}
+
 void proc_sys_poll_notify(struct ctl_table_poll *poll)
 {
 	if (!poll)
@@ -187,6 +209,17 @@ static int insert_header(struct ctl_dir *dir, struct ctl_table_header *header)
 	struct ctl_table *entry;
 	int err;
 
+	/* Is this a permanently empty directory? */
+	if (is_empty_dir(&dir->header))
+		return -EROFS;
+
+	/* Am I creating a permanently empty directory? */
+	if (header->ctl_table == permanently_empty_table) {
+		if (!RB_EMPTY_ROOT(&dir->root))
+			return -EINVAL;
+		set_empty_dir(dir);
+	}
+
 	dir->header.nreg++;
 	header->parent = dir;
 	err = insert_links(header);
@@ -202,6 +235,8 @@ fail:
 	erase_header(header);
 	put_links(header);
 fail_links:
+	if (header->ctl_table == permanently_empty_table)
+		clear_empty_dir(dir);
 	header->parent = NULL;
 	drop_sysctl_table(&dir->header);
 	return err;
@@ -419,6 +454,8 @@ static struct inode *proc_sys_make_inode(struct super_block *sb,
 		inode->i_mode |= S_IFDIR;
 		inode->i_op = &proc_sys_dir_operations;
 		inode->i_fop = &proc_sys_dir_file_operations;
+		if (is_empty_dir(head))
+			make_empty_dir_inode(inode);
 	}
 out:
 	return inode;
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 795d5fea5697..71fd81994a82 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -188,6 +188,9 @@ struct ctl_table_header *register_sysctl_paths(const struct ctl_path *path,
 void unregister_sysctl_table(struct ctl_table_header * table);
 
 extern int sysctl_init(void);
+
+extern struct ctl_table permanently_empty_table[];
+
 #else /* CONFIG_SYSCTL */
 static inline struct ctl_table_header *register_sysctl_table(struct ctl_table * table)
 {
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2082b1a88fb9..92f41a43875e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1531,12 +1531,6 @@ static struct ctl_table vm_table[] = {
 	{ }
 };
 
-#if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE)
-static struct ctl_table binfmt_misc_table[] = {
-	{ }
-};
-#endif
-
 static struct ctl_table fs_table[] = {
 	{
 		.procname	= "inode-nr",
@@ -1690,7 +1684,7 @@ static struct ctl_table fs_table[] = {
 	{
 		.procname	= "binfmt_misc",
 		.mode		= 0555,
-		.child		= binfmt_misc_table,
+		.child		= permanently_empty_table,
 	},
 #endif
 	{
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [CFT][PATCH 06/10] proc: Allow creating permanently empty directories.
       [not found] ` <87pp63jcca.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                     ` (2 preceding siblings ...)
  2015-05-14 17:32   ` [CFT][PATCH 03/10] vfs: Ignore unlocked mounts in fs_fully_visible Eric W. Biederman
@ 2015-05-14 17:34   ` Eric W. Biederman
  2015-05-14 17:34   ` [CFT][PATCH 07/10] kernfs: Add support for always " Eric W. Biederman
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-14 17:34 UTC (permalink / raw)
  To: Linux Containers
  Cc: Linux API, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda,
	Michael Kerrisk-manpages, Richard Weinberger,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo


Add a new function proc_mk_empty_dir that when used to creates
a directory that can not be added to.

Update the code to use make_empty_dir_inode when reporting
a permanently empty directory to the vfs.

Update the code to not allow adding to permanently empty directories.

Update /proc/openprom and /proc/fs/nfsd to be permanently empty directories.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/proc/generic.c  | 23 +++++++++++++++++++++++
 fs/proc/inode.c    |  3 +++
 fs/proc/internal.h |  1 +
 fs/proc/root.c     |  4 ++--
 4 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index df6327a2b865..e235c1544b22 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -373,6 +373,10 @@ static struct proc_dir_entry *__proc_create(struct proc_dir_entry **parent,
 		WARN(1, "create '/proc/%s' by hand\n", qstr.name);
 		return NULL;
 	}
+	if (S_ISDIR((*parent)->mode) && ((*parent)->proc_fops == NULL)) {
+		WARN(1, "attempt to add to permanently empty directory");
+		return NULL;
+	}
 
 	ent = kzalloc(sizeof(struct proc_dir_entry) + qstr.len + 1, GFP_KERNEL);
 	if (!ent)
@@ -455,6 +459,25 @@ struct proc_dir_entry *proc_mkdir(const char *name,
 }
 EXPORT_SYMBOL(proc_mkdir);
 
+struct proc_dir_entry *proc_mk_empty_dir(const char *name)
+{
+	umode_t mode = S_IFDIR | S_IRUGO | S_IXUGO;
+	struct proc_dir_entry *ent, *parent = NULL;
+
+	ent = __proc_create(&parent, name, mode, 2);
+	if (ent) {
+		ent->data = NULL;
+		ent->proc_fops = NULL;
+		ent->proc_iops = NULL;
+		if (proc_register(parent, ent) < 0) {
+			kfree(ent);
+			parent->nlink--;
+			ent = NULL;
+		}
+	}
+	return ent;
+}
+
 struct proc_dir_entry *proc_create_data(const char *name, umode_t mode,
 					struct proc_dir_entry *parent,
 					const struct file_operations *proc_fops,
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 8272aaba1bb0..b957ec618bda 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -445,6 +445,9 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
 					inode->i_fop = &proc_reg_file_ops;
 			} else {
 				inode->i_fop = de->proc_fops;
+				if (S_ISDIR(inode->i_mode) &&
+				    (de->proc_fops == NULL))
+					make_empty_dir_inode(inode);
 			}
 		}
 	} else
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index c835b94c0cd3..6bc2e7a12912 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -190,6 +190,7 @@ static inline struct proc_dir_entry *pde_get(struct proc_dir_entry *pde)
 	return pde;
 }
 extern void pde_put(struct proc_dir_entry *);
+struct proc_dir_entry *proc_mk_empty_dir(const char *name);
 
 /*
  * inode.c
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 64e1ab64bde6..b031fc3991c3 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -179,10 +179,10 @@ void __init proc_root_init(void)
 #endif
 	proc_mkdir("fs", NULL);
 	proc_mkdir("driver", NULL);
-	proc_mkdir("fs/nfsd", NULL); /* somewhere for the nfsd filesystem to be mounted */
+	proc_mk_empty_dir("fs/nfsd"); /* somewhere for the nfsd filesystem to be mounted */
 #if defined(CONFIG_SUN_OPENPROMFS) || defined(CONFIG_SUN_OPENPROMFS_MODULE)
 	/* just give it a mountpoint */
-	proc_mkdir("openprom", NULL);
+	proc_mk_empty_dir("openprom");
 #endif
 	proc_tty_init();
 	proc_mkdir("bus", NULL);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [CFT][PATCH 07/10] kernfs: Add support for always empty directories.
       [not found] ` <87pp63jcca.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                     ` (3 preceding siblings ...)
  2015-05-14 17:34   ` [CFT][PATCH 06/10] proc: Allow creating permanently empty directories Eric W. Biederman
@ 2015-05-14 17:34   ` Eric W. Biederman
  2015-05-14 17:35   ` [CFT][PATCH 08/10] sysfs: Add support for permanently " Eric W. Biederman
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-14 17:34 UTC (permalink / raw)
  To: Linux Containers
  Cc: Linux API, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda,
	Michael Kerrisk-manpages, Richard Weinberger,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo


Add a new function kernfs_create_empty_dir that can be used to create
directory that can not be modified.

Update the code to use make_empty_dir_inode when reporting a
permanently empty directory to the vfs.

Update the code to not allow adding to permanently empty directories.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/kernfs/dir.c        | 38 +++++++++++++++++++++++++++++++++++++-
 fs/kernfs/inode.c      |  2 ++
 include/linux/kernfs.h |  3 +++
 3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index f131fc23ffc4..8643e70536f8 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -585,6 +585,9 @@ int kernfs_add_one(struct kernfs_node *kn)
 		goto out_unlock;
 
 	ret = -ENOENT;
+	if (parent->flags & KERNFS_EMPTY_DIR)
+		goto out_unlock;
+
 	if ((parent->flags & KERNFS_ACTIVATED) && !kernfs_active(parent))
 		goto out_unlock;
 
@@ -776,6 +779,38 @@ struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
 	return ERR_PTR(rc);
 }
 
+/**
+ * kernfs_create_empty_dir - create an always empty directory
+ * @parent: parent in which to create a new directory
+ * @name: name of the new directory
+ *
+ * Returns the created node on success, ERR_PTR() value on failure.
+ */
+struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
+					    const char *name, void *priv)
+{
+	struct kernfs_node *kn;
+	int rc;
+
+	/* allocate */
+	kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR, KERNFS_DIR);
+	if (!kn)
+		return ERR_PTR(-ENOMEM);
+
+	kn->flags |= KERNFS_EMPTY_DIR;
+	kn->dir.root = parent->dir.root;
+	kn->ns = NULL;
+	kn->priv = priv;
+
+	/* link in */
+	rc = kernfs_add_one(kn);
+	if (!rc)
+		return kn;
+
+	kernfs_put(kn);
+	return ERR_PTR(rc);
+}
+
 static struct dentry *kernfs_iop_lookup(struct inode *dir,
 					struct dentry *dentry,
 					unsigned int flags)
@@ -1247,7 +1282,8 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
 	mutex_lock(&kernfs_mutex);
 
 	error = -ENOENT;
-	if (!kernfs_active(kn) || !kernfs_active(new_parent))
+	if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
+	    (new_parent->flags & KERNFS_EMPTY_DIR))
 		goto out;
 
 	error = 0;
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 2da8493a380b..756dd56aaf60 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -296,6 +296,8 @@ static void kernfs_init_inode(struct kernfs_node *kn, struct inode *inode)
 	case KERNFS_DIR:
 		inode->i_op = &kernfs_dir_iops;
 		inode->i_fop = &kernfs_dir_fops;
+		if (kn->flags & KERNFS_EMPTY_DIR)
+			make_empty_dir_inode(inode);
 		break;
 	case KERNFS_FILE:
 		inode->i_size = kn->attr.size;
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 71ecdab1671b..4b479a0b3d61 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -45,6 +45,7 @@ enum kernfs_node_flag {
 	KERNFS_LOCKDEP		= 0x0100,
 	KERNFS_SUICIDAL		= 0x0400,
 	KERNFS_SUICIDED		= 0x0800,
+	KERNFS_EMPTY_DIR	= 0x1000,
 };
 
 /* @flags for kernfs_create_root() */
@@ -285,6 +286,8 @@ void kernfs_destroy_root(struct kernfs_root *root);
 struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
 					 const char *name, umode_t mode,
 					 void *priv, const void *ns);
+struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
+					    const char *name, void *priv);
 struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
 					 const char *name,
 					 umode_t mode, loff_t size,
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [CFT][PATCH 08/10] sysfs: Add support for permanently empty directories.
       [not found] ` <87pp63jcca.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                     ` (4 preceding siblings ...)
  2015-05-14 17:34   ` [CFT][PATCH 07/10] kernfs: Add support for always " Eric W. Biederman
@ 2015-05-14 17:35   ` Eric W. Biederman
       [not found]     ` <87fv6zhxkp.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-05-14 17:36   ` [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir Eric W. Biederman
  2015-05-14 17:37   ` [CFT][PATCH 10/10] mnt: Update fs_fully_visible to test for permanently empty directories Eric W. Biederman
  7 siblings, 1 reply; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-14 17:35 UTC (permalink / raw)
  To: Linux Containers
  Cc: Linux API, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda,
	Michael Kerrisk-manpages, Richard Weinberger,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo


Add two functions sysfs_create_empty_dir and sysfs_remove_empty_dir
that hang a permanently empty directory off of a kobject or remove
a permanently emptpy directory hanging from a kobject.  Export
these new functions so modular filesystems can use them.

As all permanently empty directories are, are names and used
for mouting other filesystems this seems like the right abstraction.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/sysfs/dir.c        | 34 ++++++++++++++++++++++++++++++++++
 include/linux/sysfs.h | 16 ++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 0b45ff42f374..8244741474d7 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -121,3 +121,37 @@ int sysfs_move_dir_ns(struct kobject *kobj, struct kobject *new_parent_kobj,
 
 	return kernfs_rename_ns(kn, new_parent, kn->name, new_ns);
 }
+
+/**
+ * sysfs_create_empty_dir - create an always empty directory
+ * @parent_kobj:  kobject that will contain this always empty directory
+ * @name: The name of the always empty directory to add
+ */
+int sysfs_create_empty_dir(struct kobject *parent_kobj, const char *name)
+{
+	struct kernfs_node *kn, *parent = parent_kobj->sd;
+
+	kn = kernfs_create_empty_dir(parent, name, NULL);
+	if (IS_ERR(kn)) {
+		if (PTR_ERR(kn) == -EEXIST)
+			sysfs_warn_dup(parent, name);
+		return PTR_ERR(kn);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(sysfs_create_empty_dir);
+
+/**
+ *	sysfs_remove_empty_dir - remove an always empty directory.
+ *	@parent_kobj: kobject that will contain this always empty directory
+ *	@name: The name of the always empty directory to remove
+ *
+ */
+void sysfs_remove_empty_dir(struct kobject *parent_kobj, const char *name)
+{
+	struct kernfs_node *parent = parent_kobj->sd;
+
+	kernfs_remove_by_name_ns(parent, name, NULL);
+}
+EXPORT_SYMBOL_GPL(sysfs_remove_empty_dir);
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index 99382c0df17e..e156d419de75 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -210,6 +210,10 @@ int __must_check sysfs_rename_dir_ns(struct kobject *kobj, const char *new_name,
 int __must_check sysfs_move_dir_ns(struct kobject *kobj,
 				   struct kobject *new_parent_kobj,
 				   const void *new_ns);
+int __must_check sysfs_create_empty_dir(struct kobject *parent_kobj,
+					const char *name);
+void sysfs_remove_empty_dir(struct kobject *parent_kobj,
+			    const char *name);
 
 int __must_check sysfs_create_file_ns(struct kobject *kobj,
 				      const struct attribute *attr,
@@ -298,6 +302,18 @@ static inline int sysfs_move_dir_ns(struct kobject *kobj,
 	return 0;
 }
 
+static inline int sysfs_create_empty_dir(struct kobject *parent_kobj,
+					 const char *name)
+{
+	return 0;
+}
+
+static inline void sysfs_remove_empty_dir(struct kobject *parent_kobj,
+					  const char *name)
+{
+	return 0;
+}
+
 static inline int sysfs_create_file_ns(struct kobject *kobj,
 				       const struct attribute *attr,
 				       const void *ns)
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir
       [not found] ` <87pp63jcca.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                     ` (5 preceding siblings ...)
  2015-05-14 17:35   ` [CFT][PATCH 08/10] sysfs: Add support for permanently " Eric W. Biederman
@ 2015-05-14 17:36   ` Eric W. Biederman
       [not found]     ` <878ucrhxi9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-05-14 17:37   ` [CFT][PATCH 10/10] mnt: Update fs_fully_visible to test for permanently empty directories Eric W. Biederman
  7 siblings, 1 reply; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-14 17:36 UTC (permalink / raw)
  To: Linux Containers
  Cc: Linux API, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda,
	Michael Kerrisk-manpages, Richard Weinberger,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo


This allows for better documentation in the code and
it allows for a simpler and fully correct version of
fs_fully_visible to be written.

The mount points converted and their filesystems are:
/sys/hypervisor/s390/       s390_hypfs
/sys/kernel/config/         configfs
/sys/kernel/debug/          debugfs
/sys/firmware/efi/efivars/  efivarfs
/sys/fs/fuse/connections/   fusectl
/sys/fs/pstore/             pstore
/sys/kernel/tracing/        tracefs
/sys/fs/cgroup/             cgroup
/sys/kernel/security/       securityfs
/sys/fs/selinux/            selinuxfs
/sys/fs/smackfs/            smackfs

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 arch/s390/hypfs/inode.c      | 12 ++++--------
 drivers/firmware/efi/efi.c   |  6 ++----
 fs/configfs/mount.c          | 10 ++++------
 fs/debugfs/inode.c           | 11 ++++-------
 fs/fuse/inode.c              |  9 +++------
 fs/pstore/inode.c            | 12 ++++--------
 fs/tracefs/inode.c           |  6 ++----
 kernel/cgroup.c              | 10 ++++------
 security/inode.c             | 10 ++++------
 security/selinux/selinuxfs.c | 11 +++++------
 security/smack/smackfs.c     |  8 ++++----
 11 files changed, 40 insertions(+), 65 deletions(-)

diff --git a/arch/s390/hypfs/inode.c b/arch/s390/hypfs/inode.c
index d3f896a35b98..d943d36076cc 100644
--- a/arch/s390/hypfs/inode.c
+++ b/arch/s390/hypfs/inode.c
@@ -456,8 +456,6 @@ static const struct super_operations hypfs_s_ops = {
 	.show_options	= hypfs_show_options,
 };
 
-static struct kobject *s390_kobj;
-
 static int __init hypfs_init(void)
 {
 	int rc;
@@ -481,18 +479,16 @@ static int __init hypfs_init(void)
 		rc = -ENODATA;
 		goto fail_hypfs_sprp_exit;
 	}
-	s390_kobj = kobject_create_and_add("s390", hypervisor_kobj);
-	if (!s390_kobj) {
-		rc = -ENOMEM;
+	rc = sysfs_create_empty_dir(hypervisor_kobj, "s390");
+	if (rc)
 		goto fail_hypfs_diag0c_exit;
-	}
 	rc = register_filesystem(&hypfs_type);
 	if (rc)
 		goto fail_filesystem;
 	return 0;
 
 fail_filesystem:
-	kobject_put(s390_kobj);
+	sysfs_remove_empty_dir(hypervisor_kobj, "s390");
 fail_hypfs_diag0c_exit:
 	hypfs_diag0c_exit();
 fail_hypfs_sprp_exit:
@@ -510,7 +506,7 @@ fail_dbfs_exit:
 static void __exit hypfs_exit(void)
 {
 	unregister_filesystem(&hypfs_type);
-	kobject_put(s390_kobj);
+	sysfs_remove_empty_dir(hypervisor_kobj, "s390");
 	hypfs_diag0c_exit();
 	hypfs_sprp_exit();
 	hypfs_vm_exit();
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 3061bb8629dc..98523650efd9 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -65,7 +65,6 @@ static int __init parse_efi_cmdline(char *str)
 early_param("efi", parse_efi_cmdline);
 
 static struct kobject *efi_kobj;
-static struct kobject *efivars_kobj;
 
 /*
  * Let's not leave out systab information that snuck into
@@ -212,10 +211,9 @@ static int __init efisubsys_init(void)
 		goto err_remove_group;
 
 	/* and the standard mountpoint for efivarfs */
-	efivars_kobj = kobject_create_and_add("efivars", efi_kobj);
-	if (!efivars_kobj) {
+	error = sysfs_create_empty_dir(efi_kobj, "efivars");
+	if (error) {
 		pr_err("efivars: Subsystem registration failed.\n");
-		error = -ENOMEM;
 		goto err_remove_group;
 	}
 
diff --git a/fs/configfs/mount.c b/fs/configfs/mount.c
index da94e41bdbf6..b4d1580a6602 100644
--- a/fs/configfs/mount.c
+++ b/fs/configfs/mount.c
@@ -129,8 +129,6 @@ void configfs_release_fs(void)
 }
 
 
-static struct kobject *config_kobj;
-
 static int __init configfs_init(void)
 {
 	int err = -ENOMEM;
@@ -141,8 +139,8 @@ static int __init configfs_init(void)
 	if (!configfs_dir_cachep)
 		goto out;
 
-	config_kobj = kobject_create_and_add("config", kernel_kobj);
-	if (!config_kobj)
+	err = sysfs_create_empty_dir(kernel_kobj, "config");
+	if (err)
 		goto out2;
 
 	err = register_filesystem(&configfs_fs_type);
@@ -152,7 +150,7 @@ static int __init configfs_init(void)
 	return 0;
 out3:
 	pr_err("Unable to register filesystem!\n");
-	kobject_put(config_kobj);
+	sysfs_remove_empty_dir(kernel_kobj, "config");
 out2:
 	kmem_cache_destroy(configfs_dir_cachep);
 	configfs_dir_cachep = NULL;
@@ -163,7 +161,7 @@ out:
 static void __exit configfs_exit(void)
 {
 	unregister_filesystem(&configfs_fs_type);
-	kobject_put(config_kobj);
+	sysfs_remove_empty_dir(kernel_kobj, "config");
 	kmem_cache_destroy(configfs_dir_cachep);
 	configfs_dir_cachep = NULL;
 }
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index c1e7ffb0dab6..5bcb499980d0 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -716,20 +716,17 @@ bool debugfs_initialized(void)
 }
 EXPORT_SYMBOL_GPL(debugfs_initialized);
 
-
-static struct kobject *debug_kobj;
-
 static int __init debugfs_init(void)
 {
 	int retval;
 
-	debug_kobj = kobject_create_and_add("debug", kernel_kobj);
-	if (!debug_kobj)
-		return -EINVAL;
+	retval = sysfs_create_empty_dir(kernel_kobj, "debug");
+	if (retval)
+		return retval;
 
 	retval = register_filesystem(&debug_fs_type);
 	if (retval)
-		kobject_put(debug_kobj);
+		sysfs_remove_empty_dir(kernel_kobj, "debug");
 	else
 		debugfs_registered = true;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 082ac1c97f39..475d9cfa59a9 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1238,7 +1238,6 @@ static void fuse_fs_cleanup(void)
 }
 
 static struct kobject *fuse_kobj;
-static struct kobject *connections_kobj;
 
 static int fuse_sysfs_init(void)
 {
@@ -1250,11 +1249,9 @@ static int fuse_sysfs_init(void)
 		goto out_err;
 	}
 
-	connections_kobj = kobject_create_and_add("connections", fuse_kobj);
-	if (!connections_kobj) {
-		err = -ENOMEM;
+	err = sysfs_create_empty_dir(fuse_kobj, "connections");
+	if (err)
 		goto out_fuse_unregister;
-	}
 
 	return 0;
 
@@ -1266,7 +1263,7 @@ static int fuse_sysfs_init(void)
 
 static void fuse_sysfs_cleanup(void)
 {
-	kobject_put(connections_kobj);
+	sysfs_remove_empty_dir(fuse_kobj, "connections");
 	kobject_put(fuse_kobj);
 }
 
diff --git a/fs/pstore/inode.c b/fs/pstore/inode.c
index dc43b5f29305..d1caeefd2d1b 100644
--- a/fs/pstore/inode.c
+++ b/fs/pstore/inode.c
@@ -461,22 +461,18 @@ static struct file_system_type pstore_fs_type = {
 	.kill_sb	= pstore_kill_sb,
 };
 
-static struct kobject *pstore_kobj;
-
 static int __init init_pstore_fs(void)
 {
-	int err = 0;
+	int err;
 
 	/* Create a convenient mount point for people to access pstore */
-	pstore_kobj = kobject_create_and_add("pstore", fs_kobj);
-	if (!pstore_kobj) {
-		err = -ENOMEM;
+	err = sysfs_create_empty_dir(fs_kobj, "pstore");
+	if (err)
 		goto out;
-	}
 
 	err = register_filesystem(&pstore_fs_type);
 	if (err < 0)
-		kobject_put(pstore_kobj);
+		sysfs_remove_empty_dir(fs_kobj, "pstore");
 
 out:
 	return err;
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index d92bdf3b079a..e887c881a4b3 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -631,14 +631,12 @@ bool tracefs_initialized(void)
 	return tracefs_registered;
 }
 
-static struct kobject *trace_kobj;
-
 static int __init tracefs_init(void)
 {
 	int retval;
 
-	trace_kobj = kobject_create_and_add("tracing", kernel_kobj);
-	if (!trace_kobj)
+	retval = sysfs_create_empty_dir(kernel_kobj, "tracing");
+	if (retval)
 		return -EINVAL;
 
 	retval = register_filesystem(&trace_fs_type);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 469dd547770c..816657b5ef16 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1924,8 +1924,6 @@ static struct file_system_type cgroup_fs_type = {
 	.kill_sb = cgroup_kill_sb,
 };
 
-static struct kobject *cgroup_kobj;
-
 /**
  * task_cgroup_path - cgroup path of a task in the first cgroup hierarchy
  * @task: target task
@@ -5044,13 +5042,13 @@ int __init cgroup_init(void)
 			ss->bind(init_css_set.subsys[ssid]);
 	}
 
-	cgroup_kobj = kobject_create_and_add("cgroup", fs_kobj);
-	if (!cgroup_kobj)
-		return -ENOMEM;
+	err = sysfs_create_empty_dir(fs_kobj, "cgroup");
+	if (err)
+		return err;
 
 	err = register_filesystem(&cgroup_fs_type);
 	if (err < 0) {
-		kobject_put(cgroup_kobj);
+		sysfs_remove_empty_dir(fs_kobj, "cgroup");
 		return err;
 	}
 
diff --git a/security/inode.c b/security/inode.c
index 91503b79c5f8..d7e5de5ffc59 100644
--- a/security/inode.c
+++ b/security/inode.c
@@ -215,19 +215,17 @@ void securityfs_remove(struct dentry *dentry)
 }
 EXPORT_SYMBOL_GPL(securityfs_remove);
 
-static struct kobject *security_kobj;
-
 static int __init securityfs_init(void)
 {
 	int retval;
 
-	security_kobj = kobject_create_and_add("security", kernel_kobj);
-	if (!security_kobj)
-		return -EINVAL;
+	retval = sysfs_create_empty_dir(kernel_kobj, "security");
+	if (retval)
+		return retval;
 
 	retval = register_filesystem(&fs_type);
 	if (retval)
-		kobject_put(security_kobj);
+		sysfs_remove_empty_dir(kernel_kobj, "security");
 	return retval;
 }
 
diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
index d2787cca1fcb..a3d882729a45 100644
--- a/security/selinux/selinuxfs.c
+++ b/security/selinux/selinuxfs.c
@@ -1853,7 +1853,6 @@ static struct file_system_type sel_fs_type = {
 };
 
 struct vfsmount *selinuxfs_mount;
-static struct kobject *selinuxfs_kobj;
 
 static int __init init_sel_fs(void)
 {
@@ -1862,13 +1861,13 @@ static int __init init_sel_fs(void)
 	if (!selinux_enabled)
 		return 0;
 
-	selinuxfs_kobj = kobject_create_and_add("selinux", fs_kobj);
-	if (!selinuxfs_kobj)
-		return -ENOMEM;
+	err = sysfs_create_empty_dir(fs_kobj, "selinux");
+	if (err)
+		return err;
 
 	err = register_filesystem(&sel_fs_type);
 	if (err) {
-		kobject_put(selinuxfs_kobj);
+		sysfs_remove_empty_dir(fs_kobj, "selinux");
 		return err;
 	}
 
@@ -1887,7 +1886,7 @@ __initcall(init_sel_fs);
 #ifdef CONFIG_SECURITY_SELINUX_DISABLE
 void exit_sel_fs(void)
 {
-	kobject_put(selinuxfs_kobj);
+	sysfs_remove_empty_dir(fs_kobj, "selinux");
 	kern_unmount(selinuxfs_mount);
 	unregister_filesystem(&sel_fs_type);
 }
diff --git a/security/smack/smackfs.c b/security/smack/smackfs.c
index d9682985349e..35079cc8c765 100644
--- a/security/smack/smackfs.c
+++ b/security/smack/smackfs.c
@@ -2241,16 +2241,16 @@ static const struct file_operations smk_revoke_subj_ops = {
 	.llseek		= generic_file_llseek,
 };
 
-static struct kset *smackfs_kset;
 /**
  * smk_init_sysfs - initialize /sys/fs/smackfs
  *
  */
 static int smk_init_sysfs(void)
 {
-	smackfs_kset = kset_create_and_add("smackfs", NULL, fs_kobj);
-	if (!smackfs_kset)
-		return -ENOMEM;
+	int err;
+	err = sysfs_create_empty_dir(fs_kobj, "smackfs");
+	if (err)
+		return err;
 	return 0;
 }
 
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [CFT][PATCH 10/10] mnt: Update fs_fully_visible to test for permanently empty directories
       [not found] ` <87pp63jcca.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                     ` (6 preceding siblings ...)
  2015-05-14 17:36   ` [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir Eric W. Biederman
@ 2015-05-14 17:37   ` Eric W. Biederman
  7 siblings, 0 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-14 17:37 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, Serge E. Hallyn,
	Andy Lutomirski, Richard Weinberger, Kenton Varda,
	Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch,
	Greg Kroah-Hartman, Tejun Heo


fs_fully_visible attempts to make fresh mounts of proc and sysfs give
the mounter no more access to proc and sysfs than if they could have
by creating a bind mount.  One aspect of proc and sysfs that makes
this particularly tricky is that there are other filesystems that
typically mount on top of proc and sysfs.  As those filesystems are
mounted on empty directories in practice it is safe to ignore them.
However testing to ensure filesystems are mounted on empty directories
has not been something the in kernel data structures have supported so
the current test for an empty directory which checks to see
if nlink <= 2 is a bit lacking.

proc and sysfs have recently been modified to use the new empty_dir
infrastructure to create all of their dedicated mount points.  Instead
of testing for S_ISDIR(inode->i_mode) && i_nlink <= 2 to see if a
directory is empty, test for is_empty_dir_inode(inode).  That small
change guaranteess mounts found on proc and sysfs really are safe to
ignore, because the directories are not only empty but nothing can
ever be added to them.  This guarantees there is nothing to worry
about when mounting proc and sysfs.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 3ede0669b8d2..eccd925c6e82 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3220,9 +3220,8 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags)
 			/* Only worry about locked mounts */
 			if (!(mnt->mnt.mnt_flags & MNT_LOCKED))
 				continue;
-			if (!S_ISDIR(inode->i_mode))
-				goto next;
-			if (inode->i_nlink > 2)
+			/* Is the directory permanetly empty? */
+			if (!is_empty_dir_inode(inode))
 				goto next;
 		}
 		/* Preserve the locked attributes */
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts
  2015-05-14 17:30 [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts Eric W. Biederman
                   ` (2 preceding siblings ...)
       [not found] ` <87pp63jcca.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-05-14 20:29 ` Greg Kroah-Hartman
  2015-05-14 21:10   ` Eric W. Biederman
  2015-05-16  2:05 ` [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) Eric W. Biederman
  4 siblings, 1 reply; 85+ messages in thread
From: Greg Kroah-Hartman @ 2015-05-14 20:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-fsdevel, Linux API, Serge E. Hallyn,
	Andy Lutomirski, Richard Weinberger, Kenton Varda,
	Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch,
	Tejun Heo

On Thu, May 14, 2015 at 12:30:45PM -0500, Eric W. Biederman wrote:
> 
> The code is currently available at:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing
> 
>    HEAD: a524faf520600968e58bbc732063fccf2fdf9199 mnt: Update fs_fully_visible to test for permanently empty directories
> 
> The problem:  Mounting a new instance of proc of sysfs can allow things
> that a bind mount of those filesystems would not.
> 
> That is the cases I am dealing with are:
>      unshare --user --net --mount ; mount -t sysfs ...
>      unshare --user --pid --mount ; mount -t proc ...
> 
> The big change is that this set of changes enforces the preservation of
> locked mount flags, from the existing mount to the current mount.  Which
> means that if proc was mounted read-only the current current will allow
> a new instance of proc to be mounted read-write, and this set of changes
> enforces that proc remain read-only.
> 
> The other gotcha is that the current code does not properly detect empty
> directories so to prevent things slipping through the cracks this set of
> changes annotates all mount points where nothing will be revealed if
> the filesystem mounted on top is removed.
> 
> Enforcing the administrators policy can actually matter in the real
> world as has been shown by the recent docker issue.
> 
> With this patchset I have two concerns:
> - The enforcement of mount flag preservation on proc and sysfs may break
>   things.  (I am especially worried about the implicit adding of nodev).

What do you mean by this?  What got added?

> - I missed a filesystem mountpoint on proc or sysfs which would make a
>   fresh copy unmountable for no good reason.
> 
> I don't want to break userspace if I can help it, and the code has been
> this way for a while so I figure there is time to find any pitfalls and
> address them before this code gets merged.
> 
> So if this works for you please give me your Tested-By
> 
> The well known mountpoints for pseudo filesystems that I could find are:
> /dev/ffs*/                 functionfs
> /dev/gadget/               gadgetfs
> /dev/mqueue                mqueue
> /dev/oprofile/             oprofilefs
> /dev/pts/                  devpts

/dev/shm gets a tmpfs, right?  Or do those not matter here?

> /dlm/                      ocfs2_dlmfs
> /ipath/                    ipathfs
> /proc/fs/nfsd/             nfsd
> /proc/openprom/            openpromfs
> /proc/sys/fs/binfmt_misc/  binfmt_misc
> /spu/                      spufs

> /sys/firmware/efi/efivars/ efivarfs
> /sys/fs/cgroup/            cgroup
> /sys/fs/fuse/connections/  fusectl

I thought fuse mounted a few more things in here, but I don't know for
sure.

> /sys/fs/pstore/            pstore
> /sys/fs/selinux/           selinuxfs
> /sys/fs/smackfs/           smackfs
> /sys/hypervisor/s390/      s390_hypfs
> /sys/kernel/config/        configfs
> /sys/kernel/debug/         debugfs
> /sys/kernel/security/      securityfs
> /sys/kernel/tracing/       tracefs

I think these are all correct for sysfs, I have a minor comment on the
sysfs patch I'll make in it.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 08/10] sysfs: Add support for permanently empty directories.
       [not found]     ` <87fv6zhxkp.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-05-14 20:31       ` Greg Kroah-Hartman
       [not found]         ` <20150514203131.GB16416-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 85+ messages in thread
From: Greg Kroah-Hartman @ 2015-05-14 20:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger,
	Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber,
	Eric Windisch, Tejun Heo

On Thu, May 14, 2015 at 12:35:02PM -0500, Eric W. Biederman wrote:
> 
> Add two functions sysfs_create_empty_dir and sysfs_remove_empty_dir
> that hang a permanently empty directory off of a kobject or remove
> a permanently emptpy directory hanging from a kobject.  Export
> these new functions so modular filesystems can use them.
> 
> As all permanently empty directories are, are names and used
> for mouting other filesystems this seems like the right abstraction.

That sentence doesn't make much sense, cut and paste?

> 
> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> ---
>  fs/sysfs/dir.c        | 34 ++++++++++++++++++++++++++++++++++
>  include/linux/sysfs.h | 16 ++++++++++++++++
>  2 files changed, 50 insertions(+)
> 
> diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
> index 0b45ff42f374..8244741474d7 100644
> --- a/fs/sysfs/dir.c
> +++ b/fs/sysfs/dir.c
> @@ -121,3 +121,37 @@ int sysfs_move_dir_ns(struct kobject *kobj, struct kobject *new_parent_kobj,
>  
>  	return kernfs_rename_ns(kn, new_parent, kn->name, new_ns);
>  }
> +
> +/**
> + * sysfs_create_empty_dir - create an always empty directory
> + * @parent_kobj:  kobject that will contain this always empty directory
> + * @name: The name of the always empty directory to add
> + */
> +int sysfs_create_empty_dir(struct kobject *parent_kobj, const char *name)

As this really is just a mount point, how about we be explicit with
this and call the function:
	sysfs_create_mount_point()
	sysfs_remove_mount_point()
That makes more sense in the long run, otherwise if you just want to
create an empty directory in sysfs, you can do so without making an
"empty" kobject and some people might do that accidentally in the
future.  This makes it more obvious as to what is going on.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts
  2015-05-14 20:29 ` [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts Greg Kroah-Hartman
@ 2015-05-14 21:10   ` Eric W. Biederman
       [not found]     ` <87oalmg90j.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-14 21:10 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Linux Containers, linux-fsdevel, Linux API, Serge E. Hallyn,
	Andy Lutomirski, Richard Weinberger, Kenton Varda,
	Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch,
	Tejun Heo

Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes:

> On Thu, May 14, 2015 at 12:30:45PM -0500, Eric W. Biederman wrote:
>> 
>> The code is currently available at:
>> 
>>    git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing
>> 
>>    HEAD: a524faf520600968e58bbc732063fccf2fdf9199 mnt: Update fs_fully_visible to test for permanently empty directories
>> 
>> The problem:  Mounting a new instance of proc of sysfs can allow things
>> that a bind mount of those filesystems would not.
>> 
>> That is the cases I am dealing with are:
>>      unshare --user --net --mount ; mount -t sysfs ...
>>      unshare --user --pid --mount ; mount -t proc ...
>> 
>> The big change is that this set of changes enforces the preservation of
>> locked mount flags, from the existing mount to the current mount.  Which
>> means that if proc was mounted read-only the current current will allow
>> a new instance of proc to be mounted read-write, and this set of changes
>> enforces that proc remain read-only.
>> 
>> The other gotcha is that the current code does not properly detect empty
>> directories so to prevent things slipping through the cracks this set of
>> changes annotates all mount points where nothing will be revealed if
>> the filesystem mounted on top is removed.
>> 
>> Enforcing the administrators policy can actually matter in the real
>> world as has been shown by the recent docker issue.
>> 
>> With this patchset I have two concerns:
>> - The enforcement of mount flag preservation on proc and sysfs may break
>>   things.  (I am especially worried about the implicit adding of nodev).
>
> What do you mean by this?  What got added?

In a user namespace mounting a filesystem implicitly adds nodev.

When I started enforcing not clearing bits that root had set on a
filesystem in mount -o remount the implicit nodev wound up being
an issue that broke userspace for no good reason.  The fix was
to implicitly add nodev in remount as well.

Taking a second look at this nodev is implicitly added before the
fs_fully_visible check so even for applications that are know how the
original proc was mounted (and don't see an implicit nodev) and that
don't add nodev (because they ''know'' the mount flags) this change
should not be a problem.  Hooray!  One less scary thing.

>> - I missed a filesystem mountpoint on proc or sysfs which would make a
>>   fresh copy unmountable for no good reason.
>> 
>> I don't want to break userspace if I can help it, and the code has been
>> this way for a while so I figure there is time to find any pitfalls and
>> address them before this code gets merged.
>> 
>> So if this works for you please give me your Tested-By
>> 
>> The well known mountpoints for pseudo filesystems that I could find are:
>> /dev/ffs*/                 functionfs
>> /dev/gadget/               gadgetfs
>> /dev/mqueue                mqueue
>> /dev/oprofile/             oprofilefs
>> /dev/pts/                  devpts
>
> /dev/shm gets a tmpfs, right?  Or do those not matter here?

It does, but it doesn't matter in this context.   I was looking for
things that mounted themselves on proc or sysfs and I catalogued the
rest just to know they were not mounted there.

>> /dlm/                      ocfs2_dlmfs
>> /ipath/                    ipathfs
>> /proc/fs/nfsd/             nfsd
>> /proc/openprom/            openpromfs
>> /proc/sys/fs/binfmt_misc/  binfmt_misc
>> /spu/                      spufs
>
>> /sys/firmware/efi/efivars/ efivarfs
>> /sys/fs/cgroup/            cgroup
>> /sys/fs/fuse/connections/  fusectl
>
> I thought fuse mounted a few more things in here, but I don't know for
> sure.

There are definitely some fuse attributes under /sys/fs/fuse/ but
I don't see anything else in the code that could be creating a mount
point.

>> /sys/fs/pstore/            pstore
>> /sys/fs/selinux/           selinuxfs
>> /sys/fs/smackfs/           smackfs
>> /sys/hypervisor/s390/      s390_hypfs
>> /sys/kernel/config/        configfs
>> /sys/kernel/debug/         debugfs
>> /sys/kernel/security/      securityfs
>> /sys/kernel/tracing/       tracefs
>
> I think these are all correct for sysfs, I have a minor comment on the
> sysfs patch I'll make in it.

Good to hear and I will answer there as well.

Eric


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 08/10] sysfs: Add support for permanently empty directories.
       [not found]         ` <20150514203131.GB16416-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
@ 2015-05-14 21:33           ` Eric W. Biederman
  0 siblings, 0 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-14 21:33 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Linux Containers, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger,
	Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber,
	Eric Windisch, Tejun Heo

Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> writes:

> On Thu, May 14, 2015 at 12:35:02PM -0500, Eric W. Biederman wrote:
>> 
>> Add two functions sysfs_create_empty_dir and sysfs_remove_empty_dir
>> that hang a permanently empty directory off of a kobject or remove
>> a permanently emptpy directory hanging from a kobject.  Export
>> these new functions so modular filesystems can use them.
>> 
>> As all permanently empty directories are, are names and used
>> for mouting other filesystems this seems like the right abstraction.
>
> That sentence doesn't make much sense, cut and paste?

Probably one edit too many or too few depending on how you look at it.

What I meant is that since the only interesting thing about a
permanently empty directory is it's name, treating them like sysfs files
rather than normal sysfs directories which require a kobject seems like
the right abstraction.

>> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>> ---
>>  fs/sysfs/dir.c        | 34 ++++++++++++++++++++++++++++++++++
>>  include/linux/sysfs.h | 16 ++++++++++++++++
>>  2 files changed, 50 insertions(+)
>> 
>> diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
>> index 0b45ff42f374..8244741474d7 100644
>> --- a/fs/sysfs/dir.c
>> +++ b/fs/sysfs/dir.c
>> @@ -121,3 +121,37 @@ int sysfs_move_dir_ns(struct kobject *kobj, struct kobject *new_parent_kobj,
>>  
>>  	return kernfs_rename_ns(kn, new_parent, kn->name, new_ns);
>>  }
>> +
>> +/**
>> + * sysfs_create_empty_dir - create an always empty directory
>> + * @parent_kobj:  kobject that will contain this always empty directory
>> + * @name: The name of the always empty directory to add
>> + */
>> +int sysfs_create_empty_dir(struct kobject *parent_kobj, const char *name)
>
> As this really is just a mount point, how about we be explicit with
> this and call the function:
> 	sysfs_create_mount_point()
> 	sysfs_remove_mount_point()
> That makes more sense in the long run, otherwise if you just want to
> create an empty directory in sysfs, you can do so without making an
> "empty" kobject and some people might do that accidentally in the
> future.  This makes it more obvious as to what is going on.

Yeah.  That seems fairly reasonable.

My brain is on the edge between the functional description of
creating a permanently empty directory, and the usage based
description (creating a directory to mount filesystems on).

But I agree a name that makes it totally obvious we are creating a
directory to mount something on is going to be more usable and
comprehensible.

My head doesn't like sysfs_create_mount_point() as a mount point can be
a file.  But I will put it on the back burner and see if I can come up
with something better, and if not sysfs_create_mount_point it is.

Brainstorming:

sysfs_create_expected_mount_point()
sysfs_reserve_dir_for_mount()
sysfs_create_dir_mount_point()
sysfs_create_expected_mount_point()

Partly I think I would like to rename the proc, sysctl and
infrastructure bit as well (consistency and clarity is good).

Where I get stuck is how do I ask the question:
I see this directory is a mount point, is it a directory whose sole
purpose in life is to be a mount point?

In the context of that question I like my naming of empty_dir as it
conveys what I am interested in.

But I like the sysfs_create_mount_point for general use.  Maybe I won't
make my names consistent.

I don't know.  I am putting this naming question on the back burner for
a bit.

Eric

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts
       [not found]     ` <87oalmg90j.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-05-15  6:26       ` Andy Lutomirski
       [not found]         ` <CALCETrU1yxcDfv4YV3wVpWMAdiOOsSUFOPUpFAN-mVA4M-OxdQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 85+ messages in thread
From: Andy Lutomirski @ 2015-05-15  6:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Linux Containers, Linux FS Devel, Linux API,
	Serge E. Hallyn, Richard Weinberger, Kenton Varda,
	Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch,
	Tejun Heo

On Thu, May 14, 2015 at 2:10 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> writes:
>
>> On Thu, May 14, 2015 at 12:30:45PM -0500, Eric W. Biederman wrote:
>>>
>>> The code is currently available at:
>>>
>>>    git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing
>>>
>>>    HEAD: a524faf520600968e58bbc732063fccf2fdf9199 mnt: Update fs_fully_visible to test for permanently empty directories
>>>
>>> The problem:  Mounting a new instance of proc of sysfs can allow things
>>> that a bind mount of those filesystems would not.
>>>
>>> That is the cases I am dealing with are:
>>>      unshare --user --net --mount ; mount -t sysfs ...
>>>      unshare --user --pid --mount ; mount -t proc ...
>>>
>>> The big change is that this set of changes enforces the preservation of
>>> locked mount flags, from the existing mount to the current mount.  Which
>>> means that if proc was mounted read-only the current current will allow
>>> a new instance of proc to be mounted read-write, and this set of changes
>>> enforces that proc remain read-only.
>>>
>>> The other gotcha is that the current code does not properly detect empty
>>> directories so to prevent things slipping through the cracks this set of
>>> changes annotates all mount points where nothing will be revealed if
>>> the filesystem mounted on top is removed.
>>>
>>> Enforcing the administrators policy can actually matter in the real
>>> world as has been shown by the recent docker issue.
>>>
>>> With this patchset I have two concerns:
>>> - The enforcement of mount flag preservation on proc and sysfs may break
>>>   things.  (I am especially worried about the implicit adding of nodev).
>>
>> What do you mean by this?  What got added?
>
> In a user namespace mounting a filesystem implicitly adds nodev.
>
> When I started enforcing not clearing bits that root had set on a
> filesystem in mount -o remount the implicit nodev wound up being
> an issue that broke userspace for no good reason.  The fix was
> to implicitly add nodev in remount as well.
>
> Taking a second look at this nodev is implicitly added before the
> fs_fully_visible check so even for applications that are know how the
> original proc was mounted (and don't see an implicit nodev) and that
> don't add nodev (because they ''know'' the mount flags) this change
> should not be a problem.  Hooray!  One less scary thing.

Can we please just get rid of this implicit nodev thing once and for all?  If it
breaks some really weird /proc use case, then I think the right fix is to
stop enforcing the nodev lock for the proc fully visible check.  After
all, /proc doesn't contain useful device nodes anyway.

Other than that, the code here looks okay to me on brief inspection.

--Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts
       [not found]         ` <CALCETrU1yxcDfv4YV3wVpWMAdiOOsSUFOPUpFAN-mVA4M-OxdQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-15  6:55           ` Eric W. Biederman
  0 siblings, 0 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-15  6:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Greg Kroah-Hartman, Linux Containers, Linux FS Devel, Linux API,
	Serge E. Hallyn, Richard Weinberger, Kenton Varda,
	Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch,
	Tejun Heo

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:

> Can we please just get rid of this implicit nodev thing once and for all?  If it
> breaks some really weird /proc use case, then I think the right fix is to
> stop enforcing the nodev lock for the proc fully visible check.  After
> all, /proc doesn't contain useful device nodes anyway.

On second look I don't think that will actually cause issues in this
case.

I actually have a fix for the implicit nodev weirdness in my development
qeueue but it requires figuring out how to add s_user_ns to superblocks.
My last round of testing told me I was doing that wrong.

But if the implicit nodev is actually a problem I will definitely delay
this until I have that change ready to go as well.

> Other than that, the code here looks okay to me on brief inspection.

At a practical level I am concerned that enforcing things like noexec
and nosuid from the original normal global proc might cause problems for
things like sandstorm, lxc, and possibly libvirt-lxc.  So I would really
appreciate if people associated with those projects could test this and
tell me if I break things.

Other than my stupid refactor in my code for /proc/fs/nfsd that causes
the kernel to oops :(  Doh!

Eric

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
  2015-05-14 17:30 [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts Eric W. Biederman
                   ` (3 preceding siblings ...)
  2015-05-14 20:29 ` [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts Greg Kroah-Hartman
@ 2015-05-16  2:05 ` Eric W. Biederman
  2015-05-16  2:06   ` [CFT][PATCH 02/10] mnt: Modify fs_fully_visible to deal with mount attributes Eric W. Biederman
       [not found]   ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  4 siblings, 2 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-16  2:05 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Linux API, Serge E. Hallyn, Andy Lutomirski,
	Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages,
	Stéphane Graber, Eric Windisch, Greg Kroah-Hartman,
	Tejun Heo

The code is currently available at:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing

   HEAD: 513d98ba1adfa9e3178b6fc3b2fa57a622283d32 mnt: Update fs_fully_visible to test for permanently empty directories

The problem:  Mounting a new instance of proc of sysfs can allow things
that a bind mount of those filesystems would not.

That is the cases I am dealing with are:
     unshare --user --net --mount ; mount -t sysfs ...
     unshare --user --pid --mount ; mount -t proc ...

This set of changes enforces the preservation of locked mount flags,
from the existing mount to the current mount.  Which means that if proc
was mounted read-only the current current will allow a new instance of
proc to be mounted read-write, and this set of changes enforces that
proc remain read-only.

This set of changes also updates sysctl, proc and sysfs to explicitly
create the directories they expect to be mount points as mount points.
Making the code a little clearly and making it so when fs_fully_visible
disregards something mounted on a proc or sysfs it is guaranteed to
be safe, unlike the current code which can occassionally let things
fall through the cracks.

These changes to enforce the administrators policy can actually matter
in the real world as has been shown by the recent docker issue.

With this patchset I have two concerns:
- The enforcement of not being able to mount proc or sysfs with fewer
  mount flags than the existing mount may break something.

- That there is a filesystem that that common mounts on proc or sysfs
  and I missed annotating it's mount point.  That would make mounting
  a freshy copy of proc or sysfs impossible.

I don't want to break userspace if I can help it, and the code has been
this way for a while so I figure there is time to find any pitfalls and
address them before this code gets merged.  Folks rom lxc, sandstorm,
libvirt-lxc (anyone who uses user namespaces in the least) a
confirmation that I have not broken your existing code would be
appreciated.

If this works for you please give me your Tested-By

Since the first version I have renamed the directory creation calls to
have sysfs_create_mount_point and proc_create_mount_point (as suggested
by Greg KH so that it is very clear what the code that creates those
mount points is doing.  I have also fixed a stupid bug that slipped into
the proc code when I refactored it.  I have also gone through and rested
everything so hopefully nothing has slipped past me.

The well known mountpoints for pseudo filesystems that I could find are:
/dev/ffs*/                 functionfs
/dev/gadget/               gadgetfs
/dev/mqueue                mqueue
/dev/oprofile/             oprofilefs
/dev/pts/                  devpts
/dev/shm/                  tmpfs
/dlm/                      ocfs2_dlmfs
/ipath/                    ipathfs
/proc/fs/nfsd/             nfsd
/proc/openprom/            openpromfs
/proc/sys/fs/binfmt_misc/  binfmt_misc
/spu/                      spufs
/sys/firmware/efi/efivars/ efivarfs
/sys/fs/cgroup/            cgroup
/sys/fs/fuse/connections/  fusectl
/sys/fs/pstore/            pstore
/sys/fs/selinux/           selinuxfs
/sys/fs/smackfs/           smackfs
/sys/hypervisor/s390/      s390_hypfs
/sys/kernel/config/        configfs
/sys/kernel/debug/         debugfs
/sys/kernel/security/      securityfs
/sys/kernel/tracing/       tracefs
/var/lib/ibmasm/           ibmasmfs
/var/lib/nfs/rpc_pipefs/   rpc_pipefs

Eric W. Biederman (10):
      mnt: Refactor the logic for mounting sysfs and proc in a user namespace
      mnt: Modify fs_fully_visible to deal with mount attributes
      vfs: Ignore unlocked mounts in fs_fully_visible
      fs: Add helper functions for permanently empty directories.
      sysctl: Allow creating permanently empty directories that serve as mountpoints.
      proc: Allow creating permanently empty directories that serve as mount points
      kernfs: Add support for always empty directories.
      sysfs: Add support for permanently empty directories to serve as mount points.
      sysfs: Create mountpoints with sysfs_create_mount_point
      mnt: Update fs_fully_visible to test for permanently empty directories


 arch/s390/hypfs/inode.c      | 12 ++----
 drivers/firmware/efi/efi.c   |  6 +--
 fs/configfs/mount.c          | 10 ++---
 fs/debugfs/inode.c           | 11 ++---
 fs/fuse/inode.c              |  9 ++---
 fs/kernfs/dir.c              | 38 +++++++++++++++++-
 fs/kernfs/inode.c            |  2 +
 fs/libfs.c                   | 96 ++++++++++++++++++++++++++++++++++++++++++++
 fs/namespace.c               | 47 +++++++++++++++++++---
 fs/proc/generic.c            | 23 +++++++++++
 fs/proc/inode.c              |  4 ++
 fs/proc/internal.h           |  6 +++
 fs/proc/proc_sysctl.c        | 37 +++++++++++++++++
 fs/proc/root.c               |  9 ++---
 fs/pstore/inode.c            | 12 ++----
 fs/sysfs/dir.c               | 34 ++++++++++++++++
 fs/sysfs/mount.c             |  5 +--
 fs/tracefs/inode.c           |  6 +--
 include/linux/fs.h           |  4 +-
 include/linux/kernfs.h       |  3 ++
 include/linux/sysctl.h       |  3 ++
 include/linux/sysfs.h        | 16 ++++++++
 kernel/cgroup.c              | 10 ++---
 kernel/sysctl.c              |  8 +---
 security/inode.c             | 10 ++---
 security/selinux/selinuxfs.c | 11 +++--
 security/smack/smackfs.c     |  8 ++--
 27 files changed, 350 insertions(+), 90 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [CFT][PATCH 01/10] mnt: Refactor the logic for mounting sysfs and proc in a user namespace
       [not found]   ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-05-16  2:06     ` Eric W. Biederman
  2015-05-16  2:07     ` [CFT][PATCH 03/10] vfs: Ignore unlocked mounts in fs_fully_visible Eric W. Biederman
                       ` (8 subsequent siblings)
  9 siblings, 0 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-16  2:06 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, Serge E. Hallyn,
	Andy Lutomirski, Richard Weinberger, Kenton Varda,
	Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch,
	Greg Kroah-Hartman, Tejun Heo


Fresh mounts of proc and sysfs are a very special case that works very
much like a bind mount.  Unfortunately the current structure can not
preserve the MNT_LOCK... mount flags.  Therefore refactor the logic
into a form that can be modified to preserve those lock bits.

Add a new filesystem flag FS_USERNS_VISIBLE that requires some mount
of the filesystem be fully visible in the current mount namespace,
before the filesystem may be mounted.

Move the logic for calling fs_fully_visible from proc and sysfs into
fs/namespace.c where it has greater access to mount namespace state.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c     | 8 +++++++-
 fs/proc/root.c     | 5 +----
 fs/sysfs/mount.c   | 5 +----
 include/linux/fs.h | 2 +-
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 1b9e11167bae..8e7edaf60fe1 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2332,6 +2332,8 @@ unlock:
 	return err;
 }
 
+static bool fs_fully_visible(struct file_system_type *fs_type);
+
 /*
  * create a new mount for userspace and request it to be added into the
  * namespace's tree
@@ -2363,6 +2365,10 @@ static int do_new_mount(struct path *path, const char *fstype, int flags,
 			flags |= MS_NODEV;
 			mnt_flags |= MNT_NODEV | MNT_LOCK_NODEV;
 		}
+		if (type->fs_flags & FS_USERNS_VISIBLE) {
+			if (!fs_fully_visible(type))
+				return -EPERM;
+		}
 	}
 
 	mnt = vfs_kern_mount(type, flags, name, data);
@@ -3164,7 +3170,7 @@ bool current_chrooted(void)
 	return chrooted;
 }
 
-bool fs_fully_visible(struct file_system_type *type)
+static bool fs_fully_visible(struct file_system_type *type)
 {
 	struct mnt_namespace *ns = current->nsproxy->mnt_ns;
 	struct mount *mnt;
diff --git a/fs/proc/root.c b/fs/proc/root.c
index b7fa4bfe896a..64e1ab64bde6 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -112,9 +112,6 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
 		ns = task_active_pid_ns(current);
 		options = data;
 
-		if (!capable(CAP_SYS_ADMIN) && !fs_fully_visible(fs_type))
-			return ERR_PTR(-EPERM);
-
 		/* Does the mounter have privilege over the pid namespace? */
 		if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN))
 			return ERR_PTR(-EPERM);
@@ -159,7 +156,7 @@ static struct file_system_type proc_fs_type = {
 	.name		= "proc",
 	.mount		= proc_mount,
 	.kill_sb	= proc_kill_sb,
-	.fs_flags	= FS_USERNS_MOUNT,
+	.fs_flags	= FS_USERNS_VISIBLE | FS_USERNS_MOUNT,
 };
 
 void __init proc_root_init(void)
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 8a49486bf30c..1c6ac6fcee9f 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -31,9 +31,6 @@ static struct dentry *sysfs_mount(struct file_system_type *fs_type,
 	bool new_sb;
 
 	if (!(flags & MS_KERNMOUNT)) {
-		if (!capable(CAP_SYS_ADMIN) && !fs_fully_visible(fs_type))
-			return ERR_PTR(-EPERM);
-
 		if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET))
 			return ERR_PTR(-EPERM);
 	}
@@ -58,7 +55,7 @@ static struct file_system_type sysfs_fs_type = {
 	.name		= "sysfs",
 	.mount		= sysfs_mount,
 	.kill_sb	= sysfs_kill_sb,
-	.fs_flags	= FS_USERNS_MOUNT,
+	.fs_flags	= FS_USERNS_VISIBLE | FS_USERNS_MOUNT,
 };
 
 int __init sysfs_init(void)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 35ec87e490b1..2d24eeb8e59c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1897,6 +1897,7 @@ struct file_system_type {
 #define FS_HAS_SUBTYPE		4
 #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
 #define FS_USERNS_DEV_MOUNT	16 /* A userns mount does not imply MNT_NODEV */
+#define FS_USERNS_VISIBLE	32	/* FS must already be visible */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
 	struct dentry *(*mount) (struct file_system_type *, int,
 		       const char *, void *);
@@ -1984,7 +1985,6 @@ extern int vfs_ustat(dev_t, struct kstatfs *);
 extern int freeze_super(struct super_block *super);
 extern int thaw_super(struct super_block *super);
 extern bool our_mnt(struct vfsmount *mnt);
-extern bool fs_fully_visible(struct file_system_type *);
 
 extern int current_umask(void);
 
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [CFT][PATCH 02/10] mnt: Modify fs_fully_visible to deal with mount attributes
  2015-05-16  2:05 ` [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) Eric W. Biederman
@ 2015-05-16  2:06   ` Eric W. Biederman
       [not found]   ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  1 sibling, 0 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-16  2:06 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Linux API, Serge E. Hallyn, Andy Lutomirski,
	Richard Weinberger, Kenton Varda, Michael Kerrisk,
	Stéphane Graber, Eric Windisch, Greg Kroah-Hartman,
	Tejun Heo


Ignore an existing mount if it's locked attributes are less permissive
than the new mounts attributes.

On success ensure the new mount locks all of the same attributes as
the old mount.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c | 32 +++++++++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 8e7edaf60fe1..fccee9924e8c 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2332,7 +2332,7 @@ unlock:
 	return err;
 }
 
-static bool fs_fully_visible(struct file_system_type *fs_type);
+static bool fs_fully_visible(struct file_system_type *fs_type, int *new_mnt_flags);
 
 /*
  * create a new mount for userspace and request it to be added into the
@@ -2366,7 +2366,7 @@ static int do_new_mount(struct path *path, const char *fstype, int flags,
 			mnt_flags |= MNT_NODEV | MNT_LOCK_NODEV;
 		}
 		if (type->fs_flags & FS_USERNS_VISIBLE) {
-			if (!fs_fully_visible(type))
+			if (!fs_fully_visible(type, &mnt_flags))
 				return -EPERM;
 		}
 	}
@@ -3170,9 +3170,10 @@ bool current_chrooted(void)
 	return chrooted;
 }
 
-static bool fs_fully_visible(struct file_system_type *type)
+static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags)
 {
 	struct mnt_namespace *ns = current->nsproxy->mnt_ns;
+	int new_flags = *new_mnt_flags;
 	struct mount *mnt;
 	bool visible = false;
 
@@ -3191,6 +3192,25 @@ static bool fs_fully_visible(struct file_system_type *type)
 		if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
 			continue;
 
+		/* Verify the mount flags are equal to or more permissive
+		 * than the proposed new mount.
+		 */
+		if ((mnt->mnt.mnt_flags & MNT_LOCK_READONLY) &&
+		    !(new_flags & MNT_READONLY))
+			continue;
+		if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) &&
+		    !(new_flags & MNT_NODEV))
+			continue;
+		if ((mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) &&
+		    !(new_flags & MNT_NOSUID))
+			continue;
+		if ((mnt->mnt.mnt_flags & MNT_LOCK_NOEXEC) &&
+		    !(new_flags & MNT_NOEXEC))
+			continue;
+		if ((mnt->mnt.mnt_flags & MNT_LOCK_ATIME) &&
+		    ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
+			continue;
+
 		/* This mount is not fully visible if there are any child mounts
 		 * that cover anything except for empty directories.
 		 */
@@ -3201,6 +3221,12 @@ static bool fs_fully_visible(struct file_system_type *type)
 			if (inode->i_nlink > 2)
 				goto next;
 		}
+		/* Preserve the locked attributes */
+		*new_mnt_flags |= mnt->mnt.mnt_flags & (MNT_LOCK_READONLY | \
+							MNT_LOCK_NODEV    | \
+							MNT_LOCK_NOSUID   | \
+							MNT_LOCK_NOEXEC   | \
+							MNT_LOCK_ATIME);
 		visible = true;
 		goto found;
 	next:	;
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [CFT][PATCH 03/10] vfs: Ignore unlocked mounts in fs_fully_visible
       [not found]   ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-05-16  2:06     ` [CFT][PATCH 01/10] mnt: Refactor the logic for mounting sysfs and proc in a user namespace Eric W. Biederman
@ 2015-05-16  2:07     ` Eric W. Biederman
  2015-05-16  2:07     ` [CFT][PATCH 04/10] fs: Add helper functions for permanently empty directories Eric W. Biederman
                       ` (7 subsequent siblings)
  9 siblings, 0 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-16  2:07 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, Serge E. Hallyn,
	Andy Lutomirski, Richard Weinberger, Kenton Varda,
	Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch,
	Greg Kroah-Hartman, Tejun Heo


Limit the mounts fs_fully_visible considers to locked mounts.
Unlocked can always be unmounted so considering them adds hassle
but no security benefit.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index fccee9924e8c..3ede0669b8d2 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3211,11 +3211,15 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags)
 		    ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
 			continue;
 
-		/* This mount is not fully visible if there are any child mounts
-		 * that cover anything except for empty directories.
+		/* This mount is not fully visible if there are any
+		 * locked child mounts that cover anything except for
+		 * empty directories.
 		 */
 		list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
 			struct inode *inode = child->mnt_mountpoint->d_inode;
+			/* Only worry about locked mounts */
+			if (!(mnt->mnt.mnt_flags & MNT_LOCKED))
+				continue;
 			if (!S_ISDIR(inode->i_mode))
 				goto next;
 			if (inode->i_nlink > 2)
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [CFT][PATCH 04/10] fs: Add helper functions for permanently empty directories.
       [not found]   ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-05-16  2:06     ` [CFT][PATCH 01/10] mnt: Refactor the logic for mounting sysfs and proc in a user namespace Eric W. Biederman
  2015-05-16  2:07     ` [CFT][PATCH 03/10] vfs: Ignore unlocked mounts in fs_fully_visible Eric W. Biederman
@ 2015-05-16  2:07     ` Eric W. Biederman
  2015-05-16  2:08     ` [CFT][PATCH 05/10] sysctl: Allow creating permanently empty directories that serve as mountpoints Eric W. Biederman
                       ` (6 subsequent siblings)
  9 siblings, 0 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-16  2:07 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, Serge E. Hallyn,
	Andy Lutomirski, Richard Weinberger, Kenton Varda,
	Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch,
	Greg Kroah-Hartman, Tejun Heo


To ensure it is safe to mount proc and sysfs I need to check if
filesystems that are mounted on top of them are mounted on truly empty
directories.  Given that some directories can gain entries over time,
knowing that a directory is empty right now is insufficient.

Therefore add supporting infrastructure for permantently empty
directories that proc and sysfs can use when they create mount points
for filesystems and fs_fully_visible can use to test for permanently
empty directories to ensure that nothing will be gained by mounting a
fresh copy of proc or sysfs.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/libfs.c         | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |  2 ++
 2 files changed, 98 insertions(+)

diff --git a/fs/libfs.c b/fs/libfs.c
index cb1fb4b9b637..02813592e121 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1093,3 +1093,99 @@ simple_nosetlease(struct file *filp, long arg, struct file_lock **flp,
 	return -EINVAL;
 }
 EXPORT_SYMBOL(simple_nosetlease);
+
+
+/*
+ * Operations for a permanently empty directory.
+ */
+static struct dentry *empty_dir_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
+{
+	return ERR_PTR(-ENOENT);
+}
+
+static int empty_dir_getattr(struct vfsmount *mnt, struct dentry *dentry,
+				 struct kstat *stat)
+{
+	struct inode *inode = d_inode(dentry);
+	generic_fillattr(inode, stat);
+	return 0;
+}
+
+static int empty_dir_setattr(struct dentry *dentry, struct iattr *attr)
+{
+	return -EPERM;
+}
+
+static int empty_dir_setxattr(struct dentry *dentry, const char *name,
+			      const void *value, size_t size, int flags)
+{
+	return -EOPNOTSUPP;
+}
+
+static ssize_t empty_dir_getxattr(struct dentry *dentry, const char *name,
+				  void *value, size_t size)
+{
+	return -EOPNOTSUPP;
+}
+
+static int empty_dir_removexattr(struct dentry *dentry, const char *name)
+{
+	return -EOPNOTSUPP;
+}
+
+static ssize_t empty_dir_listxattr(struct dentry *dentry, char *list, size_t size)
+{
+	return -EOPNOTSUPP;
+}
+
+static const struct inode_operations empty_dir_inode_operations = {
+	.lookup		= empty_dir_lookup,
+	.permission	= generic_permission,
+	.setattr	= empty_dir_setattr,
+	.getattr	= empty_dir_getattr,
+	.setxattr	= empty_dir_setxattr,
+	.getxattr	= empty_dir_getxattr,
+	.removexattr	= empty_dir_removexattr,
+	.listxattr	= empty_dir_listxattr,
+};
+
+static loff_t empty_dir_llseek(struct file *file, loff_t offset, int whence)
+{
+	/* An empty directory has two entries . and .. at offsets 0 and 1 */
+	return generic_file_llseek_size(file, offset, whence, 2, 2);
+}
+
+static int empty_dir_readdir(struct file *file, struct dir_context *ctx)
+{
+	dir_emit_dots(file, ctx);
+	return 0;
+}
+
+static const struct file_operations empty_dir_operations = {
+	.llseek		= empty_dir_llseek,
+	.read		= generic_read_dir,
+	.iterate	= empty_dir_readdir,
+	.fsync		= noop_fsync,
+};
+
+
+void make_empty_dir_inode(struct inode *inode)
+{
+	set_nlink(inode, 2);
+	inode->i_mode = S_IFDIR | S_IRUGO | S_IXUGO;
+	inode->i_uid = GLOBAL_ROOT_UID;
+	inode->i_gid = GLOBAL_ROOT_GID;
+	inode->i_rdev = 0;
+	inode->i_size = 2;
+	inode->i_blkbits = PAGE_SHIFT;
+	inode->i_blocks = 0;
+
+	inode->i_op = &empty_dir_inode_operations;
+	inode->i_fop = &empty_dir_operations;
+}
+
+bool is_empty_dir_inode(struct inode *inode)
+{
+	return (inode->i_fop == &empty_dir_operations) &&
+		(inode->i_op == &empty_dir_inode_operations);
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2d24eeb8e59c..571aab91bfc0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2780,6 +2780,8 @@ extern struct dentry *simple_lookup(struct inode *, struct dentry *, unsigned in
 extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t *);
 extern const struct file_operations simple_dir_operations;
 extern const struct inode_operations simple_dir_inode_operations;
+extern void make_empty_dir_inode(struct inode *inode);
+extern bool is_empty_dir_inode(struct inode *inode);
 struct tree_descr { char *name; const struct file_operations *ops; int mode; };
 struct dentry *d_alloc_name(struct dentry *, const char *);
 extern int simple_fill_super(struct super_block *, unsigned long, struct tree_descr *);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [CFT][PATCH 05/10] sysctl: Allow creating permanently empty directories that serve as mountpoints.
       [not found]   ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                       ` (2 preceding siblings ...)
  2015-05-16  2:07     ` [CFT][PATCH 04/10] fs: Add helper functions for permanently empty directories Eric W. Biederman
@ 2015-05-16  2:08     ` Eric W. Biederman
  2015-05-16  2:08     ` [CFT][PATCH 06/10] proc: Allow creating permanently empty directories that serve as mount points Eric W. Biederman
                       ` (5 subsequent siblings)
  9 siblings, 0 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-16  2:08 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, Serge E. Hallyn,
	Andy Lutomirski, Richard Weinberger, Kenton Varda,
	Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch,
	Greg Kroah-Hartman, Tejun Heo


Add a magic sysctl table sysctl_mount_point that when used to
create a directory forces that directory to be permanently empty.

Update the code to use make_empty_dir_inode when accessing permanently
empty directories.

Update the code to not allow adding to permanently empty directories.

Update /proc/sys/fs/binfmt_misc to be a permanently empty directory.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/proc/proc_sysctl.c  | 37 +++++++++++++++++++++++++++++++++++++
 include/linux/sysctl.h |  3 +++
 kernel/sysctl.c        |  8 +-------
 3 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index fea2561d773b..fdda62e6115e 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -19,6 +19,28 @@ static const struct inode_operations proc_sys_inode_operations;
 static const struct file_operations proc_sys_dir_file_operations;
 static const struct inode_operations proc_sys_dir_operations;
 
+/* Support for permanently empty directories */
+
+struct ctl_table sysctl_mount_point[] = {
+	{ }
+};
+
+static bool is_empty_dir(struct ctl_table_header *head)
+{
+	return head->ctl_table[0].child == sysctl_mount_point;
+}
+
+static void set_empty_dir(struct ctl_dir *dir)
+{
+	dir->header.ctl_table[0].child = sysctl_mount_point;
+}
+
+static void clear_empty_dir(struct ctl_dir *dir)
+
+{
+	dir->header.ctl_table[0].child = NULL;
+}
+
 void proc_sys_poll_notify(struct ctl_table_poll *poll)
 {
 	if (!poll)
@@ -187,6 +209,17 @@ static int insert_header(struct ctl_dir *dir, struct ctl_table_header *header)
 	struct ctl_table *entry;
 	int err;
 
+	/* Is this a permanently empty directory? */
+	if (is_empty_dir(&dir->header))
+		return -EROFS;
+
+	/* Am I creating a permanently empty directory? */
+	if (header->ctl_table == sysctl_mount_point) {
+		if (!RB_EMPTY_ROOT(&dir->root))
+			return -EINVAL;
+		set_empty_dir(dir);
+	}
+
 	dir->header.nreg++;
 	header->parent = dir;
 	err = insert_links(header);
@@ -202,6 +235,8 @@ fail:
 	erase_header(header);
 	put_links(header);
 fail_links:
+	if (header->ctl_table == sysctl_mount_point)
+		clear_empty_dir(dir);
 	header->parent = NULL;
 	drop_sysctl_table(&dir->header);
 	return err;
@@ -419,6 +454,8 @@ static struct inode *proc_sys_make_inode(struct super_block *sb,
 		inode->i_mode |= S_IFDIR;
 		inode->i_op = &proc_sys_dir_operations;
 		inode->i_fop = &proc_sys_dir_file_operations;
+		if (is_empty_dir(head))
+			make_empty_dir_inode(inode);
 	}
 out:
 	return inode;
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 795d5fea5697..fa7bc29925c9 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -188,6 +188,9 @@ struct ctl_table_header *register_sysctl_paths(const struct ctl_path *path,
 void unregister_sysctl_table(struct ctl_table_header * table);
 
 extern int sysctl_init(void);
+
+extern struct ctl_table sysctl_mount_point[];
+
 #else /* CONFIG_SYSCTL */
 static inline struct ctl_table_header *register_sysctl_table(struct ctl_table * table)
 {
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2082b1a88fb9..c3eee4c6d6c1 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1531,12 +1531,6 @@ static struct ctl_table vm_table[] = {
 	{ }
 };
 
-#if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE)
-static struct ctl_table binfmt_misc_table[] = {
-	{ }
-};
-#endif
-
 static struct ctl_table fs_table[] = {
 	{
 		.procname	= "inode-nr",
@@ -1690,7 +1684,7 @@ static struct ctl_table fs_table[] = {
 	{
 		.procname	= "binfmt_misc",
 		.mode		= 0555,
-		.child		= binfmt_misc_table,
+		.child		= sysctl_mount_point,
 	},
 #endif
 	{
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [CFT][PATCH 06/10] proc: Allow creating permanently empty directories that serve as mount points
       [not found]   ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                       ` (3 preceding siblings ...)
  2015-05-16  2:08     ` [CFT][PATCH 05/10] sysctl: Allow creating permanently empty directories that serve as mountpoints Eric W. Biederman
@ 2015-05-16  2:08     ` Eric W. Biederman
  2015-05-16  2:09     ` [CFT][PATCH 07/10] kernfs: Add support for always empty directories Eric W. Biederman
                       ` (4 subsequent siblings)
  9 siblings, 0 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-16  2:08 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, Serge E. Hallyn,
	Andy Lutomirski, Richard Weinberger, Kenton Varda,
	Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch,
	Greg Kroah-Hartman, Tejun Heo


Add a new function proc_create_mount_point that when used to creates a
directory that can not be added to.

Add a new function is_empty_pde to test if a function is a mount
point.

Update the code to use make_empty_dir_inode when reporting
a permanently empty directory to the vfs.

Update the code to not allow adding to permanently empty directories.

Update /proc/openprom and /proc/fs/nfsd to be permanently empty directories.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/proc/generic.c  | 23 +++++++++++++++++++++++
 fs/proc/inode.c    |  4 ++++
 fs/proc/internal.h |  6 ++++++
 fs/proc/root.c     |  4 ++--
 4 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index df6327a2b865..e5dee5c3188e 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -373,6 +373,10 @@ static struct proc_dir_entry *__proc_create(struct proc_dir_entry **parent,
 		WARN(1, "create '/proc/%s' by hand\n", qstr.name);
 		return NULL;
 	}
+	if (is_empty_pde(*parent)) {
+		WARN(1, "attempt to add to permanently empty directory");
+		return NULL;
+	}
 
 	ent = kzalloc(sizeof(struct proc_dir_entry) + qstr.len + 1, GFP_KERNEL);
 	if (!ent)
@@ -455,6 +459,25 @@ struct proc_dir_entry *proc_mkdir(const char *name,
 }
 EXPORT_SYMBOL(proc_mkdir);
 
+struct proc_dir_entry *proc_create_mount_point(const char *name)
+{
+	umode_t mode = S_IFDIR | S_IRUGO | S_IXUGO;
+	struct proc_dir_entry *ent, *parent = NULL;
+
+	ent = __proc_create(&parent, name, mode, 2);
+	if (ent) {
+		ent->data = NULL;
+		ent->proc_fops = NULL;
+		ent->proc_iops = NULL;
+		if (proc_register(parent, ent) < 0) {
+			kfree(ent);
+			parent->nlink--;
+			ent = NULL;
+		}
+	}
+	return ent;
+}
+
 struct proc_dir_entry *proc_create_data(const char *name, umode_t mode,
 					struct proc_dir_entry *parent,
 					const struct file_operations *proc_fops,
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 8272aaba1bb0..e3eb5524639f 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -423,6 +423,10 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
 		inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
 		PROC_I(inode)->pde = de;
 
+		if (is_empty_pde(de)) {
+			make_empty_dir_inode(inode);
+			return inode;
+		}
 		if (de->mode) {
 			inode->i_mode = de->mode;
 			inode->i_uid = de->uid;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index c835b94c0cd3..aa2781095bd1 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -191,6 +191,12 @@ static inline struct proc_dir_entry *pde_get(struct proc_dir_entry *pde)
 }
 extern void pde_put(struct proc_dir_entry *);
 
+static inline bool is_empty_pde(const struct proc_dir_entry *pde)
+{
+	return S_ISDIR(pde->mode) && !pde->proc_iops;
+}
+struct proc_dir_entry *proc_create_mount_point(const char *name);
+
 /*
  * inode.c
  */
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 64e1ab64bde6..68feb0f70e63 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -179,10 +179,10 @@ void __init proc_root_init(void)
 #endif
 	proc_mkdir("fs", NULL);
 	proc_mkdir("driver", NULL);
-	proc_mkdir("fs/nfsd", NULL); /* somewhere for the nfsd filesystem to be mounted */
+	proc_create_mount_point("fs/nfsd"); /* somewhere for the nfsd filesystem to be mounted */
 #if defined(CONFIG_SUN_OPENPROMFS) || defined(CONFIG_SUN_OPENPROMFS_MODULE)
 	/* just give it a mountpoint */
-	proc_mkdir("openprom", NULL);
+	proc_create_mount_point("openprom");
 #endif
 	proc_tty_init();
 	proc_mkdir("bus", NULL);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [CFT][PATCH 07/10] kernfs: Add support for always empty directories.
       [not found]   ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                       ` (4 preceding siblings ...)
  2015-05-16  2:08     ` [CFT][PATCH 06/10] proc: Allow creating permanently empty directories that serve as mount points Eric W. Biederman
@ 2015-05-16  2:09     ` Eric W. Biederman
  2015-05-16  2:09     ` [CFT][PATCH 08/10] sysfs: Add support for permanently empty directories to serve as mount points Eric W. Biederman
                       ` (3 subsequent siblings)
  9 siblings, 0 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-16  2:09 UTC (permalink / raw)
  To: Linux Containers
  Cc: Linux API, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda,
	Michael Kerrisk-manpages, Richard Weinberger,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo


Add a new function kernfs_create_empty_dir that can be used to create
directory that can not be modified.

Update the code to use make_empty_dir_inode when reporting a
permanently empty directory to the vfs.

Update the code to not allow adding to permanently empty directories.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/kernfs/dir.c        | 38 +++++++++++++++++++++++++++++++++++++-
 fs/kernfs/inode.c      |  2 ++
 include/linux/kernfs.h |  3 +++
 3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index f131fc23ffc4..47dc636d80ed 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -585,6 +585,9 @@ int kernfs_add_one(struct kernfs_node *kn)
 		goto out_unlock;
 
 	ret = -ENOENT;
+	if (parent->flags & KERNFS_EMPTY_DIR)
+		goto out_unlock;
+
 	if ((parent->flags & KERNFS_ACTIVATED) && !kernfs_active(parent))
 		goto out_unlock;
 
@@ -776,6 +779,38 @@ struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
 	return ERR_PTR(rc);
 }
 
+/**
+ * kernfs_create_empty_dir - create an always empty directory
+ * @parent: parent in which to create a new directory
+ * @name: name of the new directory
+ *
+ * Returns the created node on success, ERR_PTR() value on failure.
+ */
+struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
+					    const char *name)
+{
+	struct kernfs_node *kn;
+	int rc;
+
+	/* allocate */
+	kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR, KERNFS_DIR);
+	if (!kn)
+		return ERR_PTR(-ENOMEM);
+
+	kn->flags |= KERNFS_EMPTY_DIR;
+	kn->dir.root = parent->dir.root;
+	kn->ns = NULL;
+	kn->priv = NULL;
+
+	/* link in */
+	rc = kernfs_add_one(kn);
+	if (!rc)
+		return kn;
+
+	kernfs_put(kn);
+	return ERR_PTR(rc);
+}
+
 static struct dentry *kernfs_iop_lookup(struct inode *dir,
 					struct dentry *dentry,
 					unsigned int flags)
@@ -1247,7 +1282,8 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
 	mutex_lock(&kernfs_mutex);
 
 	error = -ENOENT;
-	if (!kernfs_active(kn) || !kernfs_active(new_parent))
+	if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
+	    (new_parent->flags & KERNFS_EMPTY_DIR))
 		goto out;
 
 	error = 0;
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 2da8493a380b..756dd56aaf60 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -296,6 +296,8 @@ static void kernfs_init_inode(struct kernfs_node *kn, struct inode *inode)
 	case KERNFS_DIR:
 		inode->i_op = &kernfs_dir_iops;
 		inode->i_fop = &kernfs_dir_fops;
+		if (kn->flags & KERNFS_EMPTY_DIR)
+			make_empty_dir_inode(inode);
 		break;
 	case KERNFS_FILE:
 		inode->i_size = kn->attr.size;
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 71ecdab1671b..29d1896c3ba5 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -45,6 +45,7 @@ enum kernfs_node_flag {
 	KERNFS_LOCKDEP		= 0x0100,
 	KERNFS_SUICIDAL		= 0x0400,
 	KERNFS_SUICIDED		= 0x0800,
+	KERNFS_EMPTY_DIR	= 0x1000,
 };
 
 /* @flags for kernfs_create_root() */
@@ -285,6 +286,8 @@ void kernfs_destroy_root(struct kernfs_root *root);
 struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
 					 const char *name, umode_t mode,
 					 void *priv, const void *ns);
+struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
+					    const char *name);
 struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
 					 const char *name,
 					 umode_t mode, loff_t size,
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [CFT][PATCH 08/10] sysfs: Add support for permanently empty directories to serve as mount points.
       [not found]   ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                       ` (5 preceding siblings ...)
  2015-05-16  2:09     ` [CFT][PATCH 07/10] kernfs: Add support for always empty directories Eric W. Biederman
@ 2015-05-16  2:09     ` Eric W. Biederman
  2015-05-18 13:14       ` Greg Kroah-Hartman
  2015-05-16  2:10     ` [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_mount_point Eric W. Biederman
                       ` (2 subsequent siblings)
  9 siblings, 1 reply; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-16  2:09 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, Serge E. Hallyn,
	Andy Lutomirski, Richard Weinberger, Kenton Varda,
	Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch,
	Greg Kroah-Hartman, Tejun Heo


Add two functions sysfs_create_mount_point and sysfs_remove_mount_point
that hang a permanently empty directory off of a kobject or remove a
permanently emptpy directory hanging from a kobject.  Export these new
functions so modular filesystems can use them.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/sysfs/dir.c        | 34 ++++++++++++++++++++++++++++++++++
 include/linux/sysfs.h | 16 ++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 0b45ff42f374..94374e435025 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -121,3 +121,37 @@ int sysfs_move_dir_ns(struct kobject *kobj, struct kobject *new_parent_kobj,
 
 	return kernfs_rename_ns(kn, new_parent, kn->name, new_ns);
 }
+
+/**
+ * sysfs_create_mount_point - create an always empty directory
+ * @parent_kobj:  kobject that will contain this always empty directory
+ * @name: The name of the always empty directory to add
+ */
+int sysfs_create_mount_point(struct kobject *parent_kobj, const char *name)
+{
+	struct kernfs_node *kn, *parent = parent_kobj->sd;
+
+	kn = kernfs_create_empty_dir(parent, name);
+	if (IS_ERR(kn)) {
+		if (PTR_ERR(kn) == -EEXIST)
+			sysfs_warn_dup(parent, name);
+		return PTR_ERR(kn);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(sysfs_create_mount_point);
+
+/**
+ *	sysfs_remove_mount_point - remove an always empty directory.
+ *	@parent_kobj: kobject that will contain this always empty directory
+ *	@name: The name of the always empty directory to remove
+ *
+ */
+void sysfs_remove_mount_point(struct kobject *parent_kobj, const char *name)
+{
+	struct kernfs_node *parent = parent_kobj->sd;
+
+	kernfs_remove_by_name_ns(parent, name, NULL);
+}
+EXPORT_SYMBOL_GPL(sysfs_remove_mount_point);
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index 99382c0df17e..3e7e41acc451 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -210,6 +210,10 @@ int __must_check sysfs_rename_dir_ns(struct kobject *kobj, const char *new_name,
 int __must_check sysfs_move_dir_ns(struct kobject *kobj,
 				   struct kobject *new_parent_kobj,
 				   const void *new_ns);
+int __must_check sysfs_create_mount_point(struct kobject *parent_kobj,
+					  const char *name);
+void sysfs_remove_mount_point(struct kobject *parent_kobj,
+			      const char *name);
 
 int __must_check sysfs_create_file_ns(struct kobject *kobj,
 				      const struct attribute *attr,
@@ -298,6 +302,18 @@ static inline int sysfs_move_dir_ns(struct kobject *kobj,
 	return 0;
 }
 
+static inline int sysfs_create_mount_point(struct kobject *parent_kobj,
+					   const char *name)
+{
+	return 0;
+}
+
+static inline void sysfs_remove_mount_point(struct kobject *parent_kobj,
+					    const char *name)
+{
+	return 0;
+}
+
 static inline int sysfs_create_file_ns(struct kobject *kobj,
 				       const struct attribute *attr,
 				       const void *ns)
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_mount_point
       [not found]   ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                       ` (6 preceding siblings ...)
  2015-05-16  2:09     ` [CFT][PATCH 08/10] sysfs: Add support for permanently empty directories to serve as mount points Eric W. Biederman
@ 2015-05-16  2:10     ` Eric W. Biederman
  2015-05-18 13:14       ` Greg Kroah-Hartman
  2015-05-16  2:11     ` [CFT][PATCH 10/10] mnt: Update fs_fully_visible to test for permanently empty directories Eric W. Biederman
  2015-05-22 17:39     ` [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) Eric W. Biederman
  9 siblings, 1 reply; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-16  2:10 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, Serge E. Hallyn,
	Andy Lutomirski, Richard Weinberger, Kenton Varda,
	Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch,
	Greg Kroah-Hartman, Tejun Heo


This allows for better documentation in the code and
it allows for a simpler and fully correct version of
fs_fully_visible to be written.

The mount points converted and their filesystems are:
/sys/hypervisor/s390/       s390_hypfs
/sys/kernel/config/         configfs
/sys/kernel/debug/          debugfs
/sys/firmware/efi/efivars/  efivarfs
/sys/fs/fuse/connections/   fusectl
/sys/fs/pstore/             pstore
/sys/kernel/tracing/        tracefs
/sys/fs/cgroup/             cgroup
/sys/kernel/security/       securityfs
/sys/fs/selinux/            selinuxfs
/sys/fs/smackfs/            smackfs

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 arch/s390/hypfs/inode.c      | 12 ++++--------
 drivers/firmware/efi/efi.c   |  6 ++----
 fs/configfs/mount.c          | 10 ++++------
 fs/debugfs/inode.c           | 11 ++++-------
 fs/fuse/inode.c              |  9 +++------
 fs/pstore/inode.c            | 12 ++++--------
 fs/tracefs/inode.c           |  6 ++----
 kernel/cgroup.c              | 10 ++++------
 security/inode.c             | 10 ++++------
 security/selinux/selinuxfs.c | 11 +++++------
 security/smack/smackfs.c     |  8 ++++----
 11 files changed, 40 insertions(+), 65 deletions(-)

diff --git a/arch/s390/hypfs/inode.c b/arch/s390/hypfs/inode.c
index d3f896a35b98..2eeb0a0f506d 100644
--- a/arch/s390/hypfs/inode.c
+++ b/arch/s390/hypfs/inode.c
@@ -456,8 +456,6 @@ static const struct super_operations hypfs_s_ops = {
 	.show_options	= hypfs_show_options,
 };
 
-static struct kobject *s390_kobj;
-
 static int __init hypfs_init(void)
 {
 	int rc;
@@ -481,18 +479,16 @@ static int __init hypfs_init(void)
 		rc = -ENODATA;
 		goto fail_hypfs_sprp_exit;
 	}
-	s390_kobj = kobject_create_and_add("s390", hypervisor_kobj);
-	if (!s390_kobj) {
-		rc = -ENOMEM;
+	rc = sysfs_create_mount_point(hypervisor_kobj, "s390");
+	if (rc)
 		goto fail_hypfs_diag0c_exit;
-	}
 	rc = register_filesystem(&hypfs_type);
 	if (rc)
 		goto fail_filesystem;
 	return 0;
 
 fail_filesystem:
-	kobject_put(s390_kobj);
+	sysfs_remove_mount_point(hypervisor_kobj, "s390");
 fail_hypfs_diag0c_exit:
 	hypfs_diag0c_exit();
 fail_hypfs_sprp_exit:
@@ -510,7 +506,7 @@ fail_dbfs_exit:
 static void __exit hypfs_exit(void)
 {
 	unregister_filesystem(&hypfs_type);
-	kobject_put(s390_kobj);
+	sysfs_remove_mount_point(hypervisor_kobj, "s390");
 	hypfs_diag0c_exit();
 	hypfs_sprp_exit();
 	hypfs_vm_exit();
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 3061bb8629dc..e14363d12690 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -65,7 +65,6 @@ static int __init parse_efi_cmdline(char *str)
 early_param("efi", parse_efi_cmdline);
 
 static struct kobject *efi_kobj;
-static struct kobject *efivars_kobj;
 
 /*
  * Let's not leave out systab information that snuck into
@@ -212,10 +211,9 @@ static int __init efisubsys_init(void)
 		goto err_remove_group;
 
 	/* and the standard mountpoint for efivarfs */
-	efivars_kobj = kobject_create_and_add("efivars", efi_kobj);
-	if (!efivars_kobj) {
+	error = sysfs_create_mount_point(efi_kobj, "efivars");
+	if (error) {
 		pr_err("efivars: Subsystem registration failed.\n");
-		error = -ENOMEM;
 		goto err_remove_group;
 	}
 
diff --git a/fs/configfs/mount.c b/fs/configfs/mount.c
index da94e41bdbf6..bca58da65e2b 100644
--- a/fs/configfs/mount.c
+++ b/fs/configfs/mount.c
@@ -129,8 +129,6 @@ void configfs_release_fs(void)
 }
 
 
-static struct kobject *config_kobj;
-
 static int __init configfs_init(void)
 {
 	int err = -ENOMEM;
@@ -141,8 +139,8 @@ static int __init configfs_init(void)
 	if (!configfs_dir_cachep)
 		goto out;
 
-	config_kobj = kobject_create_and_add("config", kernel_kobj);
-	if (!config_kobj)
+	err = sysfs_create_mount_point(kernel_kobj, "config");
+	if (err)
 		goto out2;
 
 	err = register_filesystem(&configfs_fs_type);
@@ -152,7 +150,7 @@ static int __init configfs_init(void)
 	return 0;
 out3:
 	pr_err("Unable to register filesystem!\n");
-	kobject_put(config_kobj);
+	sysfs_remove_mount_point(kernel_kobj, "config");
 out2:
 	kmem_cache_destroy(configfs_dir_cachep);
 	configfs_dir_cachep = NULL;
@@ -163,7 +161,7 @@ out:
 static void __exit configfs_exit(void)
 {
 	unregister_filesystem(&configfs_fs_type);
-	kobject_put(config_kobj);
+	sysfs_remove_mount_point(kernel_kobj, "config");
 	kmem_cache_destroy(configfs_dir_cachep);
 	configfs_dir_cachep = NULL;
 }
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index c1e7ffb0dab6..12756040ca20 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -716,20 +716,17 @@ bool debugfs_initialized(void)
 }
 EXPORT_SYMBOL_GPL(debugfs_initialized);
 
-
-static struct kobject *debug_kobj;
-
 static int __init debugfs_init(void)
 {
 	int retval;
 
-	debug_kobj = kobject_create_and_add("debug", kernel_kobj);
-	if (!debug_kobj)
-		return -EINVAL;
+	retval = sysfs_create_mount_point(kernel_kobj, "debug");
+	if (retval)
+		return retval;
 
 	retval = register_filesystem(&debug_fs_type);
 	if (retval)
-		kobject_put(debug_kobj);
+		sysfs_remove_mount_point(kernel_kobj, "debug");
 	else
 		debugfs_registered = true;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 082ac1c97f39..18dacf9ed8ff 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1238,7 +1238,6 @@ static void fuse_fs_cleanup(void)
 }
 
 static struct kobject *fuse_kobj;
-static struct kobject *connections_kobj;
 
 static int fuse_sysfs_init(void)
 {
@@ -1250,11 +1249,9 @@ static int fuse_sysfs_init(void)
 		goto out_err;
 	}
 
-	connections_kobj = kobject_create_and_add("connections", fuse_kobj);
-	if (!connections_kobj) {
-		err = -ENOMEM;
+	err = sysfs_create_mount_point(fuse_kobj, "connections");
+	if (err)
 		goto out_fuse_unregister;
-	}
 
 	return 0;
 
@@ -1266,7 +1263,7 @@ static int fuse_sysfs_init(void)
 
 static void fuse_sysfs_cleanup(void)
 {
-	kobject_put(connections_kobj);
+	sysfs_remove_mount_point(fuse_kobj, "connections");
 	kobject_put(fuse_kobj);
 }
 
diff --git a/fs/pstore/inode.c b/fs/pstore/inode.c
index dc43b5f29305..3adcc4669fac 100644
--- a/fs/pstore/inode.c
+++ b/fs/pstore/inode.c
@@ -461,22 +461,18 @@ static struct file_system_type pstore_fs_type = {
 	.kill_sb	= pstore_kill_sb,
 };
 
-static struct kobject *pstore_kobj;
-
 static int __init init_pstore_fs(void)
 {
-	int err = 0;
+	int err;
 
 	/* Create a convenient mount point for people to access pstore */
-	pstore_kobj = kobject_create_and_add("pstore", fs_kobj);
-	if (!pstore_kobj) {
-		err = -ENOMEM;
+	err = sysfs_create_mount_point(fs_kobj, "pstore");
+	if (err)
 		goto out;
-	}
 
 	err = register_filesystem(&pstore_fs_type);
 	if (err < 0)
-		kobject_put(pstore_kobj);
+		sysfs_remove_mount_point(fs_kobj, "pstore");
 
 out:
 	return err;
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index d92bdf3b079a..a43df11a163f 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -631,14 +631,12 @@ bool tracefs_initialized(void)
 	return tracefs_registered;
 }
 
-static struct kobject *trace_kobj;
-
 static int __init tracefs_init(void)
 {
 	int retval;
 
-	trace_kobj = kobject_create_and_add("tracing", kernel_kobj);
-	if (!trace_kobj)
+	retval = sysfs_create_mount_point(kernel_kobj, "tracing");
+	if (retval)
 		return -EINVAL;
 
 	retval = register_filesystem(&trace_fs_type);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 469dd547770c..e8a5491be756 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1924,8 +1924,6 @@ static struct file_system_type cgroup_fs_type = {
 	.kill_sb = cgroup_kill_sb,
 };
 
-static struct kobject *cgroup_kobj;
-
 /**
  * task_cgroup_path - cgroup path of a task in the first cgroup hierarchy
  * @task: target task
@@ -5044,13 +5042,13 @@ int __init cgroup_init(void)
 			ss->bind(init_css_set.subsys[ssid]);
 	}
 
-	cgroup_kobj = kobject_create_and_add("cgroup", fs_kobj);
-	if (!cgroup_kobj)
-		return -ENOMEM;
+	err = sysfs_create_mount_point(fs_kobj, "cgroup");
+	if (err)
+		return err;
 
 	err = register_filesystem(&cgroup_fs_type);
 	if (err < 0) {
-		kobject_put(cgroup_kobj);
+		sysfs_remove_mount_point(fs_kobj, "cgroup");
 		return err;
 	}
 
diff --git a/security/inode.c b/security/inode.c
index 91503b79c5f8..0e37e4fba8fa 100644
--- a/security/inode.c
+++ b/security/inode.c
@@ -215,19 +215,17 @@ void securityfs_remove(struct dentry *dentry)
 }
 EXPORT_SYMBOL_GPL(securityfs_remove);
 
-static struct kobject *security_kobj;
-
 static int __init securityfs_init(void)
 {
 	int retval;
 
-	security_kobj = kobject_create_and_add("security", kernel_kobj);
-	if (!security_kobj)
-		return -EINVAL;
+	retval = sysfs_create_mount_point(kernel_kobj, "security");
+	if (retval)
+		return retval;
 
 	retval = register_filesystem(&fs_type);
 	if (retval)
-		kobject_put(security_kobj);
+		sysfs_remove_mount_point(kernel_kobj, "security");
 	return retval;
 }
 
diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
index d2787cca1fcb..3d2201413028 100644
--- a/security/selinux/selinuxfs.c
+++ b/security/selinux/selinuxfs.c
@@ -1853,7 +1853,6 @@ static struct file_system_type sel_fs_type = {
 };
 
 struct vfsmount *selinuxfs_mount;
-static struct kobject *selinuxfs_kobj;
 
 static int __init init_sel_fs(void)
 {
@@ -1862,13 +1861,13 @@ static int __init init_sel_fs(void)
 	if (!selinux_enabled)
 		return 0;
 
-	selinuxfs_kobj = kobject_create_and_add("selinux", fs_kobj);
-	if (!selinuxfs_kobj)
-		return -ENOMEM;
+	err = sysfs_create_mount_point(fs_kobj, "selinux");
+	if (err)
+		return err;
 
 	err = register_filesystem(&sel_fs_type);
 	if (err) {
-		kobject_put(selinuxfs_kobj);
+		sysfs_remove_mount_point(fs_kobj, "selinux");
 		return err;
 	}
 
@@ -1887,7 +1886,7 @@ __initcall(init_sel_fs);
 #ifdef CONFIG_SECURITY_SELINUX_DISABLE
 void exit_sel_fs(void)
 {
-	kobject_put(selinuxfs_kobj);
+	sysfs_remove_mount_point(fs_kobj, "selinux");
 	kern_unmount(selinuxfs_mount);
 	unregister_filesystem(&sel_fs_type);
 }
diff --git a/security/smack/smackfs.c b/security/smack/smackfs.c
index d9682985349e..ac4cac7c661a 100644
--- a/security/smack/smackfs.c
+++ b/security/smack/smackfs.c
@@ -2241,16 +2241,16 @@ static const struct file_operations smk_revoke_subj_ops = {
 	.llseek		= generic_file_llseek,
 };
 
-static struct kset *smackfs_kset;
 /**
  * smk_init_sysfs - initialize /sys/fs/smackfs
  *
  */
 static int smk_init_sysfs(void)
 {
-	smackfs_kset = kset_create_and_add("smackfs", NULL, fs_kobj);
-	if (!smackfs_kset)
-		return -ENOMEM;
+	int err;
+	err = sysfs_create_mount_point(fs_kobj, "smackfs");
+	if (err)
+		return err;
 	return 0;
 }
 
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [CFT][PATCH 10/10] mnt: Update fs_fully_visible to test for permanently empty directories
       [not found]   ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                       ` (7 preceding siblings ...)
  2015-05-16  2:10     ` [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_mount_point Eric W. Biederman
@ 2015-05-16  2:11     ` Eric W. Biederman
  2015-05-22 17:39     ` [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) Eric W. Biederman
  9 siblings, 0 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-16  2:11 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, Serge E. Hallyn,
	Andy Lutomirski, Richard Weinberger, Kenton Varda,
	Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch,
	Greg Kroah-Hartman, Tejun Heo


fs_fully_visible attempts to make fresh mounts of proc and sysfs give
the mounter no more access to proc and sysfs than if they could have
by creating a bind mount.  One aspect of proc and sysfs that makes
this particularly tricky is that there are other filesystems that
typically mount on top of proc and sysfs.  As those filesystems are
mounted on empty directories in practice it is safe to ignore them.
However testing to ensure filesystems are mounted on empty directories
has not been something the in kernel data structures have supported so
the current test for an empty directory which checks to see
if nlink <= 2 is a bit lacking.

proc and sysfs have recently been modified to use the new empty_dir
infrastructure to create all of their dedicated mount points.  Instead
of testing for S_ISDIR(inode->i_mode) && i_nlink <= 2 to see if a
directory is empty, test for is_empty_dir_inode(inode).  That small
change guaranteess mounts found on proc and sysfs really are safe to
ignore, because the directories are not only empty but nothing can
ever be added to them.  This guarantees there is nothing to worry
about when mounting proc and sysfs.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 3ede0669b8d2..eccd925c6e82 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3220,9 +3220,8 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags)
 			/* Only worry about locked mounts */
 			if (!(mnt->mnt.mnt_flags & MNT_LOCKED))
 				continue;
-			if (!S_ISDIR(inode->i_mode))
-				goto next;
-			if (inode->i_nlink > 2)
+			/* Is the directory permanetly empty? */
+			if (!is_empty_dir_inode(inode))
 				goto next;
 		}
 		/* Preserve the locked attributes */
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 08/10] sysfs: Add support for permanently empty directories to serve as mount points.
  2015-05-16  2:09     ` [CFT][PATCH 08/10] sysfs: Add support for permanently empty directories to serve as mount points Eric W. Biederman
@ 2015-05-18 13:14       ` Greg Kroah-Hartman
  0 siblings, 0 replies; 85+ messages in thread
From: Greg Kroah-Hartman @ 2015-05-18 13:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-fsdevel, Linux API, Serge E. Hallyn,
	Andy Lutomirski, Richard Weinberger, Kenton Varda,
	Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch,
	Tejun Heo

On Fri, May 15, 2015 at 09:09:53PM -0500, Eric W. Biederman wrote:
> 
> Add two functions sysfs_create_mount_point and sysfs_remove_mount_point
> that hang a permanently empty directory off of a kobject or remove a
> permanently emptpy directory hanging from a kobject.  Export these new
> functions so modular filesystems can use them.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/sysfs/dir.c        | 34 ++++++++++++++++++++++++++++++++++
>  include/linux/sysfs.h | 16 ++++++++++++++++
>  2 files changed, 50 insertions(+)

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_mount_point
  2015-05-16  2:10     ` [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_mount_point Eric W. Biederman
@ 2015-05-18 13:14       ` Greg Kroah-Hartman
  0 siblings, 0 replies; 85+ messages in thread
From: Greg Kroah-Hartman @ 2015-05-18 13:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-fsdevel, Linux API, Serge E. Hallyn,
	Andy Lutomirski, Richard Weinberger, Kenton Varda,
	Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch,
	Tejun Heo

On Fri, May 15, 2015 at 09:10:42PM -0500, Eric W. Biederman wrote:
> 
> This allows for better documentation in the code and
> it allows for a simpler and fully correct version of
> fs_fully_visible to be written.
> 
> The mount points converted and their filesystems are:
> /sys/hypervisor/s390/       s390_hypfs
> /sys/kernel/config/         configfs
> /sys/kernel/debug/          debugfs
> /sys/firmware/efi/efivars/  efivarfs
> /sys/fs/fuse/connections/   fusectl
> /sys/fs/pstore/             pstore
> /sys/kernel/tracing/        tracefs
> /sys/fs/cgroup/             cgroup
> /sys/kernel/security/       securityfs
> /sys/fs/selinux/            selinuxfs
> /sys/fs/smackfs/            smackfs
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
       [not found]   ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                       ` (8 preceding siblings ...)
  2015-05-16  2:11     ` [CFT][PATCH 10/10] mnt: Update fs_fully_visible to test for permanently empty directories Eric W. Biederman
@ 2015-05-22 17:39     ` Eric W. Biederman
       [not found]       ` <87wq004im1.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  9 siblings, 1 reply; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-22 17:39 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, Serge E. Hallyn,
	Andy Lutomirski, Richard Weinberger, Kenton Varda,
	Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch,
	Greg Kroah-Hartman, Tejun Heo, Seth Forshee

I had hoped to get some Tested-By's on that patch series. 

Oh well.  The fundamentals seem sound, and my biggest concern the
implicit nodev does not apply so I will put this patchset in linux-next
and aim at merging it in the next merge window.  Hopefully that will
leave enough time catch problems.

Eric

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
       [not found]       ` <87wq004im1.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-05-22 18:59         ` Andy Lutomirski
       [not found]           ` <CALCETrUhXBR5WQ6gXr9KzGc4=7tph7kzopY29Hug4g+FhOzEKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-05-28 14:08           ` Serge Hallyn
  0 siblings, 2 replies; 85+ messages in thread
From: Andy Lutomirski @ 2015-05-22 18:59 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, Linux FS Devel, Linux API, Serge E. Hallyn,
	Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages,
	Stéphane Graber, Eric Windisch, Greg Kroah-Hartman,
	Tejun Heo, Seth Forshee

On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> I had hoped to get some Tested-By's on that patch series.

Sorry, I've been totally swamped.

I suspect that Sandstorm is okay, but I haven't had a chance to test
it for real.  Sandstorm makes only limited use of proc and sysfs in
containers, but I'll see if I can test it for real this weekend.

>
> Oh well.  The fundamentals seem sound, and my biggest concern the
> implicit nodev does not apply so I will put this patchset in linux-next
> and aim at merging it in the next merge window.  Hopefully that will
> leave enough time catch problems.
>
> Eric
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
       [not found]           ` <CALCETrUhXBR5WQ6gXr9KzGc4=7tph7kzopY29Hug4g+FhOzEKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-22 20:41             ` Eric W. Biederman
  0 siblings, 0 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-22 20:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman,
	Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger,
	Linux FS Devel, Tejun Heo

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:

> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> I had hoped to get some Tested-By's on that patch series.
>
> Sorry, I've been totally swamped.
>
> I suspect that Sandstorm is okay, but I haven't had a chance to test
> it for real.  Sandstorm makes only limited use of proc and sysfs in
> containers, but I'll see if I can test it for real this weekend.

Thanks.

Eric

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
  2015-05-22 18:59         ` Andy Lutomirski
       [not found]           ` <CALCETrUhXBR5WQ6gXr9KzGc4=7tph7kzopY29Hug4g+FhOzEKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-28 14:08           ` Serge Hallyn
  2015-05-28 15:03             ` Eric W. Biederman
  2015-05-28 19:36             ` Richard Weinberger
  1 sibling, 2 replies; 85+ messages in thread
From: Serge Hallyn @ 2015-05-28 14:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric W. Biederman, Seth Forshee, Linux API, Linux Containers,
	Greg Kroah-Hartman, Kenton Varda, Michael Kerrisk-manpages,
	Richard Weinberger, Linux FS Devel, Tejun Heo

Quoting Andy Lutomirski (luto@amacapital.net):
> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
> > I had hoped to get some Tested-By's on that patch series.
> 
> Sorry, I've been totally swamped.
> 
> I suspect that Sandstorm is okay, but I haven't had a chance to test
> it for real.  Sandstorm makes only limited use of proc and sysfs in
> containers, but I'll see if I can test it for real this weekend.

Testing this with unprivileged containers, I get

lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted - error mounting sysfs on /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0

> > Oh well.  The fundamentals seem sound, and my biggest concern the
> > implicit nodev does not apply so I will put this patchset in linux-next
> > and aim at merging it in the next merge window.  Hopefully that will
> > leave enough time catch problems.
> >
> > Eric
> >
> 
> 
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
  2015-05-28 14:08           ` Serge Hallyn
@ 2015-05-28 15:03             ` Eric W. Biederman
  2015-05-28 17:33               ` Andy Lutomirski
  2015-05-28 21:04               ` Serge E. Hallyn
  2015-05-28 19:36             ` Richard Weinberger
  1 sibling, 2 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-28 15:03 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Andy Lutomirski, Seth Forshee, Linux API, Linux Containers,
	Greg Kroah-Hartman, Kenton Varda, Michael Kerrisk-manpages,
	Richard Weinberger, Linux FS Devel, Tejun Heo

Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes:

> Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org):
>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman
>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> > I had hoped to get some Tested-By's on that patch series.
>> 
>> Sorry, I've been totally swamped.
>> 
>> I suspect that Sandstorm is okay, but I haven't had a chance to test
>> it for real.  Sandstorm makes only limited use of proc and sysfs in
>> containers, but I'll see if I can test it for real this weekend.
>
> Testing this with unprivileged containers, I get
>
> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted
> - error mounting sysfs on
> /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0

Grr..  I was afraid this would break something. :(

Looking at my system I see that sysfs is currently mounted
"nosuid,nodev,noexec"

Looking at the lxc-start code I don't see it as including any of those
mount options.  In practice for sysfs I think those options are
meaningless (as there should be no devices and nothing executable in
sysfs) but I can understand the past concerns with chmod on virtual
filesystems that would incline people to use them, so I think the
failure is reporting a legitimate security issue in the lxc userspace
code where the the unprivileged code is currently attempting to give
greater access to sysfs than was given by the original mount of sysfs.

As nosuid,nodev,noexec should not impair the operation of sysfs
operation it looks like you can always specify those options and just
make this concern go away.

Something like the untested patch below I expect.

diff --git a/src/lxc/conf.c b/src/lxc/conf.c
index 9870455b3cae..d9ccd03afe68 100644
--- a/src/lxc/conf.c
+++ b/src/lxc/conf.c
@@ -770,8 +770,8 @@ static int lxc_mount_auto_mounts(struct lxc_conf *conf, int flags, struct lxc_ha
 		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, "%r/proc/sysrq-trigger",                             "%r/proc/sysrq-trigger",        NULL,       MS_BIND,                        NULL },
 		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, NULL,                                                "%r/proc/sysrq-trigger",        NULL,       MS_REMOUNT|MS_BIND|MS_RDONLY,   NULL },
 		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_RW,    "proc",                                              "%r/proc",                      "proc",     MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
-		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RW,     "sysfs",                                             "%r/sys",                       "sysfs",    0,                              NULL },
-		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RO,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_RDONLY,                      NULL },
+		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RW,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
+		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RO,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY, NULL },
 		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
 		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  "%r/sys",                                            "%r/sys",                       NULL,       MS_BIND,                        NULL },
 		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  NULL,                                                "%r/sys",                       NULL,       MS_REMOUNT|MS_BIND|MS_RDONLY,   NULL },

Alternately you can read the flags off of the original mount of proc or sysfs.

diff --git a/src/lxc/conf.c b/src/lxc/conf.c
index 9870455b3cae..50ea49973e80 100644
--- a/src/lxc/conf.c
+++ b/src/lxc/conf.c
@@ -712,7 +712,9 @@ static unsigned long add_required_remount_flags(const char *s, const char *d,
        struct statvfs sb;
        unsigned long required_flags = 0;
 
-       if (!(flags & MS_REMOUNT))
+       if (!(flags & MS_REMOUNT) &&
+           (strcmp(s, "proc") != 0) &&
+           (strcmp(s, "sysfs") != 0))
                return flags;
 
        if (!s)

Eric

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
  2015-05-28 15:03             ` Eric W. Biederman
@ 2015-05-28 17:33               ` Andy Lutomirski
       [not found]                 ` <CALCETrXXax28s9kMTQ-zDx0MttQWG4rg2y-oz3bSGiumSL=3sg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-05-28 21:04               ` Serge E. Hallyn
  1 sibling, 1 reply; 85+ messages in thread
From: Andy Lutomirski @ 2015-05-28 17:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge Hallyn, Seth Forshee, Linux API, Linux Containers,
	Greg Kroah-Hartman, Kenton Varda, Michael Kerrisk-manpages,
	Richard Weinberger, Linux FS Devel, Tejun Heo

On Thu, May 28, 2015 at 8:03 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
>
>> Quoting Andy Lutomirski (luto@amacapital.net):
>>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman
>>> <ebiederm@xmission.com> wrote:
>>> > I had hoped to get some Tested-By's on that patch series.
>>>
>>> Sorry, I've been totally swamped.
>>>
>>> I suspect that Sandstorm is okay, but I haven't had a chance to test
>>> it for real.  Sandstorm makes only limited use of proc and sysfs in
>>> containers, but I'll see if I can test it for real this weekend.
>>
>> Testing this with unprivileged containers, I get
>>
>> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted
>> - error mounting sysfs on
>> /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0
>
> Grr..  I was afraid this would break something. :(
>
> Looking at my system I see that sysfs is currently mounted
> "nosuid,nodev,noexec"
>
> Looking at the lxc-start code I don't see it as including any of those
> mount options.  In practice for sysfs I think those options are
> meaningless (as there should be no devices and nothing executable in
> sysfs) but I can understand the past concerns with chmod on virtual
> filesystems that would incline people to use them, so I think the
> failure is reporting a legitimate security issue in the lxc userspace
> code where the the unprivileged code is currently attempting to give
> greater access to sysfs than was given by the original mount of sysfs.
>
> As nosuid,nodev,noexec should not impair the operation of sysfs
> operation it looks like you can always specify those options and just
> make this concern go away.

Linus is pretty strict about not breaking the ABI, and this definitely
counts as breaking the ABI.  There's an exception for security issues,
but is there really a security issue here?  That is, do we lose
anything important if we just drop the offending part of the patch
set?  As you've said, there shouldn't be sensitive device nodes,
executables, or setuid files in proc or sysfs in the first place.

--Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
       [not found]                 ` <CALCETrXXax28s9kMTQ-zDx0MttQWG4rg2y-oz3bSGiumSL=3sg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-28 18:20                   ` Kenton Varda
       [not found]                     ` <CAOP=4wid+N_80iyPpiVMN96_fuHZZRGtYQ6AOPn-HFBj2H6Vgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 85+ messages in thread
From: Kenton Varda @ 2015-05-28 18:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Richard Weinberger, Greg Kroah-Hartman, Linux Containers,
	Serge Hallyn, Seth Forshee, Eric W. Biederman, Linux API,
	Linux FS Devel, Tejun Heo, Michael Kerrisk-manpages

On Thu, May 28, 2015 at 10:33 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Thu, May 28, 2015 at 8:03 AM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes:
>>
>>> Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org):
>>>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman
>>>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>> > I had hoped to get some Tested-By's on that patch series.
>>>>
>>>> Sorry, I've been totally swamped.
>>>>
>>>> I suspect that Sandstorm is okay, but I haven't had a chance to test
>>>> it for real.  Sandstorm makes only limited use of proc and sysfs in
>>>> containers, but I'll see if I can test it for real this weekend.
>>>
>>> Testing this with unprivileged containers, I get
>>>
>>> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted
>>> - error mounting sysfs on
>>> /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0
>>
>> Grr..  I was afraid this would break something. :(
>>
>> Looking at my system I see that sysfs is currently mounted
>> "nosuid,nodev,noexec"
>>
>> Looking at the lxc-start code I don't see it as including any of those
>> mount options.  In practice for sysfs I think those options are
>> meaningless (as there should be no devices and nothing executable in
>> sysfs) but I can understand the past concerns with chmod on virtual
>> filesystems that would incline people to use them, so I think the
>> failure is reporting a legitimate security issue in the lxc userspace
>> code where the the unprivileged code is currently attempting to give
>> greater access to sysfs than was given by the original mount of sysfs.
>>
>> As nosuid,nodev,noexec should not impair the operation of sysfs
>> operation it looks like you can always specify those options and just
>> make this concern go away.
>
> Linus is pretty strict about not breaking the ABI, and this definitely
> counts as breaking the ABI.  There's an exception for security issues,
> but is there really a security issue here?  That is, do we lose
> anything important if we just drop the offending part of the patch
> set?  As you've said, there shouldn't be sensitive device nodes,
> executables, or setuid files in proc or sysfs in the first place.

Speaking as a user of the mount() interfaces, I really think it would
be less confusing overall if mount() simply ignored the requested
flags when the caller doesn't have a choice. That is, in cases where
mount() currently fails with EPERM when not given, say, MS_NOSUID, it
should instead just pretend the caller actually set MS_NOSUID and go
ahead with a nosuid mount. Or put another way, the absence of
MS_NOSUID should not be interpreted as "remove the nosuid bit" but
rather "don't set the nosuid bit if not required".

Consider:

- This approach will actually cause lxc to have the correct behavior,
without any changes to lxc. I suspect that this generalizes: In the
vast majority of cases, when users have failed to set MS_NOSUID, it's
not because they are explicitly requesting that the flag be turned
off, but rather that they didn't know it mattered.

- If a user actually *does* expect not passing MS_NOSUID to remove the
nosuid bit, and they find instead that the nosuid bit is silently
kept, I don't think they'll be confused: it's pretty obvious in
context that this must be for security reasons.

- On the other hand, the current behavior *is* very confusing: mount()
returns EPERM because of rules the caller probably doesn't know
anything about. I've spent a fair amount of time frustrated by this
sort of thing.

-Kenton

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
       [not found]                     ` <CAOP=4wid+N_80iyPpiVMN96_fuHZZRGtYQ6AOPn-HFBj2H6Vgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-28 19:14                       ` Eric W. Biederman
       [not found]                         ` <87fv6gikfn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-05-29  0:35                         ` Andy Lutomirski
  0 siblings, 2 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-28 19:14 UTC (permalink / raw)
  To: Kenton Varda
  Cc: Andy Lutomirski, Serge Hallyn, Seth Forshee, Linux API,
	Linux Containers, Greg Kroah-Hartman, Michael Kerrisk-manpages,
	Richard Weinberger, Linux FS Devel, Tejun Heo

Kenton Varda <kenton-AuYgBwuPrUQTaNkGU808tA@public.gmane.org> writes:

> On Thu, May 28, 2015 at 10:33 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Thu, May 28, 2015 at 8:03 AM, Eric W. Biederman
>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>> Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes:
>>>
>>>> Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org):
>>>>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman
>>>>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>>> > I had hoped to get some Tested-By's on that patch series.
>>>>>
>>>>> Sorry, I've been totally swamped.
>>>>>
>>>>> I suspect that Sandstorm is okay, but I haven't had a chance to test
>>>>> it for real.  Sandstorm makes only limited use of proc and sysfs in
>>>>> containers, but I'll see if I can test it for real this weekend.
>>>>
>>>> Testing this with unprivileged containers, I get
>>>>
>>>> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted
>>>> - error mounting sysfs on
>>>> /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0
>>>
>>> Grr..  I was afraid this would break something. :(
>>>
>>> Looking at my system I see that sysfs is currently mounted
>>> "nosuid,nodev,noexec"
>>>
>>> Looking at the lxc-start code I don't see it as including any of those
>>> mount options.  In practice for sysfs I think those options are
>>> meaningless (as there should be no devices and nothing executable in
>>> sysfs) but I can understand the past concerns with chmod on virtual
>>> filesystems that would incline people to use them, so I think the
>>> failure is reporting a legitimate security issue in the lxc userspace
>>> code where the the unprivileged code is currently attempting to give
>>> greater access to sysfs than was given by the original mount of sysfs.
>>>
>>> As nosuid,nodev,noexec should not impair the operation of sysfs
>>> operation it looks like you can always specify those options and just
>>> make this concern go away.
>>
>> Linus is pretty strict about not breaking the ABI, and this definitely
>> counts as breaking the ABI.  There's an exception for security issues,
>> but is there really a security issue here?  That is, do we lose
>> anything important if we just drop the offending part of the patch
>> set?  As you've said, there shouldn't be sensitive device nodes,
>> executables, or setuid files in proc or sysfs in the first place.

We do need to enforce retaining the existing mount flags one way or
another.  Where this really matters is with MS_RDONLY.  We don't want
any old user to be able to mount /proc read-write when root mounted it
read-only.  There is a very real attack vector there.  That attack
almost works in docker container today and is avoided simply because
docker mounts over a few files on proc.

Which leads to the second side of the reason for these changes.   I am
fixing a very small but long standing ABI break.   That is in some small
ways I broke some sandboxes and when I realized they were broken I could
not imagine think how to fix the code until now.

It is the goal that user namespaces don't introduce anything for people
to worry about security wise more than simply the ability to execute
more kernel code.  So at least when the kernel implementation is correct
developers of existing applications simply do not need care.  Sadly we are
not quite there yet.

> Speaking as a user of the mount() interfaces, I really think it would
> be less confusing overall if mount() simply ignored the requested
> flags when the caller doesn't have a choice. That is, in cases where
> mount() currently fails with EPERM when not given, say, MS_NOSUID, it
> should instead just pretend the caller actually set MS_NOSUID and go
> ahead with a nosuid mount. Or put another way, the absence of
> MS_NOSUID should not be interpreted as "remove the nosuid bit" but
> rather "don't set the nosuid bit if not required".

I am conflicted.  Implicits are nice but confusing.  If we can do
something reliable and robust and maintainable here that is truly worth
the cost I am all for it.

If I mount proc read-write I likely want to be able to write to proc
files, and I will be much happier if the mount fails than if a bazillion
syscalls later something else fails when it tries to write to proc.

> Consider:
>
> - This approach will actually cause lxc to have the correct behavior,
> without any changes to lxc. I suspect that this generalizes: In the
> vast majority of cases, when users have failed to set MS_NOSUID, it's
> not because they are explicitly requesting that the flag be turned
> off, but rather that they didn't know it mattered.
>
> - If a user actually *does* expect not passing MS_NOSUID to remove the
> nosuid bit, and they find instead that the nosuid bit is silently
> kept, I don't think they'll be confused: it's pretty obvious in
> context that this must be for security reasons.
>
> - On the other hand, the current behavior *is* very confusing: mount()
> returns EPERM because of rules the caller probably doesn't know
> anything about. I've spent a fair amount of time frustrated by this
> sort of thing.

My sympathies.  This all started with an oh crap we overlooked corner
case X and it actually matters, and the fixes were quite likely a little
bit hasty.  The only case where this really shows up is remount insode
of a user namespace of filesystems that were mounted outside of the user
namespace is where this all actually matters.  And mounting new
instances of proc and sysfs wind up being weird instances of that
nonsense.

But please someone test sandstorm with this patchset and tell me if it
bites you.  The impetus to find a way to avoid breaking slightly buggy
userspace is higher if it is more than unprivileged lxc that is broken.

Eric

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
  2015-05-28 14:08           ` Serge Hallyn
  2015-05-28 15:03             ` Eric W. Biederman
@ 2015-05-28 19:36             ` Richard Weinberger
       [not found]               ` <55676E32.3050006-/L3Ra7n9ekc@public.gmane.org>
  1 sibling, 1 reply; 85+ messages in thread
From: Richard Weinberger @ 2015-05-28 19:36 UTC (permalink / raw)
  To: Serge Hallyn, Andy Lutomirski
  Cc: Eric W. Biederman, Seth Forshee, Linux API, Linux Containers,
	Greg Kroah-Hartman, Kenton Varda, Michael Kerrisk-manpages,
	Linux FS Devel, Tejun Heo

Am 28.05.2015 um 16:08 schrieb Serge Hallyn:
> Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org):
>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman
>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>> I had hoped to get some Tested-By's on that patch series.
>>
>> Sorry, I've been totally swamped.
>>
>> I suspect that Sandstorm is okay, but I haven't had a chance to test
>> it for real.  Sandstorm makes only limited use of proc and sysfs in
>> containers, but I'll see if I can test it for real this weekend.
> 
> Testing this with unprivileged containers, I get
> 
> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted - error mounting sysfs on /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0
>

FWIW, it breaks also libvirt-lxc:
Error: internal error: guest failed to start: Failed to re-mount /proc/sys on /proc/sys flags=1021: Operation not permitted

Thanks,
//richard

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
       [not found]               ` <55676E32.3050006-/L3Ra7n9ekc@public.gmane.org>
@ 2015-05-28 19:57                 ` Eric W. Biederman
  2015-05-28 20:30                   ` Richard Weinberger
  0 siblings, 1 reply; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-28 19:57 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Kenton Varda, Greg Kroah-Hartman, Linux Containers, Serge Hallyn,
	Andy Lutomirski, Seth Forshee, Michael Kerrisk-manpages,
	Linux API, Linux FS Devel, Tejun Heo

Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:

> Am 28.05.2015 um 16:08 schrieb Serge Hallyn:
>> Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org):
>>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman
>>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>> I had hoped to get some Tested-By's on that patch series.
>>>
>>> Sorry, I've been totally swamped.
>>>
>>> I suspect that Sandstorm is okay, but I haven't had a chance to test
>>> it for real.  Sandstorm makes only limited use of proc and sysfs in
>>> containers, but I'll see if I can test it for real this weekend.
>> 
>> Testing this with unprivileged containers, I get
>> 
>> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted - error mounting sysfs on /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0
>>
>
> FWIW, it breaks also libvirt-lxc:
> Error: internal error: guest failed to start: Failed to re-mount /proc/sys on /proc/sys flags=1021: Operation not permitted

Interesting.  I had not anticipated a failure there?  And it is failing
in remount?  Oh that is interesting.

That implies that there is some flag of the original mount of /proc that
the remount of /proc/sys is clearing, and that previously 

The flags specified are current rdonly,remount,bind so I expect there
are some other flags on proc that libvirt-lxc is clearing by accident
and we did not fail before because the kernel was not enforcing things.

What are the mount flags in a working libvirt-lxc?

Eric

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
       [not found]                         ` <87fv6gikfn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-05-28 20:12                           ` Kenton Varda
  2015-05-28 20:47                             ` Richard Weinberger
  2015-05-29  0:30                           ` Andy Lutomirski
  1 sibling, 1 reply; 85+ messages in thread
From: Kenton Varda @ 2015-05-28 20:12 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Lutomirski, Serge Hallyn, Seth Forshee, Linux API,
	Linux Containers, Greg Kroah-Hartman, Michael Kerrisk-manpages,
	Richard Weinberger, Linux FS Devel, Tejun Heo

On Thu, May 28, 2015 at 12:14 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> But please someone test sandstorm with this patchset and tell me if it
> bites you.  The impetus to find a way to avoid breaking slightly buggy
> userspace is higher if it is more than unprivileged lxc that is broken.

One of these days I'm going to learn how to compile and test kernels
again (last time I did it was 1999). Unfortunately I don't think I
have time at the moment, but hopefully Andy can do it.

I note, though, that we only have two mount() calls in the sandstorm
codebase that seem like they could be affected:

run-bundle.c++:1264: KJ_SYSCALL(mount("proc", "proc", "proc",
MS_NOSUID | MS_NODEV | MS_NOEXEC, ""));
minibox.c++:251: KJ_SYSCALL(mount("proc", vpath.cStr(), "proc",
MS_NOSUID | MS_NODEV | MS_NOEXEC, ""),
supervisor.c++:921: KJ_SYSCALL(mount("/proc", "proc", nullptr, MS_BIND
| MS_REC, nullptr));

The first two seem like they should be fine since they set all the
flags (except readonly, which would be inappropriate for proc). I
guess my habit of setting every security flag I see came in handy. The
third case looks like it will be broken, BUT this line is in a
debug-only code path, so I don't care. Also we have the ability to
push any needed update within 24 hours, so we're generally in good
shape.

We never mount sysfs in Sandstorm.

> If I mount proc read-write I likely want to be able to write to proc
> files, and I will be much happier if the mount fails than if a bazillion
> syscalls later something else fails when it tries to write to proc.

I'm not sure that's true. Consider the broader context:
1) Your system's /proc is mounted read-only.
2) Now you're trying to mount a new proc in a new pid namespace, and
you do *not* specify MS_READONLY.

What should we expect here? Let's back off a bit and state user intent:
1) The system administrator has set a system-wide policy that /proc
may only be read, not written.
2) You made a PID namespace and it needed its own proc.

It seems intuitive here that the administrator's policy should apply
in the namespace. Certainly everyone using the system and/or all
software on the system already needs to be aware of this policy, since
it's unusual and will break things. Running software on this system
outside of any container already has the problem that syscalls
randomly break, so why should it be surprising when this happens
inside the container as well? Why do we need to go out of our way to
break at mount() time?

-Kenton

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
  2015-05-28 19:57                 ` Eric W. Biederman
@ 2015-05-28 20:30                   ` Richard Weinberger
       [not found]                     ` <55677AEF.1090809-/L3Ra7n9ekc@public.gmane.org>
  0 siblings, 1 reply; 85+ messages in thread
From: Richard Weinberger @ 2015-05-28 20:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge Hallyn, Andy Lutomirski, Seth Forshee, Linux API,
	Linux Containers, Greg Kroah-Hartman, Kenton Varda,
	Michael Kerrisk-manpages, Linux FS Devel, Tejun Heo

Am 28.05.2015 um 21:57 schrieb Eric W. Biederman:
>> FWIW, it breaks also libvirt-lxc:
>> Error: internal error: guest failed to start: Failed to re-mount /proc/sys on /proc/sys flags=1021: Operation not permitted
> 
> Interesting.  I had not anticipated a failure there?  And it is failing
> in remount?  Oh that is interesting.
> 
> That implies that there is some flag of the original mount of /proc that
> the remount of /proc/sys is clearing, and that previously 
> 
> The flags specified are current rdonly,remount,bind so I expect there
> are some other flags on proc that libvirt-lxc is clearing by accident
> and we did not fail before because the kernel was not enforcing things.

Please see:
http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l933
lxcContainerMountBasicFS()

and:
http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l850
lxcBasicMounts

> What are the mount flags in a working libvirt-lxc?

See:
test1:~ # cat /proc/self/mountinfo
147 100 0:30 /srv/container/test1/rootfs / rw,relatime - btrfs /dev/sda2 rw,space_cache
149 147 0:56 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
150 149 0:56 /sys /proc/sys ro,nodev,relatime - proc proc rw
151 150 0:3 /sys/net/ipv4 /proc/sys/net/ipv4 rw,nosuid,nodev,noexec,relatime - proc proc rw
152 150 0:3 /sys/net/ipv6 /proc/sys/net/ipv6 rw,nosuid,nodev,noexec,relatime - proc proc rw
153 147 0:57 / /sys ro,nodev,relatime - sysfs sysfs rw
154 149 0:53 /meminfo /proc/meminfo rw,nosuid,nodev,relatime - fuse libvirt rw,user_id=0,group_id=0,allow_other
155 153 0:58 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,size=64k,mode=755,uid=10000,gid=10000
156 155 0:22 /machine.slice/machine-lxc\134x2dtest1.scope /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpu,cpuacct
157 155 0:21 /machine.slice/machine-lxc\134x2dtest1.scope /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpuset
158 155 0:23 /machine.slice/machine-lxc\134x2dtest1.scope /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,memory
159 155 0:24 /machine.slice/machine-lxc\134x2dtest1.scope /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,devices
160 155 0:25 /machine.slice/machine-lxc\134x2dtest1.scope /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
161 155 0:27 /machine.slice/machine-lxc\134x2dtest1.scope /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,blkio
162 155 0:26 /machine.slice/machine-lxc\134x2dtest1.scope /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,net_cls,net_prio
163 155 0:28 /machine.slice/machine-lxc\134x2dtest1.scope /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,perf_event
164 155 0:19 /machine.slice/machine-lxc\134x2dtest1.scope /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime - cgroup cgroup
rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
165 147 0:52 / /dev rw,nosuid,relatime - tmpfs devfs rw,size=64k,mode=755
166 165 0:51 / /dev/pts rw,nosuid,relatime - devpts devpts rw,gid=10005,mode=620,ptmxmode=666
167 165 0:51 /ptmx /dev/ptmx rw,nosuid,relatime - devpts devpts rw,gid=10005,mode=620,ptmxmode=666
101 165 0:55 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw,uid=10000,gid=10000
102 147 0:59 / /run rw,nosuid,nodev - tmpfs tmpfs rw,mode=755,uid=10000,gid=10000
103 165 0:54 / /dev/mqueue rw,nodev,relatime - mqueue mqueue rw
104 147 0:59 / /var/run rw,nosuid,nodev - tmpfs tmpfs rw,mode=755,uid=10000,gid=10000
105 147 0:59 /lock /var/lock rw,nosuid,nodev - tmpfs tmpfs rw,mode=755,uid=10000,gid=10000

If you need more info, please let me know. :-)

Thanks,
//richard

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
  2015-05-28 20:12                           ` Kenton Varda
@ 2015-05-28 20:47                             ` Richard Weinberger
  2015-05-28 21:07                               ` Kenton Varda
  0 siblings, 1 reply; 85+ messages in thread
From: Richard Weinberger @ 2015-05-28 20:47 UTC (permalink / raw)
  To: Kenton Varda, Eric W. Biederman
  Cc: Andy Lutomirski, Serge Hallyn, Seth Forshee, Linux API,
	Linux Containers, Greg Kroah-Hartman, Michael Kerrisk-manpages,
	Linux FS Devel, Tejun Heo

Am 28.05.2015 um 22:12 schrieb Kenton Varda:
> We never mount sysfs in Sandstorm.

sysfs is ABI and applications depend on it.
Even glibc is using sysfs. Currently it has
fallback paths but these may go away...

Thanks,
//richard

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
  2015-05-28 15:03             ` Eric W. Biederman
  2015-05-28 17:33               ` Andy Lutomirski
@ 2015-05-28 21:04               ` Serge E. Hallyn
       [not found]                 ` <20150528210438.GA14849-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  1 sibling, 1 reply; 85+ messages in thread
From: Serge E. Hallyn @ 2015-05-28 21:04 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge Hallyn, Richard Weinberger, Kenton Varda, Linux API,
	Linux Containers, Andy Lutomirski, Seth Forshee,
	Michael Kerrisk-manpages, Greg Kroah-Hartman, Linux FS Devel,
	Tejun Heo

On Thu, May 28, 2015 at 10:03:28AM -0500, Eric W. Biederman wrote:
> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
> 
> > Quoting Andy Lutomirski (luto@amacapital.net):
> >> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman
> >> <ebiederm@xmission.com> wrote:
> >> > I had hoped to get some Tested-By's on that patch series.
> >> 
> >> Sorry, I've been totally swamped.
> >> 
> >> I suspect that Sandstorm is okay, but I haven't had a chance to test
> >> it for real.  Sandstorm makes only limited use of proc and sysfs in
> >> containers, but I'll see if I can test it for real this weekend.
> >
> > Testing this with unprivileged containers, I get
> >
> > lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted
> > - error mounting sysfs on
> > /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0
> 
> Grr..  I was afraid this would break something. :(
> 
> Looking at my system I see that sysfs is currently mounted
> "nosuid,nodev,noexec"
> 
> Looking at the lxc-start code I don't see it as including any of those
> mount options.  In practice for sysfs I think those options are
> meaningless (as there should be no devices and nothing executable in
> sysfs) but I can understand the past concerns with chmod on virtual
> filesystems that would incline people to use them, so I think the
> failure is reporting a legitimate security issue in the lxc userspace
> code where the the unprivileged code is currently attempting to give
> greater access to sysfs than was given by the original mount of sysfs.
> 
> As nosuid,nodev,noexec should not impair the operation of sysfs
> operation it looks like you can always specify those options and just
> make this concern go away.
> 
> Something like the untested patch below I expect.
> 
> diff --git a/src/lxc/conf.c b/src/lxc/conf.c
> index 9870455b3cae..d9ccd03afe68 100644
> --- a/src/lxc/conf.c
> +++ b/src/lxc/conf.c
> @@ -770,8 +770,8 @@ static int lxc_mount_auto_mounts(struct lxc_conf *conf, int flags, struct lxc_ha
>  		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, "%r/proc/sysrq-trigger",                             "%r/proc/sysrq-trigger",        NULL,       MS_BIND,                        NULL },
>  		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, NULL,                                                "%r/proc/sysrq-trigger",        NULL,       MS_REMOUNT|MS_BIND|MS_RDONLY,   NULL },
>  		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_RW,    "proc",                                              "%r/proc",                      "proc",     MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
> -		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RW,     "sysfs",                                             "%r/sys",                       "sysfs",    0,                              NULL },
> -		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RO,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_RDONLY,                      NULL },
> +		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RW,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
> +		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RO,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY, NULL },
>  		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
>  		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  "%r/sys",                                            "%r/sys",                       NULL,       MS_BIND,                        NULL },
>  		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  NULL,                                                "%r/sys",                       NULL,       MS_REMOUNT|MS_BIND|MS_RDONLY,   NULL },

fwiw - the first one works, the second one does not due to an apparent
inability to statvfs the origin.

> Alternately you can read the flags off of the original mount of proc or sysfs.
> 
> diff --git a/src/lxc/conf.c b/src/lxc/conf.c
> index 9870455b3cae..50ea49973e80 100644
> --- a/src/lxc/conf.c
> +++ b/src/lxc/conf.c
> @@ -712,7 +712,9 @@ static unsigned long add_required_remount_flags(const char *s, const char *d,
>         struct statvfs sb;
>         unsigned long required_flags = 0;
>  
> -       if (!(flags & MS_REMOUNT))
> +       if (!(flags & MS_REMOUNT) &&
> +           (strcmp(s, "proc") != 0) &&
> +           (strcmp(s, "sysfs") != 0))
>                 return flags;
>  
>         if (!s)
> 
> Eric
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
  2015-05-28 20:47                             ` Richard Weinberger
@ 2015-05-28 21:07                               ` Kenton Varda
       [not found]                                 ` <CAOP=4wiAA4SqvMn_rQJHOjg6M-75bi_G9Fx8ENgVnYdkT5WVQA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 85+ messages in thread
From: Kenton Varda @ 2015-05-28 21:07 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Eric W. Biederman, Andy Lutomirski, Serge Hallyn, Seth Forshee,
	Linux API, Linux Containers, Greg Kroah-Hartman,
	Michael Kerrisk-manpages, Linux FS Devel, Tejun Heo

On Thu, May 28, 2015 at 1:47 PM, Richard Weinberger <richard@nod.at> wrote:
> Am 28.05.2015 um 22:12 schrieb Kenton Varda:
>> We never mount sysfs in Sandstorm.
>
> sysfs is ABI and applications depend on it.
> Even glibc is using sysfs. Currently it has
> fallback paths but these may go away...

Off-topic, but Sandstorm isn't intended to provide a full Linux ABI.
It is intended to provide a secure sandbox that can run apps that have
been explicitly ported to Sandstorm. More background if you're interested:

https://github.com/sandstorm-io/sandstorm/wiki/Security-Practices-Overview#server-sandboxing
https://blog.sandstorm.io/news/2014-08-13-sandbox-security.html

-Kenton

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
       [not found]                                 ` <CAOP=4wiAA4SqvMn_rQJHOjg6M-75bi_G9Fx8ENgVnYdkT5WVQA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-28 21:12                                   ` Richard Weinberger
  0 siblings, 0 replies; 85+ messages in thread
From: Richard Weinberger @ 2015-05-28 21:12 UTC (permalink / raw)
  To: Kenton Varda
  Cc: Linux API, Linux Containers, Serge Hallyn, Andy Lutomirski,
	Seth Forshee, Eric W. Biederman, Greg Kroah-Hartman,
	Linux FS Devel, Tejun Heo, Michael Kerrisk-manpages

Am 28.05.2015 um 23:07 schrieb Kenton Varda:
> On Thu, May 28, 2015 at 1:47 PM, Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> wrote:
>> Am 28.05.2015 um 22:12 schrieb Kenton Varda:
>>> We never mount sysfs in Sandstorm.
>>
>> sysfs is ABI and applications depend on it.
>> Even glibc is using sysfs. Currently it has
>> fallback paths but these may go away...
> 
> Off-topic, but Sandstorm isn't intended to provide a full Linux ABI.
> It is intended to provide a secure sandbox that can run apps that have
> been explicitly ported to Sandstorm. More background if you're interested:

Ahh, the application needs to be Sandstorm aware.
I was missing that detail. Thanks for pointing that out!

Thanks,
//richard

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
       [not found]                     ` <55677AEF.1090809-/L3Ra7n9ekc@public.gmane.org>
@ 2015-05-28 21:32                       ` Eric W. Biederman
       [not found]                         ` <87iobcfkwx.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-28 21:32 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Kenton Varda, Greg Kroah-Hartman, Linux Containers, Serge Hallyn,
	Andy Lutomirski, Seth Forshee, Michael Kerrisk-manpages,
	Linux API, Linux FS Devel, Tejun Heo

Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:

> Am 28.05.2015 um 21:57 schrieb Eric W. Biederman:
>>> FWIW, it breaks also libvirt-lxc:
>>> Error: internal error: guest failed to start: Failed to re-mount /proc/sys on /proc/sys flags=1021: Operation not permitted
>> 
>> Interesting.  I had not anticipated a failure there?  And it is failing
>> in remount?  Oh that is interesting.
>> 
>> That implies that there is some flag of the original mount of /proc that
>> the remount of /proc/sys is clearing, and that previously 
>> 
>> The flags specified are current rdonly,remount,bind so I expect there
>> are some other flags on proc that libvirt-lxc is clearing by accident
>> and we did not fail before because the kernel was not enforcing things.
>
> Please see:
> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l933
> lxcContainerMountBasicFS()
>
> and:
> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l850
> lxcBasicMounts
>
>> What are the mount flags in a working libvirt-lxc?
>
> See:
> test1:~ # cat /proc/self/mountinfo
> 149 147 0:56 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 150 149 0:56 /sys /proc/sys ro,nodev,relatime - proc proc rw

> If you need more info, please let me know. :-)

Oh interesting I had not realized libvirt-lxc had grown an unprivileged
mode using user namespaces.

This does appear to be a classic remount bug, where you are not
preserving the permissions.  It appears the fact that the code
failed to enforce locked permissions on the fresh mount of proc
was hiding this bug until now.

I expect what you actually want is the code below:

diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c
index 9a9ae5c2aaf0..f008a7484bfe 100644
--- a/src/lxc/lxc_container.c
+++ b/src/lxc/lxc_container.c
@@ -850,7 +850,7 @@ typedef struct {
 
 static const virLXCBasicMountInfo lxcBasicMounts[] = {
     { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false },
-    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false },
+    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
     { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true },
     { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true },
     { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },

Or possibly just:

diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c
index 9a9ae5c2aaf0..a60ccbd12bfc 100644
--- a/src/lxc/lxc_container.c
+++ b/src/lxc/lxc_container.c
@@ -850,7 +850,7 @@ typedef struct {
 
 static const virLXCBasicMountInfo lxcBasicMounts[] = {
     { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false },
-    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false },
+    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, true, false, false },
     { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true },
     { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true },
     { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },

As the there is little point in making /proc/sys read-only in a
user-namespace, as the permission checks are uid based and no-one should
have the global uid 0 in your container.  Making mounting /proc/sys
read-only rather pointless.

Eric

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
       [not found]                 ` <20150528210438.GA14849-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2015-05-28 21:42                   ` Eric W. Biederman
  2015-05-28 21:52                     ` Serge E. Hallyn
  0 siblings, 1 reply; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-28 21:42 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Serge Hallyn, Richard Weinberger, Kenton Varda, Linux API,
	Linux Containers, Andy Lutomirski, Seth Forshee,
	Michael Kerrisk-manpages, Greg Kroah-Hartman, Linux FS Devel,
	Tejun Heo

"Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:

> On Thu, May 28, 2015 at 10:03:28AM -0500, Eric W. Biederman wrote:
>> Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes:
>> 
>> > Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org):
>> >> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman
>> >> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> >> > I had hoped to get some Tested-By's on that patch series.
>> >> 
>> >> Sorry, I've been totally swamped.
>> >> 
>> >> I suspect that Sandstorm is okay, but I haven't had a chance to test
>> >> it for real.  Sandstorm makes only limited use of proc and sysfs in
>> >> containers, but I'll see if I can test it for real this weekend.
>> >
>> > Testing this with unprivileged containers, I get
>> >
>> > lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted
>> > - error mounting sysfs on
>> > /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0
>> 
>> Grr..  I was afraid this would break something. :(
>> 
>> Looking at my system I see that sysfs is currently mounted
>> "nosuid,nodev,noexec"
>> 
>> Looking at the lxc-start code I don't see it as including any of those
>> mount options.  In practice for sysfs I think those options are
>> meaningless (as there should be no devices and nothing executable in
>> sysfs) but I can understand the past concerns with chmod on virtual
>> filesystems that would incline people to use them, so I think the
>> failure is reporting a legitimate security issue in the lxc userspace
>> code where the the unprivileged code is currently attempting to give
>> greater access to sysfs than was given by the original mount of sysfs.
>> 
>> As nosuid,nodev,noexec should not impair the operation of sysfs
>> operation it looks like you can always specify those options and just
>> make this concern go away.
>> 
>> Something like the untested patch below I expect.
>> 
>> diff --git a/src/lxc/conf.c b/src/lxc/conf.c
>> index 9870455b3cae..d9ccd03afe68 100644
>> --- a/src/lxc/conf.c
>> +++ b/src/lxc/conf.c
>> @@ -770,8 +770,8 @@ static int lxc_mount_auto_mounts(struct lxc_conf *conf, int flags, struct lxc_ha
>>  		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, "%r/proc/sysrq-trigger",                             "%r/proc/sysrq-trigger",        NULL,       MS_BIND,                        NULL },
>>  		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, NULL,                                                "%r/proc/sysrq-trigger",        NULL,       MS_REMOUNT|MS_BIND|MS_RDONLY,   NULL },
>>  		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_RW,    "proc",                                              "%r/proc",                      "proc",     MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
>> -		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RW,     "sysfs",                                             "%r/sys",                       "sysfs",    0,                              NULL },
>> -		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RO,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_RDONLY,                      NULL },
>> +		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RW,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
>> +		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RO,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY, NULL },
>>  		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
>>  		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  "%r/sys",                                            "%r/sys",                       NULL,       MS_BIND,                        NULL },
>>  		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  NULL,                                                "%r/sys",                       NULL,       MS_REMOUNT|MS_BIND|MS_RDONLY,   NULL },
>
> fwiw - the first one works, the second one does not due to an apparent
> inability to statvfs the origin.

Good to hear.  That confirms there are no other gotchas waiting in the
wings.

Apparently my second suggested patch is buggy due to an invalid source
string.  The source would need to be %r/proc or %r/sysfs to use statvfs
productively.


>> Alternately you can read the flags off of the original mount of proc or sysfs.
>> 
>> diff --git a/src/lxc/conf.c b/src/lxc/conf.c
>> index 9870455b3cae..50ea49973e80 100644
>> --- a/src/lxc/conf.c
>> +++ b/src/lxc/conf.c
>> @@ -712,7 +712,9 @@ static unsigned long add_required_remount_flags(const char *s, const char *d,
>>         struct statvfs sb;
>>         unsigned long required_flags = 0;
>>  
>> -       if (!(flags & MS_REMOUNT))
>> +       if (!(flags & MS_REMOUNT) &&
>> +           (strcmp(s, "proc") != 0) &&
>> +           (strcmp(s, "sysfs") != 0))
>>                 return flags;
>>  
>>         if (!s)
>> 
>> Eric
>> _______________________________________________
>> Containers mailing list
>> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
>> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
       [not found]                         ` <87iobcfkwx.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-05-28 21:46                           ` Richard Weinberger
       [not found]                             ` <55678CCA.80807-/L3Ra7n9ekc@public.gmane.org>
  2015-05-29  9:30                           ` Richard Weinberger
  1 sibling, 1 reply; 85+ messages in thread
From: Richard Weinberger @ 2015-05-28 21:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kenton Varda, Greg Kroah-Hartman, Linux Containers, Serge Hallyn,
	Andy Lutomirski, Seth Forshee, Michael Kerrisk-manpages,
	Linux API, Linux FS Devel, Tejun Heo

Am 28.05.2015 um 23:32 schrieb Eric W. Biederman:
> Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:
> 
>> Am 28.05.2015 um 21:57 schrieb Eric W. Biederman:
>>>> FWIW, it breaks also libvirt-lxc:
>>>> Error: internal error: guest failed to start: Failed to re-mount /proc/sys on /proc/sys flags=1021: Operation not permitted
>>>
>>> Interesting.  I had not anticipated a failure there?  And it is failing
>>> in remount?  Oh that is interesting.
>>>
>>> That implies that there is some flag of the original mount of /proc that
>>> the remount of /proc/sys is clearing, and that previously 
>>>
>>> The flags specified are current rdonly,remount,bind so I expect there
>>> are some other flags on proc that libvirt-lxc is clearing by accident
>>> and we did not fail before because the kernel was not enforcing things.
>>
>> Please see:
>> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l933
>> lxcContainerMountBasicFS()
>>
>> and:
>> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l850
>> lxcBasicMounts
>>
>>> What are the mount flags in a working libvirt-lxc?
>>
>> See:
>> test1:~ # cat /proc/self/mountinfo
>> 149 147 0:56 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
>> 150 149 0:56 /sys /proc/sys ro,nodev,relatime - proc proc rw
> 
>> If you need more info, please let me know. :-)
> 
> Oh interesting I had not realized libvirt-lxc had grown an unprivileged
> mode using user namespaces.

Yep. It works quite well. I've migrated all my containers from lxc
to libvirt-lxc because libvirt-lxc had a working user-namespace
implementation before lxc.

> This does appear to be a classic remount bug, where you are not
> preserving the permissions.  It appears the fact that the code
> failed to enforce locked permissions on the fresh mount of proc
> was hiding this bug until now.
> 
> I expect what you actually want is the code below:
> 
> diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c
> index 9a9ae5c2aaf0..f008a7484bfe 100644
> --- a/src/lxc/lxc_container.c
> +++ b/src/lxc/lxc_container.c
> @@ -850,7 +850,7 @@ typedef struct {
>  
>  static const virLXCBasicMountInfo lxcBasicMounts[] = {
>      { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false },
> -    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false },
> +    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
>      { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true },
>      { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true },
>      { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
> 
> Or possibly just:
> 
> diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c
> index 9a9ae5c2aaf0..a60ccbd12bfc 100644
> --- a/src/lxc/lxc_container.c
> +++ b/src/lxc/lxc_container.c
> @@ -850,7 +850,7 @@ typedef struct {
>  
>  static const virLXCBasicMountInfo lxcBasicMounts[] = {
>      { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false },
> -    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false },
> +    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, true, false, false },
>      { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true },
>      { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true },
>      { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },

I'll test your diff tomorrow with a fresh brain.
I sent a similar patch to libvirt folks some time ago, looks like it got lost. ;-\

> As the there is little point in making /proc/sys read-only in a
> user-namespace, as the permission checks are uid based and no-one should
> have the global uid 0 in your container.  Making mounting /proc/sys
> read-only rather pointless.

Yeah, I've been ranting about that for ages...
libvirt-lxc contains a lot of cruft to make privileged container
kind of secure. Some users still fear using the user-namespace.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
  2015-05-28 21:42                   ` Eric W. Biederman
@ 2015-05-28 21:52                     ` Serge E. Hallyn
  0 siblings, 0 replies; 85+ messages in thread
From: Serge E. Hallyn @ 2015-05-28 21:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Serge Hallyn, Richard Weinberger, Kenton Varda,
	Linux API, Linux Containers, Andy Lutomirski, Seth Forshee,
	Michael Kerrisk-manpages, Greg Kroah-Hartman, Linux FS Devel,
	Tejun Heo

On Thu, May 28, 2015 at 04:42:34PM -0500, Eric W. Biederman wrote:
> "Serge E. Hallyn" <serge@hallyn.com> writes:
> 
> > On Thu, May 28, 2015 at 10:03:28AM -0500, Eric W. Biederman wrote:
> >> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
> >> 
> >> > Quoting Andy Lutomirski (luto@amacapital.net):
> >> >> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman
> >> >> <ebiederm@xmission.com> wrote:
> >> >> > I had hoped to get some Tested-By's on that patch series.
> >> >> 
> >> >> Sorry, I've been totally swamped.
> >> >> 
> >> >> I suspect that Sandstorm is okay, but I haven't had a chance to test
> >> >> it for real.  Sandstorm makes only limited use of proc and sysfs in
> >> >> containers, but I'll see if I can test it for real this weekend.
> >> >
> >> > Testing this with unprivileged containers, I get
> >> >
> >> > lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted
> >> > - error mounting sysfs on
> >> > /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0
> >> 
> >> Grr..  I was afraid this would break something. :(
> >> 
> >> Looking at my system I see that sysfs is currently mounted
> >> "nosuid,nodev,noexec"
> >> 
> >> Looking at the lxc-start code I don't see it as including any of those
> >> mount options.  In practice for sysfs I think those options are
> >> meaningless (as there should be no devices and nothing executable in
> >> sysfs) but I can understand the past concerns with chmod on virtual
> >> filesystems that would incline people to use them, so I think the
> >> failure is reporting a legitimate security issue in the lxc userspace
> >> code where the the unprivileged code is currently attempting to give
> >> greater access to sysfs than was given by the original mount of sysfs.
> >> 
> >> As nosuid,nodev,noexec should not impair the operation of sysfs
> >> operation it looks like you can always specify those options and just
> >> make this concern go away.
> >> 
> >> Something like the untested patch below I expect.
> >> 
> >> diff --git a/src/lxc/conf.c b/src/lxc/conf.c
> >> index 9870455b3cae..d9ccd03afe68 100644
> >> --- a/src/lxc/conf.c
> >> +++ b/src/lxc/conf.c
> >> @@ -770,8 +770,8 @@ static int lxc_mount_auto_mounts(struct lxc_conf *conf, int flags, struct lxc_ha
> >>  		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, "%r/proc/sysrq-trigger",                             "%r/proc/sysrq-trigger",        NULL,       MS_BIND,                        NULL },
> >>  		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, NULL,                                                "%r/proc/sysrq-trigger",        NULL,       MS_REMOUNT|MS_BIND|MS_RDONLY,   NULL },
> >>  		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_RW,    "proc",                                              "%r/proc",                      "proc",     MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
> >> -		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RW,     "sysfs",                                             "%r/sys",                       "sysfs",    0,                              NULL },
> >> -		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RO,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_RDONLY,                      NULL },
> >> +		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RW,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
> >> +		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RO,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY, NULL },
> >>  		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
> >>  		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  "%r/sys",                                            "%r/sys",                       NULL,       MS_BIND,                        NULL },
> >>  		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  NULL,                                                "%r/sys",                       NULL,       MS_REMOUNT|MS_BIND|MS_RDONLY,   NULL },
> >
> > fwiw - the first one works, the second one does not due to an apparent
> > inability to statvfs the origin.
> 
> Good to hear.  That confirms there are no other gotchas waiting in the
> wings.
> 
> Apparently my second suggested patch is buggy due to an invalid source
> string.  The source would need to be %r/proc or %r/sysfs to use statvfs
> productively.

Right, in these cases they're only passing in "sysfs".  The first way
is more explicit anyway (though may not help some people who have a
"lxc.mount.entry = sysfs sys sysfs ro 0 0" line in their configuration
instead, so maybe we'll have to go with the second after all, d'oh.
I'll have to look into it next week)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
       [not found]                         ` <87fv6gikfn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-05-28 20:12                           ` Kenton Varda
@ 2015-05-29  0:30                           ` Andy Lutomirski
  1 sibling, 0 replies; 85+ messages in thread
From: Andy Lutomirski @ 2015-05-29  0:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Seth Forshee, Kenton Varda, Richard Weinberger, Linux Containers,
	Serge Hallyn, Linux FS Devel, Michael Kerrisk-manpages,
	Greg Kroah-Hartman, Tejun Heo, Linux API

On May 28, 2015 12:19 PM, "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
> Kenton Varda <kenton-AuYgBwuPrUQTaNkGU808tA@public.gmane.org> writes:
>
> > On Thu, May 28, 2015 at 10:33 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
wrote:
> >> On Thu, May 28, 2015 at 8:03 AM, Eric W. Biederman
> >> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> >>> Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes:
> >>>
> >>>> Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org):
> >>>>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman
> >>>>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> >>>>> > I had hoped to get some Tested-By's on that patch series.
> >>>>>
> >>>>> Sorry, I've been totally swamped.
> >>>>>
> >>>>> I suspect that Sandstorm is okay, but I haven't had a chance to test
> >>>>> it for real.  Sandstorm makes only limited use of proc and sysfs in
> >>>>> containers, but I'll see if I can test it for real this weekend.
> >>>>
> >>>> Testing this with unprivileged containers, I get
> >>>>
> >>>> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted
> >>>> - error mounting sysfs on
> >>>> /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0
> >>>
> >>> Grr..  I was afraid this would break something. :(
> >>>
> >>> Looking at my system I see that sysfs is currently mounted
> >>> "nosuid,nodev,noexec"
> >>>
> >>> Looking at the lxc-start code I don't see it as including any of those
> >>> mount options.  In practice for sysfs I think those options are
> >>> meaningless (as there should be no devices and nothing executable in
> >>> sysfs) but I can understand the past concerns with chmod on virtual
> >>> filesystems that would incline people to use them, so I think the
> >>> failure is reporting a legitimate security issue in the lxc userspace
> >>> code where the the unprivileged code is currently attempting to give
> >>> greater access to sysfs than was given by the original mount of sysfs.
> >>>
> >>> As nosuid,nodev,noexec should not impair the operation of sysfs
> >>> operation it looks like you can always specify those options and just
> >>> make this concern go away.
> >>
> >> Linus is pretty strict about not breaking the ABI, and this definitely
> >> counts as breaking the ABI.  There's an exception for security issues,
> >> but is there really a security issue here?  That is, do we lose
> >> anything important if we just drop the offending part of the patch
> >> set?  As you've said, there shouldn't be sensitive device nodes,
> >> executables, or setuid files in proc or sysfs in the first place.
>
> We do need to enforce retaining the existing mount flags one way or
> another.  Where this really matters is with MS_RDONLY.  We don't want
> any old user to be able to mount /proc read-write when root mounted it
> read-only.  There is a very real attack vector there.  That attack
> almost works in docker container today and is avoided simply because
> docker mounts over a few files on proc.

You could drop the nosuid, noexec, and nodev changes and keep just the ro
part.  The ro part is probably not an ABI break in the sense of something
that actually breaks real programs.

>
> Which leads to the second side of the reason for these changes.   I am
> fixing a very small but long standing ABI break.   That is in some small
> ways I broke some sandboxes and when I realized they were broken I could
> not imagine think how to fix the code until now.
>
> It is the goal that user namespaces don't introduce anything for people
> to worry about security wise more than simply the ability to execute
> more kernel code.  So at least when the kernel implementation is correct
> developers of existing applications simply do not need care.  Sadly we are
> not quite there yet.
>
> > Speaking as a user of the mount() interfaces, I really think it would
> > be less confusing overall if mount() simply ignored the requested
> > flags when the caller doesn't have a choice. That is, in cases where
> > mount() currently fails with EPERM when not given, say, MS_NOSUID, it
> > should instead just pretend the caller actually set MS_NOSUID and go
> > ahead with a nosuid mount. Or put another way, the absence of
> > MS_NOSUID should not be interpreted as "remove the nosuid bit" but
> > rather "don't set the nosuid bit if not required".
>
> I am conflicted.  Implicits are nice but confusing.  If we can do
> something reliable and robust and maintainable here that is truly worth
> the cost I am all for it.
>
> If I mount proc read-write I likely want to be able to write to proc
> files, and I will be much happier if the mount fails than if a bazillion
> syscalls later something else fails when it tries to write to proc.

I agree.  I don't like the implicit thing.

--Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
  2015-05-28 19:14                       ` Eric W. Biederman
       [not found]                         ` <87fv6gikfn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-05-29  0:35                         ` Andy Lutomirski
       [not found]                           ` <CALCETrXO21Y7PR=pKqaqJb1YZArNyjAv7Z-J44O53FcfLM_0Tw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 85+ messages in thread
From: Andy Lutomirski @ 2015-05-29  0:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kenton Varda, Serge Hallyn, Seth Forshee, Linux API,
	Linux Containers, Greg Kroah-Hartman, Michael Kerrisk-manpages,
	Richard Weinberger, Linux FS Devel, Tejun Heo

[resend due to HTML. Sorry.]


On May 28, 2015 12:19 PM, "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>
> Kenton Varda <kenton@sandstorm.io> writes:
>
> > On Thu, May 28, 2015 at 10:33 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> On Thu, May 28, 2015 at 8:03 AM, Eric W. Biederman
> >> <ebiederm@xmission.com> wrote:
> >>> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
> >>>
> >>>> Quoting Andy Lutomirski (luto@amacapital.net):
> >>>>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman
> >>>>> <ebiederm@xmission.com> wrote:
> >>>>> > I had hoped to get some Tested-By's on that patch series.
> >>>>>
> >>>>> Sorry, I've been totally swamped.
> >>>>>
> >>>>> I suspect that Sandstorm is okay, but I haven't had a chance to test
> >>>>> it for real.  Sandstorm makes only limited use of proc and sysfs in
> >>>>> containers, but I'll see if I can test it for real this weekend.
> >>>>
> >>>> Testing this with unprivileged containers, I get
> >>>>
> >>>> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted
> >>>> - error mounting sysfs on
> >>>> /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0
> >>>
> >>> Grr..  I was afraid this would break something. :(
> >>>
> >>> Looking at my system I see that sysfs is currently mounted
> >>> "nosuid,nodev,noexec"
> >>>
> >>> Looking at the lxc-start code I don't see it as including any of those
> >>> mount options.  In practice for sysfs I think those options are
> >>> meaningless (as there should be no devices and nothing executable in
> >>> sysfs) but I can understand the past concerns with chmod on virtual
> >>> filesystems that would incline people to use them, so I think the
> >>> failure is reporting a legitimate security issue in the lxc userspace
> >>> code where the the unprivileged code is currently attempting to give
> >>> greater access to sysfs than was given by the original mount of sysfs.
> >>>
> >>> As nosuid,nodev,noexec should not impair the operation of sysfs
> >>> operation it looks like you can always specify those options and just
> >>> make this concern go away.
> >>
> >> Linus is pretty strict about not breaking the ABI, and this definitely
> >> counts as breaking the ABI.  There's an exception for security issues,
> >> but is there really a security issue here?  That is, do we lose
> >> anything important if we just drop the offending part of the patch
> >> set?  As you've said, there shouldn't be sensitive device nodes,
> >> executables, or setuid files in proc or sysfs in the first place.
>
> We do need to enforce retaining the existing mount flags one way or
> another.  Where this really matters is with MS_RDONLY.  We don't want
> any old user to be able to mount /proc read-write when root mounted it
> read-only.  There is a very real attack vector there.  That attack
> almost works in docker container today and is avoided simply because
> docker mounts over a few files on proc.

You could drop the nosuid, noexec, and nodev changes and keep just the
ro part.  The ro part is probably not an ABI break in the sense of
something that actually breaks real programs.

>
> Which leads to the second side of the reason for these changes.   I am
> fixing a very small but long standing ABI break.   That is in some small
> ways I broke some sandboxes and when I realized they were broken I could
> not imagine think how to fix the code until now.
>
> It is the goal that user namespaces don't introduce anything for people
> to worry about security wise more than simply the ability to execute
> more kernel code.  So at least when the kernel implementation is correct
> developers of existing applications simply do not need care.  Sadly we are
> not quite there yet.
>
> > Speaking as a user of the mount() interfaces, I really think it would
> > be less confusing overall if mount() simply ignored the requested
> > flags when the caller doesn't have a choice. That is, in cases where
> > mount() currently fails with EPERM when not given, say, MS_NOSUID, it
> > should instead just pretend the caller actually set MS_NOSUID and go
> > ahead with a nosuid mount. Or put another way, the absence of
> > MS_NOSUID should not be interpreted as "remove the nosuid bit" but
> > rather "don't set the nosuid bit if not required".
>
> I am conflicted.  Implicits are nice but confusing.  If we can do
> something reliable and robust and maintainable here that is truly worth
> the cost I am all for it.
>
> If I mount proc read-write I likely want to be able to write to proc
> files, and I will be much happier if the mount fails than if a bazillion
> syscalls later something else fails when it tries to write to proc.

I agree.  I don't like the implicit thing.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
       [not found]                           ` <CALCETrXO21Y7PR=pKqaqJb1YZArNyjAv7Z-J44O53FcfLM_0Tw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-29  4:36                             ` Eric W. Biederman
       [not found]                               ` <87fv6g80g7.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-29  4:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kenton Varda, Serge Hallyn, Seth Forshee, Linux API,
	Linux Containers, Greg Kroah-Hartman, Michael Kerrisk-manpages,
	Richard Weinberger, Linux FS Devel, Tejun Heo

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
> On May 28, 2015 12:19 PM, "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> Kenton Varda <kenton-AuYgBwuPrUQTaNkGU808tA@public.gmane.org> writes:
>>
>> We do need to enforce retaining the existing mount flags one way or
>> another.  Where this really matters is with MS_RDONLY.  We don't want
>> any old user to be able to mount /proc read-write when root mounted it
>> read-only.  There is a very real attack vector there.  That attack
>> almost works in docker container today and is avoided simply because
>> docker mounts over a few files on proc.
>
> You could drop the nosuid, noexec, and nodev changes and keep just the
> ro part.  The ro part is probably not an ABI break in the sense of
> something that actually breaks real programs.

As a change simply removing the code from the existing patches that
worries about nosuid, noexec, and the nodev flags is certainly doable.
It is the best proposal I have heard so far.

I remain unconvinced about ignoring those flags:
- There are clearly people who think it matters (or else proc and sysfs
  would not have those flags specified).

- There have been times when it actually has mattered.
  Aka when files like /proc/self/env could be chmodded and used for
  privilege escalation.

- The code in lxc and libvirt-lxc so far has been clearly buggy.
  * lxc only has problems with sysfs (in some configurations).
  * libvirt-lxc only has problems on a bind mount remount of
    proc after remounting proc properly.

So I am leaning towards enforcing all of the mount flags including
nosuid, noexec, and nodev.  Then when the next subtle bug in proc or
sysfs with respect to chmod shows up I will be able to sleep soundly at
night because the mount flags of those filesystems allow a mitigation,
and I did not sabatage the mitigation.

Plus contemplating code that just enforces a couple of mount flags but
not all of the feels wrong.

I don't think it is actually a maintainable position to just enforce a
couple of those flags.  If nothing else I would expect someone to look
at the code and to generate a bug fix to start enforcing the rest of the
flags.  Or perhaps it is in a few years time and something gets
refactored and the enforcing starts happening by virtue of using a new
common function that no-one realizes will be a problem.

Additionally if we don't enforce nosuid, noexec, and nodev people are
going to ask questions, that will be hard to explain.  When what is
truly desirable is to say that sysfs and proc are a little odd but they
don't allow anything that a bind mount won't.

I can be persuaded otherwise but right now I do think the kernel code
needs to enforce nosuid, noexec, and nodev as it is a security issue (if
only a defence in depth one), and a maintenance issue as I do not
believe in the long term it is a maintanable or an explicable position.

>> > Speaking as a user of the mount() interfaces, I really think it would
>> > be less confusing overall if mount() simply ignored the requested
>> > flags when the caller doesn't have a choice. That is, in cases where
>> > mount() currently fails with EPERM when not given, say, MS_NOSUID, it
>> > should instead just pretend the caller actually set MS_NOSUID and go
>> > ahead with a nosuid mount. Or put another way, the absence of
>> > MS_NOSUID should not be interpreted as "remove the nosuid bit" but
>> > rather "don't set the nosuid bit if not required".
>>
>> I am conflicted.  Implicits are nice but confusing.  If we can do
>> something reliable and robust and maintainable here that is truly worth
>> the cost I am all for it.
>>
>> If I mount proc read-write I likely want to be able to write to proc
>> files, and I will be much happier if the mount fails than if a bazillion
>> syscalls later something else fails when it tries to write to proc.
>
> I agree.  I don't like the implicit thing.

My memory returns of our last round of looking at this and for whatever
it's warts the existing mount API for remounting filesystems needs to
have the flags have exactly the same meaning as at mount time.  There
are existing userspace applications that depend on that behavior.

Implicits for only the locked mount flags is a little different but
still ick.

Eric

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
       [not found]                               ` <87fv6g80g7.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-05-29  4:54                                 ` Kenton Varda
  2015-05-29 17:49                                 ` Andy Lutomirski
  1 sibling, 0 replies; 85+ messages in thread
From: Kenton Varda @ 2015-05-29  4:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Richard Weinberger, Greg Kroah-Hartman, Linux Containers,
	Serge Hallyn, Andy Lutomirski, Seth Forshee,
	Michael Kerrisk-manpages, Linux API, Linux FS Devel, Tejun Heo

On Thu, May 28, 2015 at 9:36 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Implicits for only the locked mount flags is a little different but
> still ick.

FWIW, I only ever meant to advocate for this for locked flags, i.e.
cases where the only other option is to throw EPERM. Clearly when the
user has permission, the exact requested flags should be applied, or
all kinds of things break.

It seems to me that if we can fix the security issue without breaking
userspace, we should. Sometimes we end up with icky APIs to avoid
breaking userspace. (Though IMO implicitly preserving locked bits is
not icky at all.)

-Kenton

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
       [not found]                         ` <87iobcfkwx.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-05-28 21:46                           ` Richard Weinberger
@ 2015-05-29  9:30                           ` Richard Weinberger
       [not found]                             ` <556831CF.9040600-/L3Ra7n9ekc@public.gmane.org>
  2015-06-06 18:56                             ` Eric W. Biederman
  1 sibling, 2 replies; 85+ messages in thread
From: Richard Weinberger @ 2015-05-29  9:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kenton Varda, libvir-list-H+wXaHxf7aLQT0dZR+AlfA,
	Greg Kroah-Hartman, Linux Containers, Serge Hallyn,
	Andy Lutomirski, Seth Forshee, Michael Kerrisk-manpages,
	Linux API, Linux FS Devel, Tejun Heo, Cedric Bosdonnat

[CC'ing libvirt-lxc folks]

Am 28.05.2015 um 23:32 schrieb Eric W. Biederman:
> Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:
> 
>> Am 28.05.2015 um 21:57 schrieb Eric W. Biederman:
>>>> FWIW, it breaks also libvirt-lxc:
>>>> Error: internal error: guest failed to start: Failed to re-mount /proc/sys on /proc/sys flags=1021: Operation not permitted
>>>
>>> Interesting.  I had not anticipated a failure there?  And it is failing
>>> in remount?  Oh that is interesting.
>>>
>>> That implies that there is some flag of the original mount of /proc that
>>> the remount of /proc/sys is clearing, and that previously 
>>>
>>> The flags specified are current rdonly,remount,bind so I expect there
>>> are some other flags on proc that libvirt-lxc is clearing by accident
>>> and we did not fail before because the kernel was not enforcing things.
>>
>> Please see:
>> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l933
>> lxcContainerMountBasicFS()
>>
>> and:
>> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l850
>> lxcBasicMounts
>>
>>> What are the mount flags in a working libvirt-lxc?
>>
>> See:
>> test1:~ # cat /proc/self/mountinfo
>> 149 147 0:56 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
>> 150 149 0:56 /sys /proc/sys ro,nodev,relatime - proc proc rw
> 
>> If you need more info, please let me know. :-)
> 
> Oh interesting I had not realized libvirt-lxc had grown an unprivileged
> mode using user namespaces.
> 
> This does appear to be a classic remount bug, where you are not
> preserving the permissions.  It appears the fact that the code
> failed to enforce locked permissions on the fresh mount of proc
> was hiding this bug until now.
> 
> I expect what you actually want is the code below:
> 
> diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c
> index 9a9ae5c2aaf0..f008a7484bfe 100644
> --- a/src/lxc/lxc_container.c
> +++ b/src/lxc/lxc_container.c
> @@ -850,7 +850,7 @@ typedef struct {
>  
>  static const virLXCBasicMountInfo lxcBasicMounts[] = {
>      { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false },
> -    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false },
> +    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
>      { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true },
>      { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true },
>      { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
> 
> Or possibly just:
> 
> diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c
> index 9a9ae5c2aaf0..a60ccbd12bfc 100644
> --- a/src/lxc/lxc_container.c
> +++ b/src/lxc/lxc_container.c
> @@ -850,7 +850,7 @@ typedef struct {
>  
>  static const virLXCBasicMountInfo lxcBasicMounts[] = {
>      { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false },
> -    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false },
> +    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, true, false, false },
>      { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true },
>      { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true },
>      { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
> 
> As the there is little point in making /proc/sys read-only in a
> user-namespace, as the permission checks are uid based and no-one should
> have the global uid 0 in your container.  Making mounting /proc/sys
> read-only rather pointless.

Eric, using the patch below I was able to spawn a user-namespace enabled container
using libvirt-lxc. :-)

I had to:
1. Disable the read-only mount of /proc/sys which is anyway useless in the user-namespace case.
2. Disable the /proc/sys/net/ipv{4,6} bind mounts, this ugly hack is only needed for the non user-namespace case.
3. Remove MS_RDONLY from the sysfs mount (For the non user-namespace case we'd have to keep this, though).

Daniel, I'd take this as a chance to disable all the MS_RDONLY games if user-namespace are configured.
With Eric's fixes they hurt us. And as I wrote many times before if root within the user-namespace
is able to do nasty things in /sys and /proc that's a plain kernel bug which needs fixing. There is no
point in mounting these read-only. Except for the case then no user-namespace is used.

diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c
index 9a9ae5c..497e05f 100644
--- a/src/lxc/lxc_container.c
+++ b/src/lxc/lxc_container.c
@@ -850,10 +850,10 @@ typedef struct {

 static const virLXCBasicMountInfo lxcBasicMounts[] = {
     { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false },
-    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false },
-    { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true },
-    { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true },
-    { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
+    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, true, false, false },
+    { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, true, false, true },
+    { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, true, false, true },
+    { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false },
     { "securityfs", "/sys/kernel/security", "securityfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, true, true, false },
 #if WITH_SELINUX
     { SELINUX_MOUNT, SELINUX_MOUNT, "selinuxfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, true, true, false },

Thanks,
//richard

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
       [not found]                             ` <556831CF.9040600-/L3Ra7n9ekc@public.gmane.org>
@ 2015-05-29 17:41                               ` Eric W. Biederman
  0 siblings, 0 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-05-29 17:41 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Serge Hallyn, Andy Lutomirski, Seth Forshee, Linux API,
	Linux Containers, Greg Kroah-Hartman, Kenton Varda,
	Michael Kerrisk-manpages, Linux FS Devel, Tejun Heo,
	libvir-list@redhat.com, Daniel P. Berrange, Cedric Bosdonnat

Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:

> [CC'ing libvirt-lxc folks]
>
> Am 28.05.2015 um 23:32 schrieb Eric W. Biederman:
>> Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:
>> 
>>> Am 28.05.2015 um 21:57 schrieb Eric W. Biederman:
>>>>> FWIW, it breaks also libvirt-lxc:
>>>>> Error: internal error: guest failed to start: Failed to re-mount /proc/sys on /proc/sys flags=1021: Operation not permitted
>>>>
>>>> Interesting.  I had not anticipated a failure there?  And it is failing
>>>> in remount?  Oh that is interesting.
>>>>
>>>> That implies that there is some flag of the original mount of /proc that
>>>> the remount of /proc/sys is clearing, and that previously 
>>>>
>>>> The flags specified are current rdonly,remount,bind so I expect there
>>>> are some other flags on proc that libvirt-lxc is clearing by accident
>>>> and we did not fail before because the kernel was not enforcing things.
>>>
>>> Please see:
>>> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l933
>>> lxcContainerMountBasicFS()
>>>
>>> and:
>>> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l850
>>> lxcBasicMounts
>>>
>>>> What are the mount flags in a working libvirt-lxc?
>>>
>>> See:
>>> test1:~ # cat /proc/self/mountinfo
>>> 149 147 0:56 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
>>> 150 149 0:56 /sys /proc/sys ro,nodev,relatime - proc proc rw
>> 
>>> If you need more info, please let me know. :-)
>> 
>> Oh interesting I had not realized libvirt-lxc had grown an unprivileged
>> mode using user namespaces.
>> 
>> This does appear to be a classic remount bug, where you are not
>> preserving the permissions.  It appears the fact that the code
>> failed to enforce locked permissions on the fresh mount of proc
>> was hiding this bug until now.
>> 
>> I expect what you actually want is the code below:
>> 
>> diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c
>> index 9a9ae5c2aaf0..f008a7484bfe 100644
>> --- a/src/lxc/lxc_container.c
>> +++ b/src/lxc/lxc_container.c
>> @@ -850,7 +850,7 @@ typedef struct {
>>  
>>  static const virLXCBasicMountInfo lxcBasicMounts[] = {
>>      { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false },
>> -    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false },
>> +    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
>>      { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true },
>>      { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true },
>>      { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
>> 
>> Or possibly just:
>> 
>> diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c
>> index 9a9ae5c2aaf0..a60ccbd12bfc 100644
>> --- a/src/lxc/lxc_container.c
>> +++ b/src/lxc/lxc_container.c
>> @@ -850,7 +850,7 @@ typedef struct {
>>  
>>  static const virLXCBasicMountInfo lxcBasicMounts[] = {
>>      { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false },
>> -    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false },
>> +    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, true, false, false },
>>      { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true },
>>      { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true },
>>      { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
>> 
>> As the there is little point in making /proc/sys read-only in a
>> user-namespace, as the permission checks are uid based and no-one should
>> have the global uid 0 in your container.  Making mounting /proc/sys
>> read-only rather pointless.
>
> Eric, using the patch below I was able to spawn a user-namespace enabled container
> using libvirt-lxc. :-)

I am glad.  

I am trying to figure out which set of changes were necessary vs just
nice to have, to inform that part of the conversation that is asking is
there a way we can avoid breaking userspace for this security fix.

> I had to:
> 1. Disable the read-only mount of /proc/sys which is anyway useless in
> the user-namespace case.

It is likely worth addressing the libvirt-lxc MS_REMOUNT code as it does
not preserve any mount flags, or even have the capability to try.

    if (bindOverReadonly &&
            mount(mnt_src, mnt->dst, NULL,
                  MS_BIND|MS_REMOUNT|MS_RDONLY, NULL) < 0) {
            virReportSystemError(errno,
                                 _("Failed to re-mount %s on %s flags=%x"),
                                 mnt_src, mnt->dst,
                                 MS_BIND|MS_REMOUNT|MS_RDONLY);
            goto cleanup;
     }

Aka the flags during remount are hard coded (which is buggy).
So I believe even without the use of user-namespaces this code does the
wrong thing.

Likely statvfs needs to be called to get the existing mount flags and
those should be applied during remount or possibly just the mount flags
from the virLXCBasicMountInfo entry should be added.

> 2. Disable the /proc/sys/net/ipv{4,6} bind mounts, this ugly hack is only needed for the non user-namespace case.

*Scratches my head*

Why was this necessary?  Those are just plain bind mounts which do not
need any remount-magic so they should have just worked and preserved
the existing mount flags.

I agree they are unnecessary in this context but I don't expect they
would have cause problems or were "wrong".

> 3. Remove MS_RDONLY from the sysfs mount (For the non user-namespace case we'd have to keep this, though).

Ok. I can see this as being necessary as well, and missed in the first
pass because the code did not get this far.

The code flow for sysfs appears to trigger the bindOverReadOnly code as
MS_RDONLY is set.

Then the remount clears the other mount flags on sysfs.  Which
previously we would have not enforced as sysfs with a network namespace
is a fresh mount (and that is the bug my patchset fixes).

This does very much look like a bug in libvirt-lxc clearing flags it did
not intend to.

> Daniel, I'd take this as a chance to disable all the MS_RDONLY games if user-namespace are configured.
> With Eric's fixes they hurt us. And as I wrote many times before if root within the user-namespace
> is able to do nasty things in /sys and /proc that's a plain kernel bug which needs fixing. There is no
> point in mounting these read-only. Except for the case then no user-namespace is used.
>
> diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c
> index 9a9ae5c..497e05f 100644
> --- a/src/lxc/lxc_container.c
> +++ b/src/lxc/lxc_container.c
> @@ -850,10 +850,10 @@ typedef struct {
>
>  static const virLXCBasicMountInfo lxcBasicMounts[] = {
>      { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false },
> -    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false },
> -    { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true },
> -    { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true },
> -    { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
> +    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, true, false, false },
> +    { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, true, false, true },
> +    { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, true, false, true },
> +    { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false },
>      { "securityfs", "/sys/kernel/security", "securityfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, true, true, false },
>  #if WITH_SELINUX
>      { SELINUX_MOUNT, SELINUX_MOUNT, "selinuxfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, true, true, false },
>
> Thanks,
> //richard

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
       [not found]                               ` <87fv6g80g7.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-05-29  4:54                                 ` Kenton Varda
@ 2015-05-29 17:49                                 ` Andy Lutomirski
  2015-06-03 21:13                                   ` Eric W. Biederman
  1 sibling, 1 reply; 85+ messages in thread
From: Andy Lutomirski @ 2015-05-29 17:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Richard Weinberger, Seth Forshee, Greg Kroah-Hartman,
	Linux Containers, Serge Hallyn, Kenton Varda,
	Michael Kerrisk-manpages, Linux API, Linux FS Devel, Tejun Heo

On Thu, May 28, 2015 at 9:36 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>> On May 28, 2015 12:19 PM, "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>> Kenton Varda <kenton-AuYgBwuPrUQTaNkGU808tA@public.gmane.org> writes:
>>>
>>> We do need to enforce retaining the existing mount flags one way or
>>> another.  Where this really matters is with MS_RDONLY.  We don't want
>>> any old user to be able to mount /proc read-write when root mounted it
>>> read-only.  There is a very real attack vector there.  That attack
>>> almost works in docker container today and is avoided simply because
>>> docker mounts over a few files on proc.
>>
>> You could drop the nosuid, noexec, and nodev changes and keep just the
>> ro part.  The ro part is probably not an ABI break in the sense of
>> something that actually breaks real programs.
>
> As a change simply removing the code from the existing patches that
> worries about nosuid, noexec, and the nodev flags is certainly doable.
> It is the best proposal I have heard so far.
>
> I remain unconvinced about ignoring those flags:
> - There are clearly people who think it matters (or else proc and sysfs
>   would not have those flags specified).
>
> - There have been times when it actually has mattered.
>   Aka when files like /proc/self/env could be chmodded and used for
>   privilege escalation.
>
> - The code in lxc and libvirt-lxc so far has been clearly buggy.
>   * lxc only has problems with sysfs (in some configurations).
>   * libvirt-lxc only has problems on a bind mount remount of
>     proc after remounting proc properly.
>
> So I am leaning towards enforcing all of the mount flags including
> nosuid, noexec, and nodev.  Then when the next subtle bug in proc or
> sysfs with respect to chmod shows up I will be able to sleep soundly at
> night because the mount flags of those filesystems allow a mitigation,
> and I did not sabatage the mitigation.

One option would be to break the nosuid, nodev, and noexec parts into
their own patch and then avoid tagging that patch for -stable if at
all possible.  It would be nice to avoid another -stable ABI break if
at all possible.

--Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
  2015-05-29 17:49                                 ` Andy Lutomirski
@ 2015-06-03 21:13                                   ` Eric W. Biederman
       [not found]                                     ` <87k2vkebri.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 85+ messages in thread
From: Eric W. Biederman @ 2015-06-03 21:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kenton Varda, Serge Hallyn, Seth Forshee, Linux API,
	Linux Containers, Greg Kroah-Hartman, Michael Kerrisk-manpages,
	Richard Weinberger, Linux FS Devel, Tejun Heo

Andy Lutomirski <luto@amacapital.net> writes:

> One option would be to break the nosuid, nodev, and noexec parts into
> their own patch and then avoid tagging that patch for -stable if at
> all possible.  It would be nice to avoid another -stable ABI break if
> at all possible.

So I don't think we actually have anything that could be called an ABI
break in the whole mess, but it is definitely a behavioral change that
is a regression for lxc and libvirt-lxc that prevents them from starting.

nodev does not actually matter because of the implicit silliness that
is being added right now.

We do want those programs fixed and after those programs are fixed we
can safely begin failing mount when those attributes are being cleared
in a fresh mount.

So it looks to me like the best thing to do is to print a warning
whenever lxc or libvirt-lxc gets it wrong, which should ensure the
authors are sufficiently pestered that in a kernel release or 3 we can
begin enforcing those attributes.  Especially as the discussion on the
fix for those applications has already begun.

And if folks would double check the patch I am going to post in a moment
to ensure that lxc and libvirt-lxc continue to start I would appreciate it.

Eric




^ permalink raw reply	[flat|nested] 85+ messages in thread

* [CFT][PATCH 11/10] mnt: Avoid unnecessary regressions in fs_fully_visible
       [not found]                                     ` <87k2vkebri.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-06-03 21:15                                       ` Eric W. Biederman
       [not found]                                         ` <87eglseboh.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-06-04  5:19                                       ` [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) Greg Kroah-Hartman
  1 sibling, 1 reply; 85+ messages in thread
From: Eric W. Biederman @ 2015-06-03 21:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Richard Weinberger, Seth Forshee, Greg Kroah-Hartman,
	Linux Containers, Serge Hallyn, Kenton Varda,
	Michael Kerrisk-manpages, Linux API, Linux FS Devel, Tejun Heo


Not allowing programs to clear nosuid, nodev, and noexec on new mounts
of sysfs or proc will cause lxc and libvirt-lxc to fail to start (a
regression).  There are no device nodes or executables on sysfs or
proc today which means clearing these flags is harmless today.

Instead of failing the fresh mounts of sysfs and proc emit a warning
when these flags are improprely cleared.  We only reach this point
because lxc and libvirt-lxc clear flags they mount flags had not
intended to.

In a couple of kernel releases when lxc and libvirt-lxc have been
fixed we can start failing fresh mounts proc and sysfs that clear
nosuid, nodev and noexec.  Userspace clearly means to enforce those
attributes and historically they have avoided bugs.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index eccd925c6e82..eaa49b628d28 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3198,6 +3198,7 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags)
 		if ((mnt->mnt.mnt_flags & MNT_LOCK_READONLY) &&
 		    !(new_flags & MNT_READONLY))
 			continue;
+#if 0		/* Avoid unnecessary regressions */
 		if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) &&
 		    !(new_flags & MNT_NODEV))
 			continue;
@@ -3207,6 +3208,7 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags)
 		if ((mnt->mnt.mnt_flags & MNT_LOCK_NOEXEC) &&
 		    !(new_flags & MNT_NOEXEC))
 			continue;
+#endif
 		if ((mnt->mnt.mnt_flags & MNT_LOCK_ATIME) &&
 		    ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
 			continue;
@@ -3226,10 +3228,35 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags)
 		}
 		/* Preserve the locked attributes */
 		*new_mnt_flags |= mnt->mnt.mnt_flags & (MNT_LOCK_READONLY | \
+						/* Avoid unnecessary regressions \
 							MNT_LOCK_NODEV    | \
 							MNT_LOCK_NOSUID   | \
 							MNT_LOCK_NOEXEC   | \
+						 */ \
 							MNT_LOCK_ATIME);
+		/* For now, warn about the "harmless" but invalid mnt flags */
+		{
+			bool nodev = false, nosuid = false, noexec = false;
+			if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) &&
+			    !(new_flags & MNT_NODEV))
+				nodev = true;
+			if ((mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) &&
+			    !(new_flags & MNT_NOSUID))
+				nosuid = true;
+			if ((mnt->mnt.mnt_flags & MNT_LOCK_NOEXEC) &&
+			    !(new_flags & MNT_NOEXEC))
+				noexec = true;
+
+			if ((nodev || nosuid || noexec) && printk_ratelimit()) {
+				printk(KERN_INFO
+				       "warning: process `%s' clears %s%s%sin mount of %s\n",
+				       current->comm,
+				       nodev ? "nodev ":"",
+				       nosuid ? "nosuid ":"",
+				       noexec ? "noexec ":"",
+				       type->name);
+			}
+		}
 		visible = true;
 		goto found;
 	next:	;
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [CFT][PATCH 11/10] mnt: Avoid unnecessary regressions in fs_fully_visible (take 2)
       [not found]                                         ` <87eglseboh.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-06-04  4:35                                           ` Eric W. Biederman
       [not found]                                             ` <874mmodral.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-06-05  0:46                                           ` [CFT][PATCH 11/10] mnt: Avoid unnecessary regressions in fs_fully_visible Andy Lutomirski
  1 sibling, 1 reply; 85+ messages in thread
From: Eric W. Biederman @ 2015-06-04  4:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kenton Varda, Serge Hallyn, Seth Forshee, Linux API,
	Linux Containers, Greg Kroah-Hartman, Michael Kerrisk-manpages,
	Richard Weinberger, Linux FS Devel, Tejun Heo


Not allowing programs to clear nosuid and noexec on new mounts of
sysfs or proc will cause lxc and libvirt-lxc to fail to start (a
regression).  There are no executables files on sysfs or proc today
which means clearing these flags is harmless today.

Instead of failing the fresh mounts of sysfs and proc emit a warning
when these flags are improprely cleared.  We only reach this point
because lxc and libvirt-lxc clear flags they mount flags had not
intended to.

In a couple of kernel releases when lxc and libvirt-lxc have been
fixed we can start failing fresh mounts proc and sysfs that clear
nosuid and noexec.  Userspace clearly means to enforce those
attributes and enforcing these attributes have historically avoided
bugs in the setattr implementations of proc and sysfs.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---

Now with warning on problematic remounts as well.
nodev is also ignored because it is not currently problematic.

 fs/namespace.c        | 33 +++++++++++++++++++++++++++++++++
 include/linux/mount.h |  5 +++++
 2 files changed, 38 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index eccd925c6e82..3c3f8172c734 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2162,6 +2162,18 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
 	    ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (mnt_flags & MNT_ATIME_MASK))) {
 		return -EPERM;
 	}
+	if ((mnt->mnt.mnt_flags & MNT_WARN_NOSUID) &&
+	    !(mnt_flags & MNT_NOSUID) && printk_ratelimit()) {
+		printk(KERN_INFO
+		       "warning: process `%s' clears nosuid in remount of %s\n",
+		       current->comm, sb->s_type->name);
+	}
+	if ((mnt->mnt.mnt_flags & MNT_WARN_NOEXEC) &&
+	    !(mnt_flags & MNT_NOEXEC) && printk_ratelimit()) {
+		printk(KERN_INFO
+		       "warning: process `%s' clears noexec in remount of %s\n",
+		       current->comm, sb->s_type->name);
+	}
 
 	err = security_sb_remount(sb, data);
 	if (err)
@@ -3201,12 +3213,14 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags)
 		if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) &&
 		    !(new_flags & MNT_NODEV))
 			continue;
+#if 0		/* Avoid unnecessary regressions */
 		if ((mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) &&
 		    !(new_flags & MNT_NOSUID))
 			continue;
 		if ((mnt->mnt.mnt_flags & MNT_LOCK_NOEXEC) &&
 		    !(new_flags & MNT_NOEXEC))
 			continue;
+#endif
 		if ((mnt->mnt.mnt_flags & MNT_LOCK_ATIME) &&
 		    ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
 			continue;
@@ -3227,9 +3241,28 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags)
 		/* Preserve the locked attributes */
 		*new_mnt_flags |= mnt->mnt.mnt_flags & (MNT_LOCK_READONLY | \
 							MNT_LOCK_NODEV    | \
+						/* Avoid unnecessary regressions \
 							MNT_LOCK_NOSUID   | \
 							MNT_LOCK_NOEXEC   | \
+						 */ \
 							MNT_LOCK_ATIME);
+		/* For now, warn about the "harmless" but invalid mnt flags */
+		if (mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) {
+			*new_mnt_flags |= MNT_WARN_NOSUID;
+			if (!(new_flags & MNT_NOSUID) && printk_ratelimit()) {
+				printk(KERN_INFO
+				       "warning: process `%s' clears nosuid in mount of %s\n",
+				       current->comm, type->name);
+			}
+		}
+		if (mnt->mnt.mnt_flags & MNT_LOCK_NOEXEC) {
+			*new_mnt_flags |= MNT_WARN_NOEXEC;
+			if (!(new_flags & MNT_NOEXEC) && printk_ratelimit()) {
+				printk(KERN_INFO
+				       "warning: process `%s' clears noexec in mount of %s\n",
+				       current->comm, type->name);
+			}
+		}
 		visible = true;
 		goto found;
 	next:	;
diff --git a/include/linux/mount.h b/include/linux/mount.h
index f822c3c11377..a9ac188413fd 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -52,6 +52,11 @@ struct mnt_namespace;
 
 #define MNT_INTERNAL	0x4000
 
+/* These warning options should be removed in a few kernel releases
+ * once userspace has been fixed.
+ */
+#define MNT_WARN_NOSUID		0x010000
+#define MNT_WARN_NOEXEC		0x020000
 #define MNT_LOCK_ATIME		0x040000
 #define MNT_LOCK_NOEXEC		0x080000
 #define MNT_LOCK_NOSUID		0x100000
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
       [not found]                                     ` <87k2vkebri.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-06-03 21:15                                       ` [CFT][PATCH 11/10] mnt: Avoid unnecessary regressions in fs_fully_visible Eric W. Biederman
@ 2015-06-04  5:19                                       ` Greg Kroah-Hartman
  2015-06-04  6:27                                         ` Eric W. Biederman
  1 sibling, 1 reply; 85+ messages in thread
From: Greg Kroah-Hartman @ 2015-06-04  5:19 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Seth Forshee, Linux API, Linux Containers, Serge Hallyn,
	Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages,
	Richard Weinberger, Linux FS Devel, Tejun Heo

On Wed, Jun 03, 2015 at 04:13:21PM -0500, Eric W. Biederman wrote:
> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
> 
> > One option would be to break the nosuid, nodev, and noexec parts into
> > their own patch and then avoid tagging that patch for -stable if at
> > all possible.  It would be nice to avoid another -stable ABI break if
> > at all possible.
> 
> So I don't think we actually have anything that could be called an ABI
> break in the whole mess, but it is definitely a behavioral change that
> is a regression for lxc and libvirt-lxc that prevents them from starting.
> 
> nodev does not actually matter because of the implicit silliness that
> is being added right now.
> 
> We do want those programs fixed and after those programs are fixed we
> can safely begin failing mount when those attributes are being cleared
> in a fresh mount.
> 
> So it looks to me like the best thing to do is to print a warning
> whenever lxc or libvirt-lxc gets it wrong, which should ensure the
> authors are sufficiently pestered that in a kernel release or 3 we can
> begin enforcing those attributes.  Especially as the discussion on the
> fix for those applications has already begun.

"pestering" never works, look at some of the SCSI drivers for examples
of how a distro will just patch out the "warning this driver is using an
old api and needs to be fixed" messages.

You can't break stuff like this, people will get upset :(

greg k-h

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 11/10] mnt: Avoid unnecessary regressions in fs_fully_visible (take 2)
       [not found]                                             ` <874mmodral.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-06-04  5:20                                               ` Greg Kroah-Hartman
  0 siblings, 0 replies; 85+ messages in thread
From: Greg Kroah-Hartman @ 2015-06-04  5:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Lutomirski, Kenton Varda, Serge Hallyn, Seth Forshee,
	Linux API, Linux Containers, Michael Kerrisk-manpages,
	Richard Weinberger, Linux FS Devel, Tejun Heo

On Wed, Jun 03, 2015 at 11:35:30PM -0500, Eric W. Biederman wrote:
> 
> Not allowing programs to clear nosuid and noexec on new mounts of
> sysfs or proc will cause lxc and libvirt-lxc to fail to start (a
> regression).  There are no executables files on sysfs or proc today
> which means clearing these flags is harmless today.
> 
> Instead of failing the fresh mounts of sysfs and proc emit a warning
> when these flags are improprely cleared.  We only reach this point
> because lxc and libvirt-lxc clear flags they mount flags had not
> intended to.
> 
> In a couple of kernel releases when lxc and libvirt-lxc have been
> fixed we can start failing fresh mounts proc and sysfs that clear
> nosuid and noexec.  Userspace clearly means to enforce those
> attributes and enforcing these attributes have historically avoided
> bugs in the setattr implementations of proc and sysfs.
> 
> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> ---
> 
> Now with warning on problematic remounts as well.
> nodev is also ignored because it is not currently problematic.
> 
>  fs/namespace.c        | 33 +++++++++++++++++++++++++++++++++
>  include/linux/mount.h |  5 +++++
>  2 files changed, 38 insertions(+)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index eccd925c6e82..3c3f8172c734 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2162,6 +2162,18 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
>  	    ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (mnt_flags & MNT_ATIME_MASK))) {
>  		return -EPERM;
>  	}
> +	if ((mnt->mnt.mnt_flags & MNT_WARN_NOSUID) &&
> +	    !(mnt_flags & MNT_NOSUID) && printk_ratelimit()) {
> +		printk(KERN_INFO
> +		       "warning: process `%s' clears nosuid in remount of %s\n",
> +		       current->comm, sb->s_type->name);
> +	}
> +	if ((mnt->mnt.mnt_flags & MNT_WARN_NOEXEC) &&
> +	    !(mnt_flags & MNT_NOEXEC) && printk_ratelimit()) {
> +		printk(KERN_INFO
> +		       "warning: process `%s' clears noexec in remount of %s\n",
> +		       current->comm, sb->s_type->name);
> +	}
>  
>  	err = security_sb_remount(sb, data);
>  	if (err)
> @@ -3201,12 +3213,14 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags)
>  		if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) &&
>  		    !(new_flags & MNT_NODEV))
>  			continue;
> +#if 0		/* Avoid unnecessary regressions */
>  		if ((mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) &&
>  		    !(new_flags & MNT_NOSUID))
>  			continue;
>  		if ((mnt->mnt.mnt_flags & MNT_LOCK_NOEXEC) &&
>  		    !(new_flags & MNT_NOEXEC))
>  			continue;
> +#endif
>  		if ((mnt->mnt.mnt_flags & MNT_LOCK_ATIME) &&
>  		    ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
>  			continue;
> @@ -3227,9 +3241,28 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags)
>  		/* Preserve the locked attributes */
>  		*new_mnt_flags |= mnt->mnt.mnt_flags & (MNT_LOCK_READONLY | \
>  							MNT_LOCK_NODEV    | \
> +						/* Avoid unnecessary regressions \
>  							MNT_LOCK_NOSUID   | \
>  							MNT_LOCK_NOEXEC   | \
> +						 */ \
>  							MNT_LOCK_ATIME);
> +		/* For now, warn about the "harmless" but invalid mnt flags */
> +		if (mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) {
> +			*new_mnt_flags |= MNT_WARN_NOSUID;
> +			if (!(new_flags & MNT_NOSUID) && printk_ratelimit()) {
> +				printk(KERN_INFO
> +				       "warning: process `%s' clears nosuid in mount of %s\n",
> +				       current->comm, type->name);
> +			}
> +		}
> +		if (mnt->mnt.mnt_flags & MNT_LOCK_NOEXEC) {
> +			*new_mnt_flags |= MNT_WARN_NOEXEC;
> +			if (!(new_flags & MNT_NOEXEC) && printk_ratelimit()) {
> +				printk(KERN_INFO
> +				       "warning: process `%s' clears noexec in mount of %s\n",
> +				       current->comm, type->name);
> +			}
> +		}

Adding this to a stable kernel is not going to be ok, sorry.  We can't
start being noisy in system logs for things that were working just fine.

greg k-h

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
  2015-06-04  5:19                                       ` [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) Greg Kroah-Hartman
@ 2015-06-04  6:27                                         ` Eric W. Biederman
       [not found]                                           ` <87h9qo6la9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 85+ messages in thread
From: Eric W. Biederman @ 2015-06-04  6:27 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Andy Lutomirski, Kenton Varda, Serge Hallyn, Seth Forshee,
	Linux API, Linux Containers, Michael Kerrisk-manpages,
	Richard Weinberger, Linux FS Devel, Tejun Heo

Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes:

> On Wed, Jun 03, 2015 at 04:13:21PM -0500, Eric W. Biederman wrote:
>> Andy Lutomirski <luto@amacapital.net> writes:
>> 
>> > One option would be to break the nosuid, nodev, and noexec parts into
>> > their own patch and then avoid tagging that patch for -stable if at
>> > all possible.  It would be nice to avoid another -stable ABI break if
>> > at all possible.
>> 
>> So I don't think we actually have anything that could be called an ABI
>> break in the whole mess, but it is definitely a behavioral change that
>> is a regression for lxc and libvirt-lxc that prevents them from starting.
>> 
>> nodev does not actually matter because of the implicit silliness that
>> is being added right now.
>> 
>> We do want those programs fixed and after those programs are fixed we
>> can safely begin failing mount when those attributes are being cleared
>> in a fresh mount.
>> 
>> So it looks to me like the best thing to do is to print a warning
>> whenever lxc or libvirt-lxc gets it wrong, which should ensure the
>> authors are sufficiently pestered that in a kernel release or 3 we can
>> begin enforcing those attributes.  Especially as the discussion on the
>> fix for those applications has already begun.
>
> "pestering" never works, look at some of the SCSI drivers for examples
> of how a distro will just patch out the "warning this driver is using an
> old api and needs to be fixed" messages.

> You can't break stuff like this, people will get upset :(

A) To the best of my knowledge there are two programs on the face of the
   planet where this matters. (lxc and libvirt-lxc)

B) The code in those two programs is buggy.  That is the code in those
   two programs does not do what the authors intended.  That is fixing
   those programs is something that should be done regardless of what
   I do in the kernel.  I have already reached out to the developers of
   those programs.  The pestering in the kernel is a form of reminder,
   not the primary source of communication.

C) These bugs really are security holes.  Currently they do not appear
   exploitable (thank goodness) but they are security holes.

   Since they are not currently exploitable it does make sense
   to give people a little time to get their act together.

   The bugs are larger then the case that is being hit here,
   this is just where they are noticed.

D) Letting people know that there is a problem as part of a larger
   effort has actually worked for me.  Distro's have stopped enabling
   the sysctl system call.

E) Given that I have not audited sysfs and proc closely in recent years
   I may actually be wrong.  Those bugs may actually be exploitable.
   All it takes is chmod to be supported on one file that can be made
   executable.  That bug has existed in the past and I don't doubt
   someone will overlook something and we will see the bug again in the
   future.

So it is my best judgment that I disable the code that stops
containers from starting and just making it a warning (for now).
Then in a release or so I start failing these operations instead of
warning.

This is the most fair and reasonable I can see to be.

The only other choice I can see is to say I don't care it is a security
issue I am breaking your sloopy insecure code.

Am I being too nice with these security bugs?

Eric

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
       [not found]                                           ` <87h9qo6la9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-06-04  7:34                                             ` Eric W. Biederman
  2015-06-16 12:23                                             ` Daniel P. Berrange
  1 sibling, 0 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-06-04  7:34 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Seth Forshee, Linux API, Linux Containers, Serge Hallyn,
	Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages,
	Richard Weinberger, Linux FS Devel, Tejun Heo

ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:

> Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> writes:
>
>> On Wed, Jun 03, 2015 at 04:13:21PM -0500, Eric W. Biederman wrote:
>>> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>>> 
>>> > One option would be to break the nosuid, nodev, and noexec parts into
>>> > their own patch and then avoid tagging that patch for -stable if at
>>> > all possible.  It would be nice to avoid another -stable ABI break if
>>> > at all possible.
>>> 
>>> So I don't think we actually have anything that could be called an ABI
>>> break in the whole mess, but it is definitely a behavioral change that
>>> is a regression for lxc and libvirt-lxc that prevents them from starting.
>>> 
>>> nodev does not actually matter because of the implicit silliness that
>>> is being added right now.
>>> 
>>> We do want those programs fixed and after those programs are fixed we
>>> can safely begin failing mount when those attributes are being cleared
>>> in a fresh mount.
>>> 
>>> So it looks to me like the best thing to do is to print a warning
>>> whenever lxc or libvirt-lxc gets it wrong, which should ensure the
>>> authors are sufficiently pestered that in a kernel release or 3 we can
>>> begin enforcing those attributes.  Especially as the discussion on the
>>> fix for those applications has already begun.
>>
>> "pestering" never works, look at some of the SCSI drivers for examples
>> of how a distro will just patch out the "warning this driver is using an
>> old api and needs to be fixed" messages.
>
>> You can't break stuff like this, people will get upset :(
>
> A) To the best of my knowledge there are two programs on the face of the
>    planet where this matters. (lxc and libvirt-lxc)
>
> B) The code in those two programs is buggy.  That is the code in those
>    two programs does not do what the authors intended.  That is fixing
>    those programs is something that should be done regardless of what
>    I do in the kernel.  I have already reached out to the developers of
>    those programs.  The pestering in the kernel is a form of reminder,
>    not the primary source of communication.
>
> C) These bugs really are security holes.  Currently they do not appear
>    exploitable (thank goodness) but they are security holes.
>
>    Since they are not currently exploitable it does make sense
>    to give people a little time to get their act together.
>
>    The bugs are larger then the case that is being hit here,
>    this is just where they are noticed.
>
> D) Letting people know that there is a problem as part of a larger
>    effort has actually worked for me.  Distro's have stopped enabling
>    the sysctl system call.
>
> E) Given that I have not audited sysfs and proc closely in recent years
>    I may actually be wrong.  Those bugs may actually be exploitable.
>    All it takes is chmod to be supported on one file that can be made
>    executable.  That bug has existed in the past and I don't doubt
>    someone will overlook something and we will see the bug again in the
>    future.
>
> So it is my best judgment that I disable the code that stops
> containers from starting and just making it a warning (for now).
> Then in a release or so I start failing these operations instead of
> warning.
>
> This is the most fair and reasonable I can see to be.
>
> The only other choice I can see is to say I don't care it is a security
> issue I am breaking your sloopy insecure code.
>
> Am I being too nice with these security bugs?

Thinking about it a little more.  There is a possibility that sometime
in the future that someone will deliberately add a suid executable as a
file in proc or sysfs and have a good reason for doing so.

Some sysadmin or sandbox builder with special requirements may then
disable suid and exec on proc because in their sandbox (not linux in
general) having access to that executable is a bad thing.  At which
we have an exploitable security issue if nosuid and noexec are not
enforced.

Or in other words I am not smarter than the bad guys.  This is a
security issue.  I can not ignore nosuid and noexec indefinitely.
I have to make those cases fail at some point.  At that point
current unfixed versions of lxc and libvirt-lxc will break.

A warning is the nicest I can imagine being.

Eric

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 11/10] mnt: Avoid unnecessary regressions in fs_fully_visible
       [not found]                                         ` <87eglseboh.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-06-04  4:35                                           ` [CFT][PATCH 11/10] mnt: Avoid unnecessary regressions in fs_fully_visible (take 2) Eric W. Biederman
@ 2015-06-05  0:46                                           ` Andy Lutomirski
       [not found]                                             ` <CALCETrWwtFaiaYGLoq4EPkrgcq9nEA2GseVfP3iBkbYZ8NfGPg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 85+ messages in thread
From: Andy Lutomirski @ 2015-06-05  0:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kenton Varda, Serge Hallyn, Seth Forshee, Linux API,
	Linux Containers, Greg Kroah-Hartman, Michael Kerrisk-manpages,
	Richard Weinberger, Linux FS Devel, Tejun Heo

On Wed, Jun 3, 2015 at 2:15 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
> Not allowing programs to clear nosuid, nodev, and noexec on new mounts
> of sysfs or proc will cause lxc and libvirt-lxc to fail to start (a
> regression).  There are no device nodes or executables on sysfs or
> proc today which means clearing these flags is harmless today.
>
> Instead of failing the fresh mounts of sysfs and proc emit a warning
> when these flags are improprely cleared.  We only reach this point
> because lxc and libvirt-lxc clear flags they mount flags had not
> intended to.
>
> In a couple of kernel releases when lxc and libvirt-lxc have been
> fixed we can start failing fresh mounts proc and sysfs that clear
> nosuid, nodev and noexec.  Userspace clearly means to enforce those
> attributes and historically they have avoided bugs.

At the very least, I think this should be folded in so that the ABI
doesn't break in the middle of the series.

--Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
  2015-05-29  9:30                           ` Richard Weinberger
       [not found]                             ` <556831CF.9040600-/L3Ra7n9ekc@public.gmane.org>
@ 2015-06-06 18:56                             ` Eric W. Biederman
       [not found]                               ` <87mw0c1x8p.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  1 sibling, 1 reply; 85+ messages in thread
From: Eric W. Biederman @ 2015-06-06 18:56 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Serge Hallyn, Andy Lutomirski, Seth Forshee, Linux API,
	Linux Containers, Greg Kroah-Hartman, Kenton Varda,
	Michael Kerrisk-manpages, Linux FS Devel, Tejun Heo, libvir-list,
	Daniel P. Berrange, Cedric Bosdonnat

Richard Weinberger <richard@nod.at> writes:

> [CC'ing libvirt-lxc folks]
>
> Am 28.05.2015 um 23:32 schrieb Eric W. Biederman:
>> Richard Weinberger <richard@nod.at> writes:
>> 
>>> Am 28.05.2015 um 21:57 schrieb Eric W. Biederman:
>>>>> FWIW, it breaks also libvirt-lxc:
>>>>> Error: internal error: guest failed to start: Failed to re-mount /proc/sys on /proc/sys flags=1021: Operation not permitted
>>>>
>>>> Interesting.  I had not anticipated a failure there?  And it is failing
>>>> in remount?  Oh that is interesting.
>>>>
>>>> That implies that there is some flag of the original mount of /proc that
>>>> the remount of /proc/sys is clearing, and that previously 
>>>>
>>>> The flags specified are current rdonly,remount,bind so I expect there
>>>> are some other flags on proc that libvirt-lxc is clearing by accident
>>>> and we did not fail before because the kernel was not enforcing things.
>>>
>>> Please see:
>>> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l933
>>> lxcContainerMountBasicFS()
>>>
>>> and:
>>> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l850
>>> lxcBasicMounts
>>>
>>>> What are the mount flags in a working libvirt-lxc?
>>>
>>> See:
>>> test1:~ # cat /proc/self/mountinfo
>>> 149 147 0:56 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
>>> 150 149 0:56 /sys /proc/sys ro,nodev,relatime - proc proc rw
>> 
>>> If you need more info, please let me know. :-)
>> 
>> Oh interesting I had not realized libvirt-lxc had grown an unprivileged
>> mode using user namespaces.
>> 
>> This does appear to be a classic remount bug, where you are not
>> preserving the permissions.  It appears the fact that the code
>> failed to enforce locked permissions on the fresh mount of proc
>> was hiding this bug until now.
>> 
>> I expect what you actually want is the code below:
>> 
>> diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c
>> index 9a9ae5c2aaf0..f008a7484bfe 100644
>> --- a/src/lxc/lxc_container.c
>> +++ b/src/lxc/lxc_container.c
>> @@ -850,7 +850,7 @@ typedef struct {
>>  
>>  static const virLXCBasicMountInfo lxcBasicMounts[] = {
>>      { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false },
>> -    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false },
>> +    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
>>      { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true },
>>      { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true },
>>      { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
>> 
>> Or possibly just:
>> 
>> diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c
>> index 9a9ae5c2aaf0..a60ccbd12bfc 100644
>> --- a/src/lxc/lxc_container.c
>> +++ b/src/lxc/lxc_container.c
>> @@ -850,7 +850,7 @@ typedef struct {
>>  
>>  static const virLXCBasicMountInfo lxcBasicMounts[] = {
>>      { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false },
>> -    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false },
>> +    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, true, false, false },
>>      { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true },
>>      { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true },
>>      { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
>> 
>> As the there is little point in making /proc/sys read-only in a
>> user-namespace, as the permission checks are uid based and no-one should
>> have the global uid 0 in your container.  Making mounting /proc/sys
>> read-only rather pointless.
>
> Eric, using the patch below I was able to spawn a user-namespace enabled container
> using libvirt-lxc. :-)
>
> I had to:
> 1. Disable the read-only mount of /proc/sys which is anyway useless in the user-namespace case.
> 2. Disable the /proc/sys/net/ipv{4,6} bind mounts, this ugly hack is only needed for the non user-namespace case.
> 3. Remove MS_RDONLY from the sysfs mount (For the non user-namespace case we'd have to keep this, though).
>
> Daniel, I'd take this as a chance to disable all the MS_RDONLY games if user-namespace are configured.
> With Eric's fixes they hurt us. And as I wrote many times before if root within the user-namespace
> is able to do nasty things in /sys and /proc that's a plain kernel bug which needs fixing. There is no
> point in mounting these read-only. Except for the case then no user-namespace is used.
>

For clarity the patch below appears to be the minimal change needed to
fix this security issue.

AKA add mnt_mflags in when remounting something read-only.

/proc/sys needed to be updated so it had the proper flags to be added
back in.

I hope this helps.

diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c
index 9a9ae5c2aaf0..11e9514e0761 100644
--- a/src/lxc/lxc_container.c
+++ b/src/lxc/lxc_container.c
@@ -850,7 +850,7 @@ typedef struct {
 
 static const virLXCBasicMountInfo lxcBasicMounts[] = {
     { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false },
-    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false },
+    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
     { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true },
     { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true },
     { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
@@ -1030,7 +1030,7 @@ static int lxcContainerMountBasicFS(bool userns_enabled,
 
         if (bindOverReadonly &&
             mount(mnt_src, mnt->dst, NULL,
-                  MS_BIND|MS_REMOUNT|MS_RDONLY, NULL) < 0) {
+                  MS_BIND|MS_REMOUNT|mnt_mflags|MS_RDONLY, NULL) < 0) {
             virReportSystemError(errno,
                                  _("Failed to re-mount %s on %s flags=%x"),
                                  mnt_src, mnt->dst,


Eric

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 11/10] mnt: Avoid unnecessary regressions in fs_fully_visible
       [not found]                                             ` <CALCETrWwtFaiaYGLoq4EPkrgcq9nEA2GseVfP3iBkbYZ8NfGPg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-06-06 19:14                                               ` Eric W. Biederman
  0 siblings, 0 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-06-06 19:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kenton Varda, Serge Hallyn, Seth Forshee, Linux API,
	Linux Containers, Greg Kroah-Hartman, Michael Kerrisk-manpages,
	Richard Weinberger, Linux FS Devel, Tejun Heo

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:

> On Wed, Jun 3, 2015 at 2:15 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>
>> Not allowing programs to clear nosuid, nodev, and noexec on new mounts
>> of sysfs or proc will cause lxc and libvirt-lxc to fail to start (a
>> regression).  There are no device nodes or executables on sysfs or
>> proc today which means clearing these flags is harmless today.
>>
>> Instead of failing the fresh mounts of sysfs and proc emit a warning
>> when these flags are improprely cleared.  We only reach this point
>> because lxc and libvirt-lxc clear flags they mount flags had not
>> intended to.
>>
>> In a couple of kernel releases when lxc and libvirt-lxc have been
>> fixed we can start failing fresh mounts proc and sysfs that clear
>> nosuid, nodev and noexec.  Userspace clearly means to enforce those
>> attributes and historically they have avoided bugs.
>
> At the very least, I think this should be folded in so that the ABI
> doesn't break in the middle of the series.

Nothing in any of these patches has ever broken the ABI.  The bits have
always been interpreted with the same meaning.

I have been going back and forth on exactly the best way to handle this
because I don't like breaking working executables even for valid
reasons.

I think I have finally reach my personal peace on this issue.

Not requiring the presence of nosuid and noexec on a fresh mount of proc
and sysfs if the original mount has nosuid or noexec is a security issue
as what proc and sysfs implement in the future can not be known.

The one possible way to remedy this is to implicity add nosuid and
noexec as appropriate unfortunately that would break the ABI as it
changes the interpretation of the bits in the userspace interface, and
the day proc or sysfs changes and we honest to truly want to enable suid
exectuables on proc or sysfs we would not be able to. :( So implicitly
adding attributes is out.

As the current implementation of proc and sysfs are known I agree
it does not make sense to backport the enforcement of nosuid and
noexec.   So I have split the patch.  See my for-testing branch
and shortly my for-next branch.

It only takes two or three line patches in the affected userspace
executables, and a 5 minute test.  So a warning printk does not actually
make sense.

If the authors of lxc and libvirt-lxc have not taken the time to fix
their code by the time this code lands in a stable release (in 2 months
or so) no amount of other warnings are going to be enough.

Eric

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
       [not found]                                           ` <87h9qo6la9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-06-04  7:34                                             ` Eric W. Biederman
@ 2015-06-16 12:23                                             ` Daniel P. Berrange
  1 sibling, 0 replies; 85+ messages in thread
From: Daniel P. Berrange @ 2015-06-16 12:23 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Seth Forshee, Linux API, Linux Containers,
	Serge Hallyn, Andy Lutomirski, Kenton Varda,
	Michael Kerrisk-manpages, Richard Weinberger, Linux FS Devel,
	Tejun Heo

On Thu, Jun 04, 2015 at 01:27:10AM -0500, Eric W. Biederman wrote:
> Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> writes:
> 
> > On Wed, Jun 03, 2015 at 04:13:21PM -0500, Eric W. Biederman wrote:
> >> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
> >> 
> >> > One option would be to break the nosuid, nodev, and noexec parts into
> >> > their own patch and then avoid tagging that patch for -stable if at
> >> > all possible.  It would be nice to avoid another -stable ABI break if
> >> > at all possible.
> >> 
> >> So I don't think we actually have anything that could be called an ABI
> >> break in the whole mess, but it is definitely a behavioral change that
> >> is a regression for lxc and libvirt-lxc that prevents them from starting.
> >> 
> >> nodev does not actually matter because of the implicit silliness that
> >> is being added right now.
> >> 
> >> We do want those programs fixed and after those programs are fixed we
> >> can safely begin failing mount when those attributes are being cleared
> >> in a fresh mount.
> >> 
> >> So it looks to me like the best thing to do is to print a warning
> >> whenever lxc or libvirt-lxc gets it wrong, which should ensure the
> >> authors are sufficiently pestered that in a kernel release or 3 we can
> >> begin enforcing those attributes.  Especially as the discussion on the
> >> fix for those applications has already begun.
> >
> > "pestering" never works, look at some of the SCSI drivers for examples
> > of how a distro will just patch out the "warning this driver is using an
> > old api and needs to be fixed" messages.
> 
> > You can't break stuff like this, people will get upset :(
> 
> A) To the best of my knowledge there are two programs on the face of the
>    planet where this matters. (lxc and libvirt-lxc)
> 
> B) The code in those two programs is buggy.  That is the code in those
>    two programs does not do what the authors intended.  That is fixing
>    those programs is something that should be done regardless of what
>    I do in the kernel.  I have already reached out to the developers of
>    those programs.  The pestering in the kernel is a form of reminder,
>    not the primary source of communication.
> 
> C) These bugs really are security holes.  Currently they do not appear
>    exploitable (thank goodness) but they are security holes.
> 
>    Since they are not currently exploitable it does make sense
>    to give people a little time to get their act together.
> 
>    The bugs are larger then the case that is being hit here,
>    this is just where they are noticed.
> 
> D) Letting people know that there is a problem as part of a larger
>    effort has actually worked for me.  Distro's have stopped enabling
>    the sysctl system call.
> 
> E) Given that I have not audited sysfs and proc closely in recent years
>    I may actually be wrong.  Those bugs may actually be exploitable.
>    All it takes is chmod to be supported on one file that can be made
>    executable.  That bug has existed in the past and I don't doubt
>    someone will overlook something and we will see the bug again in the
>    future.
> 
> So it is my best judgment that I disable the code that stops
> containers from starting and just making it a warning (for now).
> Then in a release or so I start failing these operations instead of
> warning.
> 
> This is the most fair and reasonable I can see to be.

While I generally like & support the kernel standard that userspace must
never be broken, as libvirt LXC maintainer I think what Eric proposes is
acceptable from the libvirt POV.

We'll get the fix into libvirt LXC in this month's release and backport
it to our stable branches. So as long as there are a few months/releases
grace period between this being a kernel warning and it turning into a
hard error, libvirt users will have the fix already, or at least have it
easily available to them.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
       [not found]                             ` <55678CCA.80807-/L3Ra7n9ekc@public.gmane.org>
@ 2015-06-16 12:30                               ` Daniel P. Berrange
  0 siblings, 0 replies; 85+ messages in thread
From: Daniel P. Berrange @ 2015-06-16 12:30 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Eric W. Biederman, Kenton Varda, Greg Kroah-Hartman,
	Linux Containers, Serge Hallyn, Andy Lutomirski, Seth Forshee,
	Michael Kerrisk-manpages, Linux API, Linux FS Devel, Tejun Heo

On Thu, May 28, 2015 at 11:46:50PM +0200, Richard Weinberger wrote:
> Am 28.05.2015 um 23:32 schrieb Eric W. Biederman:
> > Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:
> > 
> >> Am 28.05.2015 um 21:57 schrieb Eric W. Biederman:
> >>>> FWIW, it breaks also libvirt-lxc:
> >>>> Error: internal error: guest failed to start: Failed to re-mount /proc/sys on /proc/sys flags=1021: Operation not permitted
> >>>
> >>> Interesting.  I had not anticipated a failure there?  And it is failing
> >>> in remount?  Oh that is interesting.
> >>>
> >>> That implies that there is some flag of the original mount of /proc that
> >>> the remount of /proc/sys is clearing, and that previously 
> >>>
> >>> The flags specified are current rdonly,remount,bind so I expect there
> >>> are some other flags on proc that libvirt-lxc is clearing by accident
> >>> and we did not fail before because the kernel was not enforcing things.
> >>
> >> Please see:
> >> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l933
> >> lxcContainerMountBasicFS()
> >>
> >> and:
> >> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l850
> >> lxcBasicMounts
> >>
> >>> What are the mount flags in a working libvirt-lxc?
> >>
> >> See:
> >> test1:~ # cat /proc/self/mountinfo
> >> 149 147 0:56 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> >> 150 149 0:56 /sys /proc/sys ro,nodev,relatime - proc proc rw
> > 
> >> If you need more info, please let me know. :-)
> > 
> > Oh interesting I had not realized libvirt-lxc had grown an unprivileged
> > mode using user namespaces.
> 
> Yep. It works quite well. I've migrated all my containers from lxc
> to libvirt-lxc because libvirt-lxc had a working user-namespace
> implementation before lxc.
> 
> > This does appear to be a classic remount bug, where you are not
> > preserving the permissions.  It appears the fact that the code
> > failed to enforce locked permissions on the fresh mount of proc
> > was hiding this bug until now.
> > 
> > I expect what you actually want is the code below:
> > 
> > diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c
> > index 9a9ae5c2aaf0..f008a7484bfe 100644
> > --- a/src/lxc/lxc_container.c
> > +++ b/src/lxc/lxc_container.c
> > @@ -850,7 +850,7 @@ typedef struct {
> >  
> >  static const virLXCBasicMountInfo lxcBasicMounts[] = {
> >      { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false },
> > -    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false },
> > +    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
> >      { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true },
> >      { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true },
> >      { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
> > 
> > Or possibly just:
> > 
> > diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c
> > index 9a9ae5c2aaf0..a60ccbd12bfc 100644
> > --- a/src/lxc/lxc_container.c
> > +++ b/src/lxc/lxc_container.c
> > @@ -850,7 +850,7 @@ typedef struct {
> >  
> >  static const virLXCBasicMountInfo lxcBasicMounts[] = {
> >      { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false },
> > -    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false },
> > +    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, true, false, false },
> >      { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true },
> >      { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true },
> >      { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
> 
> I'll test your diff tomorrow with a fresh brain.
> I sent a similar patch to libvirt folks some time ago, looks like it got lost. ;-\
> 
> > As the there is little point in making /proc/sys read-only in a
> > user-namespace, as the permission checks are uid based and no-one should
> > have the global uid 0 in your container.  Making mounting /proc/sys
> > read-only rather pointless.
> 
> Yeah, I've been ranting about that for ages...
> libvirt-lxc contains a lot of cruft to make privileged container
> kind of secure. Some users still fear using the user-namespace.

Yes, we've discussed this before and I'd like to simplify this. The
thing that has been stopping me tackling it has been figuring out a
way to ensure we don't change semantics for existing deployed users.
ie when RHEL-7 rebases to newer libvirt, I don't want existing
containers to suddenly change their setup, because although the
existing setup is sub-optimal, some apps / users might be relying
on its behaviour in ways I can't predict.

I do believe I have figured out a way to allow backwards compatibility
now though, so we should have able to have another stab at simplifying
and removing this cruft for newly deployed containers.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
       [not found]                               ` <87mw0c1x8p.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-06-16 12:31                                 ` Daniel P. Berrange
       [not found]                                   ` <20150616123148.GB18689-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 85+ messages in thread
From: Daniel P. Berrange @ 2015-06-16 12:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Richard Weinberger, Serge Hallyn, Andy Lutomirski, Seth Forshee,
	Linux API, Linux Containers, Greg Kroah-Hartman, Kenton Varda,
	Michael Kerrisk-manpages, Linux FS Devel, Tejun Heo,
	libvir-list-H+wXaHxf7aLQT0dZR+AlfA, Cedric Bosdonnat

On Sat, Jun 06, 2015 at 01:56:54PM -0500, Eric W. Biederman wrote:
> Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:
> 
> > [CC'ing libvirt-lxc folks]
> >
> > Am 28.05.2015 um 23:32 schrieb Eric W. Biederman:
> >> Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:
> >> 
> >>> Am 28.05.2015 um 21:57 schrieb Eric W. Biederman:
> >>>>> FWIW, it breaks also libvirt-lxc:
> >>>>> Error: internal error: guest failed to start: Failed to re-mount /proc/sys on /proc/sys flags=1021: Operation not permitted
> >>>>
> >>>> Interesting.  I had not anticipated a failure there?  And it is failing
> >>>> in remount?  Oh that is interesting.
> >>>>
> >>>> That implies that there is some flag of the original mount of /proc that
> >>>> the remount of /proc/sys is clearing, and that previously 
> >>>>
> >>>> The flags specified are current rdonly,remount,bind so I expect there
> >>>> are some other flags on proc that libvirt-lxc is clearing by accident
> >>>> and we did not fail before because the kernel was not enforcing things.
> >>>
> >>> Please see:
> >>> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l933
> >>> lxcContainerMountBasicFS()
> >>>
> >>> and:
> >>> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l850
> >>> lxcBasicMounts
> >>>
> >>>> What are the mount flags in a working libvirt-lxc?
> >>>
> >>> See:
> >>> test1:~ # cat /proc/self/mountinfo
> >>> 149 147 0:56 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> >>> 150 149 0:56 /sys /proc/sys ro,nodev,relatime - proc proc rw
> >> 
> >>> If you need more info, please let me know. :-)
> >> 
> >> Oh interesting I had not realized libvirt-lxc had grown an unprivileged
> >> mode using user namespaces.
> >> 
> >> This does appear to be a classic remount bug, where you are not
> >> preserving the permissions.  It appears the fact that the code
> >> failed to enforce locked permissions on the fresh mount of proc
> >> was hiding this bug until now.
> >> 
> >> I expect what you actually want is the code below:
> >> 
> >> diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c
> >> index 9a9ae5c2aaf0..f008a7484bfe 100644
> >> --- a/src/lxc/lxc_container.c
> >> +++ b/src/lxc/lxc_container.c
> >> @@ -850,7 +850,7 @@ typedef struct {
> >>  
> >>  static const virLXCBasicMountInfo lxcBasicMounts[] = {
> >>      { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false },
> >> -    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false },
> >> +    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
> >>      { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true },
> >>      { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true },
> >>      { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
> >> 
> >> Or possibly just:
> >> 
> >> diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c
> >> index 9a9ae5c2aaf0..a60ccbd12bfc 100644
> >> --- a/src/lxc/lxc_container.c
> >> +++ b/src/lxc/lxc_container.c
> >> @@ -850,7 +850,7 @@ typedef struct {
> >>  
> >>  static const virLXCBasicMountInfo lxcBasicMounts[] = {
> >>      { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false },
> >> -    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false },
> >> +    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, true, false, false },
> >>      { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true },
> >>      { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true },
> >>      { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
> >> 
> >> As the there is little point in making /proc/sys read-only in a
> >> user-namespace, as the permission checks are uid based and no-one should
> >> have the global uid 0 in your container.  Making mounting /proc/sys
> >> read-only rather pointless.
> >
> > Eric, using the patch below I was able to spawn a user-namespace enabled container
> > using libvirt-lxc. :-)
> >
> > I had to:
> > 1. Disable the read-only mount of /proc/sys which is anyway useless in the user-namespace case.
> > 2. Disable the /proc/sys/net/ipv{4,6} bind mounts, this ugly hack is only needed for the non user-namespace case.
> > 3. Remove MS_RDONLY from the sysfs mount (For the non user-namespace case we'd have to keep this, though).
> >
> > Daniel, I'd take this as a chance to disable all the MS_RDONLY games if user-namespace are configured.
> > With Eric's fixes they hurt us. And as I wrote many times before if root within the user-namespace
> > is able to do nasty things in /sys and /proc that's a plain kernel bug which needs fixing. There is no
> > point in mounting these read-only. Except for the case then no user-namespace is used.
> >
> 
> For clarity the patch below appears to be the minimal change needed to
> fix this security issue.
> 
> AKA add mnt_mflags in when remounting something read-only.
> 
> /proc/sys needed to be updated so it had the proper flags to be added
> back in.
> 
> I hope this helps.
> 
> diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c
> index 9a9ae5c2aaf0..11e9514e0761 100644
> --- a/src/lxc/lxc_container.c
> +++ b/src/lxc/lxc_container.c
> @@ -850,7 +850,7 @@ typedef struct {
>  
>  static const virLXCBasicMountInfo lxcBasicMounts[] = {
>      { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false },
> -    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false },
> +    { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
>      { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true },
>      { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true },
>      { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false },
> @@ -1030,7 +1030,7 @@ static int lxcContainerMountBasicFS(bool userns_enabled,
>  
>          if (bindOverReadonly &&
>              mount(mnt_src, mnt->dst, NULL,
> -                  MS_BIND|MS_REMOUNT|MS_RDONLY, NULL) < 0) {
> +                  MS_BIND|MS_REMOUNT|mnt_mflags|MS_RDONLY, NULL) < 0) {
>              virReportSystemError(errno,
>                                   _("Failed to re-mount %s on %s flags=%x"),
>                                   mnt_src, mnt->dst,

Thanks Richard / Eric for the suggested patches. I'll apply Eric's
simplified patch to libvirt now, and backport it to our stable
libvirt branches.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)
       [not found]                                   ` <20150616123148.GB18689-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-06-16 12:46                                     ` Richard Weinberger
  0 siblings, 0 replies; 85+ messages in thread
From: Richard Weinberger @ 2015-06-16 12:46 UTC (permalink / raw)
  To: Daniel P. Berrange, Eric W. Biederman
  Cc: Serge Hallyn, Andy Lutomirski, Seth Forshee, Linux API,
	Linux Containers, Greg Kroah-Hartman, Kenton Varda,
	Michael Kerrisk-manpages, Linux FS Devel, Tejun Heo,
	libvir-list-H+wXaHxf7aLQT0dZR+AlfA, Cedric Bosdonnat

Am 16.06.2015 um 14:31 schrieb Daniel P. Berrange:
> Thanks Richard / Eric for the suggested patches. I'll apply Eric's
> simplified patch to libvirt now, and backport it to our stable
> libvirt branches.

Thank you Daniel!

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir
       [not found]     ` <878ucrhxi9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-11 18:44       ` Tejun Heo
  2015-08-11 18:57         ` Eric W. Biederman
  0 siblings, 1 reply; 85+ messages in thread
From: Tejun Heo @ 2015-08-11 18:44 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger,
	Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber,
	Eric Windisch, Greg Kroah-Hartman

On Thu, May 14, 2015 at 12:36:30PM -0500, Eric W. Biederman wrote:
> 
> This allows for better documentation in the code and
> it allows for a simpler and fully correct version of
> fs_fully_visible to be written.
> 
> The mount points converted and their filesystems are:
> /sys/hypervisor/s390/       s390_hypfs
> /sys/kernel/config/         configfs
> /sys/kernel/debug/          debugfs
> /sys/firmware/efi/efivars/  efivarfs
> /sys/fs/fuse/connections/   fusectl
> /sys/fs/pstore/             pstore
> /sys/kernel/tracing/        tracefs
> /sys/fs/cgroup/             cgroup
> /sys/kernel/security/       securityfs
> /sys/fs/selinux/            selinuxfs
> /sys/fs/smackfs/            smackfs
> 
> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

So, this somehow ends up confusing upstart on centos6 based systems
making it fail to mount tmpfs on /sys/fs/cgroup.  It also skips sunrpc
and other mounts are different too.  No idea why at this point.  Can
we please revert this from -stable until we know what's going on?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir
  2015-08-11 18:44       ` Tejun Heo
@ 2015-08-11 18:57         ` Eric W. Biederman
  2015-08-11 19:21           ` Andy Lutomirski
       [not found]           ` <877fp1hcuj.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 2 replies; 85+ messages in thread
From: Eric W. Biederman @ 2015-08-11 18:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linux Containers, linux-fsdevel, Linux API, Serge E. Hallyn,
	Andy Lutomirski, Richard Weinberger, Kenton Varda,
	Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch,
	Greg Kroah-Hartman

Tejun Heo <tj@kernel.org> writes:

> On Thu, May 14, 2015 at 12:36:30PM -0500, Eric W. Biederman wrote:
>> 
>> This allows for better documentation in the code and
>> it allows for a simpler and fully correct version of
>> fs_fully_visible to be written.
>> 
>> The mount points converted and their filesystems are:
>> /sys/hypervisor/s390/       s390_hypfs
>> /sys/kernel/config/         configfs
>> /sys/kernel/debug/          debugfs
>> /sys/firmware/efi/efivars/  efivarfs
>> /sys/fs/fuse/connections/   fusectl
>> /sys/fs/pstore/             pstore
>> /sys/kernel/tracing/        tracefs
>> /sys/fs/cgroup/             cgroup
>> /sys/kernel/security/       securityfs
>> /sys/fs/selinux/            selinuxfs
>> /sys/fs/smackfs/            smackfs
>> 
>> Cc: stable@vger.kernel.org
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
> So, this somehow ends up confusing upstart on centos6 based systems
> making it fail to mount tmpfs on /sys/fs/cgroup.  It also skips sunrpc
> and other mounts are different too.  No idea why at this point.  Can
> we please revert this from -stable until we know what's going on?

*Boggle*

The only time this should prevent anything is when in a container when
you are not global root.  And then only mounting sysfs should be
affected.

The only difference in executed code really should be setting an extra
flag on the kernfs, inode.  The kernfs changes will also refuse to add
entries to these directories (but these directories are empty).

If this is causing problems I don't have a problem with a revert but
reverts take a minute, and this seems to be the first report of this
kind.  Can we take a minute and attempt to get a coherent explanation.

>From what little information you given above it sounds like something
shifted and when you rebuilt the kernel and now a memory stomp is
hitting something else.  It should be a matter of moments to debug this
issue (once a test environment is setup), and see what is wrong and then
we can act intelligently.  Tracing a single system call is not difficult.

If there really is some weird issue I want to know what it is.

Eric


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir
  2015-08-11 18:57         ` Eric W. Biederman
@ 2015-08-11 19:21           ` Andy Lutomirski
       [not found]             ` <CALCETrXE=fKa3XkEEo6y2=ZNtsuBfX=kaoyDwiP0C2BwqKJWjw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
       [not found]           ` <877fp1hcuj.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  1 sibling, 1 reply; 85+ messages in thread
From: Andy Lutomirski @ 2015-08-11 19:21 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Tejun Heo, Linux Containers, Linux FS Devel, Linux API,
	Serge E. Hallyn, Richard Weinberger, Kenton Varda,
	Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch,
	Greg Kroah-Hartman

On Tue, Aug 11, 2015 at 11:57 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Tejun Heo <tj@kernel.org> writes:
>
>> On Thu, May 14, 2015 at 12:36:30PM -0500, Eric W. Biederman wrote:
>>>
>>> This allows for better documentation in the code and
>>> it allows for a simpler and fully correct version of
>>> fs_fully_visible to be written.
>>>
>>> The mount points converted and their filesystems are:
>>> /sys/hypervisor/s390/       s390_hypfs
>>> /sys/kernel/config/         configfs
>>> /sys/kernel/debug/          debugfs
>>> /sys/firmware/efi/efivars/  efivarfs
>>> /sys/fs/fuse/connections/   fusectl
>>> /sys/fs/pstore/             pstore
>>> /sys/kernel/tracing/        tracefs
>>> /sys/fs/cgroup/             cgroup
>>> /sys/kernel/security/       securityfs
>>> /sys/fs/selinux/            selinuxfs
>>> /sys/fs/smackfs/            smackfs
>>>
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>
>> So, this somehow ends up confusing upstart on centos6 based systems
>> making it fail to mount tmpfs on /sys/fs/cgroup.  It also skips sunrpc
>> and other mounts are different too.  No idea why at this point.  Can
>> we please revert this from -stable until we know what's going on?
>
> *Boggle*
>
> The only time this should prevent anything is when in a container when
> you are not global root.  And then only mounting sysfs should be
> affected.

Before:

open("/sys/kernel/debug/asdf", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK,
0666) = -1 EACCES (Permission denied)


After:

open("/sys/kernel/debug/asdf", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK,
0666) = -1 ENOENT (No such file or directory)

Something broke.  I don't know whether CentOS cares about that change,
but there could be other odd side effects.

--Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir
       [not found]           ` <877fp1hcuj.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-11 20:11             ` Tejun Heo
       [not found]               ` <CAOS58YOHU8SFv4UXeBRr4t88UU=DXQCPg2HU_dMBmgM7WBB1zQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 85+ messages in thread
From: Tejun Heo @ 2015-08-11 20:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux API, Linux Containers, Greg Kroah-Hartman, Andy Lutomirski,
	Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger,
	LINUXFS-ML

Hey,

On Tue, Aug 11, 2015 at 2:57 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> So, this somehow ends up confusing upstart on centos6 based systems
>> making it fail to mount tmpfs on /sys/fs/cgroup.  It also skips sunrpc
>> and other mounts are different too.  No idea why at this point.  Can
>> we please revert this from -stable until we know what's going on?
>
> *Boggle*
>
> The only time this should prevent anything is when in a container when
> you are not global root.  And then only mounting sysfs should be
> affected.

This is just plain boot. No namespace involved.

> The only difference in executed code really should be setting an extra
> flag on the kernfs, inode.  The kernfs changes will also refuse to add
> entries to these directories (but these directories are empty).

Why do we have this in -stable then? Is this part of a larger fix?

> If this is causing problems I don't have a problem with a revert but
> reverts take a minute, and this seems to be the first report of this
> kind.  Can we take a minute and attempt to get a coherent explanation.
>
> From what little information you given above it sounds like something
> shifted and when you rebuilt the kernel and now a memory stomp is
> hitting something else.  It should be a matter of moments to debug this

I don't think it's a random memory stomping thing. I reverted the
commit from two different kernels and the result was always
consistent.

> issue (once a test environment is setup), and see what is wrong and then
> we can act intelligently.  Tracing a single system call is not difficult.

I'm already out today so it'll have to wait till tomorrow.

> If there really is some weird issue I want to know what it is.

Sure, but you wanna do that in -stable?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir
       [not found]               ` <CAOS58YOHU8SFv4UXeBRr4t88UU=DXQCPg2HU_dMBmgM7WBB1zQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-08-12  0:37                 ` Eric W. Biederman
       [not found]                   ` <87fv3pe3zn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 85+ messages in thread
From: Eric W. Biederman @ 2015-08-12  0:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linux Containers, LINUXFS-ML, Linux API, Serge E. Hallyn,
	Andy Lutomirski, Richard Weinberger, Kenton Varda,
	Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch,
	Greg Kroah-Hartman

Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:

> Hey,
>
> On Tue, Aug 11, 2015 at 2:57 PM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>> So, this somehow ends up confusing upstart on centos6 based systems
>>> making it fail to mount tmpfs on /sys/fs/cgroup.  It also skips sunrpc
>>> and other mounts are different too.  No idea why at this point.  Can
>>> we please revert this from -stable until we know what's going on?
>>
>> *Boggle*
>>
>> The only time this should prevent anything is when in a container when
>> you are not global root.  And then only mounting sysfs should be
>> affected.
>
> This is just plain boot. No namespace involved.
>
>> The only difference in executed code really should be setting an extra
>> flag on the kernfs, inode.  The kernfs changes will also refuse to add
>> entries to these directories (but these directories are empty).
>
> Why do we have this in -stable then? Is this part of a larger fix?

It is. This patch is part of the prep work to prevent unprivileged users
not mounting sysfs (using user namespace permissions) when they should
not be allowed to.

>> If this is causing problems I don't have a problem with a revert but
>> reverts take a minute, and this seems to be the first report of this
>> kind.  Can we take a minute and attempt to get a coherent explanation.
>>
>> It should be a matter of moments to debug this
>> issue (once a test environment is setup), and see what is wrong and then
>> we can act intelligently.  Tracing a single system call is not difficult.
>
> I'm already out today so it'll have to wait till tomorrow.
>
>> If there really is some weird issue I want to know what it is.
>
> Sure, but you wanna do that in -stable?

Before fixing anything I want a bug report that is clear enough
to be reproducible.

I just went and attempted to reproduce this, and on RHEL6 workstation
(aka my work laptop), using the todays 4.2.0-rc6+ aka
edf15b4d4b01b565cb5f4fd2e2d08940b9f92e2f and all of the mounts in
/proc/self/mounts are the same between 4.2.0-rc6 and the RHEL6 stock
2.6.32-504.30.3.el6.x86_64, including the cgroups mounted on /cgroup.

Which means that I don't have any reason to believe that normal CentOS 6
is broken.

Which -stable kernel are you having problems with?  Perhaps it was
a broken backport?

Is it possible this is a local CentOS 6 hack that is breaking?
Perhaps a patch you apply on top of your -stable kernel?

Certainly with cgroups expected to be mounted at /sys/fs/cgroup there
has clearly been at least one change from the stock configuration.

I think it is a little less serious if stock CentOS 6 doesn't have
problems.  Unless it is a conflict of kernel patches I definitely think
whatever it is needs to be fixed.

Eric

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir
       [not found]             ` <CALCETrXE=fKa3XkEEo6y2=ZNtsuBfX=kaoyDwiP0C2BwqKJWjw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-08-12  0:58               ` Eric W. Biederman
       [not found]                 ` <87mvxxcogp.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 85+ messages in thread
From: Eric W. Biederman @ 2015-08-12  0:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tejun Heo, Linux Containers, Linux FS Devel, Linux API,
	Serge E. Hallyn, Richard Weinberger, Kenton Varda,
	Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch,
	Greg Kroah-Hartman

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:

> On Tue, Aug 11, 2015 at 11:57 AM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>
>> *Boggle*
>>
>> The only time this should prevent anything is when in a container when
>> you are not global root.  And then only mounting sysfs should be
>> affected.
>
> Before:
>
> open("/sys/kernel/debug/asdf", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK,
> 0666) = -1 EACCES (Permission denied)
>
>
> After:
>
> open("/sys/kernel/debug/asdf", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK,
> 0666) = -1 ENOENT (No such file or directory)
>
> Something broke.  I don't know whether CentOS cares about that change,
> but there could be other odd side effects.

Thanks for pointing this out.  I don't know if broke is quite the right
word for a change in error codes on lookup failure, but I agree it is a
difference that could have affected something.

The behavior of empty proc dirs actually return -ENOENT in this
situation and so it is a little fuzzy about which is the best behavior
to use.

Creativing a negative dentry and and then letting vfs_create fail may be
the better way to go.

Negative dentries are weird enough that I would prefer not to have code
that creates negative dentries.  They could easily be a lurking trap
for some filesystems dentry operations.

The patch below is enough to change the error code if someone who can
reproduce this wants to try this.

Eric

diff --gdiff --git a/fs/libfs.c b/fs/libfs.c
index 102edfd39000..3a452a485cbf 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1109,7 +1109,7 @@ EXPORT_SYMBOL(simple_symlink_inode_operations);
  */
 static struct dentry *empty_dir_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
 {
-       return ERR_PTR(-ENOENT);
+       return NULL;
 }
 
 static int empty_dir_getattr(struct vfsmount *mnt, struct dentry *dentry,

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir
       [not found]                   ` <87fv3pe3zn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-12  3:58                     ` Eric W. Biederman
       [not found]                       ` <87a8txb1k8.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 85+ messages in thread
From: Eric W. Biederman @ 2015-08-12  3:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linux API, Linux Containers, Greg Kroah-Hartman, Andy Lutomirski,
	Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger,
	LINUXFS-ML

ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:

> I just went and attempted to reproduce this, and on RHEL6 workstation
> (aka my work laptop), using the todays 4.2.0-rc6+ aka
> edf15b4d4b01b565cb5f4fd2e2d08940b9f92e2f and all of the mounts in
> /proc/self/mounts are the same between 4.2.0-rc6 and the RHEL6 stock
> 2.6.32-504.30.3.el6.x86_64, including the cgroups mounted on /cgroup.

I built a few more kernels just to see if this was some weird backport
thing. The kernels 3.10.86, 3.14.58, 3.18.20, and 4.1.5 all boot and
mount their cgroup filesystems just fine.  Granted I kept having to
smack the memory cgroup into being compiled in as the config options
kept changing but otherwise I have not seen any problems.

So I am very surprised you are having problems.

Eric

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir
       [not found]                       ` <87a8txb1k8.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-12  4:04                         ` Eric W. Biederman
       [not found]                           ` <871tf9b19v.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 85+ messages in thread
From: Eric W. Biederman @ 2015-08-12  4:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linux API, Linux Containers, Greg Kroah-Hartman, Andy Lutomirski,
	Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger,
	LINUXFS-ML

ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:

> ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:
>
>> I just went and attempted to reproduce this, and on RHEL6 workstation
>> (aka my work laptop), using the todays 4.2.0-rc6+ aka
>> edf15b4d4b01b565cb5f4fd2e2d08940b9f92e2f and all of the mounts in
>> /proc/self/mounts are the same between 4.2.0-rc6 and the RHEL6 stock
>> 2.6.32-504.30.3.el6.x86_64, including the cgroups mounted on /cgroup.
>
> I built a few more kernels just to see if this was some weird backport
> thing. The kernels 3.10.86, 3.14.58, 3.18.20, and 4.1.5 all boot and
> mount their cgroup filesystems just fine.  Granted I kept having to
> smack the memory cgroup into being compiled in as the config options
> kept changing but otherwise I have not seen any problems.
>
> So I am very surprised you are having problems.

Although I guess I could have saved myself some time by noticing that
4.1.5 was the only one of the kernels with the change backported into
it.  *Shrug*

I don't see the problem and I don't know where to look to see why you
are having problems.

Eric

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir
       [not found]                           ` <871tf9b19v.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-12 19:15                             ` Tejun Heo
       [not found]                               ` <20150812191515.GA4496-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 85+ messages in thread
From: Tejun Heo @ 2015-08-12 19:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, LINUXFS-ML, Linux API, Serge E. Hallyn,
	Andy Lutomirski, Richard Weinberger, Kenton Varda,
	Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch,
	Greg Kroah-Hartman

Hello, Eric.

On Tue, Aug 11, 2015 at 11:04:28PM -0500, Eric W. Biederman wrote:
> ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:
> 
> > ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:
> >
> >> I just went and attempted to reproduce this, and on RHEL6 workstation
> >> (aka my work laptop), using the todays 4.2.0-rc6+ aka
> >> edf15b4d4b01b565cb5f4fd2e2d08940b9f92e2f and all of the mounts in
> >> /proc/self/mounts are the same between 4.2.0-rc6 and the RHEL6 stock
> >> 2.6.32-504.30.3.el6.x86_64, including the cgroups mounted on /cgroup.
> >
> > I built a few more kernels just to see if this was some weird backport
> > thing. The kernels 3.10.86, 3.14.58, 3.18.20, and 4.1.5 all boot and
> > mount their cgroup filesystems just fine.  Granted I kept having to
> > smack the memory cgroup into being compiled in as the config options
> > kept changing but otherwise I have not seen any problems.
> >
> > So I am very surprised you are having problems.
> 
> Although I guess I could have saved myself some time by noticing that
> 4.1.5 was the only one of the kernels with the change backported into
> it.  *Shrug*
> 
> I don't see the problem and I don't know where to look to see why you
> are having problems.

lol, this wasn't upstart but an internal tool which sets up a custom
cgroup hierarchy and the problem was the size of the directory inode
reported by stat(2).  It's kinda hilarious but that's what the tool
was depending on to tell whether tmpfs is mounted on /sys/fs/cgroup or
not.  A kernfs directory reports zero as its inode size while tmpfs
reports some non-zero number, so the tool did stat(2) on
/sys/fs/cgroup and mounted tmpfs iff size is zero to avoid mounting
tmpfs multiple times.  Now, make_empty_dir_inode() sets i_size to 2
and the tool thinks that tmpfs is already mounted there.

It's an icky behavior but it'd be better to maintain the original
behavior.  We should be able to set size to zero for empty dirs,
right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir
       [not found]                 ` <87mvxxcogp.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-12 20:00                   ` Tejun Heo
  2015-08-12 20:27                     ` Eric W. Biederman
  0 siblings, 1 reply; 85+ messages in thread
From: Tejun Heo @ 2015-08-12 20:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Lutomirski, Linux Containers, Linux FS Devel, Linux API,
	Serge E. Hallyn, Richard Weinberger, Kenton Varda,
	Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch,
	Greg Kroah-Hartman

On Tue, Aug 11, 2015 at 07:58:14PM -0500, Eric W. Biederman wrote:
> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
> 
> > On Tue, Aug 11, 2015 at 11:57 AM, Eric W. Biederman
> > <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> >>
> >> *Boggle*
> >>
> >> The only time this should prevent anything is when in a container when
> >> you are not global root.  And then only mounting sysfs should be
> >> affected.
> >
> > Before:
> >
> > open("/sys/kernel/debug/asdf", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK,
> > 0666) = -1 EACCES (Permission denied)
> >
> >
> > After:
> >
> > open("/sys/kernel/debug/asdf", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK,
> > 0666) = -1 ENOENT (No such file or directory)
> >
> > Something broke.  I don't know whether CentOS cares about that change,
> > but there could be other odd side effects.
> 
> Thanks for pointing this out.  I don't know if broke is quite the right
> word for a change in error codes on lookup failure, but I agree it is a
> difference that could have affected something.
> 
> The behavior of empty proc dirs actually return -ENOENT in this
> situation and so it is a little fuzzy about which is the best behavior
> to use.
> 
> Creativing a negative dentry and and then letting vfs_create fail may be
> the better way to go.
> 
> Negative dentries are weird enough that I would prefer not to have code
> that creates negative dentries.  They could easily be a lurking trap
> for some filesystems dentry operations.
> 
> The patch below is enough to change the error code if someone who can
> reproduce this wants to try this.
> 
> Eric
> 
> diff --gdiff --git a/fs/libfs.c b/fs/libfs.c
> index 102edfd39000..3a452a485cbf 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -1109,7 +1109,7 @@ EXPORT_SYMBOL(simple_symlink_inode_operations);
>   */
>  static struct dentry *empty_dir_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
>  {
> -       return ERR_PTR(-ENOENT);
> +       return NULL;

And let's please restore this too.  Sentiments about negative dentries
aside, It's outright wrong to report -ENOENT on creat.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH] fs: Set the size of empty dirs to 0.
       [not found]                               ` <20150812191515.GA4496-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
@ 2015-08-12 20:07                                 ` Eric W. Biederman
       [not found]                                   ` <87mvxw46fc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 85+ messages in thread
From: Eric W. Biederman @ 2015-08-12 20:07 UTC (permalink / raw)
  To: Linux Containers
  Cc: Linux API, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda,
	Michael Kerrisk-manpages, Richard Weinberger, LINUXFS-ML,
	Tejun Heo


Before the make_empty_dir_inode calls were introduce into proc, sysfs,
and sysctl those directories when stated reported an i_size of 0.
make_empty_dir_inode started reporting an i_size of 2.  At least one
userspace application depended on stat returning i_size of 0.  So modify
make_empty_dir_inode to cause an i_size of 0 to be reported for these
directories.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Reproted-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---

I have tested this and will queue this up shortly.

 fs/libfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 102edfd39000..c7cbfb092e94 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1185,7 +1185,7 @@ void make_empty_dir_inode(struct inode *inode)
 	inode->i_uid = GLOBAL_ROOT_UID;
 	inode->i_gid = GLOBAL_ROOT_GID;
 	inode->i_rdev = 0;
-	inode->i_size = 2;
+	inode->i_size = 0;
 	inode->i_blkbits = PAGE_SHIFT;
 	inode->i_blocks = 0;
 
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [PATCH] fs: Set the size of empty dirs to 0.
       [not found]                                   ` <87mvxw46fc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-12 20:18                                     ` Tejun Heo
  0 siblings, 0 replies; 85+ messages in thread
From: Tejun Heo @ 2015-08-12 20:18 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux API, Linux Containers, Greg Kroah-Hartman, Andy Lutomirski,
	Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger,
	LINUXFS-ML

On Wed, Aug 12, 2015 at 03:07:19PM -0500, Eric W. Biederman wrote:
> 
> Before the make_empty_dir_inode calls were introduce into proc, sysfs,
> and sysctl those directories when stated reported an i_size of 0.
> make_empty_dir_inode started reporting an i_size of 2.  At least one
> userspace application depended on stat returning i_size of 0.  So modify
> make_empty_dir_inode to cause an i_size of 0 to be reported for these
> directories.
> 
> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Reproted-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
    ^^^
> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

Acked-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir
  2015-08-12 20:00                   ` Tejun Heo
@ 2015-08-12 20:27                     ` Eric W. Biederman
       [not found]                       ` <87r3n82qxd.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 85+ messages in thread
From: Eric W. Biederman @ 2015-08-12 20:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andy Lutomirski, Linux Containers, Linux FS Devel, Linux API,
	Serge E. Hallyn, Richard Weinberger, Kenton Varda,
	Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch,
	Greg Kroah-Hartman

Tejun Heo <tj@kernel.org> writes:

> On Tue, Aug 11, 2015 at 07:58:14PM -0500, Eric W. Biederman wrote:
>> Andy Lutomirski <luto@amacapital.net> writes:
>> 
>> > On Tue, Aug 11, 2015 at 11:57 AM, Eric W. Biederman
>> > <ebiederm@xmission.com> wrote:
>> >>
>> >> *Boggle*
>> >>
>> >> The only time this should prevent anything is when in a container when
>> >> you are not global root.  And then only mounting sysfs should be
>> >> affected.
>> >
>> > Before:
>> >
>> > open("/sys/kernel/debug/asdf", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK,
>> > 0666) = -1 EACCES (Permission denied)
>> >
>> >
>> > After:
>> >
>> > open("/sys/kernel/debug/asdf", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK,
>> > 0666) = -1 ENOENT (No such file or directory)
>> >
>> > Something broke.  I don't know whether CentOS cares about that change,
>> > but there could be other odd side effects.
>> 
>> Thanks for pointing this out.  I don't know if broke is quite the right
>> word for a change in error codes on lookup failure, but I agree it is a
>> difference that could have affected something.
>> 
>> The behavior of empty proc dirs actually return -ENOENT in this
>> situation and so it is a little fuzzy about which is the best behavior
>> to use.
>> 
>> Creativing a negative dentry and and then letting vfs_create fail may be
>> the better way to go.
>> 
>> Negative dentries are weird enough that I would prefer not to have code
>> that creates negative dentries.  They could easily be a lurking trap
>> for some filesystems dentry operations.
>> 
>> The patch below is enough to change the error code if someone who can
>> reproduce this wants to try this.
>> 
>> Eric
>> 
>> diff --gdiff --git a/fs/libfs.c b/fs/libfs.c
>> index 102edfd39000..3a452a485cbf 100644
>> --- a/fs/libfs.c
>> +++ b/fs/libfs.c
>> @@ -1109,7 +1109,7 @@ EXPORT_SYMBOL(simple_symlink_inode_operations);
>>   */
>>  static struct dentry *empty_dir_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
>>  {
>> -       return ERR_PTR(-ENOENT);
>> +       return NULL;
>
> And let's please restore this too.  Sentiments about negative dentries
> aside, It's outright wrong to report -ENOENT on creat.

proc has always reported -ENOENT. sysfs is the odd one out.

I am not completely certain that trivial patch above, does not introduce
a leak, a NULL pointer dereference or something else nasty when the code
is hit.

So far it seems that no one cares.  And since the change is brittle I am
not inclined to mess with it this week, as I have other demands on my
limited review bandwidth right now.

Eric

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir
       [not found]                       ` <87r3n82qxd.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-08-12 21:05                         ` Tejun Heo
  0 siblings, 0 replies; 85+ messages in thread
From: Tejun Heo @ 2015-08-12 21:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux API, Linux Containers, Greg Kroah-Hartman, Andy Lutomirski,
	Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger,
	Linux FS Devel

Hello,

On Wed, Aug 12, 2015 at 03:27:26PM -0500, Eric W. Biederman wrote:
> proc has always reported -ENOENT. sysfs is the odd one out.

Hmm... open(2) is clear about failure modes and ENOENT doesn't fit the
bill here.  Maintaining the behavior for proc for backward
compatibility is fine but I don't think it's appropriate to change
behaviors on other filesystems which were behaving correctly
especially through changes which got routed through -stable.

       ENOENT O_CREAT is not set and the named file does not exist.  Or, a directory component in pathname does not exist or is a dangling symbolic link.

       ENOENT pathname refers to a nonexistent directory, O_TMPFILE and one of O_WRONLY or O_RDWR were specified in flags, but this kernel version does not provide the O_TMPFILE
                     functionality.

> I am not completely certain that trivial patch above, does not introduce
> a leak, a NULL pointer dereference or something else nasty when the code
> is hit.
> 
> So far it seems that no one cares.  And since the change is brittle I am
> not inclined to mess with it this week, as I have other demands on my
> limited review bandwidth right now.

Sure, it isn't "today" urgent but let's please restore the original
behavior before the new behavior gets too widespread.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 85+ messages in thread

end of thread, other threads:[~2015-08-12 21:05 UTC | newest]

Thread overview: 85+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-14 17:30 [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts Eric W. Biederman
2015-05-14 17:33 ` [CFT][PATCH 04/10] fs: Add helper functions for permanently empty directories Eric W. Biederman
2015-05-14 17:33 ` [CFT][PATCH 05/10] sysctl: Allow creating " Eric W. Biederman
     [not found] ` <87pp63jcca.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-05-14 17:31   ` [CFT][PATCH 01/10] mnt: Refactor the logic for mounting sysfs and proc in a user namespace Eric W. Biederman
2015-05-14 17:32   ` [CFT][PATCH 02/10] mnt: Modify fs_fully_visible to deal with mount attributes Eric W. Biederman
2015-05-14 17:32   ` [CFT][PATCH 03/10] vfs: Ignore unlocked mounts in fs_fully_visible Eric W. Biederman
2015-05-14 17:34   ` [CFT][PATCH 06/10] proc: Allow creating permanently empty directories Eric W. Biederman
2015-05-14 17:34   ` [CFT][PATCH 07/10] kernfs: Add support for always " Eric W. Biederman
2015-05-14 17:35   ` [CFT][PATCH 08/10] sysfs: Add support for permanently " Eric W. Biederman
     [not found]     ` <87fv6zhxkp.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-05-14 20:31       ` Greg Kroah-Hartman
     [not found]         ` <20150514203131.GB16416-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2015-05-14 21:33           ` Eric W. Biederman
2015-05-14 17:36   ` [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir Eric W. Biederman
     [not found]     ` <878ucrhxi9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-11 18:44       ` Tejun Heo
2015-08-11 18:57         ` Eric W. Biederman
2015-08-11 19:21           ` Andy Lutomirski
     [not found]             ` <CALCETrXE=fKa3XkEEo6y2=ZNtsuBfX=kaoyDwiP0C2BwqKJWjw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-08-12  0:58               ` Eric W. Biederman
     [not found]                 ` <87mvxxcogp.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-12 20:00                   ` Tejun Heo
2015-08-12 20:27                     ` Eric W. Biederman
     [not found]                       ` <87r3n82qxd.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-12 21:05                         ` Tejun Heo
     [not found]           ` <877fp1hcuj.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-11 20:11             ` Tejun Heo
     [not found]               ` <CAOS58YOHU8SFv4UXeBRr4t88UU=DXQCPg2HU_dMBmgM7WBB1zQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-08-12  0:37                 ` Eric W. Biederman
     [not found]                   ` <87fv3pe3zn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-12  3:58                     ` Eric W. Biederman
     [not found]                       ` <87a8txb1k8.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-12  4:04                         ` Eric W. Biederman
     [not found]                           ` <871tf9b19v.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-12 19:15                             ` Tejun Heo
     [not found]                               ` <20150812191515.GA4496-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2015-08-12 20:07                                 ` [PATCH] fs: Set the size of empty dirs to 0 Eric W. Biederman
     [not found]                                   ` <87mvxw46fc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-08-12 20:18                                     ` Tejun Heo
2015-05-14 17:37   ` [CFT][PATCH 10/10] mnt: Update fs_fully_visible to test for permanently empty directories Eric W. Biederman
2015-05-14 20:29 ` [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts Greg Kroah-Hartman
2015-05-14 21:10   ` Eric W. Biederman
     [not found]     ` <87oalmg90j.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-05-15  6:26       ` Andy Lutomirski
     [not found]         ` <CALCETrU1yxcDfv4YV3wVpWMAdiOOsSUFOPUpFAN-mVA4M-OxdQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-15  6:55           ` Eric W. Biederman
2015-05-16  2:05 ` [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) Eric W. Biederman
2015-05-16  2:06   ` [CFT][PATCH 02/10] mnt: Modify fs_fully_visible to deal with mount attributes Eric W. Biederman
     [not found]   ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-05-16  2:06     ` [CFT][PATCH 01/10] mnt: Refactor the logic for mounting sysfs and proc in a user namespace Eric W. Biederman
2015-05-16  2:07     ` [CFT][PATCH 03/10] vfs: Ignore unlocked mounts in fs_fully_visible Eric W. Biederman
2015-05-16  2:07     ` [CFT][PATCH 04/10] fs: Add helper functions for permanently empty directories Eric W. Biederman
2015-05-16  2:08     ` [CFT][PATCH 05/10] sysctl: Allow creating permanently empty directories that serve as mountpoints Eric W. Biederman
2015-05-16  2:08     ` [CFT][PATCH 06/10] proc: Allow creating permanently empty directories that serve as mount points Eric W. Biederman
2015-05-16  2:09     ` [CFT][PATCH 07/10] kernfs: Add support for always empty directories Eric W. Biederman
2015-05-16  2:09     ` [CFT][PATCH 08/10] sysfs: Add support for permanently empty directories to serve as mount points Eric W. Biederman
2015-05-18 13:14       ` Greg Kroah-Hartman
2015-05-16  2:10     ` [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_mount_point Eric W. Biederman
2015-05-18 13:14       ` Greg Kroah-Hartman
2015-05-16  2:11     ` [CFT][PATCH 10/10] mnt: Update fs_fully_visible to test for permanently empty directories Eric W. Biederman
2015-05-22 17:39     ` [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) Eric W. Biederman
     [not found]       ` <87wq004im1.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-05-22 18:59         ` Andy Lutomirski
     [not found]           ` <CALCETrUhXBR5WQ6gXr9KzGc4=7tph7kzopY29Hug4g+FhOzEKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-22 20:41             ` Eric W. Biederman
2015-05-28 14:08           ` Serge Hallyn
2015-05-28 15:03             ` Eric W. Biederman
2015-05-28 17:33               ` Andy Lutomirski
     [not found]                 ` <CALCETrXXax28s9kMTQ-zDx0MttQWG4rg2y-oz3bSGiumSL=3sg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-28 18:20                   ` Kenton Varda
     [not found]                     ` <CAOP=4wid+N_80iyPpiVMN96_fuHZZRGtYQ6AOPn-HFBj2H6Vgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-28 19:14                       ` Eric W. Biederman
     [not found]                         ` <87fv6gikfn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-05-28 20:12                           ` Kenton Varda
2015-05-28 20:47                             ` Richard Weinberger
2015-05-28 21:07                               ` Kenton Varda
     [not found]                                 ` <CAOP=4wiAA4SqvMn_rQJHOjg6M-75bi_G9Fx8ENgVnYdkT5WVQA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-28 21:12                                   ` Richard Weinberger
2015-05-29  0:30                           ` Andy Lutomirski
2015-05-29  0:35                         ` Andy Lutomirski
     [not found]                           ` <CALCETrXO21Y7PR=pKqaqJb1YZArNyjAv7Z-J44O53FcfLM_0Tw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-29  4:36                             ` Eric W. Biederman
     [not found]                               ` <87fv6g80g7.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-05-29  4:54                                 ` Kenton Varda
2015-05-29 17:49                                 ` Andy Lutomirski
2015-06-03 21:13                                   ` Eric W. Biederman
     [not found]                                     ` <87k2vkebri.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-06-03 21:15                                       ` [CFT][PATCH 11/10] mnt: Avoid unnecessary regressions in fs_fully_visible Eric W. Biederman
     [not found]                                         ` <87eglseboh.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-06-04  4:35                                           ` [CFT][PATCH 11/10] mnt: Avoid unnecessary regressions in fs_fully_visible (take 2) Eric W. Biederman
     [not found]                                             ` <874mmodral.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-06-04  5:20                                               ` Greg Kroah-Hartman
2015-06-05  0:46                                           ` [CFT][PATCH 11/10] mnt: Avoid unnecessary regressions in fs_fully_visible Andy Lutomirski
     [not found]                                             ` <CALCETrWwtFaiaYGLoq4EPkrgcq9nEA2GseVfP3iBkbYZ8NfGPg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-06-06 19:14                                               ` Eric W. Biederman
2015-06-04  5:19                                       ` [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) Greg Kroah-Hartman
2015-06-04  6:27                                         ` Eric W. Biederman
     [not found]                                           ` <87h9qo6la9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-06-04  7:34                                             ` Eric W. Biederman
2015-06-16 12:23                                             ` Daniel P. Berrange
2015-05-28 21:04               ` Serge E. Hallyn
     [not found]                 ` <20150528210438.GA14849-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2015-05-28 21:42                   ` Eric W. Biederman
2015-05-28 21:52                     ` Serge E. Hallyn
2015-05-28 19:36             ` Richard Weinberger
     [not found]               ` <55676E32.3050006-/L3Ra7n9ekc@public.gmane.org>
2015-05-28 19:57                 ` Eric W. Biederman
2015-05-28 20:30                   ` Richard Weinberger
     [not found]                     ` <55677AEF.1090809-/L3Ra7n9ekc@public.gmane.org>
2015-05-28 21:32                       ` Eric W. Biederman
     [not found]                         ` <87iobcfkwx.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-05-28 21:46                           ` Richard Weinberger
     [not found]                             ` <55678CCA.80807-/L3Ra7n9ekc@public.gmane.org>
2015-06-16 12:30                               ` Daniel P. Berrange
2015-05-29  9:30                           ` Richard Weinberger
     [not found]                             ` <556831CF.9040600-/L3Ra7n9ekc@public.gmane.org>
2015-05-29 17:41                               ` Eric W. Biederman
2015-06-06 18:56                             ` Eric W. Biederman
     [not found]                               ` <87mw0c1x8p.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-06-16 12:31                                 ` Daniel P. Berrange
     [not found]                                   ` <20150616123148.GB18689-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-06-16 12:46                                     ` Richard Weinberger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).