* [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts @ 2015-05-14 17:30 Eric W. Biederman 2015-05-14 17:33 ` [CFT][PATCH 04/10] fs: Add helper functions for permanently empty directories Eric W. Biederman ` (4 more replies) 0 siblings, 5 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-14 17:30 UTC (permalink / raw) To: Linux Containers Cc: Linux API, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo The code is currently available at: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing HEAD: a524faf520600968e58bbc732063fccf2fdf9199 mnt: Update fs_fully_visible to test for permanently empty directories The problem: Mounting a new instance of proc of sysfs can allow things that a bind mount of those filesystems would not. That is the cases I am dealing with are: unshare --user --net --mount ; mount -t sysfs ... unshare --user --pid --mount ; mount -t proc ... The big change is that this set of changes enforces the preservation of locked mount flags, from the existing mount to the current mount. Which means that if proc was mounted read-only the current current will allow a new instance of proc to be mounted read-write, and this set of changes enforces that proc remain read-only. The other gotcha is that the current code does not properly detect empty directories so to prevent things slipping through the cracks this set of changes annotates all mount points where nothing will be revealed if the filesystem mounted on top is removed. Enforcing the administrators policy can actually matter in the real world as has been shown by the recent docker issue. With this patchset I have two concerns: - The enforcement of mount flag preservation on proc and sysfs may break things. (I am especially worried about the implicit adding of nodev). - I missed a filesystem mountpoint on proc or sysfs which would make a fresh copy unmountable for no good reason. I don't want to break userspace if I can help it, and the code has been this way for a while so I figure there is time to find any pitfalls and address them before this code gets merged. So if this works for you please give me your Tested-By The well known mountpoints for pseudo filesystems that I could find are: /dev/ffs*/ functionfs /dev/gadget/ gadgetfs /dev/mqueue mqueue /dev/oprofile/ oprofilefs /dev/pts/ devpts /dlm/ ocfs2_dlmfs /ipath/ ipathfs /proc/fs/nfsd/ nfsd /proc/openprom/ openpromfs /proc/sys/fs/binfmt_misc/ binfmt_misc /spu/ spufs /sys/firmware/efi/efivars/ efivarfs /sys/fs/cgroup/ cgroup /sys/fs/fuse/connections/ fusectl /sys/fs/pstore/ pstore /sys/fs/selinux/ selinuxfs /sys/fs/smackfs/ smackfs /sys/hypervisor/s390/ s390_hypfs /sys/kernel/config/ configfs /sys/kernel/debug/ debugfs /sys/kernel/security/ securityfs /sys/kernel/tracing/ tracefs /var/lib/ibmasm/ ibmasmfs /var/lib/nfs/rpc_pipefs/ rpc_pipefs Eric W. Biederman (10): mnt: Refactor the logic for mounting sysfs and proc in a user namespace mnt: Modify fs_fully_visible to deal with mount attributes vfs: Ignore unlocked mounts in fs_fully_visible fs: Add helper functions for permanently empty directories. sysctl: Allow creating permanently empty directories. proc: Allow creating permanently empty directories. kernfs: Add support for always empty directories. sysfs: Add support for permanently empty directories. sysfs: Create mountpoints with sysfs_create_empty_dir mnt: Update fs_fully_visible to test for permanently empty directories arch/s390/hypfs/inode.c | 12 ++---- drivers/firmware/efi/efi.c | 6 +-- fs/configfs/mount.c | 10 ++--- fs/debugfs/inode.c | 11 ++--- fs/fuse/inode.c | 9 ++--- fs/kernfs/dir.c | 38 +++++++++++++++++- fs/kernfs/inode.c | 2 + fs/libfs.c | 96 ++++++++++++++++++++++++++++++++++++++++++++ fs/namespace.c | 47 +++++++++++++++++++--- fs/proc/generic.c | 23 +++++++++++ fs/proc/inode.c | 3 ++ fs/proc/internal.h | 1 + fs/proc/proc_sysctl.c | 37 +++++++++++++++++ fs/proc/root.c | 9 ++--- fs/pstore/inode.c | 12 ++---- fs/sysfs/dir.c | 34 ++++++++++++++++ fs/sysfs/mount.c | 5 +-- fs/tracefs/inode.c | 6 +-- include/linux/fs.h | 4 +- include/linux/kernfs.h | 3 ++ include/linux/sysctl.h | 3 ++ include/linux/sysfs.h | 16 ++++++++ kernel/cgroup.c | 10 ++--- kernel/sysctl.c | 8 +--- security/inode.c | 10 ++--- security/selinux/selinuxfs.c | 11 +++-- security/smack/smackfs.c | 8 ++-- 27 files changed, 344 insertions(+), 90 deletions(-) ^ permalink raw reply [flat|nested] 85+ messages in thread
* [CFT][PATCH 04/10] fs: Add helper functions for permanently empty directories. 2015-05-14 17:30 [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts Eric W. Biederman @ 2015-05-14 17:33 ` Eric W. Biederman 2015-05-14 17:33 ` [CFT][PATCH 05/10] sysctl: Allow creating " Eric W. Biederman ` (3 subsequent siblings) 4 siblings, 0 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-14 17:33 UTC (permalink / raw) To: Linux Containers Cc: linux-fsdevel, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Greg Kroah-Hartman, Tejun Heo To ensure it is safe to mount proc and sysfs I need to check if filesystems that are mounted on top of them are mounted on truly empty directories. Given that some directories can gain entries over time, knowing that a directory is empty right now is insufficient. Therefore add supporting infrastructure for permantently empty directories that proc and sysfs can use when they create mount points for filesystems and fs_fully_visible can use to test for permanently empty directories to ensure that nothing will be gained by mounting a fresh copy of proc or sysfs. Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- fs/libfs.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ include/linux/fs.h | 2 ++ 2 files changed, 98 insertions(+) diff --git a/fs/libfs.c b/fs/libfs.c index cb1fb4b9b637..02813592e121 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -1093,3 +1093,99 @@ simple_nosetlease(struct file *filp, long arg, struct file_lock **flp, return -EINVAL; } EXPORT_SYMBOL(simple_nosetlease); + + +/* + * Operations for a permanently empty directory. + */ +static struct dentry *empty_dir_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags) +{ + return ERR_PTR(-ENOENT); +} + +static int empty_dir_getattr(struct vfsmount *mnt, struct dentry *dentry, + struct kstat *stat) +{ + struct inode *inode = d_inode(dentry); + generic_fillattr(inode, stat); + return 0; +} + +static int empty_dir_setattr(struct dentry *dentry, struct iattr *attr) +{ + return -EPERM; +} + +static int empty_dir_setxattr(struct dentry *dentry, const char *name, + const void *value, size_t size, int flags) +{ + return -EOPNOTSUPP; +} + +static ssize_t empty_dir_getxattr(struct dentry *dentry, const char *name, + void *value, size_t size) +{ + return -EOPNOTSUPP; +} + +static int empty_dir_removexattr(struct dentry *dentry, const char *name) +{ + return -EOPNOTSUPP; +} + +static ssize_t empty_dir_listxattr(struct dentry *dentry, char *list, size_t size) +{ + return -EOPNOTSUPP; +} + +static const struct inode_operations empty_dir_inode_operations = { + .lookup = empty_dir_lookup, + .permission = generic_permission, + .setattr = empty_dir_setattr, + .getattr = empty_dir_getattr, + .setxattr = empty_dir_setxattr, + .getxattr = empty_dir_getxattr, + .removexattr = empty_dir_removexattr, + .listxattr = empty_dir_listxattr, +}; + +static loff_t empty_dir_llseek(struct file *file, loff_t offset, int whence) +{ + /* An empty directory has two entries . and .. at offsets 0 and 1 */ + return generic_file_llseek_size(file, offset, whence, 2, 2); +} + +static int empty_dir_readdir(struct file *file, struct dir_context *ctx) +{ + dir_emit_dots(file, ctx); + return 0; +} + +static const struct file_operations empty_dir_operations = { + .llseek = empty_dir_llseek, + .read = generic_read_dir, + .iterate = empty_dir_readdir, + .fsync = noop_fsync, +}; + + +void make_empty_dir_inode(struct inode *inode) +{ + set_nlink(inode, 2); + inode->i_mode = S_IFDIR | S_IRUGO | S_IXUGO; + inode->i_uid = GLOBAL_ROOT_UID; + inode->i_gid = GLOBAL_ROOT_GID; + inode->i_rdev = 0; + inode->i_size = 2; + inode->i_blkbits = PAGE_SHIFT; + inode->i_blocks = 0; + + inode->i_op = &empty_dir_inode_operations; + inode->i_fop = &empty_dir_operations; +} + +bool is_empty_dir_inode(struct inode *inode) +{ + return (inode->i_fop == &empty_dir_operations) && + (inode->i_op == &empty_dir_inode_operations); +} diff --git a/include/linux/fs.h b/include/linux/fs.h index 2d24eeb8e59c..571aab91bfc0 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2780,6 +2780,8 @@ extern struct dentry *simple_lookup(struct inode *, struct dentry *, unsigned in extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t *); extern const struct file_operations simple_dir_operations; extern const struct inode_operations simple_dir_inode_operations; +extern void make_empty_dir_inode(struct inode *inode); +extern bool is_empty_dir_inode(struct inode *inode); struct tree_descr { char *name; const struct file_operations *ops; int mode; }; struct dentry *d_alloc_name(struct dentry *, const char *); extern int simple_fill_super(struct super_block *, unsigned long, struct tree_descr *); -- 2.2.1 ^ permalink raw reply related [flat|nested] 85+ messages in thread
* [CFT][PATCH 05/10] sysctl: Allow creating permanently empty directories. 2015-05-14 17:30 [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts Eric W. Biederman 2015-05-14 17:33 ` [CFT][PATCH 04/10] fs: Add helper functions for permanently empty directories Eric W. Biederman @ 2015-05-14 17:33 ` Eric W. Biederman [not found] ` <87pp63jcca.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> ` (2 subsequent siblings) 4 siblings, 0 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-14 17:33 UTC (permalink / raw) To: Linux Containers Cc: linux-fsdevel, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Greg Kroah-Hartman, Tejun Heo Add a magic sysctl table permanently_empty_table that when used to create a directory forces that directory to be permanently empty. Update the code to use make_empty_dir_inode when accessing permanently empty directories. Update the code to not allow adding to permanently empty directories. Update /proc/sys/fs/binfmt_misc to be a permanently empty directory. Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- fs/proc/proc_sysctl.c | 37 +++++++++++++++++++++++++++++++++++++ include/linux/sysctl.h | 3 +++ kernel/sysctl.c | 8 +------- 3 files changed, 41 insertions(+), 7 deletions(-) diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c index fea2561d773b..f9ade2caf438 100644 --- a/fs/proc/proc_sysctl.c +++ b/fs/proc/proc_sysctl.c @@ -19,6 +19,28 @@ static const struct inode_operations proc_sys_inode_operations; static const struct file_operations proc_sys_dir_file_operations; static const struct inode_operations proc_sys_dir_operations; +/* Support for permanently empty directories */ + +struct ctl_table permanently_empty_table[] = { + { } +}; + +static bool is_empty_dir(struct ctl_table_header *head) +{ + return head->ctl_table[0].child == permanently_empty_table; +} + +static void set_empty_dir(struct ctl_dir *dir) +{ + dir->header.ctl_table[0].child = permanently_empty_table; +} + +static void clear_empty_dir(struct ctl_dir *dir) + +{ + dir->header.ctl_table[0].child = NULL; +} + void proc_sys_poll_notify(struct ctl_table_poll *poll) { if (!poll) @@ -187,6 +209,17 @@ static int insert_header(struct ctl_dir *dir, struct ctl_table_header *header) struct ctl_table *entry; int err; + /* Is this a permanently empty directory? */ + if (is_empty_dir(&dir->header)) + return -EROFS; + + /* Am I creating a permanently empty directory? */ + if (header->ctl_table == permanently_empty_table) { + if (!RB_EMPTY_ROOT(&dir->root)) + return -EINVAL; + set_empty_dir(dir); + } + dir->header.nreg++; header->parent = dir; err = insert_links(header); @@ -202,6 +235,8 @@ fail: erase_header(header); put_links(header); fail_links: + if (header->ctl_table == permanently_empty_table) + clear_empty_dir(dir); header->parent = NULL; drop_sysctl_table(&dir->header); return err; @@ -419,6 +454,8 @@ static struct inode *proc_sys_make_inode(struct super_block *sb, inode->i_mode |= S_IFDIR; inode->i_op = &proc_sys_dir_operations; inode->i_fop = &proc_sys_dir_file_operations; + if (is_empty_dir(head)) + make_empty_dir_inode(inode); } out: return inode; diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index 795d5fea5697..71fd81994a82 100644 --- a/include/linux/sysctl.h +++ b/include/linux/sysctl.h @@ -188,6 +188,9 @@ struct ctl_table_header *register_sysctl_paths(const struct ctl_path *path, void unregister_sysctl_table(struct ctl_table_header * table); extern int sysctl_init(void); + +extern struct ctl_table permanently_empty_table[]; + #else /* CONFIG_SYSCTL */ static inline struct ctl_table_header *register_sysctl_table(struct ctl_table * table) { diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 2082b1a88fb9..92f41a43875e 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1531,12 +1531,6 @@ static struct ctl_table vm_table[] = { { } }; -#if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE) -static struct ctl_table binfmt_misc_table[] = { - { } -}; -#endif - static struct ctl_table fs_table[] = { { .procname = "inode-nr", @@ -1690,7 +1684,7 @@ static struct ctl_table fs_table[] = { { .procname = "binfmt_misc", .mode = 0555, - .child = binfmt_misc_table, + .child = permanently_empty_table, }, #endif { -- 2.2.1 ^ permalink raw reply related [flat|nested] 85+ messages in thread
[parent not found: <87pp63jcca.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* [CFT][PATCH 01/10] mnt: Refactor the logic for mounting sysfs and proc in a user namespace [not found] ` <87pp63jcca.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-05-14 17:31 ` Eric W. Biederman 2015-05-14 17:32 ` [CFT][PATCH 02/10] mnt: Modify fs_fully_visible to deal with mount attributes Eric W. Biederman ` (6 subsequent siblings) 7 siblings, 0 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-14 17:31 UTC (permalink / raw) To: Linux Containers Cc: Linux API, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo Fresh mounts of proc and sysfs are a very special case that works very much like a bind mount. Unfortunately the current structure can not preserve the MNT_LOCK... mount flags. Therefore refactor the logic into a form that can be modified to preserve those lock bits. Add a new filesystem flag FS_USERNS_VISIBLE that requires some mount of the filesystem be fully visible in the current mount namespace, before the filesystem may be mounted. Move the logic for calling fs_fully_visible from proc and sysfs into fs/namespace.c where it has greater access to mount namespace state. Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/namespace.c | 8 +++++++- fs/proc/root.c | 5 +---- fs/sysfs/mount.c | 5 +---- include/linux/fs.h | 2 +- 4 files changed, 10 insertions(+), 10 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 1b9e11167bae..8e7edaf60fe1 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2332,6 +2332,8 @@ unlock: return err; } +static bool fs_fully_visible(struct file_system_type *fs_type); + /* * create a new mount for userspace and request it to be added into the * namespace's tree @@ -2363,6 +2365,10 @@ static int do_new_mount(struct path *path, const char *fstype, int flags, flags |= MS_NODEV; mnt_flags |= MNT_NODEV | MNT_LOCK_NODEV; } + if (type->fs_flags & FS_USERNS_VISIBLE) { + if (!fs_fully_visible(type)) + return -EPERM; + } } mnt = vfs_kern_mount(type, flags, name, data); @@ -3164,7 +3170,7 @@ bool current_chrooted(void) return chrooted; } -bool fs_fully_visible(struct file_system_type *type) +static bool fs_fully_visible(struct file_system_type *type) { struct mnt_namespace *ns = current->nsproxy->mnt_ns; struct mount *mnt; diff --git a/fs/proc/root.c b/fs/proc/root.c index b7fa4bfe896a..64e1ab64bde6 100644 --- a/fs/proc/root.c +++ b/fs/proc/root.c @@ -112,9 +112,6 @@ static struct dentry *proc_mount(struct file_system_type *fs_type, ns = task_active_pid_ns(current); options = data; - if (!capable(CAP_SYS_ADMIN) && !fs_fully_visible(fs_type)) - return ERR_PTR(-EPERM); - /* Does the mounter have privilege over the pid namespace? */ if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) return ERR_PTR(-EPERM); @@ -159,7 +156,7 @@ static struct file_system_type proc_fs_type = { .name = "proc", .mount = proc_mount, .kill_sb = proc_kill_sb, - .fs_flags = FS_USERNS_MOUNT, + .fs_flags = FS_USERNS_VISIBLE | FS_USERNS_MOUNT, }; void __init proc_root_init(void) diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c index 8a49486bf30c..1c6ac6fcee9f 100644 --- a/fs/sysfs/mount.c +++ b/fs/sysfs/mount.c @@ -31,9 +31,6 @@ static struct dentry *sysfs_mount(struct file_system_type *fs_type, bool new_sb; if (!(flags & MS_KERNMOUNT)) { - if (!capable(CAP_SYS_ADMIN) && !fs_fully_visible(fs_type)) - return ERR_PTR(-EPERM); - if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET)) return ERR_PTR(-EPERM); } @@ -58,7 +55,7 @@ static struct file_system_type sysfs_fs_type = { .name = "sysfs", .mount = sysfs_mount, .kill_sb = sysfs_kill_sb, - .fs_flags = FS_USERNS_MOUNT, + .fs_flags = FS_USERNS_VISIBLE | FS_USERNS_MOUNT, }; int __init sysfs_init(void) diff --git a/include/linux/fs.h b/include/linux/fs.h index 35ec87e490b1..2d24eeb8e59c 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1897,6 +1897,7 @@ struct file_system_type { #define FS_HAS_SUBTYPE 4 #define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */ #define FS_USERNS_DEV_MOUNT 16 /* A userns mount does not imply MNT_NODEV */ +#define FS_USERNS_VISIBLE 32 /* FS must already be visible */ #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */ struct dentry *(*mount) (struct file_system_type *, int, const char *, void *); @@ -1984,7 +1985,6 @@ extern int vfs_ustat(dev_t, struct kstatfs *); extern int freeze_super(struct super_block *super); extern int thaw_super(struct super_block *super); extern bool our_mnt(struct vfsmount *mnt); -extern bool fs_fully_visible(struct file_system_type *); extern int current_umask(void); -- 2.2.1 ^ permalink raw reply related [flat|nested] 85+ messages in thread
* [CFT][PATCH 02/10] mnt: Modify fs_fully_visible to deal with mount attributes [not found] ` <87pp63jcca.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-05-14 17:31 ` [CFT][PATCH 01/10] mnt: Refactor the logic for mounting sysfs and proc in a user namespace Eric W. Biederman @ 2015-05-14 17:32 ` Eric W. Biederman 2015-05-14 17:32 ` [CFT][PATCH 03/10] vfs: Ignore unlocked mounts in fs_fully_visible Eric W. Biederman ` (5 subsequent siblings) 7 siblings, 0 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-14 17:32 UTC (permalink / raw) To: Linux Containers Cc: Linux API, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo Ignore an existing mount if it's locked attributes are less permissive than the new mounts attributes. On success ensure the new mount locks all of the same attributes as the old mount. Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/namespace.c | 32 +++++++++++++++++++++++++++++--- 1 file changed, 29 insertions(+), 3 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 8e7edaf60fe1..fccee9924e8c 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2332,7 +2332,7 @@ unlock: return err; } -static bool fs_fully_visible(struct file_system_type *fs_type); +static bool fs_fully_visible(struct file_system_type *fs_type, int *new_mnt_flags); /* * create a new mount for userspace and request it to be added into the @@ -2366,7 +2366,7 @@ static int do_new_mount(struct path *path, const char *fstype, int flags, mnt_flags |= MNT_NODEV | MNT_LOCK_NODEV; } if (type->fs_flags & FS_USERNS_VISIBLE) { - if (!fs_fully_visible(type)) + if (!fs_fully_visible(type, &mnt_flags)) return -EPERM; } } @@ -3170,9 +3170,10 @@ bool current_chrooted(void) return chrooted; } -static bool fs_fully_visible(struct file_system_type *type) +static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags) { struct mnt_namespace *ns = current->nsproxy->mnt_ns; + int new_flags = *new_mnt_flags; struct mount *mnt; bool visible = false; @@ -3191,6 +3192,25 @@ static bool fs_fully_visible(struct file_system_type *type) if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root) continue; + /* Verify the mount flags are equal to or more permissive + * than the proposed new mount. + */ + if ((mnt->mnt.mnt_flags & MNT_LOCK_READONLY) && + !(new_flags & MNT_READONLY)) + continue; + if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) && + !(new_flags & MNT_NODEV)) + continue; + if ((mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) && + !(new_flags & MNT_NOSUID)) + continue; + if ((mnt->mnt.mnt_flags & MNT_LOCK_NOEXEC) && + !(new_flags & MNT_NOEXEC)) + continue; + if ((mnt->mnt.mnt_flags & MNT_LOCK_ATIME) && + ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK))) + continue; + /* This mount is not fully visible if there are any child mounts * that cover anything except for empty directories. */ @@ -3201,6 +3221,12 @@ static bool fs_fully_visible(struct file_system_type *type) if (inode->i_nlink > 2) goto next; } + /* Preserve the locked attributes */ + *new_mnt_flags |= mnt->mnt.mnt_flags & (MNT_LOCK_READONLY | \ + MNT_LOCK_NODEV | \ + MNT_LOCK_NOSUID | \ + MNT_LOCK_NOEXEC | \ + MNT_LOCK_ATIME); visible = true; goto found; next: ; -- 2.2.1 ^ permalink raw reply related [flat|nested] 85+ messages in thread
* [CFT][PATCH 03/10] vfs: Ignore unlocked mounts in fs_fully_visible [not found] ` <87pp63jcca.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-05-14 17:31 ` [CFT][PATCH 01/10] mnt: Refactor the logic for mounting sysfs and proc in a user namespace Eric W. Biederman 2015-05-14 17:32 ` [CFT][PATCH 02/10] mnt: Modify fs_fully_visible to deal with mount attributes Eric W. Biederman @ 2015-05-14 17:32 ` Eric W. Biederman 2015-05-14 17:34 ` [CFT][PATCH 06/10] proc: Allow creating permanently empty directories Eric W. Biederman ` (4 subsequent siblings) 7 siblings, 0 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-14 17:32 UTC (permalink / raw) To: Linux Containers Cc: Linux API, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo Limit the mounts fs_fully_visible considers to locked mounts. Unlocked can always be unmounted so considering them adds hassle but no security benefit. Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/namespace.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index fccee9924e8c..3ede0669b8d2 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3211,11 +3211,15 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags) ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK))) continue; - /* This mount is not fully visible if there are any child mounts - * that cover anything except for empty directories. + /* This mount is not fully visible if there are any + * locked child mounts that cover anything except for + * empty directories. */ list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) { struct inode *inode = child->mnt_mountpoint->d_inode; + /* Only worry about locked mounts */ + if (!(mnt->mnt.mnt_flags & MNT_LOCKED)) + continue; if (!S_ISDIR(inode->i_mode)) goto next; if (inode->i_nlink > 2) -- 2.2.1 ^ permalink raw reply related [flat|nested] 85+ messages in thread
* [CFT][PATCH 06/10] proc: Allow creating permanently empty directories. [not found] ` <87pp63jcca.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> ` (2 preceding siblings ...) 2015-05-14 17:32 ` [CFT][PATCH 03/10] vfs: Ignore unlocked mounts in fs_fully_visible Eric W. Biederman @ 2015-05-14 17:34 ` Eric W. Biederman 2015-05-14 17:34 ` [CFT][PATCH 07/10] kernfs: Add support for always " Eric W. Biederman ` (3 subsequent siblings) 7 siblings, 0 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-14 17:34 UTC (permalink / raw) To: Linux Containers Cc: Linux API, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo Add a new function proc_mk_empty_dir that when used to creates a directory that can not be added to. Update the code to use make_empty_dir_inode when reporting a permanently empty directory to the vfs. Update the code to not allow adding to permanently empty directories. Update /proc/openprom and /proc/fs/nfsd to be permanently empty directories. Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/proc/generic.c | 23 +++++++++++++++++++++++ fs/proc/inode.c | 3 +++ fs/proc/internal.h | 1 + fs/proc/root.c | 4 ++-- 4 files changed, 29 insertions(+), 2 deletions(-) diff --git a/fs/proc/generic.c b/fs/proc/generic.c index df6327a2b865..e235c1544b22 100644 --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -373,6 +373,10 @@ static struct proc_dir_entry *__proc_create(struct proc_dir_entry **parent, WARN(1, "create '/proc/%s' by hand\n", qstr.name); return NULL; } + if (S_ISDIR((*parent)->mode) && ((*parent)->proc_fops == NULL)) { + WARN(1, "attempt to add to permanently empty directory"); + return NULL; + } ent = kzalloc(sizeof(struct proc_dir_entry) + qstr.len + 1, GFP_KERNEL); if (!ent) @@ -455,6 +459,25 @@ struct proc_dir_entry *proc_mkdir(const char *name, } EXPORT_SYMBOL(proc_mkdir); +struct proc_dir_entry *proc_mk_empty_dir(const char *name) +{ + umode_t mode = S_IFDIR | S_IRUGO | S_IXUGO; + struct proc_dir_entry *ent, *parent = NULL; + + ent = __proc_create(&parent, name, mode, 2); + if (ent) { + ent->data = NULL; + ent->proc_fops = NULL; + ent->proc_iops = NULL; + if (proc_register(parent, ent) < 0) { + kfree(ent); + parent->nlink--; + ent = NULL; + } + } + return ent; +} + struct proc_dir_entry *proc_create_data(const char *name, umode_t mode, struct proc_dir_entry *parent, const struct file_operations *proc_fops, diff --git a/fs/proc/inode.c b/fs/proc/inode.c index 8272aaba1bb0..b957ec618bda 100644 --- a/fs/proc/inode.c +++ b/fs/proc/inode.c @@ -445,6 +445,9 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de) inode->i_fop = &proc_reg_file_ops; } else { inode->i_fop = de->proc_fops; + if (S_ISDIR(inode->i_mode) && + (de->proc_fops == NULL)) + make_empty_dir_inode(inode); } } } else diff --git a/fs/proc/internal.h b/fs/proc/internal.h index c835b94c0cd3..6bc2e7a12912 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -190,6 +190,7 @@ static inline struct proc_dir_entry *pde_get(struct proc_dir_entry *pde) return pde; } extern void pde_put(struct proc_dir_entry *); +struct proc_dir_entry *proc_mk_empty_dir(const char *name); /* * inode.c diff --git a/fs/proc/root.c b/fs/proc/root.c index 64e1ab64bde6..b031fc3991c3 100644 --- a/fs/proc/root.c +++ b/fs/proc/root.c @@ -179,10 +179,10 @@ void __init proc_root_init(void) #endif proc_mkdir("fs", NULL); proc_mkdir("driver", NULL); - proc_mkdir("fs/nfsd", NULL); /* somewhere for the nfsd filesystem to be mounted */ + proc_mk_empty_dir("fs/nfsd"); /* somewhere for the nfsd filesystem to be mounted */ #if defined(CONFIG_SUN_OPENPROMFS) || defined(CONFIG_SUN_OPENPROMFS_MODULE) /* just give it a mountpoint */ - proc_mkdir("openprom", NULL); + proc_mk_empty_dir("openprom"); #endif proc_tty_init(); proc_mkdir("bus", NULL); -- 2.2.1 ^ permalink raw reply related [flat|nested] 85+ messages in thread
* [CFT][PATCH 07/10] kernfs: Add support for always empty directories. [not found] ` <87pp63jcca.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> ` (3 preceding siblings ...) 2015-05-14 17:34 ` [CFT][PATCH 06/10] proc: Allow creating permanently empty directories Eric W. Biederman @ 2015-05-14 17:34 ` Eric W. Biederman 2015-05-14 17:35 ` [CFT][PATCH 08/10] sysfs: Add support for permanently " Eric W. Biederman ` (2 subsequent siblings) 7 siblings, 0 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-14 17:34 UTC (permalink / raw) To: Linux Containers Cc: Linux API, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo Add a new function kernfs_create_empty_dir that can be used to create directory that can not be modified. Update the code to use make_empty_dir_inode when reporting a permanently empty directory to the vfs. Update the code to not allow adding to permanently empty directories. Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/kernfs/dir.c | 38 +++++++++++++++++++++++++++++++++++++- fs/kernfs/inode.c | 2 ++ include/linux/kernfs.h | 3 +++ 3 files changed, 42 insertions(+), 1 deletion(-) diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index f131fc23ffc4..8643e70536f8 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -585,6 +585,9 @@ int kernfs_add_one(struct kernfs_node *kn) goto out_unlock; ret = -ENOENT; + if (parent->flags & KERNFS_EMPTY_DIR) + goto out_unlock; + if ((parent->flags & KERNFS_ACTIVATED) && !kernfs_active(parent)) goto out_unlock; @@ -776,6 +779,38 @@ struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent, return ERR_PTR(rc); } +/** + * kernfs_create_empty_dir - create an always empty directory + * @parent: parent in which to create a new directory + * @name: name of the new directory + * + * Returns the created node on success, ERR_PTR() value on failure. + */ +struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent, + const char *name, void *priv) +{ + struct kernfs_node *kn; + int rc; + + /* allocate */ + kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR, KERNFS_DIR); + if (!kn) + return ERR_PTR(-ENOMEM); + + kn->flags |= KERNFS_EMPTY_DIR; + kn->dir.root = parent->dir.root; + kn->ns = NULL; + kn->priv = priv; + + /* link in */ + rc = kernfs_add_one(kn); + if (!rc) + return kn; + + kernfs_put(kn); + return ERR_PTR(rc); +} + static struct dentry *kernfs_iop_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags) @@ -1247,7 +1282,8 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent, mutex_lock(&kernfs_mutex); error = -ENOENT; - if (!kernfs_active(kn) || !kernfs_active(new_parent)) + if (!kernfs_active(kn) || !kernfs_active(new_parent) || + (new_parent->flags & KERNFS_EMPTY_DIR)) goto out; error = 0; diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c index 2da8493a380b..756dd56aaf60 100644 --- a/fs/kernfs/inode.c +++ b/fs/kernfs/inode.c @@ -296,6 +296,8 @@ static void kernfs_init_inode(struct kernfs_node *kn, struct inode *inode) case KERNFS_DIR: inode->i_op = &kernfs_dir_iops; inode->i_fop = &kernfs_dir_fops; + if (kn->flags & KERNFS_EMPTY_DIR) + make_empty_dir_inode(inode); break; case KERNFS_FILE: inode->i_size = kn->attr.size; diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h index 71ecdab1671b..4b479a0b3d61 100644 --- a/include/linux/kernfs.h +++ b/include/linux/kernfs.h @@ -45,6 +45,7 @@ enum kernfs_node_flag { KERNFS_LOCKDEP = 0x0100, KERNFS_SUICIDAL = 0x0400, KERNFS_SUICIDED = 0x0800, + KERNFS_EMPTY_DIR = 0x1000, }; /* @flags for kernfs_create_root() */ @@ -285,6 +286,8 @@ void kernfs_destroy_root(struct kernfs_root *root); struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent, const char *name, umode_t mode, void *priv, const void *ns); +struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent, + const char *name, void *priv); struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, const char *name, umode_t mode, loff_t size, -- 2.2.1 ^ permalink raw reply related [flat|nested] 85+ messages in thread
* [CFT][PATCH 08/10] sysfs: Add support for permanently empty directories. [not found] ` <87pp63jcca.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> ` (4 preceding siblings ...) 2015-05-14 17:34 ` [CFT][PATCH 07/10] kernfs: Add support for always " Eric W. Biederman @ 2015-05-14 17:35 ` Eric W. Biederman [not found] ` <87fv6zhxkp.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-05-14 17:36 ` [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir Eric W. Biederman 2015-05-14 17:37 ` [CFT][PATCH 10/10] mnt: Update fs_fully_visible to test for permanently empty directories Eric W. Biederman 7 siblings, 1 reply; 85+ messages in thread From: Eric W. Biederman @ 2015-05-14 17:35 UTC (permalink / raw) To: Linux Containers Cc: Linux API, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo Add two functions sysfs_create_empty_dir and sysfs_remove_empty_dir that hang a permanently empty directory off of a kobject or remove a permanently emptpy directory hanging from a kobject. Export these new functions so modular filesystems can use them. As all permanently empty directories are, are names and used for mouting other filesystems this seems like the right abstraction. Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/sysfs/dir.c | 34 ++++++++++++++++++++++++++++++++++ include/linux/sysfs.h | 16 ++++++++++++++++ 2 files changed, 50 insertions(+) diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index 0b45ff42f374..8244741474d7 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -121,3 +121,37 @@ int sysfs_move_dir_ns(struct kobject *kobj, struct kobject *new_parent_kobj, return kernfs_rename_ns(kn, new_parent, kn->name, new_ns); } + +/** + * sysfs_create_empty_dir - create an always empty directory + * @parent_kobj: kobject that will contain this always empty directory + * @name: The name of the always empty directory to add + */ +int sysfs_create_empty_dir(struct kobject *parent_kobj, const char *name) +{ + struct kernfs_node *kn, *parent = parent_kobj->sd; + + kn = kernfs_create_empty_dir(parent, name, NULL); + if (IS_ERR(kn)) { + if (PTR_ERR(kn) == -EEXIST) + sysfs_warn_dup(parent, name); + return PTR_ERR(kn); + } + + return 0; +} +EXPORT_SYMBOL_GPL(sysfs_create_empty_dir); + +/** + * sysfs_remove_empty_dir - remove an always empty directory. + * @parent_kobj: kobject that will contain this always empty directory + * @name: The name of the always empty directory to remove + * + */ +void sysfs_remove_empty_dir(struct kobject *parent_kobj, const char *name) +{ + struct kernfs_node *parent = parent_kobj->sd; + + kernfs_remove_by_name_ns(parent, name, NULL); +} +EXPORT_SYMBOL_GPL(sysfs_remove_empty_dir); diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h index 99382c0df17e..e156d419de75 100644 --- a/include/linux/sysfs.h +++ b/include/linux/sysfs.h @@ -210,6 +210,10 @@ int __must_check sysfs_rename_dir_ns(struct kobject *kobj, const char *new_name, int __must_check sysfs_move_dir_ns(struct kobject *kobj, struct kobject *new_parent_kobj, const void *new_ns); +int __must_check sysfs_create_empty_dir(struct kobject *parent_kobj, + const char *name); +void sysfs_remove_empty_dir(struct kobject *parent_kobj, + const char *name); int __must_check sysfs_create_file_ns(struct kobject *kobj, const struct attribute *attr, @@ -298,6 +302,18 @@ static inline int sysfs_move_dir_ns(struct kobject *kobj, return 0; } +static inline int sysfs_create_empty_dir(struct kobject *parent_kobj, + const char *name) +{ + return 0; +} + +static inline void sysfs_remove_empty_dir(struct kobject *parent_kobj, + const char *name) +{ + return 0; +} + static inline int sysfs_create_file_ns(struct kobject *kobj, const struct attribute *attr, const void *ns) -- 2.2.1 ^ permalink raw reply related [flat|nested] 85+ messages in thread
[parent not found: <87fv6zhxkp.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* Re: [CFT][PATCH 08/10] sysfs: Add support for permanently empty directories. [not found] ` <87fv6zhxkp.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-05-14 20:31 ` Greg Kroah-Hartman [not found] ` <20150514203131.GB16416-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 85+ messages in thread From: Greg Kroah-Hartman @ 2015-05-14 20:31 UTC (permalink / raw) To: Eric W. Biederman Cc: Linux Containers, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Tejun Heo On Thu, May 14, 2015 at 12:35:02PM -0500, Eric W. Biederman wrote: > > Add two functions sysfs_create_empty_dir and sysfs_remove_empty_dir > that hang a permanently empty directory off of a kobject or remove > a permanently emptpy directory hanging from a kobject. Export > these new functions so modular filesystems can use them. > > As all permanently empty directories are, are names and used > for mouting other filesystems this seems like the right abstraction. That sentence doesn't make much sense, cut and paste? > > Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> > --- > fs/sysfs/dir.c | 34 ++++++++++++++++++++++++++++++++++ > include/linux/sysfs.h | 16 ++++++++++++++++ > 2 files changed, 50 insertions(+) > > diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c > index 0b45ff42f374..8244741474d7 100644 > --- a/fs/sysfs/dir.c > +++ b/fs/sysfs/dir.c > @@ -121,3 +121,37 @@ int sysfs_move_dir_ns(struct kobject *kobj, struct kobject *new_parent_kobj, > > return kernfs_rename_ns(kn, new_parent, kn->name, new_ns); > } > + > +/** > + * sysfs_create_empty_dir - create an always empty directory > + * @parent_kobj: kobject that will contain this always empty directory > + * @name: The name of the always empty directory to add > + */ > +int sysfs_create_empty_dir(struct kobject *parent_kobj, const char *name) As this really is just a mount point, how about we be explicit with this and call the function: sysfs_create_mount_point() sysfs_remove_mount_point() That makes more sense in the long run, otherwise if you just want to create an empty directory in sysfs, you can do so without making an "empty" kobject and some people might do that accidentally in the future. This makes it more obvious as to what is going on. thanks, greg k-h ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <20150514203131.GB16416-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>]
* Re: [CFT][PATCH 08/10] sysfs: Add support for permanently empty directories. [not found] ` <20150514203131.GB16416-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> @ 2015-05-14 21:33 ` Eric W. Biederman 0 siblings, 0 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-14 21:33 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Linux Containers, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Tejun Heo Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> writes: > On Thu, May 14, 2015 at 12:35:02PM -0500, Eric W. Biederman wrote: >> >> Add two functions sysfs_create_empty_dir and sysfs_remove_empty_dir >> that hang a permanently empty directory off of a kobject or remove >> a permanently emptpy directory hanging from a kobject. Export >> these new functions so modular filesystems can use them. >> >> As all permanently empty directories are, are names and used >> for mouting other filesystems this seems like the right abstraction. > > That sentence doesn't make much sense, cut and paste? Probably one edit too many or too few depending on how you look at it. What I meant is that since the only interesting thing about a permanently empty directory is it's name, treating them like sysfs files rather than normal sysfs directories which require a kobject seems like the right abstraction. >> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> >> --- >> fs/sysfs/dir.c | 34 ++++++++++++++++++++++++++++++++++ >> include/linux/sysfs.h | 16 ++++++++++++++++ >> 2 files changed, 50 insertions(+) >> >> diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c >> index 0b45ff42f374..8244741474d7 100644 >> --- a/fs/sysfs/dir.c >> +++ b/fs/sysfs/dir.c >> @@ -121,3 +121,37 @@ int sysfs_move_dir_ns(struct kobject *kobj, struct kobject *new_parent_kobj, >> >> return kernfs_rename_ns(kn, new_parent, kn->name, new_ns); >> } >> + >> +/** >> + * sysfs_create_empty_dir - create an always empty directory >> + * @parent_kobj: kobject that will contain this always empty directory >> + * @name: The name of the always empty directory to add >> + */ >> +int sysfs_create_empty_dir(struct kobject *parent_kobj, const char *name) > > As this really is just a mount point, how about we be explicit with > this and call the function: > sysfs_create_mount_point() > sysfs_remove_mount_point() > That makes more sense in the long run, otherwise if you just want to > create an empty directory in sysfs, you can do so without making an > "empty" kobject and some people might do that accidentally in the > future. This makes it more obvious as to what is going on. Yeah. That seems fairly reasonable. My brain is on the edge between the functional description of creating a permanently empty directory, and the usage based description (creating a directory to mount filesystems on). But I agree a name that makes it totally obvious we are creating a directory to mount something on is going to be more usable and comprehensible. My head doesn't like sysfs_create_mount_point() as a mount point can be a file. But I will put it on the back burner and see if I can come up with something better, and if not sysfs_create_mount_point it is. Brainstorming: sysfs_create_expected_mount_point() sysfs_reserve_dir_for_mount() sysfs_create_dir_mount_point() sysfs_create_expected_mount_point() Partly I think I would like to rename the proc, sysctl and infrastructure bit as well (consistency and clarity is good). Where I get stuck is how do I ask the question: I see this directory is a mount point, is it a directory whose sole purpose in life is to be a mount point? In the context of that question I like my naming of empty_dir as it conveys what I am interested in. But I like the sysfs_create_mount_point for general use. Maybe I won't make my names consistent. I don't know. I am putting this naming question on the back burner for a bit. Eric ^ permalink raw reply [flat|nested] 85+ messages in thread
* [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir [not found] ` <87pp63jcca.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> ` (5 preceding siblings ...) 2015-05-14 17:35 ` [CFT][PATCH 08/10] sysfs: Add support for permanently " Eric W. Biederman @ 2015-05-14 17:36 ` Eric W. Biederman [not found] ` <878ucrhxi9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-05-14 17:37 ` [CFT][PATCH 10/10] mnt: Update fs_fully_visible to test for permanently empty directories Eric W. Biederman 7 siblings, 1 reply; 85+ messages in thread From: Eric W. Biederman @ 2015-05-14 17:36 UTC (permalink / raw) To: Linux Containers Cc: Linux API, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo This allows for better documentation in the code and it allows for a simpler and fully correct version of fs_fully_visible to be written. The mount points converted and their filesystems are: /sys/hypervisor/s390/ s390_hypfs /sys/kernel/config/ configfs /sys/kernel/debug/ debugfs /sys/firmware/efi/efivars/ efivarfs /sys/fs/fuse/connections/ fusectl /sys/fs/pstore/ pstore /sys/kernel/tracing/ tracefs /sys/fs/cgroup/ cgroup /sys/kernel/security/ securityfs /sys/fs/selinux/ selinuxfs /sys/fs/smackfs/ smackfs Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- arch/s390/hypfs/inode.c | 12 ++++-------- drivers/firmware/efi/efi.c | 6 ++---- fs/configfs/mount.c | 10 ++++------ fs/debugfs/inode.c | 11 ++++------- fs/fuse/inode.c | 9 +++------ fs/pstore/inode.c | 12 ++++-------- fs/tracefs/inode.c | 6 ++---- kernel/cgroup.c | 10 ++++------ security/inode.c | 10 ++++------ security/selinux/selinuxfs.c | 11 +++++------ security/smack/smackfs.c | 8 ++++---- 11 files changed, 40 insertions(+), 65 deletions(-) diff --git a/arch/s390/hypfs/inode.c b/arch/s390/hypfs/inode.c index d3f896a35b98..d943d36076cc 100644 --- a/arch/s390/hypfs/inode.c +++ b/arch/s390/hypfs/inode.c @@ -456,8 +456,6 @@ static const struct super_operations hypfs_s_ops = { .show_options = hypfs_show_options, }; -static struct kobject *s390_kobj; - static int __init hypfs_init(void) { int rc; @@ -481,18 +479,16 @@ static int __init hypfs_init(void) rc = -ENODATA; goto fail_hypfs_sprp_exit; } - s390_kobj = kobject_create_and_add("s390", hypervisor_kobj); - if (!s390_kobj) { - rc = -ENOMEM; + rc = sysfs_create_empty_dir(hypervisor_kobj, "s390"); + if (rc) goto fail_hypfs_diag0c_exit; - } rc = register_filesystem(&hypfs_type); if (rc) goto fail_filesystem; return 0; fail_filesystem: - kobject_put(s390_kobj); + sysfs_remove_empty_dir(hypervisor_kobj, "s390"); fail_hypfs_diag0c_exit: hypfs_diag0c_exit(); fail_hypfs_sprp_exit: @@ -510,7 +506,7 @@ fail_dbfs_exit: static void __exit hypfs_exit(void) { unregister_filesystem(&hypfs_type); - kobject_put(s390_kobj); + sysfs_remove_empty_dir(hypervisor_kobj, "s390"); hypfs_diag0c_exit(); hypfs_sprp_exit(); hypfs_vm_exit(); diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 3061bb8629dc..98523650efd9 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -65,7 +65,6 @@ static int __init parse_efi_cmdline(char *str) early_param("efi", parse_efi_cmdline); static struct kobject *efi_kobj; -static struct kobject *efivars_kobj; /* * Let's not leave out systab information that snuck into @@ -212,10 +211,9 @@ static int __init efisubsys_init(void) goto err_remove_group; /* and the standard mountpoint for efivarfs */ - efivars_kobj = kobject_create_and_add("efivars", efi_kobj); - if (!efivars_kobj) { + error = sysfs_create_empty_dir(efi_kobj, "efivars"); + if (error) { pr_err("efivars: Subsystem registration failed.\n"); - error = -ENOMEM; goto err_remove_group; } diff --git a/fs/configfs/mount.c b/fs/configfs/mount.c index da94e41bdbf6..b4d1580a6602 100644 --- a/fs/configfs/mount.c +++ b/fs/configfs/mount.c @@ -129,8 +129,6 @@ void configfs_release_fs(void) } -static struct kobject *config_kobj; - static int __init configfs_init(void) { int err = -ENOMEM; @@ -141,8 +139,8 @@ static int __init configfs_init(void) if (!configfs_dir_cachep) goto out; - config_kobj = kobject_create_and_add("config", kernel_kobj); - if (!config_kobj) + err = sysfs_create_empty_dir(kernel_kobj, "config"); + if (err) goto out2; err = register_filesystem(&configfs_fs_type); @@ -152,7 +150,7 @@ static int __init configfs_init(void) return 0; out3: pr_err("Unable to register filesystem!\n"); - kobject_put(config_kobj); + sysfs_remove_empty_dir(kernel_kobj, "config"); out2: kmem_cache_destroy(configfs_dir_cachep); configfs_dir_cachep = NULL; @@ -163,7 +161,7 @@ out: static void __exit configfs_exit(void) { unregister_filesystem(&configfs_fs_type); - kobject_put(config_kobj); + sysfs_remove_empty_dir(kernel_kobj, "config"); kmem_cache_destroy(configfs_dir_cachep); configfs_dir_cachep = NULL; } diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c index c1e7ffb0dab6..5bcb499980d0 100644 --- a/fs/debugfs/inode.c +++ b/fs/debugfs/inode.c @@ -716,20 +716,17 @@ bool debugfs_initialized(void) } EXPORT_SYMBOL_GPL(debugfs_initialized); - -static struct kobject *debug_kobj; - static int __init debugfs_init(void) { int retval; - debug_kobj = kobject_create_and_add("debug", kernel_kobj); - if (!debug_kobj) - return -EINVAL; + retval = sysfs_create_empty_dir(kernel_kobj, "debug"); + if (retval) + return retval; retval = register_filesystem(&debug_fs_type); if (retval) - kobject_put(debug_kobj); + sysfs_remove_empty_dir(kernel_kobj, "debug"); else debugfs_registered = true; diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 082ac1c97f39..475d9cfa59a9 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -1238,7 +1238,6 @@ static void fuse_fs_cleanup(void) } static struct kobject *fuse_kobj; -static struct kobject *connections_kobj; static int fuse_sysfs_init(void) { @@ -1250,11 +1249,9 @@ static int fuse_sysfs_init(void) goto out_err; } - connections_kobj = kobject_create_and_add("connections", fuse_kobj); - if (!connections_kobj) { - err = -ENOMEM; + err = sysfs_create_empty_dir(fuse_kobj, "connections"); + if (err) goto out_fuse_unregister; - } return 0; @@ -1266,7 +1263,7 @@ static int fuse_sysfs_init(void) static void fuse_sysfs_cleanup(void) { - kobject_put(connections_kobj); + sysfs_remove_empty_dir(fuse_kobj, "connections"); kobject_put(fuse_kobj); } diff --git a/fs/pstore/inode.c b/fs/pstore/inode.c index dc43b5f29305..d1caeefd2d1b 100644 --- a/fs/pstore/inode.c +++ b/fs/pstore/inode.c @@ -461,22 +461,18 @@ static struct file_system_type pstore_fs_type = { .kill_sb = pstore_kill_sb, }; -static struct kobject *pstore_kobj; - static int __init init_pstore_fs(void) { - int err = 0; + int err; /* Create a convenient mount point for people to access pstore */ - pstore_kobj = kobject_create_and_add("pstore", fs_kobj); - if (!pstore_kobj) { - err = -ENOMEM; + err = sysfs_create_empty_dir(fs_kobj, "pstore"); + if (err) goto out; - } err = register_filesystem(&pstore_fs_type); if (err < 0) - kobject_put(pstore_kobj); + sysfs_remove_empty_dir(fs_kobj, "pstore"); out: return err; diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index d92bdf3b079a..e887c881a4b3 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -631,14 +631,12 @@ bool tracefs_initialized(void) return tracefs_registered; } -static struct kobject *trace_kobj; - static int __init tracefs_init(void) { int retval; - trace_kobj = kobject_create_and_add("tracing", kernel_kobj); - if (!trace_kobj) + retval = sysfs_create_empty_dir(kernel_kobj, "tracing"); + if (retval) return -EINVAL; retval = register_filesystem(&trace_fs_type); diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 469dd547770c..816657b5ef16 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1924,8 +1924,6 @@ static struct file_system_type cgroup_fs_type = { .kill_sb = cgroup_kill_sb, }; -static struct kobject *cgroup_kobj; - /** * task_cgroup_path - cgroup path of a task in the first cgroup hierarchy * @task: target task @@ -5044,13 +5042,13 @@ int __init cgroup_init(void) ss->bind(init_css_set.subsys[ssid]); } - cgroup_kobj = kobject_create_and_add("cgroup", fs_kobj); - if (!cgroup_kobj) - return -ENOMEM; + err = sysfs_create_empty_dir(fs_kobj, "cgroup"); + if (err) + return err; err = register_filesystem(&cgroup_fs_type); if (err < 0) { - kobject_put(cgroup_kobj); + sysfs_remove_empty_dir(fs_kobj, "cgroup"); return err; } diff --git a/security/inode.c b/security/inode.c index 91503b79c5f8..d7e5de5ffc59 100644 --- a/security/inode.c +++ b/security/inode.c @@ -215,19 +215,17 @@ void securityfs_remove(struct dentry *dentry) } EXPORT_SYMBOL_GPL(securityfs_remove); -static struct kobject *security_kobj; - static int __init securityfs_init(void) { int retval; - security_kobj = kobject_create_and_add("security", kernel_kobj); - if (!security_kobj) - return -EINVAL; + retval = sysfs_create_empty_dir(kernel_kobj, "security"); + if (retval) + return retval; retval = register_filesystem(&fs_type); if (retval) - kobject_put(security_kobj); + sysfs_remove_empty_dir(kernel_kobj, "security"); return retval; } diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c index d2787cca1fcb..a3d882729a45 100644 --- a/security/selinux/selinuxfs.c +++ b/security/selinux/selinuxfs.c @@ -1853,7 +1853,6 @@ static struct file_system_type sel_fs_type = { }; struct vfsmount *selinuxfs_mount; -static struct kobject *selinuxfs_kobj; static int __init init_sel_fs(void) { @@ -1862,13 +1861,13 @@ static int __init init_sel_fs(void) if (!selinux_enabled) return 0; - selinuxfs_kobj = kobject_create_and_add("selinux", fs_kobj); - if (!selinuxfs_kobj) - return -ENOMEM; + err = sysfs_create_empty_dir(fs_kobj, "selinux"); + if (err) + return err; err = register_filesystem(&sel_fs_type); if (err) { - kobject_put(selinuxfs_kobj); + sysfs_remove_empty_dir(fs_kobj, "selinux"); return err; } @@ -1887,7 +1886,7 @@ __initcall(init_sel_fs); #ifdef CONFIG_SECURITY_SELINUX_DISABLE void exit_sel_fs(void) { - kobject_put(selinuxfs_kobj); + sysfs_remove_empty_dir(fs_kobj, "selinux"); kern_unmount(selinuxfs_mount); unregister_filesystem(&sel_fs_type); } diff --git a/security/smack/smackfs.c b/security/smack/smackfs.c index d9682985349e..35079cc8c765 100644 --- a/security/smack/smackfs.c +++ b/security/smack/smackfs.c @@ -2241,16 +2241,16 @@ static const struct file_operations smk_revoke_subj_ops = { .llseek = generic_file_llseek, }; -static struct kset *smackfs_kset; /** * smk_init_sysfs - initialize /sys/fs/smackfs * */ static int smk_init_sysfs(void) { - smackfs_kset = kset_create_and_add("smackfs", NULL, fs_kobj); - if (!smackfs_kset) - return -ENOMEM; + int err; + err = sysfs_create_empty_dir(fs_kobj, "smackfs"); + if (err) + return err; return 0; } -- 2.2.1 ^ permalink raw reply related [flat|nested] 85+ messages in thread
[parent not found: <878ucrhxi9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir [not found] ` <878ucrhxi9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-08-11 18:44 ` Tejun Heo 2015-08-11 18:57 ` Eric W. Biederman 0 siblings, 1 reply; 85+ messages in thread From: Tejun Heo @ 2015-08-11 18:44 UTC (permalink / raw) To: Eric W. Biederman Cc: Linux Containers, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Greg Kroah-Hartman On Thu, May 14, 2015 at 12:36:30PM -0500, Eric W. Biederman wrote: > > This allows for better documentation in the code and > it allows for a simpler and fully correct version of > fs_fully_visible to be written. > > The mount points converted and their filesystems are: > /sys/hypervisor/s390/ s390_hypfs > /sys/kernel/config/ configfs > /sys/kernel/debug/ debugfs > /sys/firmware/efi/efivars/ efivarfs > /sys/fs/fuse/connections/ fusectl > /sys/fs/pstore/ pstore > /sys/kernel/tracing/ tracefs > /sys/fs/cgroup/ cgroup > /sys/kernel/security/ securityfs > /sys/fs/selinux/ selinuxfs > /sys/fs/smackfs/ smackfs > > Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> So, this somehow ends up confusing upstart on centos6 based systems making it fail to mount tmpfs on /sys/fs/cgroup. It also skips sunrpc and other mounts are different too. No idea why at this point. Can we please revert this from -stable until we know what's going on? Thanks. -- tejun ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir 2015-08-11 18:44 ` Tejun Heo @ 2015-08-11 18:57 ` Eric W. Biederman 2015-08-11 19:21 ` Andy Lutomirski [not found] ` <877fp1hcuj.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 0 siblings, 2 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-08-11 18:57 UTC (permalink / raw) To: Tejun Heo Cc: Linux Containers, linux-fsdevel, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Greg Kroah-Hartman Tejun Heo <tj@kernel.org> writes: > On Thu, May 14, 2015 at 12:36:30PM -0500, Eric W. Biederman wrote: >> >> This allows for better documentation in the code and >> it allows for a simpler and fully correct version of >> fs_fully_visible to be written. >> >> The mount points converted and their filesystems are: >> /sys/hypervisor/s390/ s390_hypfs >> /sys/kernel/config/ configfs >> /sys/kernel/debug/ debugfs >> /sys/firmware/efi/efivars/ efivarfs >> /sys/fs/fuse/connections/ fusectl >> /sys/fs/pstore/ pstore >> /sys/kernel/tracing/ tracefs >> /sys/fs/cgroup/ cgroup >> /sys/kernel/security/ securityfs >> /sys/fs/selinux/ selinuxfs >> /sys/fs/smackfs/ smackfs >> >> Cc: stable@vger.kernel.org >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > > So, this somehow ends up confusing upstart on centos6 based systems > making it fail to mount tmpfs on /sys/fs/cgroup. It also skips sunrpc > and other mounts are different too. No idea why at this point. Can > we please revert this from -stable until we know what's going on? *Boggle* The only time this should prevent anything is when in a container when you are not global root. And then only mounting sysfs should be affected. The only difference in executed code really should be setting an extra flag on the kernfs, inode. The kernfs changes will also refuse to add entries to these directories (but these directories are empty). If this is causing problems I don't have a problem with a revert but reverts take a minute, and this seems to be the first report of this kind. Can we take a minute and attempt to get a coherent explanation. >From what little information you given above it sounds like something shifted and when you rebuilt the kernel and now a memory stomp is hitting something else. It should be a matter of moments to debug this issue (once a test environment is setup), and see what is wrong and then we can act intelligently. Tracing a single system call is not difficult. If there really is some weird issue I want to know what it is. Eric ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir 2015-08-11 18:57 ` Eric W. Biederman @ 2015-08-11 19:21 ` Andy Lutomirski [not found] ` <CALCETrXE=fKa3XkEEo6y2=ZNtsuBfX=kaoyDwiP0C2BwqKJWjw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> [not found] ` <877fp1hcuj.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 1 sibling, 1 reply; 85+ messages in thread From: Andy Lutomirski @ 2015-08-11 19:21 UTC (permalink / raw) To: Eric W. Biederman Cc: Tejun Heo, Linux Containers, Linux FS Devel, Linux API, Serge E. Hallyn, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Greg Kroah-Hartman On Tue, Aug 11, 2015 at 11:57 AM, Eric W. Biederman <ebiederm@xmission.com> wrote: > Tejun Heo <tj@kernel.org> writes: > >> On Thu, May 14, 2015 at 12:36:30PM -0500, Eric W. Biederman wrote: >>> >>> This allows for better documentation in the code and >>> it allows for a simpler and fully correct version of >>> fs_fully_visible to be written. >>> >>> The mount points converted and their filesystems are: >>> /sys/hypervisor/s390/ s390_hypfs >>> /sys/kernel/config/ configfs >>> /sys/kernel/debug/ debugfs >>> /sys/firmware/efi/efivars/ efivarfs >>> /sys/fs/fuse/connections/ fusectl >>> /sys/fs/pstore/ pstore >>> /sys/kernel/tracing/ tracefs >>> /sys/fs/cgroup/ cgroup >>> /sys/kernel/security/ securityfs >>> /sys/fs/selinux/ selinuxfs >>> /sys/fs/smackfs/ smackfs >>> >>> Cc: stable@vger.kernel.org >>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> >> >> So, this somehow ends up confusing upstart on centos6 based systems >> making it fail to mount tmpfs on /sys/fs/cgroup. It also skips sunrpc >> and other mounts are different too. No idea why at this point. Can >> we please revert this from -stable until we know what's going on? > > *Boggle* > > The only time this should prevent anything is when in a container when > you are not global root. And then only mounting sysfs should be > affected. Before: open("/sys/kernel/debug/asdf", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = -1 EACCES (Permission denied) After: open("/sys/kernel/debug/asdf", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = -1 ENOENT (No such file or directory) Something broke. I don't know whether CentOS cares about that change, but there could be other odd side effects. --Andy ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <CALCETrXE=fKa3XkEEo6y2=ZNtsuBfX=kaoyDwiP0C2BwqKJWjw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir [not found] ` <CALCETrXE=fKa3XkEEo6y2=ZNtsuBfX=kaoyDwiP0C2BwqKJWjw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-08-12 0:58 ` Eric W. Biederman [not found] ` <87mvxxcogp.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 85+ messages in thread From: Eric W. Biederman @ 2015-08-12 0:58 UTC (permalink / raw) To: Andy Lutomirski Cc: Tejun Heo, Linux Containers, Linux FS Devel, Linux API, Serge E. Hallyn, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Greg Kroah-Hartman Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes: > On Tue, Aug 11, 2015 at 11:57 AM, Eric W. Biederman > <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: >> >> *Boggle* >> >> The only time this should prevent anything is when in a container when >> you are not global root. And then only mounting sysfs should be >> affected. > > Before: > > open("/sys/kernel/debug/asdf", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, > 0666) = -1 EACCES (Permission denied) > > > After: > > open("/sys/kernel/debug/asdf", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, > 0666) = -1 ENOENT (No such file or directory) > > Something broke. I don't know whether CentOS cares about that change, > but there could be other odd side effects. Thanks for pointing this out. I don't know if broke is quite the right word for a change in error codes on lookup failure, but I agree it is a difference that could have affected something. The behavior of empty proc dirs actually return -ENOENT in this situation and so it is a little fuzzy about which is the best behavior to use. Creativing a negative dentry and and then letting vfs_create fail may be the better way to go. Negative dentries are weird enough that I would prefer not to have code that creates negative dentries. They could easily be a lurking trap for some filesystems dentry operations. The patch below is enough to change the error code if someone who can reproduce this wants to try this. Eric diff --gdiff --git a/fs/libfs.c b/fs/libfs.c index 102edfd39000..3a452a485cbf 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -1109,7 +1109,7 @@ EXPORT_SYMBOL(simple_symlink_inode_operations); */ static struct dentry *empty_dir_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags) { - return ERR_PTR(-ENOENT); + return NULL; } static int empty_dir_getattr(struct vfsmount *mnt, struct dentry *dentry, ^ permalink raw reply related [flat|nested] 85+ messages in thread
[parent not found: <87mvxxcogp.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir [not found] ` <87mvxxcogp.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-08-12 20:00 ` Tejun Heo 2015-08-12 20:27 ` Eric W. Biederman 0 siblings, 1 reply; 85+ messages in thread From: Tejun Heo @ 2015-08-12 20:00 UTC (permalink / raw) To: Eric W. Biederman Cc: Andy Lutomirski, Linux Containers, Linux FS Devel, Linux API, Serge E. Hallyn, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Greg Kroah-Hartman On Tue, Aug 11, 2015 at 07:58:14PM -0500, Eric W. Biederman wrote: > Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes: > > > On Tue, Aug 11, 2015 at 11:57 AM, Eric W. Biederman > > <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: > >> > >> *Boggle* > >> > >> The only time this should prevent anything is when in a container when > >> you are not global root. And then only mounting sysfs should be > >> affected. > > > > Before: > > > > open("/sys/kernel/debug/asdf", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, > > 0666) = -1 EACCES (Permission denied) > > > > > > After: > > > > open("/sys/kernel/debug/asdf", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, > > 0666) = -1 ENOENT (No such file or directory) > > > > Something broke. I don't know whether CentOS cares about that change, > > but there could be other odd side effects. > > Thanks for pointing this out. I don't know if broke is quite the right > word for a change in error codes on lookup failure, but I agree it is a > difference that could have affected something. > > The behavior of empty proc dirs actually return -ENOENT in this > situation and so it is a little fuzzy about which is the best behavior > to use. > > Creativing a negative dentry and and then letting vfs_create fail may be > the better way to go. > > Negative dentries are weird enough that I would prefer not to have code > that creates negative dentries. They could easily be a lurking trap > for some filesystems dentry operations. > > The patch below is enough to change the error code if someone who can > reproduce this wants to try this. > > Eric > > diff --gdiff --git a/fs/libfs.c b/fs/libfs.c > index 102edfd39000..3a452a485cbf 100644 > --- a/fs/libfs.c > +++ b/fs/libfs.c > @@ -1109,7 +1109,7 @@ EXPORT_SYMBOL(simple_symlink_inode_operations); > */ > static struct dentry *empty_dir_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags) > { > - return ERR_PTR(-ENOENT); > + return NULL; And let's please restore this too. Sentiments about negative dentries aside, It's outright wrong to report -ENOENT on creat. Thanks. -- tejun ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir 2015-08-12 20:00 ` Tejun Heo @ 2015-08-12 20:27 ` Eric W. Biederman [not found] ` <87r3n82qxd.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 85+ messages in thread From: Eric W. Biederman @ 2015-08-12 20:27 UTC (permalink / raw) To: Tejun Heo Cc: Andy Lutomirski, Linux Containers, Linux FS Devel, Linux API, Serge E. Hallyn, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Greg Kroah-Hartman Tejun Heo <tj@kernel.org> writes: > On Tue, Aug 11, 2015 at 07:58:14PM -0500, Eric W. Biederman wrote: >> Andy Lutomirski <luto@amacapital.net> writes: >> >> > On Tue, Aug 11, 2015 at 11:57 AM, Eric W. Biederman >> > <ebiederm@xmission.com> wrote: >> >> >> >> *Boggle* >> >> >> >> The only time this should prevent anything is when in a container when >> >> you are not global root. And then only mounting sysfs should be >> >> affected. >> > >> > Before: >> > >> > open("/sys/kernel/debug/asdf", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, >> > 0666) = -1 EACCES (Permission denied) >> > >> > >> > After: >> > >> > open("/sys/kernel/debug/asdf", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, >> > 0666) = -1 ENOENT (No such file or directory) >> > >> > Something broke. I don't know whether CentOS cares about that change, >> > but there could be other odd side effects. >> >> Thanks for pointing this out. I don't know if broke is quite the right >> word for a change in error codes on lookup failure, but I agree it is a >> difference that could have affected something. >> >> The behavior of empty proc dirs actually return -ENOENT in this >> situation and so it is a little fuzzy about which is the best behavior >> to use. >> >> Creativing a negative dentry and and then letting vfs_create fail may be >> the better way to go. >> >> Negative dentries are weird enough that I would prefer not to have code >> that creates negative dentries. They could easily be a lurking trap >> for some filesystems dentry operations. >> >> The patch below is enough to change the error code if someone who can >> reproduce this wants to try this. >> >> Eric >> >> diff --gdiff --git a/fs/libfs.c b/fs/libfs.c >> index 102edfd39000..3a452a485cbf 100644 >> --- a/fs/libfs.c >> +++ b/fs/libfs.c >> @@ -1109,7 +1109,7 @@ EXPORT_SYMBOL(simple_symlink_inode_operations); >> */ >> static struct dentry *empty_dir_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags) >> { >> - return ERR_PTR(-ENOENT); >> + return NULL; > > And let's please restore this too. Sentiments about negative dentries > aside, It's outright wrong to report -ENOENT on creat. proc has always reported -ENOENT. sysfs is the odd one out. I am not completely certain that trivial patch above, does not introduce a leak, a NULL pointer dereference or something else nasty when the code is hit. So far it seems that no one cares. And since the change is brittle I am not inclined to mess with it this week, as I have other demands on my limited review bandwidth right now. Eric ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <87r3n82qxd.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir [not found] ` <87r3n82qxd.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-08-12 21:05 ` Tejun Heo 0 siblings, 0 replies; 85+ messages in thread From: Tejun Heo @ 2015-08-12 21:05 UTC (permalink / raw) To: Eric W. Biederman Cc: Linux API, Linux Containers, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, Linux FS Devel Hello, On Wed, Aug 12, 2015 at 03:27:26PM -0500, Eric W. Biederman wrote: > proc has always reported -ENOENT. sysfs is the odd one out. Hmm... open(2) is clear about failure modes and ENOENT doesn't fit the bill here. Maintaining the behavior for proc for backward compatibility is fine but I don't think it's appropriate to change behaviors on other filesystems which were behaving correctly especially through changes which got routed through -stable. ENOENT O_CREAT is not set and the named file does not exist. Or, a directory component in pathname does not exist or is a dangling symbolic link. ENOENT pathname refers to a nonexistent directory, O_TMPFILE and one of O_WRONLY or O_RDWR were specified in flags, but this kernel version does not provide the O_TMPFILE functionality. > I am not completely certain that trivial patch above, does not introduce > a leak, a NULL pointer dereference or something else nasty when the code > is hit. > > So far it seems that no one cares. And since the change is brittle I am > not inclined to mess with it this week, as I have other demands on my > limited review bandwidth right now. Sure, it isn't "today" urgent but let's please restore the original behavior before the new behavior gets too widespread. Thanks. -- tejun ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <877fp1hcuj.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir [not found] ` <877fp1hcuj.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-08-11 20:11 ` Tejun Heo [not found] ` <CAOS58YOHU8SFv4UXeBRr4t88UU=DXQCPg2HU_dMBmgM7WBB1zQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 85+ messages in thread From: Tejun Heo @ 2015-08-11 20:11 UTC (permalink / raw) To: Eric W. Biederman Cc: Linux API, Linux Containers, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, LINUXFS-ML Hey, On Tue, Aug 11, 2015 at 2:57 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: >> So, this somehow ends up confusing upstart on centos6 based systems >> making it fail to mount tmpfs on /sys/fs/cgroup. It also skips sunrpc >> and other mounts are different too. No idea why at this point. Can >> we please revert this from -stable until we know what's going on? > > *Boggle* > > The only time this should prevent anything is when in a container when > you are not global root. And then only mounting sysfs should be > affected. This is just plain boot. No namespace involved. > The only difference in executed code really should be setting an extra > flag on the kernfs, inode. The kernfs changes will also refuse to add > entries to these directories (but these directories are empty). Why do we have this in -stable then? Is this part of a larger fix? > If this is causing problems I don't have a problem with a revert but > reverts take a minute, and this seems to be the first report of this > kind. Can we take a minute and attempt to get a coherent explanation. > > From what little information you given above it sounds like something > shifted and when you rebuilt the kernel and now a memory stomp is > hitting something else. It should be a matter of moments to debug this I don't think it's a random memory stomping thing. I reverted the commit from two different kernels and the result was always consistent. > issue (once a test environment is setup), and see what is wrong and then > we can act intelligently. Tracing a single system call is not difficult. I'm already out today so it'll have to wait till tomorrow. > If there really is some weird issue I want to know what it is. Sure, but you wanna do that in -stable? Thanks. -- tejun ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <CAOS58YOHU8SFv4UXeBRr4t88UU=DXQCPg2HU_dMBmgM7WBB1zQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir [not found] ` <CAOS58YOHU8SFv4UXeBRr4t88UU=DXQCPg2HU_dMBmgM7WBB1zQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-08-12 0:37 ` Eric W. Biederman [not found] ` <87fv3pe3zn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 85+ messages in thread From: Eric W. Biederman @ 2015-08-12 0:37 UTC (permalink / raw) To: Tejun Heo Cc: Linux Containers, LINUXFS-ML, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Greg Kroah-Hartman Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes: > Hey, > > On Tue, Aug 11, 2015 at 2:57 PM, Eric W. Biederman > <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: >>> So, this somehow ends up confusing upstart on centos6 based systems >>> making it fail to mount tmpfs on /sys/fs/cgroup. It also skips sunrpc >>> and other mounts are different too. No idea why at this point. Can >>> we please revert this from -stable until we know what's going on? >> >> *Boggle* >> >> The only time this should prevent anything is when in a container when >> you are not global root. And then only mounting sysfs should be >> affected. > > This is just plain boot. No namespace involved. > >> The only difference in executed code really should be setting an extra >> flag on the kernfs, inode. The kernfs changes will also refuse to add >> entries to these directories (but these directories are empty). > > Why do we have this in -stable then? Is this part of a larger fix? It is. This patch is part of the prep work to prevent unprivileged users not mounting sysfs (using user namespace permissions) when they should not be allowed to. >> If this is causing problems I don't have a problem with a revert but >> reverts take a minute, and this seems to be the first report of this >> kind. Can we take a minute and attempt to get a coherent explanation. >> >> It should be a matter of moments to debug this >> issue (once a test environment is setup), and see what is wrong and then >> we can act intelligently. Tracing a single system call is not difficult. > > I'm already out today so it'll have to wait till tomorrow. > >> If there really is some weird issue I want to know what it is. > > Sure, but you wanna do that in -stable? Before fixing anything I want a bug report that is clear enough to be reproducible. I just went and attempted to reproduce this, and on RHEL6 workstation (aka my work laptop), using the todays 4.2.0-rc6+ aka edf15b4d4b01b565cb5f4fd2e2d08940b9f92e2f and all of the mounts in /proc/self/mounts are the same between 4.2.0-rc6 and the RHEL6 stock 2.6.32-504.30.3.el6.x86_64, including the cgroups mounted on /cgroup. Which means that I don't have any reason to believe that normal CentOS 6 is broken. Which -stable kernel are you having problems with? Perhaps it was a broken backport? Is it possible this is a local CentOS 6 hack that is breaking? Perhaps a patch you apply on top of your -stable kernel? Certainly with cgroups expected to be mounted at /sys/fs/cgroup there has clearly been at least one change from the stock configuration. I think it is a little less serious if stock CentOS 6 doesn't have problems. Unless it is a conflict of kernel patches I definitely think whatever it is needs to be fixed. Eric ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <87fv3pe3zn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir [not found] ` <87fv3pe3zn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-08-12 3:58 ` Eric W. Biederman [not found] ` <87a8txb1k8.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 85+ messages in thread From: Eric W. Biederman @ 2015-08-12 3:58 UTC (permalink / raw) To: Tejun Heo Cc: Linux API, Linux Containers, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, LINUXFS-ML ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes: > I just went and attempted to reproduce this, and on RHEL6 workstation > (aka my work laptop), using the todays 4.2.0-rc6+ aka > edf15b4d4b01b565cb5f4fd2e2d08940b9f92e2f and all of the mounts in > /proc/self/mounts are the same between 4.2.0-rc6 and the RHEL6 stock > 2.6.32-504.30.3.el6.x86_64, including the cgroups mounted on /cgroup. I built a few more kernels just to see if this was some weird backport thing. The kernels 3.10.86, 3.14.58, 3.18.20, and 4.1.5 all boot and mount their cgroup filesystems just fine. Granted I kept having to smack the memory cgroup into being compiled in as the config options kept changing but otherwise I have not seen any problems. So I am very surprised you are having problems. Eric ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <87a8txb1k8.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir [not found] ` <87a8txb1k8.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-08-12 4:04 ` Eric W. Biederman [not found] ` <871tf9b19v.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 85+ messages in thread From: Eric W. Biederman @ 2015-08-12 4:04 UTC (permalink / raw) To: Tejun Heo Cc: Linux API, Linux Containers, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, LINUXFS-ML ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes: > ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes: > >> I just went and attempted to reproduce this, and on RHEL6 workstation >> (aka my work laptop), using the todays 4.2.0-rc6+ aka >> edf15b4d4b01b565cb5f4fd2e2d08940b9f92e2f and all of the mounts in >> /proc/self/mounts are the same between 4.2.0-rc6 and the RHEL6 stock >> 2.6.32-504.30.3.el6.x86_64, including the cgroups mounted on /cgroup. > > I built a few more kernels just to see if this was some weird backport > thing. The kernels 3.10.86, 3.14.58, 3.18.20, and 4.1.5 all boot and > mount their cgroup filesystems just fine. Granted I kept having to > smack the memory cgroup into being compiled in as the config options > kept changing but otherwise I have not seen any problems. > > So I am very surprised you are having problems. Although I guess I could have saved myself some time by noticing that 4.1.5 was the only one of the kernels with the change backported into it. *Shrug* I don't see the problem and I don't know where to look to see why you are having problems. Eric ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <871tf9b19v.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir [not found] ` <871tf9b19v.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-08-12 19:15 ` Tejun Heo [not found] ` <20150812191515.GA4496-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 85+ messages in thread From: Tejun Heo @ 2015-08-12 19:15 UTC (permalink / raw) To: Eric W. Biederman Cc: Linux Containers, LINUXFS-ML, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Greg Kroah-Hartman Hello, Eric. On Tue, Aug 11, 2015 at 11:04:28PM -0500, Eric W. Biederman wrote: > ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes: > > > ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes: > > > >> I just went and attempted to reproduce this, and on RHEL6 workstation > >> (aka my work laptop), using the todays 4.2.0-rc6+ aka > >> edf15b4d4b01b565cb5f4fd2e2d08940b9f92e2f and all of the mounts in > >> /proc/self/mounts are the same between 4.2.0-rc6 and the RHEL6 stock > >> 2.6.32-504.30.3.el6.x86_64, including the cgroups mounted on /cgroup. > > > > I built a few more kernels just to see if this was some weird backport > > thing. The kernels 3.10.86, 3.14.58, 3.18.20, and 4.1.5 all boot and > > mount their cgroup filesystems just fine. Granted I kept having to > > smack the memory cgroup into being compiled in as the config options > > kept changing but otherwise I have not seen any problems. > > > > So I am very surprised you are having problems. > > Although I guess I could have saved myself some time by noticing that > 4.1.5 was the only one of the kernels with the change backported into > it. *Shrug* > > I don't see the problem and I don't know where to look to see why you > are having problems. lol, this wasn't upstart but an internal tool which sets up a custom cgroup hierarchy and the problem was the size of the directory inode reported by stat(2). It's kinda hilarious but that's what the tool was depending on to tell whether tmpfs is mounted on /sys/fs/cgroup or not. A kernfs directory reports zero as its inode size while tmpfs reports some non-zero number, so the tool did stat(2) on /sys/fs/cgroup and mounted tmpfs iff size is zero to avoid mounting tmpfs multiple times. Now, make_empty_dir_inode() sets i_size to 2 and the tool thinks that tmpfs is already mounted there. It's an icky behavior but it'd be better to maintain the original behavior. We should be able to set size to zero for empty dirs, right? Thanks. -- tejun ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <20150812191515.GA4496-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* [PATCH] fs: Set the size of empty dirs to 0. [not found] ` <20150812191515.GA4496-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2015-08-12 20:07 ` Eric W. Biederman [not found] ` <87mvxw46fc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 85+ messages in thread From: Eric W. Biederman @ 2015-08-12 20:07 UTC (permalink / raw) To: Linux Containers Cc: Linux API, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, LINUXFS-ML, Tejun Heo Before the make_empty_dir_inode calls were introduce into proc, sysfs, and sysctl those directories when stated reported an i_size of 0. make_empty_dir_inode started reporting an i_size of 2. At least one userspace application depended on stat returning i_size of 0. So modify make_empty_dir_inode to cause an i_size of 0 to be reported for these directories. Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Reproted-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- I have tested this and will queue this up shortly. fs/libfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/libfs.c b/fs/libfs.c index 102edfd39000..c7cbfb092e94 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -1185,7 +1185,7 @@ void make_empty_dir_inode(struct inode *inode) inode->i_uid = GLOBAL_ROOT_UID; inode->i_gid = GLOBAL_ROOT_GID; inode->i_rdev = 0; - inode->i_size = 2; + inode->i_size = 0; inode->i_blkbits = PAGE_SHIFT; inode->i_blocks = 0; -- 2.2.1 ^ permalink raw reply related [flat|nested] 85+ messages in thread
[parent not found: <87mvxw46fc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* Re: [PATCH] fs: Set the size of empty dirs to 0. [not found] ` <87mvxw46fc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-08-12 20:18 ` Tejun Heo 0 siblings, 0 replies; 85+ messages in thread From: Tejun Heo @ 2015-08-12 20:18 UTC (permalink / raw) To: Eric W. Biederman Cc: Linux API, Linux Containers, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, LINUXFS-ML On Wed, Aug 12, 2015 at 03:07:19PM -0500, Eric W. Biederman wrote: > > Before the make_empty_dir_inode calls were introduce into proc, sysfs, > and sysctl those directories when stated reported an i_size of 0. > make_empty_dir_inode started reporting an i_size of 2. At least one > userspace application depended on stat returning i_size of 0. So modify > make_empty_dir_inode to cause an i_size of 0 to be reported for these > directories. > > Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > Reproted-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ^^^ > Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> Acked-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Thanks. -- tejun ^ permalink raw reply [flat|nested] 85+ messages in thread
* [CFT][PATCH 10/10] mnt: Update fs_fully_visible to test for permanently empty directories [not found] ` <87pp63jcca.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> ` (6 preceding siblings ...) 2015-05-14 17:36 ` [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir Eric W. Biederman @ 2015-05-14 17:37 ` Eric W. Biederman 7 siblings, 0 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-14 17:37 UTC (permalink / raw) To: Linux Containers Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Greg Kroah-Hartman, Tejun Heo fs_fully_visible attempts to make fresh mounts of proc and sysfs give the mounter no more access to proc and sysfs than if they could have by creating a bind mount. One aspect of proc and sysfs that makes this particularly tricky is that there are other filesystems that typically mount on top of proc and sysfs. As those filesystems are mounted on empty directories in practice it is safe to ignore them. However testing to ensure filesystems are mounted on empty directories has not been something the in kernel data structures have supported so the current test for an empty directory which checks to see if nlink <= 2 is a bit lacking. proc and sysfs have recently been modified to use the new empty_dir infrastructure to create all of their dedicated mount points. Instead of testing for S_ISDIR(inode->i_mode) && i_nlink <= 2 to see if a directory is empty, test for is_empty_dir_inode(inode). That small change guaranteess mounts found on proc and sysfs really are safe to ignore, because the directories are not only empty but nothing can ever be added to them. This guarantees there is nothing to worry about when mounting proc and sysfs. Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/namespace.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 3ede0669b8d2..eccd925c6e82 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3220,9 +3220,8 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags) /* Only worry about locked mounts */ if (!(mnt->mnt.mnt_flags & MNT_LOCKED)) continue; - if (!S_ISDIR(inode->i_mode)) - goto next; - if (inode->i_nlink > 2) + /* Is the directory permanetly empty? */ + if (!is_empty_dir_inode(inode)) goto next; } /* Preserve the locked attributes */ -- 2.2.1 ^ permalink raw reply related [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts 2015-05-14 17:30 [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts Eric W. Biederman ` (2 preceding siblings ...) [not found] ` <87pp63jcca.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-05-14 20:29 ` Greg Kroah-Hartman 2015-05-14 21:10 ` Eric W. Biederman 2015-05-16 2:05 ` [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) Eric W. Biederman 4 siblings, 1 reply; 85+ messages in thread From: Greg Kroah-Hartman @ 2015-05-14 20:29 UTC (permalink / raw) To: Eric W. Biederman Cc: Linux Containers, linux-fsdevel, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Tejun Heo On Thu, May 14, 2015 at 12:30:45PM -0500, Eric W. Biederman wrote: > > The code is currently available at: > > git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing > > HEAD: a524faf520600968e58bbc732063fccf2fdf9199 mnt: Update fs_fully_visible to test for permanently empty directories > > The problem: Mounting a new instance of proc of sysfs can allow things > that a bind mount of those filesystems would not. > > That is the cases I am dealing with are: > unshare --user --net --mount ; mount -t sysfs ... > unshare --user --pid --mount ; mount -t proc ... > > The big change is that this set of changes enforces the preservation of > locked mount flags, from the existing mount to the current mount. Which > means that if proc was mounted read-only the current current will allow > a new instance of proc to be mounted read-write, and this set of changes > enforces that proc remain read-only. > > The other gotcha is that the current code does not properly detect empty > directories so to prevent things slipping through the cracks this set of > changes annotates all mount points where nothing will be revealed if > the filesystem mounted on top is removed. > > Enforcing the administrators policy can actually matter in the real > world as has been shown by the recent docker issue. > > With this patchset I have two concerns: > - The enforcement of mount flag preservation on proc and sysfs may break > things. (I am especially worried about the implicit adding of nodev). What do you mean by this? What got added? > - I missed a filesystem mountpoint on proc or sysfs which would make a > fresh copy unmountable for no good reason. > > I don't want to break userspace if I can help it, and the code has been > this way for a while so I figure there is time to find any pitfalls and > address them before this code gets merged. > > So if this works for you please give me your Tested-By > > The well known mountpoints for pseudo filesystems that I could find are: > /dev/ffs*/ functionfs > /dev/gadget/ gadgetfs > /dev/mqueue mqueue > /dev/oprofile/ oprofilefs > /dev/pts/ devpts /dev/shm gets a tmpfs, right? Or do those not matter here? > /dlm/ ocfs2_dlmfs > /ipath/ ipathfs > /proc/fs/nfsd/ nfsd > /proc/openprom/ openpromfs > /proc/sys/fs/binfmt_misc/ binfmt_misc > /spu/ spufs > /sys/firmware/efi/efivars/ efivarfs > /sys/fs/cgroup/ cgroup > /sys/fs/fuse/connections/ fusectl I thought fuse mounted a few more things in here, but I don't know for sure. > /sys/fs/pstore/ pstore > /sys/fs/selinux/ selinuxfs > /sys/fs/smackfs/ smackfs > /sys/hypervisor/s390/ s390_hypfs > /sys/kernel/config/ configfs > /sys/kernel/debug/ debugfs > /sys/kernel/security/ securityfs > /sys/kernel/tracing/ tracefs I think these are all correct for sysfs, I have a minor comment on the sysfs patch I'll make in it. thanks, greg k-h ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts 2015-05-14 20:29 ` [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts Greg Kroah-Hartman @ 2015-05-14 21:10 ` Eric W. Biederman [not found] ` <87oalmg90j.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 85+ messages in thread From: Eric W. Biederman @ 2015-05-14 21:10 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Linux Containers, linux-fsdevel, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Tejun Heo Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes: > On Thu, May 14, 2015 at 12:30:45PM -0500, Eric W. Biederman wrote: >> >> The code is currently available at: >> >> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing >> >> HEAD: a524faf520600968e58bbc732063fccf2fdf9199 mnt: Update fs_fully_visible to test for permanently empty directories >> >> The problem: Mounting a new instance of proc of sysfs can allow things >> that a bind mount of those filesystems would not. >> >> That is the cases I am dealing with are: >> unshare --user --net --mount ; mount -t sysfs ... >> unshare --user --pid --mount ; mount -t proc ... >> >> The big change is that this set of changes enforces the preservation of >> locked mount flags, from the existing mount to the current mount. Which >> means that if proc was mounted read-only the current current will allow >> a new instance of proc to be mounted read-write, and this set of changes >> enforces that proc remain read-only. >> >> The other gotcha is that the current code does not properly detect empty >> directories so to prevent things slipping through the cracks this set of >> changes annotates all mount points where nothing will be revealed if >> the filesystem mounted on top is removed. >> >> Enforcing the administrators policy can actually matter in the real >> world as has been shown by the recent docker issue. >> >> With this patchset I have two concerns: >> - The enforcement of mount flag preservation on proc and sysfs may break >> things. (I am especially worried about the implicit adding of nodev). > > What do you mean by this? What got added? In a user namespace mounting a filesystem implicitly adds nodev. When I started enforcing not clearing bits that root had set on a filesystem in mount -o remount the implicit nodev wound up being an issue that broke userspace for no good reason. The fix was to implicitly add nodev in remount as well. Taking a second look at this nodev is implicitly added before the fs_fully_visible check so even for applications that are know how the original proc was mounted (and don't see an implicit nodev) and that don't add nodev (because they ''know'' the mount flags) this change should not be a problem. Hooray! One less scary thing. >> - I missed a filesystem mountpoint on proc or sysfs which would make a >> fresh copy unmountable for no good reason. >> >> I don't want to break userspace if I can help it, and the code has been >> this way for a while so I figure there is time to find any pitfalls and >> address them before this code gets merged. >> >> So if this works for you please give me your Tested-By >> >> The well known mountpoints for pseudo filesystems that I could find are: >> /dev/ffs*/ functionfs >> /dev/gadget/ gadgetfs >> /dev/mqueue mqueue >> /dev/oprofile/ oprofilefs >> /dev/pts/ devpts > > /dev/shm gets a tmpfs, right? Or do those not matter here? It does, but it doesn't matter in this context. I was looking for things that mounted themselves on proc or sysfs and I catalogued the rest just to know they were not mounted there. >> /dlm/ ocfs2_dlmfs >> /ipath/ ipathfs >> /proc/fs/nfsd/ nfsd >> /proc/openprom/ openpromfs >> /proc/sys/fs/binfmt_misc/ binfmt_misc >> /spu/ spufs > >> /sys/firmware/efi/efivars/ efivarfs >> /sys/fs/cgroup/ cgroup >> /sys/fs/fuse/connections/ fusectl > > I thought fuse mounted a few more things in here, but I don't know for > sure. There are definitely some fuse attributes under /sys/fs/fuse/ but I don't see anything else in the code that could be creating a mount point. >> /sys/fs/pstore/ pstore >> /sys/fs/selinux/ selinuxfs >> /sys/fs/smackfs/ smackfs >> /sys/hypervisor/s390/ s390_hypfs >> /sys/kernel/config/ configfs >> /sys/kernel/debug/ debugfs >> /sys/kernel/security/ securityfs >> /sys/kernel/tracing/ tracefs > > I think these are all correct for sysfs, I have a minor comment on the > sysfs patch I'll make in it. Good to hear and I will answer there as well. Eric ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <87oalmg90j.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* Re: [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts [not found] ` <87oalmg90j.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-05-15 6:26 ` Andy Lutomirski [not found] ` <CALCETrU1yxcDfv4YV3wVpWMAdiOOsSUFOPUpFAN-mVA4M-OxdQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 85+ messages in thread From: Andy Lutomirski @ 2015-05-15 6:26 UTC (permalink / raw) To: Eric W. Biederman Cc: Greg Kroah-Hartman, Linux Containers, Linux FS Devel, Linux API, Serge E. Hallyn, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Tejun Heo On Thu, May 14, 2015 at 2:10 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: > Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> writes: > >> On Thu, May 14, 2015 at 12:30:45PM -0500, Eric W. Biederman wrote: >>> >>> The code is currently available at: >>> >>> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing >>> >>> HEAD: a524faf520600968e58bbc732063fccf2fdf9199 mnt: Update fs_fully_visible to test for permanently empty directories >>> >>> The problem: Mounting a new instance of proc of sysfs can allow things >>> that a bind mount of those filesystems would not. >>> >>> That is the cases I am dealing with are: >>> unshare --user --net --mount ; mount -t sysfs ... >>> unshare --user --pid --mount ; mount -t proc ... >>> >>> The big change is that this set of changes enforces the preservation of >>> locked mount flags, from the existing mount to the current mount. Which >>> means that if proc was mounted read-only the current current will allow >>> a new instance of proc to be mounted read-write, and this set of changes >>> enforces that proc remain read-only. >>> >>> The other gotcha is that the current code does not properly detect empty >>> directories so to prevent things slipping through the cracks this set of >>> changes annotates all mount points where nothing will be revealed if >>> the filesystem mounted on top is removed. >>> >>> Enforcing the administrators policy can actually matter in the real >>> world as has been shown by the recent docker issue. >>> >>> With this patchset I have two concerns: >>> - The enforcement of mount flag preservation on proc and sysfs may break >>> things. (I am especially worried about the implicit adding of nodev). >> >> What do you mean by this? What got added? > > In a user namespace mounting a filesystem implicitly adds nodev. > > When I started enforcing not clearing bits that root had set on a > filesystem in mount -o remount the implicit nodev wound up being > an issue that broke userspace for no good reason. The fix was > to implicitly add nodev in remount as well. > > Taking a second look at this nodev is implicitly added before the > fs_fully_visible check so even for applications that are know how the > original proc was mounted (and don't see an implicit nodev) and that > don't add nodev (because they ''know'' the mount flags) this change > should not be a problem. Hooray! One less scary thing. Can we please just get rid of this implicit nodev thing once and for all? If it breaks some really weird /proc use case, then I think the right fix is to stop enforcing the nodev lock for the proc fully visible check. After all, /proc doesn't contain useful device nodes anyway. Other than that, the code here looks okay to me on brief inspection. --Andy ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <CALCETrU1yxcDfv4YV3wVpWMAdiOOsSUFOPUpFAN-mVA4M-OxdQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts [not found] ` <CALCETrU1yxcDfv4YV3wVpWMAdiOOsSUFOPUpFAN-mVA4M-OxdQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-05-15 6:55 ` Eric W. Biederman 0 siblings, 0 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-15 6:55 UTC (permalink / raw) To: Andy Lutomirski Cc: Greg Kroah-Hartman, Linux Containers, Linux FS Devel, Linux API, Serge E. Hallyn, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Tejun Heo Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes: > Can we please just get rid of this implicit nodev thing once and for all? If it > breaks some really weird /proc use case, then I think the right fix is to > stop enforcing the nodev lock for the proc fully visible check. After > all, /proc doesn't contain useful device nodes anyway. On second look I don't think that will actually cause issues in this case. I actually have a fix for the implicit nodev weirdness in my development qeueue but it requires figuring out how to add s_user_ns to superblocks. My last round of testing told me I was doing that wrong. But if the implicit nodev is actually a problem I will definitely delay this until I have that change ready to go as well. > Other than that, the code here looks okay to me on brief inspection. At a practical level I am concerned that enforcing things like noexec and nosuid from the original normal global proc might cause problems for things like sandstorm, lxc, and possibly libvirt-lxc. So I would really appreciate if people associated with those projects could test this and tell me if I break things. Other than my stupid refactor in my code for /proc/fs/nfsd that causes the kernel to oops :( Doh! Eric ^ permalink raw reply [flat|nested] 85+ messages in thread
* [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) 2015-05-14 17:30 [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts Eric W. Biederman ` (3 preceding siblings ...) 2015-05-14 20:29 ` [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts Greg Kroah-Hartman @ 2015-05-16 2:05 ` Eric W. Biederman 2015-05-16 2:06 ` [CFT][PATCH 02/10] mnt: Modify fs_fully_visible to deal with mount attributes Eric W. Biederman [not found] ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 4 siblings, 2 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-16 2:05 UTC (permalink / raw) To: Linux Containers Cc: linux-fsdevel, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Greg Kroah-Hartman, Tejun Heo The code is currently available at: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing HEAD: 513d98ba1adfa9e3178b6fc3b2fa57a622283d32 mnt: Update fs_fully_visible to test for permanently empty directories The problem: Mounting a new instance of proc of sysfs can allow things that a bind mount of those filesystems would not. That is the cases I am dealing with are: unshare --user --net --mount ; mount -t sysfs ... unshare --user --pid --mount ; mount -t proc ... This set of changes enforces the preservation of locked mount flags, from the existing mount to the current mount. Which means that if proc was mounted read-only the current current will allow a new instance of proc to be mounted read-write, and this set of changes enforces that proc remain read-only. This set of changes also updates sysctl, proc and sysfs to explicitly create the directories they expect to be mount points as mount points. Making the code a little clearly and making it so when fs_fully_visible disregards something mounted on a proc or sysfs it is guaranteed to be safe, unlike the current code which can occassionally let things fall through the cracks. These changes to enforce the administrators policy can actually matter in the real world as has been shown by the recent docker issue. With this patchset I have two concerns: - The enforcement of not being able to mount proc or sysfs with fewer mount flags than the existing mount may break something. - That there is a filesystem that that common mounts on proc or sysfs and I missed annotating it's mount point. That would make mounting a freshy copy of proc or sysfs impossible. I don't want to break userspace if I can help it, and the code has been this way for a while so I figure there is time to find any pitfalls and address them before this code gets merged. Folks rom lxc, sandstorm, libvirt-lxc (anyone who uses user namespaces in the least) a confirmation that I have not broken your existing code would be appreciated. If this works for you please give me your Tested-By Since the first version I have renamed the directory creation calls to have sysfs_create_mount_point and proc_create_mount_point (as suggested by Greg KH so that it is very clear what the code that creates those mount points is doing. I have also fixed a stupid bug that slipped into the proc code when I refactored it. I have also gone through and rested everything so hopefully nothing has slipped past me. The well known mountpoints for pseudo filesystems that I could find are: /dev/ffs*/ functionfs /dev/gadget/ gadgetfs /dev/mqueue mqueue /dev/oprofile/ oprofilefs /dev/pts/ devpts /dev/shm/ tmpfs /dlm/ ocfs2_dlmfs /ipath/ ipathfs /proc/fs/nfsd/ nfsd /proc/openprom/ openpromfs /proc/sys/fs/binfmt_misc/ binfmt_misc /spu/ spufs /sys/firmware/efi/efivars/ efivarfs /sys/fs/cgroup/ cgroup /sys/fs/fuse/connections/ fusectl /sys/fs/pstore/ pstore /sys/fs/selinux/ selinuxfs /sys/fs/smackfs/ smackfs /sys/hypervisor/s390/ s390_hypfs /sys/kernel/config/ configfs /sys/kernel/debug/ debugfs /sys/kernel/security/ securityfs /sys/kernel/tracing/ tracefs /var/lib/ibmasm/ ibmasmfs /var/lib/nfs/rpc_pipefs/ rpc_pipefs Eric W. Biederman (10): mnt: Refactor the logic for mounting sysfs and proc in a user namespace mnt: Modify fs_fully_visible to deal with mount attributes vfs: Ignore unlocked mounts in fs_fully_visible fs: Add helper functions for permanently empty directories. sysctl: Allow creating permanently empty directories that serve as mountpoints. proc: Allow creating permanently empty directories that serve as mount points kernfs: Add support for always empty directories. sysfs: Add support for permanently empty directories to serve as mount points. sysfs: Create mountpoints with sysfs_create_mount_point mnt: Update fs_fully_visible to test for permanently empty directories arch/s390/hypfs/inode.c | 12 ++---- drivers/firmware/efi/efi.c | 6 +-- fs/configfs/mount.c | 10 ++--- fs/debugfs/inode.c | 11 ++--- fs/fuse/inode.c | 9 ++--- fs/kernfs/dir.c | 38 +++++++++++++++++- fs/kernfs/inode.c | 2 + fs/libfs.c | 96 ++++++++++++++++++++++++++++++++++++++++++++ fs/namespace.c | 47 +++++++++++++++++++--- fs/proc/generic.c | 23 +++++++++++ fs/proc/inode.c | 4 ++ fs/proc/internal.h | 6 +++ fs/proc/proc_sysctl.c | 37 +++++++++++++++++ fs/proc/root.c | 9 ++--- fs/pstore/inode.c | 12 ++---- fs/sysfs/dir.c | 34 ++++++++++++++++ fs/sysfs/mount.c | 5 +-- fs/tracefs/inode.c | 6 +-- include/linux/fs.h | 4 +- include/linux/kernfs.h | 3 ++ include/linux/sysctl.h | 3 ++ include/linux/sysfs.h | 16 ++++++++ kernel/cgroup.c | 10 ++--- kernel/sysctl.c | 8 +--- security/inode.c | 10 ++--- security/selinux/selinuxfs.c | 11 +++-- security/smack/smackfs.c | 8 ++-- 27 files changed, 350 insertions(+), 90 deletions(-) Eric ^ permalink raw reply [flat|nested] 85+ messages in thread
* [CFT][PATCH 02/10] mnt: Modify fs_fully_visible to deal with mount attributes 2015-05-16 2:05 ` [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) Eric W. Biederman @ 2015-05-16 2:06 ` Eric W. Biederman [not found] ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 1 sibling, 0 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-16 2:06 UTC (permalink / raw) To: Linux Containers Cc: linux-fsdevel, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk, Stéphane Graber, Eric Windisch, Greg Kroah-Hartman, Tejun Heo Ignore an existing mount if it's locked attributes are less permissive than the new mounts attributes. On success ensure the new mount locks all of the same attributes as the old mount. Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- fs/namespace.c | 32 +++++++++++++++++++++++++++++--- 1 file changed, 29 insertions(+), 3 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 8e7edaf60fe1..fccee9924e8c 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2332,7 +2332,7 @@ unlock: return err; } -static bool fs_fully_visible(struct file_system_type *fs_type); +static bool fs_fully_visible(struct file_system_type *fs_type, int *new_mnt_flags); /* * create a new mount for userspace and request it to be added into the @@ -2366,7 +2366,7 @@ static int do_new_mount(struct path *path, const char *fstype, int flags, mnt_flags |= MNT_NODEV | MNT_LOCK_NODEV; } if (type->fs_flags & FS_USERNS_VISIBLE) { - if (!fs_fully_visible(type)) + if (!fs_fully_visible(type, &mnt_flags)) return -EPERM; } } @@ -3170,9 +3170,10 @@ bool current_chrooted(void) return chrooted; } -static bool fs_fully_visible(struct file_system_type *type) +static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags) { struct mnt_namespace *ns = current->nsproxy->mnt_ns; + int new_flags = *new_mnt_flags; struct mount *mnt; bool visible = false; @@ -3191,6 +3192,25 @@ static bool fs_fully_visible(struct file_system_type *type) if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root) continue; + /* Verify the mount flags are equal to or more permissive + * than the proposed new mount. + */ + if ((mnt->mnt.mnt_flags & MNT_LOCK_READONLY) && + !(new_flags & MNT_READONLY)) + continue; + if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) && + !(new_flags & MNT_NODEV)) + continue; + if ((mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) && + !(new_flags & MNT_NOSUID)) + continue; + if ((mnt->mnt.mnt_flags & MNT_LOCK_NOEXEC) && + !(new_flags & MNT_NOEXEC)) + continue; + if ((mnt->mnt.mnt_flags & MNT_LOCK_ATIME) && + ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK))) + continue; + /* This mount is not fully visible if there are any child mounts * that cover anything except for empty directories. */ @@ -3201,6 +3221,12 @@ static bool fs_fully_visible(struct file_system_type *type) if (inode->i_nlink > 2) goto next; } + /* Preserve the locked attributes */ + *new_mnt_flags |= mnt->mnt.mnt_flags & (MNT_LOCK_READONLY | \ + MNT_LOCK_NODEV | \ + MNT_LOCK_NOSUID | \ + MNT_LOCK_NOEXEC | \ + MNT_LOCK_ATIME); visible = true; goto found; next: ; -- 2.2.1 ^ permalink raw reply related [flat|nested] 85+ messages in thread
[parent not found: <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* [CFT][PATCH 01/10] mnt: Refactor the logic for mounting sysfs and proc in a user namespace [not found] ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-05-16 2:06 ` Eric W. Biederman 2015-05-16 2:07 ` [CFT][PATCH 03/10] vfs: Ignore unlocked mounts in fs_fully_visible Eric W. Biederman ` (8 subsequent siblings) 9 siblings, 0 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-16 2:06 UTC (permalink / raw) To: Linux Containers Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Greg Kroah-Hartman, Tejun Heo Fresh mounts of proc and sysfs are a very special case that works very much like a bind mount. Unfortunately the current structure can not preserve the MNT_LOCK... mount flags. Therefore refactor the logic into a form that can be modified to preserve those lock bits. Add a new filesystem flag FS_USERNS_VISIBLE that requires some mount of the filesystem be fully visible in the current mount namespace, before the filesystem may be mounted. Move the logic for calling fs_fully_visible from proc and sysfs into fs/namespace.c where it has greater access to mount namespace state. Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/namespace.c | 8 +++++++- fs/proc/root.c | 5 +---- fs/sysfs/mount.c | 5 +---- include/linux/fs.h | 2 +- 4 files changed, 10 insertions(+), 10 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 1b9e11167bae..8e7edaf60fe1 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2332,6 +2332,8 @@ unlock: return err; } +static bool fs_fully_visible(struct file_system_type *fs_type); + /* * create a new mount for userspace and request it to be added into the * namespace's tree @@ -2363,6 +2365,10 @@ static int do_new_mount(struct path *path, const char *fstype, int flags, flags |= MS_NODEV; mnt_flags |= MNT_NODEV | MNT_LOCK_NODEV; } + if (type->fs_flags & FS_USERNS_VISIBLE) { + if (!fs_fully_visible(type)) + return -EPERM; + } } mnt = vfs_kern_mount(type, flags, name, data); @@ -3164,7 +3170,7 @@ bool current_chrooted(void) return chrooted; } -bool fs_fully_visible(struct file_system_type *type) +static bool fs_fully_visible(struct file_system_type *type) { struct mnt_namespace *ns = current->nsproxy->mnt_ns; struct mount *mnt; diff --git a/fs/proc/root.c b/fs/proc/root.c index b7fa4bfe896a..64e1ab64bde6 100644 --- a/fs/proc/root.c +++ b/fs/proc/root.c @@ -112,9 +112,6 @@ static struct dentry *proc_mount(struct file_system_type *fs_type, ns = task_active_pid_ns(current); options = data; - if (!capable(CAP_SYS_ADMIN) && !fs_fully_visible(fs_type)) - return ERR_PTR(-EPERM); - /* Does the mounter have privilege over the pid namespace? */ if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) return ERR_PTR(-EPERM); @@ -159,7 +156,7 @@ static struct file_system_type proc_fs_type = { .name = "proc", .mount = proc_mount, .kill_sb = proc_kill_sb, - .fs_flags = FS_USERNS_MOUNT, + .fs_flags = FS_USERNS_VISIBLE | FS_USERNS_MOUNT, }; void __init proc_root_init(void) diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c index 8a49486bf30c..1c6ac6fcee9f 100644 --- a/fs/sysfs/mount.c +++ b/fs/sysfs/mount.c @@ -31,9 +31,6 @@ static struct dentry *sysfs_mount(struct file_system_type *fs_type, bool new_sb; if (!(flags & MS_KERNMOUNT)) { - if (!capable(CAP_SYS_ADMIN) && !fs_fully_visible(fs_type)) - return ERR_PTR(-EPERM); - if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET)) return ERR_PTR(-EPERM); } @@ -58,7 +55,7 @@ static struct file_system_type sysfs_fs_type = { .name = "sysfs", .mount = sysfs_mount, .kill_sb = sysfs_kill_sb, - .fs_flags = FS_USERNS_MOUNT, + .fs_flags = FS_USERNS_VISIBLE | FS_USERNS_MOUNT, }; int __init sysfs_init(void) diff --git a/include/linux/fs.h b/include/linux/fs.h index 35ec87e490b1..2d24eeb8e59c 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1897,6 +1897,7 @@ struct file_system_type { #define FS_HAS_SUBTYPE 4 #define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */ #define FS_USERNS_DEV_MOUNT 16 /* A userns mount does not imply MNT_NODEV */ +#define FS_USERNS_VISIBLE 32 /* FS must already be visible */ #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */ struct dentry *(*mount) (struct file_system_type *, int, const char *, void *); @@ -1984,7 +1985,6 @@ extern int vfs_ustat(dev_t, struct kstatfs *); extern int freeze_super(struct super_block *super); extern int thaw_super(struct super_block *super); extern bool our_mnt(struct vfsmount *mnt); -extern bool fs_fully_visible(struct file_system_type *); extern int current_umask(void); -- 2.2.1 ^ permalink raw reply related [flat|nested] 85+ messages in thread
* [CFT][PATCH 03/10] vfs: Ignore unlocked mounts in fs_fully_visible [not found] ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-05-16 2:06 ` [CFT][PATCH 01/10] mnt: Refactor the logic for mounting sysfs and proc in a user namespace Eric W. Biederman @ 2015-05-16 2:07 ` Eric W. Biederman 2015-05-16 2:07 ` [CFT][PATCH 04/10] fs: Add helper functions for permanently empty directories Eric W. Biederman ` (7 subsequent siblings) 9 siblings, 0 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-16 2:07 UTC (permalink / raw) To: Linux Containers Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Greg Kroah-Hartman, Tejun Heo Limit the mounts fs_fully_visible considers to locked mounts. Unlocked can always be unmounted so considering them adds hassle but no security benefit. Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/namespace.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index fccee9924e8c..3ede0669b8d2 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3211,11 +3211,15 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags) ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK))) continue; - /* This mount is not fully visible if there are any child mounts - * that cover anything except for empty directories. + /* This mount is not fully visible if there are any + * locked child mounts that cover anything except for + * empty directories. */ list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) { struct inode *inode = child->mnt_mountpoint->d_inode; + /* Only worry about locked mounts */ + if (!(mnt->mnt.mnt_flags & MNT_LOCKED)) + continue; if (!S_ISDIR(inode->i_mode)) goto next; if (inode->i_nlink > 2) -- 2.2.1 ^ permalink raw reply related [flat|nested] 85+ messages in thread
* [CFT][PATCH 04/10] fs: Add helper functions for permanently empty directories. [not found] ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-05-16 2:06 ` [CFT][PATCH 01/10] mnt: Refactor the logic for mounting sysfs and proc in a user namespace Eric W. Biederman 2015-05-16 2:07 ` [CFT][PATCH 03/10] vfs: Ignore unlocked mounts in fs_fully_visible Eric W. Biederman @ 2015-05-16 2:07 ` Eric W. Biederman 2015-05-16 2:08 ` [CFT][PATCH 05/10] sysctl: Allow creating permanently empty directories that serve as mountpoints Eric W. Biederman ` (6 subsequent siblings) 9 siblings, 0 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-16 2:07 UTC (permalink / raw) To: Linux Containers Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Greg Kroah-Hartman, Tejun Heo To ensure it is safe to mount proc and sysfs I need to check if filesystems that are mounted on top of them are mounted on truly empty directories. Given that some directories can gain entries over time, knowing that a directory is empty right now is insufficient. Therefore add supporting infrastructure for permantently empty directories that proc and sysfs can use when they create mount points for filesystems and fs_fully_visible can use to test for permanently empty directories to ensure that nothing will be gained by mounting a fresh copy of proc or sysfs. Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/libfs.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ include/linux/fs.h | 2 ++ 2 files changed, 98 insertions(+) diff --git a/fs/libfs.c b/fs/libfs.c index cb1fb4b9b637..02813592e121 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -1093,3 +1093,99 @@ simple_nosetlease(struct file *filp, long arg, struct file_lock **flp, return -EINVAL; } EXPORT_SYMBOL(simple_nosetlease); + + +/* + * Operations for a permanently empty directory. + */ +static struct dentry *empty_dir_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags) +{ + return ERR_PTR(-ENOENT); +} + +static int empty_dir_getattr(struct vfsmount *mnt, struct dentry *dentry, + struct kstat *stat) +{ + struct inode *inode = d_inode(dentry); + generic_fillattr(inode, stat); + return 0; +} + +static int empty_dir_setattr(struct dentry *dentry, struct iattr *attr) +{ + return -EPERM; +} + +static int empty_dir_setxattr(struct dentry *dentry, const char *name, + const void *value, size_t size, int flags) +{ + return -EOPNOTSUPP; +} + +static ssize_t empty_dir_getxattr(struct dentry *dentry, const char *name, + void *value, size_t size) +{ + return -EOPNOTSUPP; +} + +static int empty_dir_removexattr(struct dentry *dentry, const char *name) +{ + return -EOPNOTSUPP; +} + +static ssize_t empty_dir_listxattr(struct dentry *dentry, char *list, size_t size) +{ + return -EOPNOTSUPP; +} + +static const struct inode_operations empty_dir_inode_operations = { + .lookup = empty_dir_lookup, + .permission = generic_permission, + .setattr = empty_dir_setattr, + .getattr = empty_dir_getattr, + .setxattr = empty_dir_setxattr, + .getxattr = empty_dir_getxattr, + .removexattr = empty_dir_removexattr, + .listxattr = empty_dir_listxattr, +}; + +static loff_t empty_dir_llseek(struct file *file, loff_t offset, int whence) +{ + /* An empty directory has two entries . and .. at offsets 0 and 1 */ + return generic_file_llseek_size(file, offset, whence, 2, 2); +} + +static int empty_dir_readdir(struct file *file, struct dir_context *ctx) +{ + dir_emit_dots(file, ctx); + return 0; +} + +static const struct file_operations empty_dir_operations = { + .llseek = empty_dir_llseek, + .read = generic_read_dir, + .iterate = empty_dir_readdir, + .fsync = noop_fsync, +}; + + +void make_empty_dir_inode(struct inode *inode) +{ + set_nlink(inode, 2); + inode->i_mode = S_IFDIR | S_IRUGO | S_IXUGO; + inode->i_uid = GLOBAL_ROOT_UID; + inode->i_gid = GLOBAL_ROOT_GID; + inode->i_rdev = 0; + inode->i_size = 2; + inode->i_blkbits = PAGE_SHIFT; + inode->i_blocks = 0; + + inode->i_op = &empty_dir_inode_operations; + inode->i_fop = &empty_dir_operations; +} + +bool is_empty_dir_inode(struct inode *inode) +{ + return (inode->i_fop == &empty_dir_operations) && + (inode->i_op == &empty_dir_inode_operations); +} diff --git a/include/linux/fs.h b/include/linux/fs.h index 2d24eeb8e59c..571aab91bfc0 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2780,6 +2780,8 @@ extern struct dentry *simple_lookup(struct inode *, struct dentry *, unsigned in extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t *); extern const struct file_operations simple_dir_operations; extern const struct inode_operations simple_dir_inode_operations; +extern void make_empty_dir_inode(struct inode *inode); +extern bool is_empty_dir_inode(struct inode *inode); struct tree_descr { char *name; const struct file_operations *ops; int mode; }; struct dentry *d_alloc_name(struct dentry *, const char *); extern int simple_fill_super(struct super_block *, unsigned long, struct tree_descr *); -- 2.2.1 ^ permalink raw reply related [flat|nested] 85+ messages in thread
* [CFT][PATCH 05/10] sysctl: Allow creating permanently empty directories that serve as mountpoints. [not found] ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> ` (2 preceding siblings ...) 2015-05-16 2:07 ` [CFT][PATCH 04/10] fs: Add helper functions for permanently empty directories Eric W. Biederman @ 2015-05-16 2:08 ` Eric W. Biederman 2015-05-16 2:08 ` [CFT][PATCH 06/10] proc: Allow creating permanently empty directories that serve as mount points Eric W. Biederman ` (5 subsequent siblings) 9 siblings, 0 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-16 2:08 UTC (permalink / raw) To: Linux Containers Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Greg Kroah-Hartman, Tejun Heo Add a magic sysctl table sysctl_mount_point that when used to create a directory forces that directory to be permanently empty. Update the code to use make_empty_dir_inode when accessing permanently empty directories. Update the code to not allow adding to permanently empty directories. Update /proc/sys/fs/binfmt_misc to be a permanently empty directory. Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/proc/proc_sysctl.c | 37 +++++++++++++++++++++++++++++++++++++ include/linux/sysctl.h | 3 +++ kernel/sysctl.c | 8 +------- 3 files changed, 41 insertions(+), 7 deletions(-) diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c index fea2561d773b..fdda62e6115e 100644 --- a/fs/proc/proc_sysctl.c +++ b/fs/proc/proc_sysctl.c @@ -19,6 +19,28 @@ static const struct inode_operations proc_sys_inode_operations; static const struct file_operations proc_sys_dir_file_operations; static const struct inode_operations proc_sys_dir_operations; +/* Support for permanently empty directories */ + +struct ctl_table sysctl_mount_point[] = { + { } +}; + +static bool is_empty_dir(struct ctl_table_header *head) +{ + return head->ctl_table[0].child == sysctl_mount_point; +} + +static void set_empty_dir(struct ctl_dir *dir) +{ + dir->header.ctl_table[0].child = sysctl_mount_point; +} + +static void clear_empty_dir(struct ctl_dir *dir) + +{ + dir->header.ctl_table[0].child = NULL; +} + void proc_sys_poll_notify(struct ctl_table_poll *poll) { if (!poll) @@ -187,6 +209,17 @@ static int insert_header(struct ctl_dir *dir, struct ctl_table_header *header) struct ctl_table *entry; int err; + /* Is this a permanently empty directory? */ + if (is_empty_dir(&dir->header)) + return -EROFS; + + /* Am I creating a permanently empty directory? */ + if (header->ctl_table == sysctl_mount_point) { + if (!RB_EMPTY_ROOT(&dir->root)) + return -EINVAL; + set_empty_dir(dir); + } + dir->header.nreg++; header->parent = dir; err = insert_links(header); @@ -202,6 +235,8 @@ fail: erase_header(header); put_links(header); fail_links: + if (header->ctl_table == sysctl_mount_point) + clear_empty_dir(dir); header->parent = NULL; drop_sysctl_table(&dir->header); return err; @@ -419,6 +454,8 @@ static struct inode *proc_sys_make_inode(struct super_block *sb, inode->i_mode |= S_IFDIR; inode->i_op = &proc_sys_dir_operations; inode->i_fop = &proc_sys_dir_file_operations; + if (is_empty_dir(head)) + make_empty_dir_inode(inode); } out: return inode; diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index 795d5fea5697..fa7bc29925c9 100644 --- a/include/linux/sysctl.h +++ b/include/linux/sysctl.h @@ -188,6 +188,9 @@ struct ctl_table_header *register_sysctl_paths(const struct ctl_path *path, void unregister_sysctl_table(struct ctl_table_header * table); extern int sysctl_init(void); + +extern struct ctl_table sysctl_mount_point[]; + #else /* CONFIG_SYSCTL */ static inline struct ctl_table_header *register_sysctl_table(struct ctl_table * table) { diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 2082b1a88fb9..c3eee4c6d6c1 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1531,12 +1531,6 @@ static struct ctl_table vm_table[] = { { } }; -#if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE) -static struct ctl_table binfmt_misc_table[] = { - { } -}; -#endif - static struct ctl_table fs_table[] = { { .procname = "inode-nr", @@ -1690,7 +1684,7 @@ static struct ctl_table fs_table[] = { { .procname = "binfmt_misc", .mode = 0555, - .child = binfmt_misc_table, + .child = sysctl_mount_point, }, #endif { -- 2.2.1 ^ permalink raw reply related [flat|nested] 85+ messages in thread
* [CFT][PATCH 06/10] proc: Allow creating permanently empty directories that serve as mount points [not found] ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> ` (3 preceding siblings ...) 2015-05-16 2:08 ` [CFT][PATCH 05/10] sysctl: Allow creating permanently empty directories that serve as mountpoints Eric W. Biederman @ 2015-05-16 2:08 ` Eric W. Biederman 2015-05-16 2:09 ` [CFT][PATCH 07/10] kernfs: Add support for always empty directories Eric W. Biederman ` (4 subsequent siblings) 9 siblings, 0 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-16 2:08 UTC (permalink / raw) To: Linux Containers Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Greg Kroah-Hartman, Tejun Heo Add a new function proc_create_mount_point that when used to creates a directory that can not be added to. Add a new function is_empty_pde to test if a function is a mount point. Update the code to use make_empty_dir_inode when reporting a permanently empty directory to the vfs. Update the code to not allow adding to permanently empty directories. Update /proc/openprom and /proc/fs/nfsd to be permanently empty directories. Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/proc/generic.c | 23 +++++++++++++++++++++++ fs/proc/inode.c | 4 ++++ fs/proc/internal.h | 6 ++++++ fs/proc/root.c | 4 ++-- 4 files changed, 35 insertions(+), 2 deletions(-) diff --git a/fs/proc/generic.c b/fs/proc/generic.c index df6327a2b865..e5dee5c3188e 100644 --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -373,6 +373,10 @@ static struct proc_dir_entry *__proc_create(struct proc_dir_entry **parent, WARN(1, "create '/proc/%s' by hand\n", qstr.name); return NULL; } + if (is_empty_pde(*parent)) { + WARN(1, "attempt to add to permanently empty directory"); + return NULL; + } ent = kzalloc(sizeof(struct proc_dir_entry) + qstr.len + 1, GFP_KERNEL); if (!ent) @@ -455,6 +459,25 @@ struct proc_dir_entry *proc_mkdir(const char *name, } EXPORT_SYMBOL(proc_mkdir); +struct proc_dir_entry *proc_create_mount_point(const char *name) +{ + umode_t mode = S_IFDIR | S_IRUGO | S_IXUGO; + struct proc_dir_entry *ent, *parent = NULL; + + ent = __proc_create(&parent, name, mode, 2); + if (ent) { + ent->data = NULL; + ent->proc_fops = NULL; + ent->proc_iops = NULL; + if (proc_register(parent, ent) < 0) { + kfree(ent); + parent->nlink--; + ent = NULL; + } + } + return ent; +} + struct proc_dir_entry *proc_create_data(const char *name, umode_t mode, struct proc_dir_entry *parent, const struct file_operations *proc_fops, diff --git a/fs/proc/inode.c b/fs/proc/inode.c index 8272aaba1bb0..e3eb5524639f 100644 --- a/fs/proc/inode.c +++ b/fs/proc/inode.c @@ -423,6 +423,10 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de) inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME; PROC_I(inode)->pde = de; + if (is_empty_pde(de)) { + make_empty_dir_inode(inode); + return inode; + } if (de->mode) { inode->i_mode = de->mode; inode->i_uid = de->uid; diff --git a/fs/proc/internal.h b/fs/proc/internal.h index c835b94c0cd3..aa2781095bd1 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -191,6 +191,12 @@ static inline struct proc_dir_entry *pde_get(struct proc_dir_entry *pde) } extern void pde_put(struct proc_dir_entry *); +static inline bool is_empty_pde(const struct proc_dir_entry *pde) +{ + return S_ISDIR(pde->mode) && !pde->proc_iops; +} +struct proc_dir_entry *proc_create_mount_point(const char *name); + /* * inode.c */ diff --git a/fs/proc/root.c b/fs/proc/root.c index 64e1ab64bde6..68feb0f70e63 100644 --- a/fs/proc/root.c +++ b/fs/proc/root.c @@ -179,10 +179,10 @@ void __init proc_root_init(void) #endif proc_mkdir("fs", NULL); proc_mkdir("driver", NULL); - proc_mkdir("fs/nfsd", NULL); /* somewhere for the nfsd filesystem to be mounted */ + proc_create_mount_point("fs/nfsd"); /* somewhere for the nfsd filesystem to be mounted */ #if defined(CONFIG_SUN_OPENPROMFS) || defined(CONFIG_SUN_OPENPROMFS_MODULE) /* just give it a mountpoint */ - proc_mkdir("openprom", NULL); + proc_create_mount_point("openprom"); #endif proc_tty_init(); proc_mkdir("bus", NULL); -- 2.2.1 ^ permalink raw reply related [flat|nested] 85+ messages in thread
* [CFT][PATCH 07/10] kernfs: Add support for always empty directories. [not found] ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> ` (4 preceding siblings ...) 2015-05-16 2:08 ` [CFT][PATCH 06/10] proc: Allow creating permanently empty directories that serve as mount points Eric W. Biederman @ 2015-05-16 2:09 ` Eric W. Biederman 2015-05-16 2:09 ` [CFT][PATCH 08/10] sysfs: Add support for permanently empty directories to serve as mount points Eric W. Biederman ` (3 subsequent siblings) 9 siblings, 0 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-16 2:09 UTC (permalink / raw) To: Linux Containers Cc: Linux API, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo Add a new function kernfs_create_empty_dir that can be used to create directory that can not be modified. Update the code to use make_empty_dir_inode when reporting a permanently empty directory to the vfs. Update the code to not allow adding to permanently empty directories. Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/kernfs/dir.c | 38 +++++++++++++++++++++++++++++++++++++- fs/kernfs/inode.c | 2 ++ include/linux/kernfs.h | 3 +++ 3 files changed, 42 insertions(+), 1 deletion(-) diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index f131fc23ffc4..47dc636d80ed 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -585,6 +585,9 @@ int kernfs_add_one(struct kernfs_node *kn) goto out_unlock; ret = -ENOENT; + if (parent->flags & KERNFS_EMPTY_DIR) + goto out_unlock; + if ((parent->flags & KERNFS_ACTIVATED) && !kernfs_active(parent)) goto out_unlock; @@ -776,6 +779,38 @@ struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent, return ERR_PTR(rc); } +/** + * kernfs_create_empty_dir - create an always empty directory + * @parent: parent in which to create a new directory + * @name: name of the new directory + * + * Returns the created node on success, ERR_PTR() value on failure. + */ +struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent, + const char *name) +{ + struct kernfs_node *kn; + int rc; + + /* allocate */ + kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR, KERNFS_DIR); + if (!kn) + return ERR_PTR(-ENOMEM); + + kn->flags |= KERNFS_EMPTY_DIR; + kn->dir.root = parent->dir.root; + kn->ns = NULL; + kn->priv = NULL; + + /* link in */ + rc = kernfs_add_one(kn); + if (!rc) + return kn; + + kernfs_put(kn); + return ERR_PTR(rc); +} + static struct dentry *kernfs_iop_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags) @@ -1247,7 +1282,8 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent, mutex_lock(&kernfs_mutex); error = -ENOENT; - if (!kernfs_active(kn) || !kernfs_active(new_parent)) + if (!kernfs_active(kn) || !kernfs_active(new_parent) || + (new_parent->flags & KERNFS_EMPTY_DIR)) goto out; error = 0; diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c index 2da8493a380b..756dd56aaf60 100644 --- a/fs/kernfs/inode.c +++ b/fs/kernfs/inode.c @@ -296,6 +296,8 @@ static void kernfs_init_inode(struct kernfs_node *kn, struct inode *inode) case KERNFS_DIR: inode->i_op = &kernfs_dir_iops; inode->i_fop = &kernfs_dir_fops; + if (kn->flags & KERNFS_EMPTY_DIR) + make_empty_dir_inode(inode); break; case KERNFS_FILE: inode->i_size = kn->attr.size; diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h index 71ecdab1671b..29d1896c3ba5 100644 --- a/include/linux/kernfs.h +++ b/include/linux/kernfs.h @@ -45,6 +45,7 @@ enum kernfs_node_flag { KERNFS_LOCKDEP = 0x0100, KERNFS_SUICIDAL = 0x0400, KERNFS_SUICIDED = 0x0800, + KERNFS_EMPTY_DIR = 0x1000, }; /* @flags for kernfs_create_root() */ @@ -285,6 +286,8 @@ void kernfs_destroy_root(struct kernfs_root *root); struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent, const char *name, umode_t mode, void *priv, const void *ns); +struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent, + const char *name); struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, const char *name, umode_t mode, loff_t size, -- 2.2.1 ^ permalink raw reply related [flat|nested] 85+ messages in thread
* [CFT][PATCH 08/10] sysfs: Add support for permanently empty directories to serve as mount points. [not found] ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> ` (5 preceding siblings ...) 2015-05-16 2:09 ` [CFT][PATCH 07/10] kernfs: Add support for always empty directories Eric W. Biederman @ 2015-05-16 2:09 ` Eric W. Biederman 2015-05-18 13:14 ` Greg Kroah-Hartman 2015-05-16 2:10 ` [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_mount_point Eric W. Biederman ` (2 subsequent siblings) 9 siblings, 1 reply; 85+ messages in thread From: Eric W. Biederman @ 2015-05-16 2:09 UTC (permalink / raw) To: Linux Containers Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Greg Kroah-Hartman, Tejun Heo Add two functions sysfs_create_mount_point and sysfs_remove_mount_point that hang a permanently empty directory off of a kobject or remove a permanently emptpy directory hanging from a kobject. Export these new functions so modular filesystems can use them. Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/sysfs/dir.c | 34 ++++++++++++++++++++++++++++++++++ include/linux/sysfs.h | 16 ++++++++++++++++ 2 files changed, 50 insertions(+) diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index 0b45ff42f374..94374e435025 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -121,3 +121,37 @@ int sysfs_move_dir_ns(struct kobject *kobj, struct kobject *new_parent_kobj, return kernfs_rename_ns(kn, new_parent, kn->name, new_ns); } + +/** + * sysfs_create_mount_point - create an always empty directory + * @parent_kobj: kobject that will contain this always empty directory + * @name: The name of the always empty directory to add + */ +int sysfs_create_mount_point(struct kobject *parent_kobj, const char *name) +{ + struct kernfs_node *kn, *parent = parent_kobj->sd; + + kn = kernfs_create_empty_dir(parent, name); + if (IS_ERR(kn)) { + if (PTR_ERR(kn) == -EEXIST) + sysfs_warn_dup(parent, name); + return PTR_ERR(kn); + } + + return 0; +} +EXPORT_SYMBOL_GPL(sysfs_create_mount_point); + +/** + * sysfs_remove_mount_point - remove an always empty directory. + * @parent_kobj: kobject that will contain this always empty directory + * @name: The name of the always empty directory to remove + * + */ +void sysfs_remove_mount_point(struct kobject *parent_kobj, const char *name) +{ + struct kernfs_node *parent = parent_kobj->sd; + + kernfs_remove_by_name_ns(parent, name, NULL); +} +EXPORT_SYMBOL_GPL(sysfs_remove_mount_point); diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h index 99382c0df17e..3e7e41acc451 100644 --- a/include/linux/sysfs.h +++ b/include/linux/sysfs.h @@ -210,6 +210,10 @@ int __must_check sysfs_rename_dir_ns(struct kobject *kobj, const char *new_name, int __must_check sysfs_move_dir_ns(struct kobject *kobj, struct kobject *new_parent_kobj, const void *new_ns); +int __must_check sysfs_create_mount_point(struct kobject *parent_kobj, + const char *name); +void sysfs_remove_mount_point(struct kobject *parent_kobj, + const char *name); int __must_check sysfs_create_file_ns(struct kobject *kobj, const struct attribute *attr, @@ -298,6 +302,18 @@ static inline int sysfs_move_dir_ns(struct kobject *kobj, return 0; } +static inline int sysfs_create_mount_point(struct kobject *parent_kobj, + const char *name) +{ + return 0; +} + +static inline void sysfs_remove_mount_point(struct kobject *parent_kobj, + const char *name) +{ + return 0; +} + static inline int sysfs_create_file_ns(struct kobject *kobj, const struct attribute *attr, const void *ns) -- 2.2.1 ^ permalink raw reply related [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 08/10] sysfs: Add support for permanently empty directories to serve as mount points. 2015-05-16 2:09 ` [CFT][PATCH 08/10] sysfs: Add support for permanently empty directories to serve as mount points Eric W. Biederman @ 2015-05-18 13:14 ` Greg Kroah-Hartman 0 siblings, 0 replies; 85+ messages in thread From: Greg Kroah-Hartman @ 2015-05-18 13:14 UTC (permalink / raw) To: Eric W. Biederman Cc: Linux Containers, linux-fsdevel, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Tejun Heo On Fri, May 15, 2015 at 09:09:53PM -0500, Eric W. Biederman wrote: > > Add two functions sysfs_create_mount_point and sysfs_remove_mount_point > that hang a permanently empty directory off of a kobject or remove a > permanently emptpy directory hanging from a kobject. Export these new > functions so modular filesystems can use them. > > Cc: stable@vger.kernel.org > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > --- > fs/sysfs/dir.c | 34 ++++++++++++++++++++++++++++++++++ > include/linux/sysfs.h | 16 ++++++++++++++++ > 2 files changed, 50 insertions(+) Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> ^ permalink raw reply [flat|nested] 85+ messages in thread
* [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_mount_point [not found] ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> ` (6 preceding siblings ...) 2015-05-16 2:09 ` [CFT][PATCH 08/10] sysfs: Add support for permanently empty directories to serve as mount points Eric W. Biederman @ 2015-05-16 2:10 ` Eric W. Biederman 2015-05-18 13:14 ` Greg Kroah-Hartman 2015-05-16 2:11 ` [CFT][PATCH 10/10] mnt: Update fs_fully_visible to test for permanently empty directories Eric W. Biederman 2015-05-22 17:39 ` [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) Eric W. Biederman 9 siblings, 1 reply; 85+ messages in thread From: Eric W. Biederman @ 2015-05-16 2:10 UTC (permalink / raw) To: Linux Containers Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Greg Kroah-Hartman, Tejun Heo This allows for better documentation in the code and it allows for a simpler and fully correct version of fs_fully_visible to be written. The mount points converted and their filesystems are: /sys/hypervisor/s390/ s390_hypfs /sys/kernel/config/ configfs /sys/kernel/debug/ debugfs /sys/firmware/efi/efivars/ efivarfs /sys/fs/fuse/connections/ fusectl /sys/fs/pstore/ pstore /sys/kernel/tracing/ tracefs /sys/fs/cgroup/ cgroup /sys/kernel/security/ securityfs /sys/fs/selinux/ selinuxfs /sys/fs/smackfs/ smackfs Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- arch/s390/hypfs/inode.c | 12 ++++-------- drivers/firmware/efi/efi.c | 6 ++---- fs/configfs/mount.c | 10 ++++------ fs/debugfs/inode.c | 11 ++++------- fs/fuse/inode.c | 9 +++------ fs/pstore/inode.c | 12 ++++-------- fs/tracefs/inode.c | 6 ++---- kernel/cgroup.c | 10 ++++------ security/inode.c | 10 ++++------ security/selinux/selinuxfs.c | 11 +++++------ security/smack/smackfs.c | 8 ++++---- 11 files changed, 40 insertions(+), 65 deletions(-) diff --git a/arch/s390/hypfs/inode.c b/arch/s390/hypfs/inode.c index d3f896a35b98..2eeb0a0f506d 100644 --- a/arch/s390/hypfs/inode.c +++ b/arch/s390/hypfs/inode.c @@ -456,8 +456,6 @@ static const struct super_operations hypfs_s_ops = { .show_options = hypfs_show_options, }; -static struct kobject *s390_kobj; - static int __init hypfs_init(void) { int rc; @@ -481,18 +479,16 @@ static int __init hypfs_init(void) rc = -ENODATA; goto fail_hypfs_sprp_exit; } - s390_kobj = kobject_create_and_add("s390", hypervisor_kobj); - if (!s390_kobj) { - rc = -ENOMEM; + rc = sysfs_create_mount_point(hypervisor_kobj, "s390"); + if (rc) goto fail_hypfs_diag0c_exit; - } rc = register_filesystem(&hypfs_type); if (rc) goto fail_filesystem; return 0; fail_filesystem: - kobject_put(s390_kobj); + sysfs_remove_mount_point(hypervisor_kobj, "s390"); fail_hypfs_diag0c_exit: hypfs_diag0c_exit(); fail_hypfs_sprp_exit: @@ -510,7 +506,7 @@ fail_dbfs_exit: static void __exit hypfs_exit(void) { unregister_filesystem(&hypfs_type); - kobject_put(s390_kobj); + sysfs_remove_mount_point(hypervisor_kobj, "s390"); hypfs_diag0c_exit(); hypfs_sprp_exit(); hypfs_vm_exit(); diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 3061bb8629dc..e14363d12690 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -65,7 +65,6 @@ static int __init parse_efi_cmdline(char *str) early_param("efi", parse_efi_cmdline); static struct kobject *efi_kobj; -static struct kobject *efivars_kobj; /* * Let's not leave out systab information that snuck into @@ -212,10 +211,9 @@ static int __init efisubsys_init(void) goto err_remove_group; /* and the standard mountpoint for efivarfs */ - efivars_kobj = kobject_create_and_add("efivars", efi_kobj); - if (!efivars_kobj) { + error = sysfs_create_mount_point(efi_kobj, "efivars"); + if (error) { pr_err("efivars: Subsystem registration failed.\n"); - error = -ENOMEM; goto err_remove_group; } diff --git a/fs/configfs/mount.c b/fs/configfs/mount.c index da94e41bdbf6..bca58da65e2b 100644 --- a/fs/configfs/mount.c +++ b/fs/configfs/mount.c @@ -129,8 +129,6 @@ void configfs_release_fs(void) } -static struct kobject *config_kobj; - static int __init configfs_init(void) { int err = -ENOMEM; @@ -141,8 +139,8 @@ static int __init configfs_init(void) if (!configfs_dir_cachep) goto out; - config_kobj = kobject_create_and_add("config", kernel_kobj); - if (!config_kobj) + err = sysfs_create_mount_point(kernel_kobj, "config"); + if (err) goto out2; err = register_filesystem(&configfs_fs_type); @@ -152,7 +150,7 @@ static int __init configfs_init(void) return 0; out3: pr_err("Unable to register filesystem!\n"); - kobject_put(config_kobj); + sysfs_remove_mount_point(kernel_kobj, "config"); out2: kmem_cache_destroy(configfs_dir_cachep); configfs_dir_cachep = NULL; @@ -163,7 +161,7 @@ out: static void __exit configfs_exit(void) { unregister_filesystem(&configfs_fs_type); - kobject_put(config_kobj); + sysfs_remove_mount_point(kernel_kobj, "config"); kmem_cache_destroy(configfs_dir_cachep); configfs_dir_cachep = NULL; } diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c index c1e7ffb0dab6..12756040ca20 100644 --- a/fs/debugfs/inode.c +++ b/fs/debugfs/inode.c @@ -716,20 +716,17 @@ bool debugfs_initialized(void) } EXPORT_SYMBOL_GPL(debugfs_initialized); - -static struct kobject *debug_kobj; - static int __init debugfs_init(void) { int retval; - debug_kobj = kobject_create_and_add("debug", kernel_kobj); - if (!debug_kobj) - return -EINVAL; + retval = sysfs_create_mount_point(kernel_kobj, "debug"); + if (retval) + return retval; retval = register_filesystem(&debug_fs_type); if (retval) - kobject_put(debug_kobj); + sysfs_remove_mount_point(kernel_kobj, "debug"); else debugfs_registered = true; diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 082ac1c97f39..18dacf9ed8ff 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -1238,7 +1238,6 @@ static void fuse_fs_cleanup(void) } static struct kobject *fuse_kobj; -static struct kobject *connections_kobj; static int fuse_sysfs_init(void) { @@ -1250,11 +1249,9 @@ static int fuse_sysfs_init(void) goto out_err; } - connections_kobj = kobject_create_and_add("connections", fuse_kobj); - if (!connections_kobj) { - err = -ENOMEM; + err = sysfs_create_mount_point(fuse_kobj, "connections"); + if (err) goto out_fuse_unregister; - } return 0; @@ -1266,7 +1263,7 @@ static int fuse_sysfs_init(void) static void fuse_sysfs_cleanup(void) { - kobject_put(connections_kobj); + sysfs_remove_mount_point(fuse_kobj, "connections"); kobject_put(fuse_kobj); } diff --git a/fs/pstore/inode.c b/fs/pstore/inode.c index dc43b5f29305..3adcc4669fac 100644 --- a/fs/pstore/inode.c +++ b/fs/pstore/inode.c @@ -461,22 +461,18 @@ static struct file_system_type pstore_fs_type = { .kill_sb = pstore_kill_sb, }; -static struct kobject *pstore_kobj; - static int __init init_pstore_fs(void) { - int err = 0; + int err; /* Create a convenient mount point for people to access pstore */ - pstore_kobj = kobject_create_and_add("pstore", fs_kobj); - if (!pstore_kobj) { - err = -ENOMEM; + err = sysfs_create_mount_point(fs_kobj, "pstore"); + if (err) goto out; - } err = register_filesystem(&pstore_fs_type); if (err < 0) - kobject_put(pstore_kobj); + sysfs_remove_mount_point(fs_kobj, "pstore"); out: return err; diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index d92bdf3b079a..a43df11a163f 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -631,14 +631,12 @@ bool tracefs_initialized(void) return tracefs_registered; } -static struct kobject *trace_kobj; - static int __init tracefs_init(void) { int retval; - trace_kobj = kobject_create_and_add("tracing", kernel_kobj); - if (!trace_kobj) + retval = sysfs_create_mount_point(kernel_kobj, "tracing"); + if (retval) return -EINVAL; retval = register_filesystem(&trace_fs_type); diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 469dd547770c..e8a5491be756 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1924,8 +1924,6 @@ static struct file_system_type cgroup_fs_type = { .kill_sb = cgroup_kill_sb, }; -static struct kobject *cgroup_kobj; - /** * task_cgroup_path - cgroup path of a task in the first cgroup hierarchy * @task: target task @@ -5044,13 +5042,13 @@ int __init cgroup_init(void) ss->bind(init_css_set.subsys[ssid]); } - cgroup_kobj = kobject_create_and_add("cgroup", fs_kobj); - if (!cgroup_kobj) - return -ENOMEM; + err = sysfs_create_mount_point(fs_kobj, "cgroup"); + if (err) + return err; err = register_filesystem(&cgroup_fs_type); if (err < 0) { - kobject_put(cgroup_kobj); + sysfs_remove_mount_point(fs_kobj, "cgroup"); return err; } diff --git a/security/inode.c b/security/inode.c index 91503b79c5f8..0e37e4fba8fa 100644 --- a/security/inode.c +++ b/security/inode.c @@ -215,19 +215,17 @@ void securityfs_remove(struct dentry *dentry) } EXPORT_SYMBOL_GPL(securityfs_remove); -static struct kobject *security_kobj; - static int __init securityfs_init(void) { int retval; - security_kobj = kobject_create_and_add("security", kernel_kobj); - if (!security_kobj) - return -EINVAL; + retval = sysfs_create_mount_point(kernel_kobj, "security"); + if (retval) + return retval; retval = register_filesystem(&fs_type); if (retval) - kobject_put(security_kobj); + sysfs_remove_mount_point(kernel_kobj, "security"); return retval; } diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c index d2787cca1fcb..3d2201413028 100644 --- a/security/selinux/selinuxfs.c +++ b/security/selinux/selinuxfs.c @@ -1853,7 +1853,6 @@ static struct file_system_type sel_fs_type = { }; struct vfsmount *selinuxfs_mount; -static struct kobject *selinuxfs_kobj; static int __init init_sel_fs(void) { @@ -1862,13 +1861,13 @@ static int __init init_sel_fs(void) if (!selinux_enabled) return 0; - selinuxfs_kobj = kobject_create_and_add("selinux", fs_kobj); - if (!selinuxfs_kobj) - return -ENOMEM; + err = sysfs_create_mount_point(fs_kobj, "selinux"); + if (err) + return err; err = register_filesystem(&sel_fs_type); if (err) { - kobject_put(selinuxfs_kobj); + sysfs_remove_mount_point(fs_kobj, "selinux"); return err; } @@ -1887,7 +1886,7 @@ __initcall(init_sel_fs); #ifdef CONFIG_SECURITY_SELINUX_DISABLE void exit_sel_fs(void) { - kobject_put(selinuxfs_kobj); + sysfs_remove_mount_point(fs_kobj, "selinux"); kern_unmount(selinuxfs_mount); unregister_filesystem(&sel_fs_type); } diff --git a/security/smack/smackfs.c b/security/smack/smackfs.c index d9682985349e..ac4cac7c661a 100644 --- a/security/smack/smackfs.c +++ b/security/smack/smackfs.c @@ -2241,16 +2241,16 @@ static const struct file_operations smk_revoke_subj_ops = { .llseek = generic_file_llseek, }; -static struct kset *smackfs_kset; /** * smk_init_sysfs - initialize /sys/fs/smackfs * */ static int smk_init_sysfs(void) { - smackfs_kset = kset_create_and_add("smackfs", NULL, fs_kobj); - if (!smackfs_kset) - return -ENOMEM; + int err; + err = sysfs_create_mount_point(fs_kobj, "smackfs"); + if (err) + return err; return 0; } -- 2.2.1 ^ permalink raw reply related [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_mount_point 2015-05-16 2:10 ` [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_mount_point Eric W. Biederman @ 2015-05-18 13:14 ` Greg Kroah-Hartman 0 siblings, 0 replies; 85+ messages in thread From: Greg Kroah-Hartman @ 2015-05-18 13:14 UTC (permalink / raw) To: Eric W. Biederman Cc: Linux Containers, linux-fsdevel, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Tejun Heo On Fri, May 15, 2015 at 09:10:42PM -0500, Eric W. Biederman wrote: > > This allows for better documentation in the code and > it allows for a simpler and fully correct version of > fs_fully_visible to be written. > > The mount points converted and their filesystems are: > /sys/hypervisor/s390/ s390_hypfs > /sys/kernel/config/ configfs > /sys/kernel/debug/ debugfs > /sys/firmware/efi/efivars/ efivarfs > /sys/fs/fuse/connections/ fusectl > /sys/fs/pstore/ pstore > /sys/kernel/tracing/ tracefs > /sys/fs/cgroup/ cgroup > /sys/kernel/security/ securityfs > /sys/fs/selinux/ selinuxfs > /sys/fs/smackfs/ smackfs > > Cc: stable@vger.kernel.org > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> ^ permalink raw reply [flat|nested] 85+ messages in thread
* [CFT][PATCH 10/10] mnt: Update fs_fully_visible to test for permanently empty directories [not found] ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> ` (7 preceding siblings ...) 2015-05-16 2:10 ` [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_mount_point Eric W. Biederman @ 2015-05-16 2:11 ` Eric W. Biederman 2015-05-22 17:39 ` [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) Eric W. Biederman 9 siblings, 0 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-16 2:11 UTC (permalink / raw) To: Linux Containers Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Greg Kroah-Hartman, Tejun Heo fs_fully_visible attempts to make fresh mounts of proc and sysfs give the mounter no more access to proc and sysfs than if they could have by creating a bind mount. One aspect of proc and sysfs that makes this particularly tricky is that there are other filesystems that typically mount on top of proc and sysfs. As those filesystems are mounted on empty directories in practice it is safe to ignore them. However testing to ensure filesystems are mounted on empty directories has not been something the in kernel data structures have supported so the current test for an empty directory which checks to see if nlink <= 2 is a bit lacking. proc and sysfs have recently been modified to use the new empty_dir infrastructure to create all of their dedicated mount points. Instead of testing for S_ISDIR(inode->i_mode) && i_nlink <= 2 to see if a directory is empty, test for is_empty_dir_inode(inode). That small change guaranteess mounts found on proc and sysfs really are safe to ignore, because the directories are not only empty but nothing can ever be added to them. This guarantees there is nothing to worry about when mounting proc and sysfs. Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/namespace.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 3ede0669b8d2..eccd925c6e82 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3220,9 +3220,8 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags) /* Only worry about locked mounts */ if (!(mnt->mnt.mnt_flags & MNT_LOCKED)) continue; - if (!S_ISDIR(inode->i_mode)) - goto next; - if (inode->i_nlink > 2) + /* Is the directory permanetly empty? */ + if (!is_empty_dir_inode(inode)) goto next; } /* Preserve the locked attributes */ -- 2.2.1 ^ permalink raw reply related [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) [not found] ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> ` (8 preceding siblings ...) 2015-05-16 2:11 ` [CFT][PATCH 10/10] mnt: Update fs_fully_visible to test for permanently empty directories Eric W. Biederman @ 2015-05-22 17:39 ` Eric W. Biederman [not found] ` <87wq004im1.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 9 siblings, 1 reply; 85+ messages in thread From: Eric W. Biederman @ 2015-05-22 17:39 UTC (permalink / raw) To: Linux Containers Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, Serge E. Hallyn, Andy Lutomirski, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Greg Kroah-Hartman, Tejun Heo, Seth Forshee I had hoped to get some Tested-By's on that patch series. Oh well. The fundamentals seem sound, and my biggest concern the implicit nodev does not apply so I will put this patchset in linux-next and aim at merging it in the next merge window. Hopefully that will leave enough time catch problems. Eric ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <87wq004im1.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) [not found] ` <87wq004im1.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-05-22 18:59 ` Andy Lutomirski [not found] ` <CALCETrUhXBR5WQ6gXr9KzGc4=7tph7kzopY29Hug4g+FhOzEKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-05-28 14:08 ` Serge Hallyn 0 siblings, 2 replies; 85+ messages in thread From: Andy Lutomirski @ 2015-05-22 18:59 UTC (permalink / raw) To: Eric W. Biederman Cc: Linux Containers, Linux FS Devel, Linux API, Serge E. Hallyn, Richard Weinberger, Kenton Varda, Michael Kerrisk-manpages, Stéphane Graber, Eric Windisch, Greg Kroah-Hartman, Tejun Heo, Seth Forshee On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: > I had hoped to get some Tested-By's on that patch series. Sorry, I've been totally swamped. I suspect that Sandstorm is okay, but I haven't had a chance to test it for real. Sandstorm makes only limited use of proc and sysfs in containers, but I'll see if I can test it for real this weekend. > > Oh well. The fundamentals seem sound, and my biggest concern the > implicit nodev does not apply so I will put this patchset in linux-next > and aim at merging it in the next merge window. Hopefully that will > leave enough time catch problems. > > Eric > -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <CALCETrUhXBR5WQ6gXr9KzGc4=7tph7kzopY29Hug4g+FhOzEKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) [not found] ` <CALCETrUhXBR5WQ6gXr9KzGc4=7tph7kzopY29Hug4g+FhOzEKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-05-22 20:41 ` Eric W. Biederman 0 siblings, 0 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-22 20:41 UTC (permalink / raw) To: Andy Lutomirski Cc: Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, Linux FS Devel, Tejun Heo Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes: > On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman > <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: >> I had hoped to get some Tested-By's on that patch series. > > Sorry, I've been totally swamped. > > I suspect that Sandstorm is okay, but I haven't had a chance to test > it for real. Sandstorm makes only limited use of proc and sysfs in > containers, but I'll see if I can test it for real this weekend. Thanks. Eric ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) 2015-05-22 18:59 ` Andy Lutomirski [not found] ` <CALCETrUhXBR5WQ6gXr9KzGc4=7tph7kzopY29Hug4g+FhOzEKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-05-28 14:08 ` Serge Hallyn 2015-05-28 15:03 ` Eric W. Biederman 2015-05-28 19:36 ` Richard Weinberger 1 sibling, 2 replies; 85+ messages in thread From: Serge Hallyn @ 2015-05-28 14:08 UTC (permalink / raw) To: Andy Lutomirski Cc: Eric W. Biederman, Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, Linux FS Devel, Tejun Heo Quoting Andy Lutomirski (luto@amacapital.net): > On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman > <ebiederm@xmission.com> wrote: > > I had hoped to get some Tested-By's on that patch series. > > Sorry, I've been totally swamped. > > I suspect that Sandstorm is okay, but I haven't had a chance to test > it for real. Sandstorm makes only limited use of proc and sysfs in > containers, but I'll see if I can test it for real this weekend. Testing this with unprivileged containers, I get lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted - error mounting sysfs on /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0 > > Oh well. The fundamentals seem sound, and my biggest concern the > > implicit nodev does not apply so I will put this patchset in linux-next > > and aim at merging it in the next merge window. Hopefully that will > > leave enough time catch problems. > > > > Eric > > > > > > -- > Andy Lutomirski > AMA Capital Management, LLC > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) 2015-05-28 14:08 ` Serge Hallyn @ 2015-05-28 15:03 ` Eric W. Biederman 2015-05-28 17:33 ` Andy Lutomirski 2015-05-28 21:04 ` Serge E. Hallyn 2015-05-28 19:36 ` Richard Weinberger 1 sibling, 2 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-28 15:03 UTC (permalink / raw) To: Serge Hallyn Cc: Andy Lutomirski, Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, Linux FS Devel, Tejun Heo Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes: > Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org): >> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman >> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: >> > I had hoped to get some Tested-By's on that patch series. >> >> Sorry, I've been totally swamped. >> >> I suspect that Sandstorm is okay, but I haven't had a chance to test >> it for real. Sandstorm makes only limited use of proc and sysfs in >> containers, but I'll see if I can test it for real this weekend. > > Testing this with unprivileged containers, I get > > lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted > - error mounting sysfs on > /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0 Grr.. I was afraid this would break something. :( Looking at my system I see that sysfs is currently mounted "nosuid,nodev,noexec" Looking at the lxc-start code I don't see it as including any of those mount options. In practice for sysfs I think those options are meaningless (as there should be no devices and nothing executable in sysfs) but I can understand the past concerns with chmod on virtual filesystems that would incline people to use them, so I think the failure is reporting a legitimate security issue in the lxc userspace code where the the unprivileged code is currently attempting to give greater access to sysfs than was given by the original mount of sysfs. As nosuid,nodev,noexec should not impair the operation of sysfs operation it looks like you can always specify those options and just make this concern go away. Something like the untested patch below I expect. diff --git a/src/lxc/conf.c b/src/lxc/conf.c index 9870455b3cae..d9ccd03afe68 100644 --- a/src/lxc/conf.c +++ b/src/lxc/conf.c @@ -770,8 +770,8 @@ static int lxc_mount_auto_mounts(struct lxc_conf *conf, int flags, struct lxc_ha { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, "%r/proc/sysrq-trigger", "%r/proc/sysrq-trigger", NULL, MS_BIND, NULL }, { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, NULL, "%r/proc/sysrq-trigger", NULL, MS_REMOUNT|MS_BIND|MS_RDONLY, NULL }, { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_RW, "proc", "%r/proc", "proc", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, - { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RW, "sysfs", "%r/sys", "sysfs", 0, NULL }, - { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RO, "sysfs", "%r/sys", "sysfs", MS_RDONLY, NULL }, + { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RW, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, + { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RO, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY, NULL }, { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, "%r/sys", "%r/sys", NULL, MS_BIND, NULL }, { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, NULL, "%r/sys", NULL, MS_REMOUNT|MS_BIND|MS_RDONLY, NULL }, Alternately you can read the flags off of the original mount of proc or sysfs. diff --git a/src/lxc/conf.c b/src/lxc/conf.c index 9870455b3cae..50ea49973e80 100644 --- a/src/lxc/conf.c +++ b/src/lxc/conf.c @@ -712,7 +712,9 @@ static unsigned long add_required_remount_flags(const char *s, const char *d, struct statvfs sb; unsigned long required_flags = 0; - if (!(flags & MS_REMOUNT)) + if (!(flags & MS_REMOUNT) && + (strcmp(s, "proc") != 0) && + (strcmp(s, "sysfs") != 0)) return flags; if (!s) Eric ^ permalink raw reply related [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) 2015-05-28 15:03 ` Eric W. Biederman @ 2015-05-28 17:33 ` Andy Lutomirski [not found] ` <CALCETrXXax28s9kMTQ-zDx0MttQWG4rg2y-oz3bSGiumSL=3sg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-05-28 21:04 ` Serge E. Hallyn 1 sibling, 1 reply; 85+ messages in thread From: Andy Lutomirski @ 2015-05-28 17:33 UTC (permalink / raw) To: Eric W. Biederman Cc: Serge Hallyn, Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, Linux FS Devel, Tejun Heo On Thu, May 28, 2015 at 8:03 AM, Eric W. Biederman <ebiederm@xmission.com> wrote: > Serge Hallyn <serge.hallyn@ubuntu.com> writes: > >> Quoting Andy Lutomirski (luto@amacapital.net): >>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman >>> <ebiederm@xmission.com> wrote: >>> > I had hoped to get some Tested-By's on that patch series. >>> >>> Sorry, I've been totally swamped. >>> >>> I suspect that Sandstorm is okay, but I haven't had a chance to test >>> it for real. Sandstorm makes only limited use of proc and sysfs in >>> containers, but I'll see if I can test it for real this weekend. >> >> Testing this with unprivileged containers, I get >> >> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted >> - error mounting sysfs on >> /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0 > > Grr.. I was afraid this would break something. :( > > Looking at my system I see that sysfs is currently mounted > "nosuid,nodev,noexec" > > Looking at the lxc-start code I don't see it as including any of those > mount options. In practice for sysfs I think those options are > meaningless (as there should be no devices and nothing executable in > sysfs) but I can understand the past concerns with chmod on virtual > filesystems that would incline people to use them, so I think the > failure is reporting a legitimate security issue in the lxc userspace > code where the the unprivileged code is currently attempting to give > greater access to sysfs than was given by the original mount of sysfs. > > As nosuid,nodev,noexec should not impair the operation of sysfs > operation it looks like you can always specify those options and just > make this concern go away. Linus is pretty strict about not breaking the ABI, and this definitely counts as breaking the ABI. There's an exception for security issues, but is there really a security issue here? That is, do we lose anything important if we just drop the offending part of the patch set? As you've said, there shouldn't be sensitive device nodes, executables, or setuid files in proc or sysfs in the first place. --Andy ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <CALCETrXXax28s9kMTQ-zDx0MttQWG4rg2y-oz3bSGiumSL=3sg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) [not found] ` <CALCETrXXax28s9kMTQ-zDx0MttQWG4rg2y-oz3bSGiumSL=3sg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-05-28 18:20 ` Kenton Varda [not found] ` <CAOP=4wid+N_80iyPpiVMN96_fuHZZRGtYQ6AOPn-HFBj2H6Vgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 85+ messages in thread From: Kenton Varda @ 2015-05-28 18:20 UTC (permalink / raw) To: Andy Lutomirski Cc: Richard Weinberger, Greg Kroah-Hartman, Linux Containers, Serge Hallyn, Seth Forshee, Eric W. Biederman, Linux API, Linux FS Devel, Tejun Heo, Michael Kerrisk-manpages On Thu, May 28, 2015 at 10:33 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote: > On Thu, May 28, 2015 at 8:03 AM, Eric W. Biederman > <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: >> Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes: >> >>> Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org): >>>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman >>>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: >>>> > I had hoped to get some Tested-By's on that patch series. >>>> >>>> Sorry, I've been totally swamped. >>>> >>>> I suspect that Sandstorm is okay, but I haven't had a chance to test >>>> it for real. Sandstorm makes only limited use of proc and sysfs in >>>> containers, but I'll see if I can test it for real this weekend. >>> >>> Testing this with unprivileged containers, I get >>> >>> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted >>> - error mounting sysfs on >>> /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0 >> >> Grr.. I was afraid this would break something. :( >> >> Looking at my system I see that sysfs is currently mounted >> "nosuid,nodev,noexec" >> >> Looking at the lxc-start code I don't see it as including any of those >> mount options. In practice for sysfs I think those options are >> meaningless (as there should be no devices and nothing executable in >> sysfs) but I can understand the past concerns with chmod on virtual >> filesystems that would incline people to use them, so I think the >> failure is reporting a legitimate security issue in the lxc userspace >> code where the the unprivileged code is currently attempting to give >> greater access to sysfs than was given by the original mount of sysfs. >> >> As nosuid,nodev,noexec should not impair the operation of sysfs >> operation it looks like you can always specify those options and just >> make this concern go away. > > Linus is pretty strict about not breaking the ABI, and this definitely > counts as breaking the ABI. There's an exception for security issues, > but is there really a security issue here? That is, do we lose > anything important if we just drop the offending part of the patch > set? As you've said, there shouldn't be sensitive device nodes, > executables, or setuid files in proc or sysfs in the first place. Speaking as a user of the mount() interfaces, I really think it would be less confusing overall if mount() simply ignored the requested flags when the caller doesn't have a choice. That is, in cases where mount() currently fails with EPERM when not given, say, MS_NOSUID, it should instead just pretend the caller actually set MS_NOSUID and go ahead with a nosuid mount. Or put another way, the absence of MS_NOSUID should not be interpreted as "remove the nosuid bit" but rather "don't set the nosuid bit if not required". Consider: - This approach will actually cause lxc to have the correct behavior, without any changes to lxc. I suspect that this generalizes: In the vast majority of cases, when users have failed to set MS_NOSUID, it's not because they are explicitly requesting that the flag be turned off, but rather that they didn't know it mattered. - If a user actually *does* expect not passing MS_NOSUID to remove the nosuid bit, and they find instead that the nosuid bit is silently kept, I don't think they'll be confused: it's pretty obvious in context that this must be for security reasons. - On the other hand, the current behavior *is* very confusing: mount() returns EPERM because of rules the caller probably doesn't know anything about. I've spent a fair amount of time frustrated by this sort of thing. -Kenton ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <CAOP=4wid+N_80iyPpiVMN96_fuHZZRGtYQ6AOPn-HFBj2H6Vgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) [not found] ` <CAOP=4wid+N_80iyPpiVMN96_fuHZZRGtYQ6AOPn-HFBj2H6Vgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-05-28 19:14 ` Eric W. Biederman [not found] ` <87fv6gikfn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-05-29 0:35 ` Andy Lutomirski 0 siblings, 2 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-28 19:14 UTC (permalink / raw) To: Kenton Varda Cc: Andy Lutomirski, Serge Hallyn, Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Michael Kerrisk-manpages, Richard Weinberger, Linux FS Devel, Tejun Heo Kenton Varda <kenton-AuYgBwuPrUQTaNkGU808tA@public.gmane.org> writes: > On Thu, May 28, 2015 at 10:33 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote: >> On Thu, May 28, 2015 at 8:03 AM, Eric W. Biederman >> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: >>> Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes: >>> >>>> Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org): >>>>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman >>>>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: >>>>> > I had hoped to get some Tested-By's on that patch series. >>>>> >>>>> Sorry, I've been totally swamped. >>>>> >>>>> I suspect that Sandstorm is okay, but I haven't had a chance to test >>>>> it for real. Sandstorm makes only limited use of proc and sysfs in >>>>> containers, but I'll see if I can test it for real this weekend. >>>> >>>> Testing this with unprivileged containers, I get >>>> >>>> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted >>>> - error mounting sysfs on >>>> /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0 >>> >>> Grr.. I was afraid this would break something. :( >>> >>> Looking at my system I see that sysfs is currently mounted >>> "nosuid,nodev,noexec" >>> >>> Looking at the lxc-start code I don't see it as including any of those >>> mount options. In practice for sysfs I think those options are >>> meaningless (as there should be no devices and nothing executable in >>> sysfs) but I can understand the past concerns with chmod on virtual >>> filesystems that would incline people to use them, so I think the >>> failure is reporting a legitimate security issue in the lxc userspace >>> code where the the unprivileged code is currently attempting to give >>> greater access to sysfs than was given by the original mount of sysfs. >>> >>> As nosuid,nodev,noexec should not impair the operation of sysfs >>> operation it looks like you can always specify those options and just >>> make this concern go away. >> >> Linus is pretty strict about not breaking the ABI, and this definitely >> counts as breaking the ABI. There's an exception for security issues, >> but is there really a security issue here? That is, do we lose >> anything important if we just drop the offending part of the patch >> set? As you've said, there shouldn't be sensitive device nodes, >> executables, or setuid files in proc or sysfs in the first place. We do need to enforce retaining the existing mount flags one way or another. Where this really matters is with MS_RDONLY. We don't want any old user to be able to mount /proc read-write when root mounted it read-only. There is a very real attack vector there. That attack almost works in docker container today and is avoided simply because docker mounts over a few files on proc. Which leads to the second side of the reason for these changes. I am fixing a very small but long standing ABI break. That is in some small ways I broke some sandboxes and when I realized they were broken I could not imagine think how to fix the code until now. It is the goal that user namespaces don't introduce anything for people to worry about security wise more than simply the ability to execute more kernel code. So at least when the kernel implementation is correct developers of existing applications simply do not need care. Sadly we are not quite there yet. > Speaking as a user of the mount() interfaces, I really think it would > be less confusing overall if mount() simply ignored the requested > flags when the caller doesn't have a choice. That is, in cases where > mount() currently fails with EPERM when not given, say, MS_NOSUID, it > should instead just pretend the caller actually set MS_NOSUID and go > ahead with a nosuid mount. Or put another way, the absence of > MS_NOSUID should not be interpreted as "remove the nosuid bit" but > rather "don't set the nosuid bit if not required". I am conflicted. Implicits are nice but confusing. If we can do something reliable and robust and maintainable here that is truly worth the cost I am all for it. If I mount proc read-write I likely want to be able to write to proc files, and I will be much happier if the mount fails than if a bazillion syscalls later something else fails when it tries to write to proc. > Consider: > > - This approach will actually cause lxc to have the correct behavior, > without any changes to lxc. I suspect that this generalizes: In the > vast majority of cases, when users have failed to set MS_NOSUID, it's > not because they are explicitly requesting that the flag be turned > off, but rather that they didn't know it mattered. > > - If a user actually *does* expect not passing MS_NOSUID to remove the > nosuid bit, and they find instead that the nosuid bit is silently > kept, I don't think they'll be confused: it's pretty obvious in > context that this must be for security reasons. > > - On the other hand, the current behavior *is* very confusing: mount() > returns EPERM because of rules the caller probably doesn't know > anything about. I've spent a fair amount of time frustrated by this > sort of thing. My sympathies. This all started with an oh crap we overlooked corner case X and it actually matters, and the fixes were quite likely a little bit hasty. The only case where this really shows up is remount insode of a user namespace of filesystems that were mounted outside of the user namespace is where this all actually matters. And mounting new instances of proc and sysfs wind up being weird instances of that nonsense. But please someone test sandstorm with this patchset and tell me if it bites you. The impetus to find a way to avoid breaking slightly buggy userspace is higher if it is more than unprivileged lxc that is broken. Eric ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <87fv6gikfn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) [not found] ` <87fv6gikfn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-05-28 20:12 ` Kenton Varda 2015-05-28 20:47 ` Richard Weinberger 2015-05-29 0:30 ` Andy Lutomirski 1 sibling, 1 reply; 85+ messages in thread From: Kenton Varda @ 2015-05-28 20:12 UTC (permalink / raw) To: Eric W. Biederman Cc: Andy Lutomirski, Serge Hallyn, Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Michael Kerrisk-manpages, Richard Weinberger, Linux FS Devel, Tejun Heo On Thu, May 28, 2015 at 12:14 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: > But please someone test sandstorm with this patchset and tell me if it > bites you. The impetus to find a way to avoid breaking slightly buggy > userspace is higher if it is more than unprivileged lxc that is broken. One of these days I'm going to learn how to compile and test kernels again (last time I did it was 1999). Unfortunately I don't think I have time at the moment, but hopefully Andy can do it. I note, though, that we only have two mount() calls in the sandstorm codebase that seem like they could be affected: run-bundle.c++:1264: KJ_SYSCALL(mount("proc", "proc", "proc", MS_NOSUID | MS_NODEV | MS_NOEXEC, "")); minibox.c++:251: KJ_SYSCALL(mount("proc", vpath.cStr(), "proc", MS_NOSUID | MS_NODEV | MS_NOEXEC, ""), supervisor.c++:921: KJ_SYSCALL(mount("/proc", "proc", nullptr, MS_BIND | MS_REC, nullptr)); The first two seem like they should be fine since they set all the flags (except readonly, which would be inappropriate for proc). I guess my habit of setting every security flag I see came in handy. The third case looks like it will be broken, BUT this line is in a debug-only code path, so I don't care. Also we have the ability to push any needed update within 24 hours, so we're generally in good shape. We never mount sysfs in Sandstorm. > If I mount proc read-write I likely want to be able to write to proc > files, and I will be much happier if the mount fails than if a bazillion > syscalls later something else fails when it tries to write to proc. I'm not sure that's true. Consider the broader context: 1) Your system's /proc is mounted read-only. 2) Now you're trying to mount a new proc in a new pid namespace, and you do *not* specify MS_READONLY. What should we expect here? Let's back off a bit and state user intent: 1) The system administrator has set a system-wide policy that /proc may only be read, not written. 2) You made a PID namespace and it needed its own proc. It seems intuitive here that the administrator's policy should apply in the namespace. Certainly everyone using the system and/or all software on the system already needs to be aware of this policy, since it's unusual and will break things. Running software on this system outside of any container already has the problem that syscalls randomly break, so why should it be surprising when this happens inside the container as well? Why do we need to go out of our way to break at mount() time? -Kenton ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) 2015-05-28 20:12 ` Kenton Varda @ 2015-05-28 20:47 ` Richard Weinberger 2015-05-28 21:07 ` Kenton Varda 0 siblings, 1 reply; 85+ messages in thread From: Richard Weinberger @ 2015-05-28 20:47 UTC (permalink / raw) To: Kenton Varda, Eric W. Biederman Cc: Andy Lutomirski, Serge Hallyn, Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Michael Kerrisk-manpages, Linux FS Devel, Tejun Heo Am 28.05.2015 um 22:12 schrieb Kenton Varda: > We never mount sysfs in Sandstorm. sysfs is ABI and applications depend on it. Even glibc is using sysfs. Currently it has fallback paths but these may go away... Thanks, //richard ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) 2015-05-28 20:47 ` Richard Weinberger @ 2015-05-28 21:07 ` Kenton Varda [not found] ` <CAOP=4wiAA4SqvMn_rQJHOjg6M-75bi_G9Fx8ENgVnYdkT5WVQA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 85+ messages in thread From: Kenton Varda @ 2015-05-28 21:07 UTC (permalink / raw) To: Richard Weinberger Cc: Eric W. Biederman, Andy Lutomirski, Serge Hallyn, Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Michael Kerrisk-manpages, Linux FS Devel, Tejun Heo On Thu, May 28, 2015 at 1:47 PM, Richard Weinberger <richard@nod.at> wrote: > Am 28.05.2015 um 22:12 schrieb Kenton Varda: >> We never mount sysfs in Sandstorm. > > sysfs is ABI and applications depend on it. > Even glibc is using sysfs. Currently it has > fallback paths but these may go away... Off-topic, but Sandstorm isn't intended to provide a full Linux ABI. It is intended to provide a secure sandbox that can run apps that have been explicitly ported to Sandstorm. More background if you're interested: https://github.com/sandstorm-io/sandstorm/wiki/Security-Practices-Overview#server-sandboxing https://blog.sandstorm.io/news/2014-08-13-sandbox-security.html -Kenton ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <CAOP=4wiAA4SqvMn_rQJHOjg6M-75bi_G9Fx8ENgVnYdkT5WVQA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) [not found] ` <CAOP=4wiAA4SqvMn_rQJHOjg6M-75bi_G9Fx8ENgVnYdkT5WVQA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-05-28 21:12 ` Richard Weinberger 0 siblings, 0 replies; 85+ messages in thread From: Richard Weinberger @ 2015-05-28 21:12 UTC (permalink / raw) To: Kenton Varda Cc: Linux API, Linux Containers, Serge Hallyn, Andy Lutomirski, Seth Forshee, Eric W. Biederman, Greg Kroah-Hartman, Linux FS Devel, Tejun Heo, Michael Kerrisk-manpages Am 28.05.2015 um 23:07 schrieb Kenton Varda: > On Thu, May 28, 2015 at 1:47 PM, Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> wrote: >> Am 28.05.2015 um 22:12 schrieb Kenton Varda: >>> We never mount sysfs in Sandstorm. >> >> sysfs is ABI and applications depend on it. >> Even glibc is using sysfs. Currently it has >> fallback paths but these may go away... > > Off-topic, but Sandstorm isn't intended to provide a full Linux ABI. > It is intended to provide a secure sandbox that can run apps that have > been explicitly ported to Sandstorm. More background if you're interested: Ahh, the application needs to be Sandstorm aware. I was missing that detail. Thanks for pointing that out! Thanks, //richard ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) [not found] ` <87fv6gikfn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-05-28 20:12 ` Kenton Varda @ 2015-05-29 0:30 ` Andy Lutomirski 1 sibling, 0 replies; 85+ messages in thread From: Andy Lutomirski @ 2015-05-29 0:30 UTC (permalink / raw) To: Eric W. Biederman Cc: Seth Forshee, Kenton Varda, Richard Weinberger, Linux Containers, Serge Hallyn, Linux FS Devel, Michael Kerrisk-manpages, Greg Kroah-Hartman, Tejun Heo, Linux API On May 28, 2015 12:19 PM, "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: > > Kenton Varda <kenton-AuYgBwuPrUQTaNkGU808tA@public.gmane.org> writes: > > > On Thu, May 28, 2015 at 10:33 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote: > >> On Thu, May 28, 2015 at 8:03 AM, Eric W. Biederman > >> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: > >>> Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes: > >>> > >>>> Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org): > >>>>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman > >>>>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: > >>>>> > I had hoped to get some Tested-By's on that patch series. > >>>>> > >>>>> Sorry, I've been totally swamped. > >>>>> > >>>>> I suspect that Sandstorm is okay, but I haven't had a chance to test > >>>>> it for real. Sandstorm makes only limited use of proc and sysfs in > >>>>> containers, but I'll see if I can test it for real this weekend. > >>>> > >>>> Testing this with unprivileged containers, I get > >>>> > >>>> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted > >>>> - error mounting sysfs on > >>>> /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0 > >>> > >>> Grr.. I was afraid this would break something. :( > >>> > >>> Looking at my system I see that sysfs is currently mounted > >>> "nosuid,nodev,noexec" > >>> > >>> Looking at the lxc-start code I don't see it as including any of those > >>> mount options. In practice for sysfs I think those options are > >>> meaningless (as there should be no devices and nothing executable in > >>> sysfs) but I can understand the past concerns with chmod on virtual > >>> filesystems that would incline people to use them, so I think the > >>> failure is reporting a legitimate security issue in the lxc userspace > >>> code where the the unprivileged code is currently attempting to give > >>> greater access to sysfs than was given by the original mount of sysfs. > >>> > >>> As nosuid,nodev,noexec should not impair the operation of sysfs > >>> operation it looks like you can always specify those options and just > >>> make this concern go away. > >> > >> Linus is pretty strict about not breaking the ABI, and this definitely > >> counts as breaking the ABI. There's an exception for security issues, > >> but is there really a security issue here? That is, do we lose > >> anything important if we just drop the offending part of the patch > >> set? As you've said, there shouldn't be sensitive device nodes, > >> executables, or setuid files in proc or sysfs in the first place. > > We do need to enforce retaining the existing mount flags one way or > another. Where this really matters is with MS_RDONLY. We don't want > any old user to be able to mount /proc read-write when root mounted it > read-only. There is a very real attack vector there. That attack > almost works in docker container today and is avoided simply because > docker mounts over a few files on proc. You could drop the nosuid, noexec, and nodev changes and keep just the ro part. The ro part is probably not an ABI break in the sense of something that actually breaks real programs. > > Which leads to the second side of the reason for these changes. I am > fixing a very small but long standing ABI break. That is in some small > ways I broke some sandboxes and when I realized they were broken I could > not imagine think how to fix the code until now. > > It is the goal that user namespaces don't introduce anything for people > to worry about security wise more than simply the ability to execute > more kernel code. So at least when the kernel implementation is correct > developers of existing applications simply do not need care. Sadly we are > not quite there yet. > > > Speaking as a user of the mount() interfaces, I really think it would > > be less confusing overall if mount() simply ignored the requested > > flags when the caller doesn't have a choice. That is, in cases where > > mount() currently fails with EPERM when not given, say, MS_NOSUID, it > > should instead just pretend the caller actually set MS_NOSUID and go > > ahead with a nosuid mount. Or put another way, the absence of > > MS_NOSUID should not be interpreted as "remove the nosuid bit" but > > rather "don't set the nosuid bit if not required". > > I am conflicted. Implicits are nice but confusing. If we can do > something reliable and robust and maintainable here that is truly worth > the cost I am all for it. > > If I mount proc read-write I likely want to be able to write to proc > files, and I will be much happier if the mount fails than if a bazillion > syscalls later something else fails when it tries to write to proc. I agree. I don't like the implicit thing. --Andy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) 2015-05-28 19:14 ` Eric W. Biederman [not found] ` <87fv6gikfn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-05-29 0:35 ` Andy Lutomirski [not found] ` <CALCETrXO21Y7PR=pKqaqJb1YZArNyjAv7Z-J44O53FcfLM_0Tw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 85+ messages in thread From: Andy Lutomirski @ 2015-05-29 0:35 UTC (permalink / raw) To: Eric W. Biederman Cc: Kenton Varda, Serge Hallyn, Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Michael Kerrisk-manpages, Richard Weinberger, Linux FS Devel, Tejun Heo [resend due to HTML. Sorry.] On May 28, 2015 12:19 PM, "Eric W. Biederman" <ebiederm@xmission.com> wrote: > > Kenton Varda <kenton@sandstorm.io> writes: > > > On Thu, May 28, 2015 at 10:33 AM, Andy Lutomirski <luto@amacapital.net> wrote: > >> On Thu, May 28, 2015 at 8:03 AM, Eric W. Biederman > >> <ebiederm@xmission.com> wrote: > >>> Serge Hallyn <serge.hallyn@ubuntu.com> writes: > >>> > >>>> Quoting Andy Lutomirski (luto@amacapital.net): > >>>>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman > >>>>> <ebiederm@xmission.com> wrote: > >>>>> > I had hoped to get some Tested-By's on that patch series. > >>>>> > >>>>> Sorry, I've been totally swamped. > >>>>> > >>>>> I suspect that Sandstorm is okay, but I haven't had a chance to test > >>>>> it for real. Sandstorm makes only limited use of proc and sysfs in > >>>>> containers, but I'll see if I can test it for real this weekend. > >>>> > >>>> Testing this with unprivileged containers, I get > >>>> > >>>> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted > >>>> - error mounting sysfs on > >>>> /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0 > >>> > >>> Grr.. I was afraid this would break something. :( > >>> > >>> Looking at my system I see that sysfs is currently mounted > >>> "nosuid,nodev,noexec" > >>> > >>> Looking at the lxc-start code I don't see it as including any of those > >>> mount options. In practice for sysfs I think those options are > >>> meaningless (as there should be no devices and nothing executable in > >>> sysfs) but I can understand the past concerns with chmod on virtual > >>> filesystems that would incline people to use them, so I think the > >>> failure is reporting a legitimate security issue in the lxc userspace > >>> code where the the unprivileged code is currently attempting to give > >>> greater access to sysfs than was given by the original mount of sysfs. > >>> > >>> As nosuid,nodev,noexec should not impair the operation of sysfs > >>> operation it looks like you can always specify those options and just > >>> make this concern go away. > >> > >> Linus is pretty strict about not breaking the ABI, and this definitely > >> counts as breaking the ABI. There's an exception for security issues, > >> but is there really a security issue here? That is, do we lose > >> anything important if we just drop the offending part of the patch > >> set? As you've said, there shouldn't be sensitive device nodes, > >> executables, or setuid files in proc or sysfs in the first place. > > We do need to enforce retaining the existing mount flags one way or > another. Where this really matters is with MS_RDONLY. We don't want > any old user to be able to mount /proc read-write when root mounted it > read-only. There is a very real attack vector there. That attack > almost works in docker container today and is avoided simply because > docker mounts over a few files on proc. You could drop the nosuid, noexec, and nodev changes and keep just the ro part. The ro part is probably not an ABI break in the sense of something that actually breaks real programs. > > Which leads to the second side of the reason for these changes. I am > fixing a very small but long standing ABI break. That is in some small > ways I broke some sandboxes and when I realized they were broken I could > not imagine think how to fix the code until now. > > It is the goal that user namespaces don't introduce anything for people > to worry about security wise more than simply the ability to execute > more kernel code. So at least when the kernel implementation is correct > developers of existing applications simply do not need care. Sadly we are > not quite there yet. > > > Speaking as a user of the mount() interfaces, I really think it would > > be less confusing overall if mount() simply ignored the requested > > flags when the caller doesn't have a choice. That is, in cases where > > mount() currently fails with EPERM when not given, say, MS_NOSUID, it > > should instead just pretend the caller actually set MS_NOSUID and go > > ahead with a nosuid mount. Or put another way, the absence of > > MS_NOSUID should not be interpreted as "remove the nosuid bit" but > > rather "don't set the nosuid bit if not required". > > I am conflicted. Implicits are nice but confusing. If we can do > something reliable and robust and maintainable here that is truly worth > the cost I am all for it. > > If I mount proc read-write I likely want to be able to write to proc > files, and I will be much happier if the mount fails than if a bazillion > syscalls later something else fails when it tries to write to proc. I agree. I don't like the implicit thing. ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <CALCETrXO21Y7PR=pKqaqJb1YZArNyjAv7Z-J44O53FcfLM_0Tw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) [not found] ` <CALCETrXO21Y7PR=pKqaqJb1YZArNyjAv7Z-J44O53FcfLM_0Tw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-05-29 4:36 ` Eric W. Biederman [not found] ` <87fv6g80g7.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 85+ messages in thread From: Eric W. Biederman @ 2015-05-29 4:36 UTC (permalink / raw) To: Andy Lutomirski Cc: Kenton Varda, Serge Hallyn, Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Michael Kerrisk-manpages, Richard Weinberger, Linux FS Devel, Tejun Heo Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes: > On May 28, 2015 12:19 PM, "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: >> Kenton Varda <kenton-AuYgBwuPrUQTaNkGU808tA@public.gmane.org> writes: >> >> We do need to enforce retaining the existing mount flags one way or >> another. Where this really matters is with MS_RDONLY. We don't want >> any old user to be able to mount /proc read-write when root mounted it >> read-only. There is a very real attack vector there. That attack >> almost works in docker container today and is avoided simply because >> docker mounts over a few files on proc. > > You could drop the nosuid, noexec, and nodev changes and keep just the > ro part. The ro part is probably not an ABI break in the sense of > something that actually breaks real programs. As a change simply removing the code from the existing patches that worries about nosuid, noexec, and the nodev flags is certainly doable. It is the best proposal I have heard so far. I remain unconvinced about ignoring those flags: - There are clearly people who think it matters (or else proc and sysfs would not have those flags specified). - There have been times when it actually has mattered. Aka when files like /proc/self/env could be chmodded and used for privilege escalation. - The code in lxc and libvirt-lxc so far has been clearly buggy. * lxc only has problems with sysfs (in some configurations). * libvirt-lxc only has problems on a bind mount remount of proc after remounting proc properly. So I am leaning towards enforcing all of the mount flags including nosuid, noexec, and nodev. Then when the next subtle bug in proc or sysfs with respect to chmod shows up I will be able to sleep soundly at night because the mount flags of those filesystems allow a mitigation, and I did not sabatage the mitigation. Plus contemplating code that just enforces a couple of mount flags but not all of the feels wrong. I don't think it is actually a maintainable position to just enforce a couple of those flags. If nothing else I would expect someone to look at the code and to generate a bug fix to start enforcing the rest of the flags. Or perhaps it is in a few years time and something gets refactored and the enforcing starts happening by virtue of using a new common function that no-one realizes will be a problem. Additionally if we don't enforce nosuid, noexec, and nodev people are going to ask questions, that will be hard to explain. When what is truly desirable is to say that sysfs and proc are a little odd but they don't allow anything that a bind mount won't. I can be persuaded otherwise but right now I do think the kernel code needs to enforce nosuid, noexec, and nodev as it is a security issue (if only a defence in depth one), and a maintenance issue as I do not believe in the long term it is a maintanable or an explicable position. >> > Speaking as a user of the mount() interfaces, I really think it would >> > be less confusing overall if mount() simply ignored the requested >> > flags when the caller doesn't have a choice. That is, in cases where >> > mount() currently fails with EPERM when not given, say, MS_NOSUID, it >> > should instead just pretend the caller actually set MS_NOSUID and go >> > ahead with a nosuid mount. Or put another way, the absence of >> > MS_NOSUID should not be interpreted as "remove the nosuid bit" but >> > rather "don't set the nosuid bit if not required". >> >> I am conflicted. Implicits are nice but confusing. If we can do >> something reliable and robust and maintainable here that is truly worth >> the cost I am all for it. >> >> If I mount proc read-write I likely want to be able to write to proc >> files, and I will be much happier if the mount fails than if a bazillion >> syscalls later something else fails when it tries to write to proc. > > I agree. I don't like the implicit thing. My memory returns of our last round of looking at this and for whatever it's warts the existing mount API for remounting filesystems needs to have the flags have exactly the same meaning as at mount time. There are existing userspace applications that depend on that behavior. Implicits for only the locked mount flags is a little different but still ick. Eric ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <87fv6g80g7.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) [not found] ` <87fv6g80g7.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-05-29 4:54 ` Kenton Varda 2015-05-29 17:49 ` Andy Lutomirski 1 sibling, 0 replies; 85+ messages in thread From: Kenton Varda @ 2015-05-29 4:54 UTC (permalink / raw) To: Eric W. Biederman Cc: Richard Weinberger, Greg Kroah-Hartman, Linux Containers, Serge Hallyn, Andy Lutomirski, Seth Forshee, Michael Kerrisk-manpages, Linux API, Linux FS Devel, Tejun Heo On Thu, May 28, 2015 at 9:36 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: > Implicits for only the locked mount flags is a little different but > still ick. FWIW, I only ever meant to advocate for this for locked flags, i.e. cases where the only other option is to throw EPERM. Clearly when the user has permission, the exact requested flags should be applied, or all kinds of things break. It seems to me that if we can fix the security issue without breaking userspace, we should. Sometimes we end up with icky APIs to avoid breaking userspace. (Though IMO implicitly preserving locked bits is not icky at all.) -Kenton ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) [not found] ` <87fv6g80g7.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-05-29 4:54 ` Kenton Varda @ 2015-05-29 17:49 ` Andy Lutomirski 2015-06-03 21:13 ` Eric W. Biederman 1 sibling, 1 reply; 85+ messages in thread From: Andy Lutomirski @ 2015-05-29 17:49 UTC (permalink / raw) To: Eric W. Biederman Cc: Richard Weinberger, Seth Forshee, Greg Kroah-Hartman, Linux Containers, Serge Hallyn, Kenton Varda, Michael Kerrisk-manpages, Linux API, Linux FS Devel, Tejun Heo On Thu, May 28, 2015 at 9:36 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: > Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes: >> On May 28, 2015 12:19 PM, "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: >>> Kenton Varda <kenton-AuYgBwuPrUQTaNkGU808tA@public.gmane.org> writes: >>> >>> We do need to enforce retaining the existing mount flags one way or >>> another. Where this really matters is with MS_RDONLY. We don't want >>> any old user to be able to mount /proc read-write when root mounted it >>> read-only. There is a very real attack vector there. That attack >>> almost works in docker container today and is avoided simply because >>> docker mounts over a few files on proc. >> >> You could drop the nosuid, noexec, and nodev changes and keep just the >> ro part. The ro part is probably not an ABI break in the sense of >> something that actually breaks real programs. > > As a change simply removing the code from the existing patches that > worries about nosuid, noexec, and the nodev flags is certainly doable. > It is the best proposal I have heard so far. > > I remain unconvinced about ignoring those flags: > - There are clearly people who think it matters (or else proc and sysfs > would not have those flags specified). > > - There have been times when it actually has mattered. > Aka when files like /proc/self/env could be chmodded and used for > privilege escalation. > > - The code in lxc and libvirt-lxc so far has been clearly buggy. > * lxc only has problems with sysfs (in some configurations). > * libvirt-lxc only has problems on a bind mount remount of > proc after remounting proc properly. > > So I am leaning towards enforcing all of the mount flags including > nosuid, noexec, and nodev. Then when the next subtle bug in proc or > sysfs with respect to chmod shows up I will be able to sleep soundly at > night because the mount flags of those filesystems allow a mitigation, > and I did not sabatage the mitigation. One option would be to break the nosuid, nodev, and noexec parts into their own patch and then avoid tagging that patch for -stable if at all possible. It would be nice to avoid another -stable ABI break if at all possible. --Andy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) 2015-05-29 17:49 ` Andy Lutomirski @ 2015-06-03 21:13 ` Eric W. Biederman [not found] ` <87k2vkebri.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 85+ messages in thread From: Eric W. Biederman @ 2015-06-03 21:13 UTC (permalink / raw) To: Andy Lutomirski Cc: Kenton Varda, Serge Hallyn, Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Michael Kerrisk-manpages, Richard Weinberger, Linux FS Devel, Tejun Heo Andy Lutomirski <luto@amacapital.net> writes: > One option would be to break the nosuid, nodev, and noexec parts into > their own patch and then avoid tagging that patch for -stable if at > all possible. It would be nice to avoid another -stable ABI break if > at all possible. So I don't think we actually have anything that could be called an ABI break in the whole mess, but it is definitely a behavioral change that is a regression for lxc and libvirt-lxc that prevents them from starting. nodev does not actually matter because of the implicit silliness that is being added right now. We do want those programs fixed and after those programs are fixed we can safely begin failing mount when those attributes are being cleared in a fresh mount. So it looks to me like the best thing to do is to print a warning whenever lxc or libvirt-lxc gets it wrong, which should ensure the authors are sufficiently pestered that in a kernel release or 3 we can begin enforcing those attributes. Especially as the discussion on the fix for those applications has already begun. And if folks would double check the patch I am going to post in a moment to ensure that lxc and libvirt-lxc continue to start I would appreciate it. Eric ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <87k2vkebri.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* [CFT][PATCH 11/10] mnt: Avoid unnecessary regressions in fs_fully_visible [not found] ` <87k2vkebri.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-06-03 21:15 ` Eric W. Biederman [not found] ` <87eglseboh.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-06-04 5:19 ` [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) Greg Kroah-Hartman 1 sibling, 1 reply; 85+ messages in thread From: Eric W. Biederman @ 2015-06-03 21:15 UTC (permalink / raw) To: Andy Lutomirski Cc: Richard Weinberger, Seth Forshee, Greg Kroah-Hartman, Linux Containers, Serge Hallyn, Kenton Varda, Michael Kerrisk-manpages, Linux API, Linux FS Devel, Tejun Heo Not allowing programs to clear nosuid, nodev, and noexec on new mounts of sysfs or proc will cause lxc and libvirt-lxc to fail to start (a regression). There are no device nodes or executables on sysfs or proc today which means clearing these flags is harmless today. Instead of failing the fresh mounts of sysfs and proc emit a warning when these flags are improprely cleared. We only reach this point because lxc and libvirt-lxc clear flags they mount flags had not intended to. In a couple of kernel releases when lxc and libvirt-lxc have been fixed we can start failing fresh mounts proc and sysfs that clear nosuid, nodev and noexec. Userspace clearly means to enforce those attributes and historically they have avoided bugs. Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/namespace.c | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/fs/namespace.c b/fs/namespace.c index eccd925c6e82..eaa49b628d28 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3198,6 +3198,7 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags) if ((mnt->mnt.mnt_flags & MNT_LOCK_READONLY) && !(new_flags & MNT_READONLY)) continue; +#if 0 /* Avoid unnecessary regressions */ if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) && !(new_flags & MNT_NODEV)) continue; @@ -3207,6 +3208,7 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags) if ((mnt->mnt.mnt_flags & MNT_LOCK_NOEXEC) && !(new_flags & MNT_NOEXEC)) continue; +#endif if ((mnt->mnt.mnt_flags & MNT_LOCK_ATIME) && ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK))) continue; @@ -3226,10 +3228,35 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags) } /* Preserve the locked attributes */ *new_mnt_flags |= mnt->mnt.mnt_flags & (MNT_LOCK_READONLY | \ + /* Avoid unnecessary regressions \ MNT_LOCK_NODEV | \ MNT_LOCK_NOSUID | \ MNT_LOCK_NOEXEC | \ + */ \ MNT_LOCK_ATIME); + /* For now, warn about the "harmless" but invalid mnt flags */ + { + bool nodev = false, nosuid = false, noexec = false; + if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) && + !(new_flags & MNT_NODEV)) + nodev = true; + if ((mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) && + !(new_flags & MNT_NOSUID)) + nosuid = true; + if ((mnt->mnt.mnt_flags & MNT_LOCK_NOEXEC) && + !(new_flags & MNT_NOEXEC)) + noexec = true; + + if ((nodev || nosuid || noexec) && printk_ratelimit()) { + printk(KERN_INFO + "warning: process `%s' clears %s%s%sin mount of %s\n", + current->comm, + nodev ? "nodev ":"", + nosuid ? "nosuid ":"", + noexec ? "noexec ":"", + type->name); + } + } visible = true; goto found; next: ; -- 2.2.1 ^ permalink raw reply related [flat|nested] 85+ messages in thread
[parent not found: <87eglseboh.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* [CFT][PATCH 11/10] mnt: Avoid unnecessary regressions in fs_fully_visible (take 2) [not found] ` <87eglseboh.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-06-04 4:35 ` Eric W. Biederman [not found] ` <874mmodral.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-06-05 0:46 ` [CFT][PATCH 11/10] mnt: Avoid unnecessary regressions in fs_fully_visible Andy Lutomirski 1 sibling, 1 reply; 85+ messages in thread From: Eric W. Biederman @ 2015-06-04 4:35 UTC (permalink / raw) To: Andy Lutomirski Cc: Kenton Varda, Serge Hallyn, Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Michael Kerrisk-manpages, Richard Weinberger, Linux FS Devel, Tejun Heo Not allowing programs to clear nosuid and noexec on new mounts of sysfs or proc will cause lxc and libvirt-lxc to fail to start (a regression). There are no executables files on sysfs or proc today which means clearing these flags is harmless today. Instead of failing the fresh mounts of sysfs and proc emit a warning when these flags are improprely cleared. We only reach this point because lxc and libvirt-lxc clear flags they mount flags had not intended to. In a couple of kernel releases when lxc and libvirt-lxc have been fixed we can start failing fresh mounts proc and sysfs that clear nosuid and noexec. Userspace clearly means to enforce those attributes and enforcing these attributes have historically avoided bugs in the setattr implementations of proc and sysfs. Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- Now with warning on problematic remounts as well. nodev is also ignored because it is not currently problematic. fs/namespace.c | 33 +++++++++++++++++++++++++++++++++ include/linux/mount.h | 5 +++++ 2 files changed, 38 insertions(+) diff --git a/fs/namespace.c b/fs/namespace.c index eccd925c6e82..3c3f8172c734 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2162,6 +2162,18 @@ static int do_remount(struct path *path, int flags, int mnt_flags, ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (mnt_flags & MNT_ATIME_MASK))) { return -EPERM; } + if ((mnt->mnt.mnt_flags & MNT_WARN_NOSUID) && + !(mnt_flags & MNT_NOSUID) && printk_ratelimit()) { + printk(KERN_INFO + "warning: process `%s' clears nosuid in remount of %s\n", + current->comm, sb->s_type->name); + } + if ((mnt->mnt.mnt_flags & MNT_WARN_NOEXEC) && + !(mnt_flags & MNT_NOEXEC) && printk_ratelimit()) { + printk(KERN_INFO + "warning: process `%s' clears noexec in remount of %s\n", + current->comm, sb->s_type->name); + } err = security_sb_remount(sb, data); if (err) @@ -3201,12 +3213,14 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags) if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) && !(new_flags & MNT_NODEV)) continue; +#if 0 /* Avoid unnecessary regressions */ if ((mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) && !(new_flags & MNT_NOSUID)) continue; if ((mnt->mnt.mnt_flags & MNT_LOCK_NOEXEC) && !(new_flags & MNT_NOEXEC)) continue; +#endif if ((mnt->mnt.mnt_flags & MNT_LOCK_ATIME) && ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK))) continue; @@ -3227,9 +3241,28 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags) /* Preserve the locked attributes */ *new_mnt_flags |= mnt->mnt.mnt_flags & (MNT_LOCK_READONLY | \ MNT_LOCK_NODEV | \ + /* Avoid unnecessary regressions \ MNT_LOCK_NOSUID | \ MNT_LOCK_NOEXEC | \ + */ \ MNT_LOCK_ATIME); + /* For now, warn about the "harmless" but invalid mnt flags */ + if (mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) { + *new_mnt_flags |= MNT_WARN_NOSUID; + if (!(new_flags & MNT_NOSUID) && printk_ratelimit()) { + printk(KERN_INFO + "warning: process `%s' clears nosuid in mount of %s\n", + current->comm, type->name); + } + } + if (mnt->mnt.mnt_flags & MNT_LOCK_NOEXEC) { + *new_mnt_flags |= MNT_WARN_NOEXEC; + if (!(new_flags & MNT_NOEXEC) && printk_ratelimit()) { + printk(KERN_INFO + "warning: process `%s' clears noexec in mount of %s\n", + current->comm, type->name); + } + } visible = true; goto found; next: ; diff --git a/include/linux/mount.h b/include/linux/mount.h index f822c3c11377..a9ac188413fd 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -52,6 +52,11 @@ struct mnt_namespace; #define MNT_INTERNAL 0x4000 +/* These warning options should be removed in a few kernel releases + * once userspace has been fixed. + */ +#define MNT_WARN_NOSUID 0x010000 +#define MNT_WARN_NOEXEC 0x020000 #define MNT_LOCK_ATIME 0x040000 #define MNT_LOCK_NOEXEC 0x080000 #define MNT_LOCK_NOSUID 0x100000 -- 2.2.1 ^ permalink raw reply related [flat|nested] 85+ messages in thread
[parent not found: <874mmodral.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* Re: [CFT][PATCH 11/10] mnt: Avoid unnecessary regressions in fs_fully_visible (take 2) [not found] ` <874mmodral.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-06-04 5:20 ` Greg Kroah-Hartman 0 siblings, 0 replies; 85+ messages in thread From: Greg Kroah-Hartman @ 2015-06-04 5:20 UTC (permalink / raw) To: Eric W. Biederman Cc: Andy Lutomirski, Kenton Varda, Serge Hallyn, Seth Forshee, Linux API, Linux Containers, Michael Kerrisk-manpages, Richard Weinberger, Linux FS Devel, Tejun Heo On Wed, Jun 03, 2015 at 11:35:30PM -0500, Eric W. Biederman wrote: > > Not allowing programs to clear nosuid and noexec on new mounts of > sysfs or proc will cause lxc and libvirt-lxc to fail to start (a > regression). There are no executables files on sysfs or proc today > which means clearing these flags is harmless today. > > Instead of failing the fresh mounts of sysfs and proc emit a warning > when these flags are improprely cleared. We only reach this point > because lxc and libvirt-lxc clear flags they mount flags had not > intended to. > > In a couple of kernel releases when lxc and libvirt-lxc have been > fixed we can start failing fresh mounts proc and sysfs that clear > nosuid and noexec. Userspace clearly means to enforce those > attributes and enforcing these attributes have historically avoided > bugs in the setattr implementations of proc and sysfs. > > Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> > --- > > Now with warning on problematic remounts as well. > nodev is also ignored because it is not currently problematic. > > fs/namespace.c | 33 +++++++++++++++++++++++++++++++++ > include/linux/mount.h | 5 +++++ > 2 files changed, 38 insertions(+) > > diff --git a/fs/namespace.c b/fs/namespace.c > index eccd925c6e82..3c3f8172c734 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -2162,6 +2162,18 @@ static int do_remount(struct path *path, int flags, int mnt_flags, > ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (mnt_flags & MNT_ATIME_MASK))) { > return -EPERM; > } > + if ((mnt->mnt.mnt_flags & MNT_WARN_NOSUID) && > + !(mnt_flags & MNT_NOSUID) && printk_ratelimit()) { > + printk(KERN_INFO > + "warning: process `%s' clears nosuid in remount of %s\n", > + current->comm, sb->s_type->name); > + } > + if ((mnt->mnt.mnt_flags & MNT_WARN_NOEXEC) && > + !(mnt_flags & MNT_NOEXEC) && printk_ratelimit()) { > + printk(KERN_INFO > + "warning: process `%s' clears noexec in remount of %s\n", > + current->comm, sb->s_type->name); > + } > > err = security_sb_remount(sb, data); > if (err) > @@ -3201,12 +3213,14 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags) > if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) && > !(new_flags & MNT_NODEV)) > continue; > +#if 0 /* Avoid unnecessary regressions */ > if ((mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) && > !(new_flags & MNT_NOSUID)) > continue; > if ((mnt->mnt.mnt_flags & MNT_LOCK_NOEXEC) && > !(new_flags & MNT_NOEXEC)) > continue; > +#endif > if ((mnt->mnt.mnt_flags & MNT_LOCK_ATIME) && > ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK))) > continue; > @@ -3227,9 +3241,28 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags) > /* Preserve the locked attributes */ > *new_mnt_flags |= mnt->mnt.mnt_flags & (MNT_LOCK_READONLY | \ > MNT_LOCK_NODEV | \ > + /* Avoid unnecessary regressions \ > MNT_LOCK_NOSUID | \ > MNT_LOCK_NOEXEC | \ > + */ \ > MNT_LOCK_ATIME); > + /* For now, warn about the "harmless" but invalid mnt flags */ > + if (mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) { > + *new_mnt_flags |= MNT_WARN_NOSUID; > + if (!(new_flags & MNT_NOSUID) && printk_ratelimit()) { > + printk(KERN_INFO > + "warning: process `%s' clears nosuid in mount of %s\n", > + current->comm, type->name); > + } > + } > + if (mnt->mnt.mnt_flags & MNT_LOCK_NOEXEC) { > + *new_mnt_flags |= MNT_WARN_NOEXEC; > + if (!(new_flags & MNT_NOEXEC) && printk_ratelimit()) { > + printk(KERN_INFO > + "warning: process `%s' clears noexec in mount of %s\n", > + current->comm, type->name); > + } > + } Adding this to a stable kernel is not going to be ok, sorry. We can't start being noisy in system logs for things that were working just fine. greg k-h ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 11/10] mnt: Avoid unnecessary regressions in fs_fully_visible [not found] ` <87eglseboh.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-06-04 4:35 ` [CFT][PATCH 11/10] mnt: Avoid unnecessary regressions in fs_fully_visible (take 2) Eric W. Biederman @ 2015-06-05 0:46 ` Andy Lutomirski [not found] ` <CALCETrWwtFaiaYGLoq4EPkrgcq9nEA2GseVfP3iBkbYZ8NfGPg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 85+ messages in thread From: Andy Lutomirski @ 2015-06-05 0:46 UTC (permalink / raw) To: Eric W. Biederman Cc: Kenton Varda, Serge Hallyn, Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Michael Kerrisk-manpages, Richard Weinberger, Linux FS Devel, Tejun Heo On Wed, Jun 3, 2015 at 2:15 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: > > Not allowing programs to clear nosuid, nodev, and noexec on new mounts > of sysfs or proc will cause lxc and libvirt-lxc to fail to start (a > regression). There are no device nodes or executables on sysfs or > proc today which means clearing these flags is harmless today. > > Instead of failing the fresh mounts of sysfs and proc emit a warning > when these flags are improprely cleared. We only reach this point > because lxc and libvirt-lxc clear flags they mount flags had not > intended to. > > In a couple of kernel releases when lxc and libvirt-lxc have been > fixed we can start failing fresh mounts proc and sysfs that clear > nosuid, nodev and noexec. Userspace clearly means to enforce those > attributes and historically they have avoided bugs. At the very least, I think this should be folded in so that the ABI doesn't break in the middle of the series. --Andy ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <CALCETrWwtFaiaYGLoq4EPkrgcq9nEA2GseVfP3iBkbYZ8NfGPg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [CFT][PATCH 11/10] mnt: Avoid unnecessary regressions in fs_fully_visible [not found] ` <CALCETrWwtFaiaYGLoq4EPkrgcq9nEA2GseVfP3iBkbYZ8NfGPg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-06-06 19:14 ` Eric W. Biederman 0 siblings, 0 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-06-06 19:14 UTC (permalink / raw) To: Andy Lutomirski Cc: Kenton Varda, Serge Hallyn, Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Michael Kerrisk-manpages, Richard Weinberger, Linux FS Devel, Tejun Heo Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes: > On Wed, Jun 3, 2015 at 2:15 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: >> >> Not allowing programs to clear nosuid, nodev, and noexec on new mounts >> of sysfs or proc will cause lxc and libvirt-lxc to fail to start (a >> regression). There are no device nodes or executables on sysfs or >> proc today which means clearing these flags is harmless today. >> >> Instead of failing the fresh mounts of sysfs and proc emit a warning >> when these flags are improprely cleared. We only reach this point >> because lxc and libvirt-lxc clear flags they mount flags had not >> intended to. >> >> In a couple of kernel releases when lxc and libvirt-lxc have been >> fixed we can start failing fresh mounts proc and sysfs that clear >> nosuid, nodev and noexec. Userspace clearly means to enforce those >> attributes and historically they have avoided bugs. > > At the very least, I think this should be folded in so that the ABI > doesn't break in the middle of the series. Nothing in any of these patches has ever broken the ABI. The bits have always been interpreted with the same meaning. I have been going back and forth on exactly the best way to handle this because I don't like breaking working executables even for valid reasons. I think I have finally reach my personal peace on this issue. Not requiring the presence of nosuid and noexec on a fresh mount of proc and sysfs if the original mount has nosuid or noexec is a security issue as what proc and sysfs implement in the future can not be known. The one possible way to remedy this is to implicity add nosuid and noexec as appropriate unfortunately that would break the ABI as it changes the interpretation of the bits in the userspace interface, and the day proc or sysfs changes and we honest to truly want to enable suid exectuables on proc or sysfs we would not be able to. :( So implicitly adding attributes is out. As the current implementation of proc and sysfs are known I agree it does not make sense to backport the enforcement of nosuid and noexec. So I have split the patch. See my for-testing branch and shortly my for-next branch. It only takes two or three line patches in the affected userspace executables, and a 5 minute test. So a warning printk does not actually make sense. If the authors of lxc and libvirt-lxc have not taken the time to fix their code by the time this code lands in a stable release (in 2 months or so) no amount of other warnings are going to be enough. Eric ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) [not found] ` <87k2vkebri.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-06-03 21:15 ` [CFT][PATCH 11/10] mnt: Avoid unnecessary regressions in fs_fully_visible Eric W. Biederman @ 2015-06-04 5:19 ` Greg Kroah-Hartman 2015-06-04 6:27 ` Eric W. Biederman 1 sibling, 1 reply; 85+ messages in thread From: Greg Kroah-Hartman @ 2015-06-04 5:19 UTC (permalink / raw) To: Eric W. Biederman Cc: Seth Forshee, Linux API, Linux Containers, Serge Hallyn, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, Linux FS Devel, Tejun Heo On Wed, Jun 03, 2015 at 04:13:21PM -0500, Eric W. Biederman wrote: > Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes: > > > One option would be to break the nosuid, nodev, and noexec parts into > > their own patch and then avoid tagging that patch for -stable if at > > all possible. It would be nice to avoid another -stable ABI break if > > at all possible. > > So I don't think we actually have anything that could be called an ABI > break in the whole mess, but it is definitely a behavioral change that > is a regression for lxc and libvirt-lxc that prevents them from starting. > > nodev does not actually matter because of the implicit silliness that > is being added right now. > > We do want those programs fixed and after those programs are fixed we > can safely begin failing mount when those attributes are being cleared > in a fresh mount. > > So it looks to me like the best thing to do is to print a warning > whenever lxc or libvirt-lxc gets it wrong, which should ensure the > authors are sufficiently pestered that in a kernel release or 3 we can > begin enforcing those attributes. Especially as the discussion on the > fix for those applications has already begun. "pestering" never works, look at some of the SCSI drivers for examples of how a distro will just patch out the "warning this driver is using an old api and needs to be fixed" messages. You can't break stuff like this, people will get upset :( greg k-h ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) 2015-06-04 5:19 ` [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) Greg Kroah-Hartman @ 2015-06-04 6:27 ` Eric W. Biederman [not found] ` <87h9qo6la9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 85+ messages in thread From: Eric W. Biederman @ 2015-06-04 6:27 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Andy Lutomirski, Kenton Varda, Serge Hallyn, Seth Forshee, Linux API, Linux Containers, Michael Kerrisk-manpages, Richard Weinberger, Linux FS Devel, Tejun Heo Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes: > On Wed, Jun 03, 2015 at 04:13:21PM -0500, Eric W. Biederman wrote: >> Andy Lutomirski <luto@amacapital.net> writes: >> >> > One option would be to break the nosuid, nodev, and noexec parts into >> > their own patch and then avoid tagging that patch for -stable if at >> > all possible. It would be nice to avoid another -stable ABI break if >> > at all possible. >> >> So I don't think we actually have anything that could be called an ABI >> break in the whole mess, but it is definitely a behavioral change that >> is a regression for lxc and libvirt-lxc that prevents them from starting. >> >> nodev does not actually matter because of the implicit silliness that >> is being added right now. >> >> We do want those programs fixed and after those programs are fixed we >> can safely begin failing mount when those attributes are being cleared >> in a fresh mount. >> >> So it looks to me like the best thing to do is to print a warning >> whenever lxc or libvirt-lxc gets it wrong, which should ensure the >> authors are sufficiently pestered that in a kernel release or 3 we can >> begin enforcing those attributes. Especially as the discussion on the >> fix for those applications has already begun. > > "pestering" never works, look at some of the SCSI drivers for examples > of how a distro will just patch out the "warning this driver is using an > old api and needs to be fixed" messages. > You can't break stuff like this, people will get upset :( A) To the best of my knowledge there are two programs on the face of the planet where this matters. (lxc and libvirt-lxc) B) The code in those two programs is buggy. That is the code in those two programs does not do what the authors intended. That is fixing those programs is something that should be done regardless of what I do in the kernel. I have already reached out to the developers of those programs. The pestering in the kernel is a form of reminder, not the primary source of communication. C) These bugs really are security holes. Currently they do not appear exploitable (thank goodness) but they are security holes. Since they are not currently exploitable it does make sense to give people a little time to get their act together. The bugs are larger then the case that is being hit here, this is just where they are noticed. D) Letting people know that there is a problem as part of a larger effort has actually worked for me. Distro's have stopped enabling the sysctl system call. E) Given that I have not audited sysfs and proc closely in recent years I may actually be wrong. Those bugs may actually be exploitable. All it takes is chmod to be supported on one file that can be made executable. That bug has existed in the past and I don't doubt someone will overlook something and we will see the bug again in the future. So it is my best judgment that I disable the code that stops containers from starting and just making it a warning (for now). Then in a release or so I start failing these operations instead of warning. This is the most fair and reasonable I can see to be. The only other choice I can see is to say I don't care it is a security issue I am breaking your sloopy insecure code. Am I being too nice with these security bugs? Eric ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <87h9qo6la9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) [not found] ` <87h9qo6la9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-06-04 7:34 ` Eric W. Biederman 2015-06-16 12:23 ` Daniel P. Berrange 1 sibling, 0 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-06-04 7:34 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Seth Forshee, Linux API, Linux Containers, Serge Hallyn, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, Linux FS Devel, Tejun Heo ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes: > Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> writes: > >> On Wed, Jun 03, 2015 at 04:13:21PM -0500, Eric W. Biederman wrote: >>> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes: >>> >>> > One option would be to break the nosuid, nodev, and noexec parts into >>> > their own patch and then avoid tagging that patch for -stable if at >>> > all possible. It would be nice to avoid another -stable ABI break if >>> > at all possible. >>> >>> So I don't think we actually have anything that could be called an ABI >>> break in the whole mess, but it is definitely a behavioral change that >>> is a regression for lxc and libvirt-lxc that prevents them from starting. >>> >>> nodev does not actually matter because of the implicit silliness that >>> is being added right now. >>> >>> We do want those programs fixed and after those programs are fixed we >>> can safely begin failing mount when those attributes are being cleared >>> in a fresh mount. >>> >>> So it looks to me like the best thing to do is to print a warning >>> whenever lxc or libvirt-lxc gets it wrong, which should ensure the >>> authors are sufficiently pestered that in a kernel release or 3 we can >>> begin enforcing those attributes. Especially as the discussion on the >>> fix for those applications has already begun. >> >> "pestering" never works, look at some of the SCSI drivers for examples >> of how a distro will just patch out the "warning this driver is using an >> old api and needs to be fixed" messages. > >> You can't break stuff like this, people will get upset :( > > A) To the best of my knowledge there are two programs on the face of the > planet where this matters. (lxc and libvirt-lxc) > > B) The code in those two programs is buggy. That is the code in those > two programs does not do what the authors intended. That is fixing > those programs is something that should be done regardless of what > I do in the kernel. I have already reached out to the developers of > those programs. The pestering in the kernel is a form of reminder, > not the primary source of communication. > > C) These bugs really are security holes. Currently they do not appear > exploitable (thank goodness) but they are security holes. > > Since they are not currently exploitable it does make sense > to give people a little time to get their act together. > > The bugs are larger then the case that is being hit here, > this is just where they are noticed. > > D) Letting people know that there is a problem as part of a larger > effort has actually worked for me. Distro's have stopped enabling > the sysctl system call. > > E) Given that I have not audited sysfs and proc closely in recent years > I may actually be wrong. Those bugs may actually be exploitable. > All it takes is chmod to be supported on one file that can be made > executable. That bug has existed in the past and I don't doubt > someone will overlook something and we will see the bug again in the > future. > > So it is my best judgment that I disable the code that stops > containers from starting and just making it a warning (for now). > Then in a release or so I start failing these operations instead of > warning. > > This is the most fair and reasonable I can see to be. > > The only other choice I can see is to say I don't care it is a security > issue I am breaking your sloopy insecure code. > > Am I being too nice with these security bugs? Thinking about it a little more. There is a possibility that sometime in the future that someone will deliberately add a suid executable as a file in proc or sysfs and have a good reason for doing so. Some sysadmin or sandbox builder with special requirements may then disable suid and exec on proc because in their sandbox (not linux in general) having access to that executable is a bad thing. At which we have an exploitable security issue if nosuid and noexec are not enforced. Or in other words I am not smarter than the bad guys. This is a security issue. I can not ignore nosuid and noexec indefinitely. I have to make those cases fail at some point. At that point current unfixed versions of lxc and libvirt-lxc will break. A warning is the nicest I can imagine being. Eric ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) [not found] ` <87h9qo6la9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-06-04 7:34 ` Eric W. Biederman @ 2015-06-16 12:23 ` Daniel P. Berrange 1 sibling, 0 replies; 85+ messages in thread From: Daniel P. Berrange @ 2015-06-16 12:23 UTC (permalink / raw) To: Eric W. Biederman Cc: Greg Kroah-Hartman, Seth Forshee, Linux API, Linux Containers, Serge Hallyn, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, Linux FS Devel, Tejun Heo On Thu, Jun 04, 2015 at 01:27:10AM -0500, Eric W. Biederman wrote: > Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> writes: > > > On Wed, Jun 03, 2015 at 04:13:21PM -0500, Eric W. Biederman wrote: > >> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes: > >> > >> > One option would be to break the nosuid, nodev, and noexec parts into > >> > their own patch and then avoid tagging that patch for -stable if at > >> > all possible. It would be nice to avoid another -stable ABI break if > >> > at all possible. > >> > >> So I don't think we actually have anything that could be called an ABI > >> break in the whole mess, but it is definitely a behavioral change that > >> is a regression for lxc and libvirt-lxc that prevents them from starting. > >> > >> nodev does not actually matter because of the implicit silliness that > >> is being added right now. > >> > >> We do want those programs fixed and after those programs are fixed we > >> can safely begin failing mount when those attributes are being cleared > >> in a fresh mount. > >> > >> So it looks to me like the best thing to do is to print a warning > >> whenever lxc or libvirt-lxc gets it wrong, which should ensure the > >> authors are sufficiently pestered that in a kernel release or 3 we can > >> begin enforcing those attributes. Especially as the discussion on the > >> fix for those applications has already begun. > > > > "pestering" never works, look at some of the SCSI drivers for examples > > of how a distro will just patch out the "warning this driver is using an > > old api and needs to be fixed" messages. > > > You can't break stuff like this, people will get upset :( > > A) To the best of my knowledge there are two programs on the face of the > planet where this matters. (lxc and libvirt-lxc) > > B) The code in those two programs is buggy. That is the code in those > two programs does not do what the authors intended. That is fixing > those programs is something that should be done regardless of what > I do in the kernel. I have already reached out to the developers of > those programs. The pestering in the kernel is a form of reminder, > not the primary source of communication. > > C) These bugs really are security holes. Currently they do not appear > exploitable (thank goodness) but they are security holes. > > Since they are not currently exploitable it does make sense > to give people a little time to get their act together. > > The bugs are larger then the case that is being hit here, > this is just where they are noticed. > > D) Letting people know that there is a problem as part of a larger > effort has actually worked for me. Distro's have stopped enabling > the sysctl system call. > > E) Given that I have not audited sysfs and proc closely in recent years > I may actually be wrong. Those bugs may actually be exploitable. > All it takes is chmod to be supported on one file that can be made > executable. That bug has existed in the past and I don't doubt > someone will overlook something and we will see the bug again in the > future. > > So it is my best judgment that I disable the code that stops > containers from starting and just making it a warning (for now). > Then in a release or so I start failing these operations instead of > warning. > > This is the most fair and reasonable I can see to be. While I generally like & support the kernel standard that userspace must never be broken, as libvirt LXC maintainer I think what Eric proposes is acceptable from the libvirt POV. We'll get the fix into libvirt LXC in this month's release and backport it to our stable branches. So as long as there are a few months/releases grace period between this being a kernel warning and it turning into a hard error, libvirt users will have the fix already, or at least have it easily available to them. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) 2015-05-28 15:03 ` Eric W. Biederman 2015-05-28 17:33 ` Andy Lutomirski @ 2015-05-28 21:04 ` Serge E. Hallyn [not found] ` <20150528210438.GA14849-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> 1 sibling, 1 reply; 85+ messages in thread From: Serge E. Hallyn @ 2015-05-28 21:04 UTC (permalink / raw) To: Eric W. Biederman Cc: Serge Hallyn, Richard Weinberger, Kenton Varda, Linux API, Linux Containers, Andy Lutomirski, Seth Forshee, Michael Kerrisk-manpages, Greg Kroah-Hartman, Linux FS Devel, Tejun Heo On Thu, May 28, 2015 at 10:03:28AM -0500, Eric W. Biederman wrote: > Serge Hallyn <serge.hallyn@ubuntu.com> writes: > > > Quoting Andy Lutomirski (luto@amacapital.net): > >> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman > >> <ebiederm@xmission.com> wrote: > >> > I had hoped to get some Tested-By's on that patch series. > >> > >> Sorry, I've been totally swamped. > >> > >> I suspect that Sandstorm is okay, but I haven't had a chance to test > >> it for real. Sandstorm makes only limited use of proc and sysfs in > >> containers, but I'll see if I can test it for real this weekend. > > > > Testing this with unprivileged containers, I get > > > > lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted > > - error mounting sysfs on > > /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0 > > Grr.. I was afraid this would break something. :( > > Looking at my system I see that sysfs is currently mounted > "nosuid,nodev,noexec" > > Looking at the lxc-start code I don't see it as including any of those > mount options. In practice for sysfs I think those options are > meaningless (as there should be no devices and nothing executable in > sysfs) but I can understand the past concerns with chmod on virtual > filesystems that would incline people to use them, so I think the > failure is reporting a legitimate security issue in the lxc userspace > code where the the unprivileged code is currently attempting to give > greater access to sysfs than was given by the original mount of sysfs. > > As nosuid,nodev,noexec should not impair the operation of sysfs > operation it looks like you can always specify those options and just > make this concern go away. > > Something like the untested patch below I expect. > > diff --git a/src/lxc/conf.c b/src/lxc/conf.c > index 9870455b3cae..d9ccd03afe68 100644 > --- a/src/lxc/conf.c > +++ b/src/lxc/conf.c > @@ -770,8 +770,8 @@ static int lxc_mount_auto_mounts(struct lxc_conf *conf, int flags, struct lxc_ha > { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, "%r/proc/sysrq-trigger", "%r/proc/sysrq-trigger", NULL, MS_BIND, NULL }, > { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, NULL, "%r/proc/sysrq-trigger", NULL, MS_REMOUNT|MS_BIND|MS_RDONLY, NULL }, > { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_RW, "proc", "%r/proc", "proc", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, > - { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RW, "sysfs", "%r/sys", "sysfs", 0, NULL }, > - { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RO, "sysfs", "%r/sys", "sysfs", MS_RDONLY, NULL }, > + { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RW, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, > + { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RO, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY, NULL }, > { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, > { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, "%r/sys", "%r/sys", NULL, MS_BIND, NULL }, > { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, NULL, "%r/sys", NULL, MS_REMOUNT|MS_BIND|MS_RDONLY, NULL }, fwiw - the first one works, the second one does not due to an apparent inability to statvfs the origin. > Alternately you can read the flags off of the original mount of proc or sysfs. > > diff --git a/src/lxc/conf.c b/src/lxc/conf.c > index 9870455b3cae..50ea49973e80 100644 > --- a/src/lxc/conf.c > +++ b/src/lxc/conf.c > @@ -712,7 +712,9 @@ static unsigned long add_required_remount_flags(const char *s, const char *d, > struct statvfs sb; > unsigned long required_flags = 0; > > - if (!(flags & MS_REMOUNT)) > + if (!(flags & MS_REMOUNT) && > + (strcmp(s, "proc") != 0) && > + (strcmp(s, "sysfs") != 0)) > return flags; > > if (!s) > > Eric > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <20150528210438.GA14849-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>]
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) [not found] ` <20150528210438.GA14849-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> @ 2015-05-28 21:42 ` Eric W. Biederman 2015-05-28 21:52 ` Serge E. Hallyn 0 siblings, 1 reply; 85+ messages in thread From: Eric W. Biederman @ 2015-05-28 21:42 UTC (permalink / raw) To: Serge E. Hallyn Cc: Serge Hallyn, Richard Weinberger, Kenton Varda, Linux API, Linux Containers, Andy Lutomirski, Seth Forshee, Michael Kerrisk-manpages, Greg Kroah-Hartman, Linux FS Devel, Tejun Heo "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes: > On Thu, May 28, 2015 at 10:03:28AM -0500, Eric W. Biederman wrote: >> Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes: >> >> > Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org): >> >> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman >> >> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: >> >> > I had hoped to get some Tested-By's on that patch series. >> >> >> >> Sorry, I've been totally swamped. >> >> >> >> I suspect that Sandstorm is okay, but I haven't had a chance to test >> >> it for real. Sandstorm makes only limited use of proc and sysfs in >> >> containers, but I'll see if I can test it for real this weekend. >> > >> > Testing this with unprivileged containers, I get >> > >> > lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted >> > - error mounting sysfs on >> > /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0 >> >> Grr.. I was afraid this would break something. :( >> >> Looking at my system I see that sysfs is currently mounted >> "nosuid,nodev,noexec" >> >> Looking at the lxc-start code I don't see it as including any of those >> mount options. In practice for sysfs I think those options are >> meaningless (as there should be no devices and nothing executable in >> sysfs) but I can understand the past concerns with chmod on virtual >> filesystems that would incline people to use them, so I think the >> failure is reporting a legitimate security issue in the lxc userspace >> code where the the unprivileged code is currently attempting to give >> greater access to sysfs than was given by the original mount of sysfs. >> >> As nosuid,nodev,noexec should not impair the operation of sysfs >> operation it looks like you can always specify those options and just >> make this concern go away. >> >> Something like the untested patch below I expect. >> >> diff --git a/src/lxc/conf.c b/src/lxc/conf.c >> index 9870455b3cae..d9ccd03afe68 100644 >> --- a/src/lxc/conf.c >> +++ b/src/lxc/conf.c >> @@ -770,8 +770,8 @@ static int lxc_mount_auto_mounts(struct lxc_conf *conf, int flags, struct lxc_ha >> { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, "%r/proc/sysrq-trigger", "%r/proc/sysrq-trigger", NULL, MS_BIND, NULL }, >> { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, NULL, "%r/proc/sysrq-trigger", NULL, MS_REMOUNT|MS_BIND|MS_RDONLY, NULL }, >> { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_RW, "proc", "%r/proc", "proc", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, >> - { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RW, "sysfs", "%r/sys", "sysfs", 0, NULL }, >> - { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RO, "sysfs", "%r/sys", "sysfs", MS_RDONLY, NULL }, >> + { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RW, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, >> + { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RO, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY, NULL }, >> { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, >> { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, "%r/sys", "%r/sys", NULL, MS_BIND, NULL }, >> { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, NULL, "%r/sys", NULL, MS_REMOUNT|MS_BIND|MS_RDONLY, NULL }, > > fwiw - the first one works, the second one does not due to an apparent > inability to statvfs the origin. Good to hear. That confirms there are no other gotchas waiting in the wings. Apparently my second suggested patch is buggy due to an invalid source string. The source would need to be %r/proc or %r/sysfs to use statvfs productively. >> Alternately you can read the flags off of the original mount of proc or sysfs. >> >> diff --git a/src/lxc/conf.c b/src/lxc/conf.c >> index 9870455b3cae..50ea49973e80 100644 >> --- a/src/lxc/conf.c >> +++ b/src/lxc/conf.c >> @@ -712,7 +712,9 @@ static unsigned long add_required_remount_flags(const char *s, const char *d, >> struct statvfs sb; >> unsigned long required_flags = 0; >> >> - if (!(flags & MS_REMOUNT)) >> + if (!(flags & MS_REMOUNT) && >> + (strcmp(s, "proc") != 0) && >> + (strcmp(s, "sysfs") != 0)) >> return flags; >> >> if (!s) >> >> Eric >> _______________________________________________ >> Containers mailing list >> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org >> https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) 2015-05-28 21:42 ` Eric W. Biederman @ 2015-05-28 21:52 ` Serge E. Hallyn 0 siblings, 0 replies; 85+ messages in thread From: Serge E. Hallyn @ 2015-05-28 21:52 UTC (permalink / raw) To: Eric W. Biederman Cc: Serge E. Hallyn, Serge Hallyn, Richard Weinberger, Kenton Varda, Linux API, Linux Containers, Andy Lutomirski, Seth Forshee, Michael Kerrisk-manpages, Greg Kroah-Hartman, Linux FS Devel, Tejun Heo On Thu, May 28, 2015 at 04:42:34PM -0500, Eric W. Biederman wrote: > "Serge E. Hallyn" <serge@hallyn.com> writes: > > > On Thu, May 28, 2015 at 10:03:28AM -0500, Eric W. Biederman wrote: > >> Serge Hallyn <serge.hallyn@ubuntu.com> writes: > >> > >> > Quoting Andy Lutomirski (luto@amacapital.net): > >> >> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman > >> >> <ebiederm@xmission.com> wrote: > >> >> > I had hoped to get some Tested-By's on that patch series. > >> >> > >> >> Sorry, I've been totally swamped. > >> >> > >> >> I suspect that Sandstorm is okay, but I haven't had a chance to test > >> >> it for real. Sandstorm makes only limited use of proc and sysfs in > >> >> containers, but I'll see if I can test it for real this weekend. > >> > > >> > Testing this with unprivileged containers, I get > >> > > >> > lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted > >> > - error mounting sysfs on > >> > /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0 > >> > >> Grr.. I was afraid this would break something. :( > >> > >> Looking at my system I see that sysfs is currently mounted > >> "nosuid,nodev,noexec" > >> > >> Looking at the lxc-start code I don't see it as including any of those > >> mount options. In practice for sysfs I think those options are > >> meaningless (as there should be no devices and nothing executable in > >> sysfs) but I can understand the past concerns with chmod on virtual > >> filesystems that would incline people to use them, so I think the > >> failure is reporting a legitimate security issue in the lxc userspace > >> code where the the unprivileged code is currently attempting to give > >> greater access to sysfs than was given by the original mount of sysfs. > >> > >> As nosuid,nodev,noexec should not impair the operation of sysfs > >> operation it looks like you can always specify those options and just > >> make this concern go away. > >> > >> Something like the untested patch below I expect. > >> > >> diff --git a/src/lxc/conf.c b/src/lxc/conf.c > >> index 9870455b3cae..d9ccd03afe68 100644 > >> --- a/src/lxc/conf.c > >> +++ b/src/lxc/conf.c > >> @@ -770,8 +770,8 @@ static int lxc_mount_auto_mounts(struct lxc_conf *conf, int flags, struct lxc_ha > >> { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, "%r/proc/sysrq-trigger", "%r/proc/sysrq-trigger", NULL, MS_BIND, NULL }, > >> { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, NULL, "%r/proc/sysrq-trigger", NULL, MS_REMOUNT|MS_BIND|MS_RDONLY, NULL }, > >> { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_RW, "proc", "%r/proc", "proc", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, > >> - { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RW, "sysfs", "%r/sys", "sysfs", 0, NULL }, > >> - { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RO, "sysfs", "%r/sys", "sysfs", MS_RDONLY, NULL }, > >> + { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RW, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, > >> + { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RO, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY, NULL }, > >> { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, > >> { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, "%r/sys", "%r/sys", NULL, MS_BIND, NULL }, > >> { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, NULL, "%r/sys", NULL, MS_REMOUNT|MS_BIND|MS_RDONLY, NULL }, > > > > fwiw - the first one works, the second one does not due to an apparent > > inability to statvfs the origin. > > Good to hear. That confirms there are no other gotchas waiting in the > wings. > > Apparently my second suggested patch is buggy due to an invalid source > string. The source would need to be %r/proc or %r/sysfs to use statvfs > productively. Right, in these cases they're only passing in "sysfs". The first way is more explicit anyway (though may not help some people who have a "lxc.mount.entry = sysfs sys sysfs ro 0 0" line in their configuration instead, so maybe we'll have to go with the second after all, d'oh. I'll have to look into it next week) ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) 2015-05-28 14:08 ` Serge Hallyn 2015-05-28 15:03 ` Eric W. Biederman @ 2015-05-28 19:36 ` Richard Weinberger [not found] ` <55676E32.3050006-/L3Ra7n9ekc@public.gmane.org> 1 sibling, 1 reply; 85+ messages in thread From: Richard Weinberger @ 2015-05-28 19:36 UTC (permalink / raw) To: Serge Hallyn, Andy Lutomirski Cc: Eric W. Biederman, Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Kenton Varda, Michael Kerrisk-manpages, Linux FS Devel, Tejun Heo Am 28.05.2015 um 16:08 schrieb Serge Hallyn: > Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org): >> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman >> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: >>> I had hoped to get some Tested-By's on that patch series. >> >> Sorry, I've been totally swamped. >> >> I suspect that Sandstorm is okay, but I haven't had a chance to test >> it for real. Sandstorm makes only limited use of proc and sysfs in >> containers, but I'll see if I can test it for real this weekend. > > Testing this with unprivileged containers, I get > > lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted - error mounting sysfs on /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0 > FWIW, it breaks also libvirt-lxc: Error: internal error: guest failed to start: Failed to re-mount /proc/sys on /proc/sys flags=1021: Operation not permitted Thanks, //richard ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <55676E32.3050006-/L3Ra7n9ekc@public.gmane.org>]
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) [not found] ` <55676E32.3050006-/L3Ra7n9ekc@public.gmane.org> @ 2015-05-28 19:57 ` Eric W. Biederman 2015-05-28 20:30 ` Richard Weinberger 0 siblings, 1 reply; 85+ messages in thread From: Eric W. Biederman @ 2015-05-28 19:57 UTC (permalink / raw) To: Richard Weinberger Cc: Kenton Varda, Greg Kroah-Hartman, Linux Containers, Serge Hallyn, Andy Lutomirski, Seth Forshee, Michael Kerrisk-manpages, Linux API, Linux FS Devel, Tejun Heo Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes: > Am 28.05.2015 um 16:08 schrieb Serge Hallyn: >> Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org): >>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman >>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: >>>> I had hoped to get some Tested-By's on that patch series. >>> >>> Sorry, I've been totally swamped. >>> >>> I suspect that Sandstorm is okay, but I haven't had a chance to test >>> it for real. Sandstorm makes only limited use of proc and sysfs in >>> containers, but I'll see if I can test it for real this weekend. >> >> Testing this with unprivileged containers, I get >> >> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted - error mounting sysfs on /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0 >> > > FWIW, it breaks also libvirt-lxc: > Error: internal error: guest failed to start: Failed to re-mount /proc/sys on /proc/sys flags=1021: Operation not permitted Interesting. I had not anticipated a failure there? And it is failing in remount? Oh that is interesting. That implies that there is some flag of the original mount of /proc that the remount of /proc/sys is clearing, and that previously The flags specified are current rdonly,remount,bind so I expect there are some other flags on proc that libvirt-lxc is clearing by accident and we did not fail before because the kernel was not enforcing things. What are the mount flags in a working libvirt-lxc? Eric ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) 2015-05-28 19:57 ` Eric W. Biederman @ 2015-05-28 20:30 ` Richard Weinberger [not found] ` <55677AEF.1090809-/L3Ra7n9ekc@public.gmane.org> 0 siblings, 1 reply; 85+ messages in thread From: Richard Weinberger @ 2015-05-28 20:30 UTC (permalink / raw) To: Eric W. Biederman Cc: Serge Hallyn, Andy Lutomirski, Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Kenton Varda, Michael Kerrisk-manpages, Linux FS Devel, Tejun Heo Am 28.05.2015 um 21:57 schrieb Eric W. Biederman: >> FWIW, it breaks also libvirt-lxc: >> Error: internal error: guest failed to start: Failed to re-mount /proc/sys on /proc/sys flags=1021: Operation not permitted > > Interesting. I had not anticipated a failure there? And it is failing > in remount? Oh that is interesting. > > That implies that there is some flag of the original mount of /proc that > the remount of /proc/sys is clearing, and that previously > > The flags specified are current rdonly,remount,bind so I expect there > are some other flags on proc that libvirt-lxc is clearing by accident > and we did not fail before because the kernel was not enforcing things. Please see: http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l933 lxcContainerMountBasicFS() and: http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l850 lxcBasicMounts > What are the mount flags in a working libvirt-lxc? See: test1:~ # cat /proc/self/mountinfo 147 100 0:30 /srv/container/test1/rootfs / rw,relatime - btrfs /dev/sda2 rw,space_cache 149 147 0:56 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw 150 149 0:56 /sys /proc/sys ro,nodev,relatime - proc proc rw 151 150 0:3 /sys/net/ipv4 /proc/sys/net/ipv4 rw,nosuid,nodev,noexec,relatime - proc proc rw 152 150 0:3 /sys/net/ipv6 /proc/sys/net/ipv6 rw,nosuid,nodev,noexec,relatime - proc proc rw 153 147 0:57 / /sys ro,nodev,relatime - sysfs sysfs rw 154 149 0:53 /meminfo /proc/meminfo rw,nosuid,nodev,relatime - fuse libvirt rw,user_id=0,group_id=0,allow_other 155 153 0:58 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,size=64k,mode=755,uid=10000,gid=10000 156 155 0:22 /machine.slice/machine-lxc\134x2dtest1.scope /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpu,cpuacct 157 155 0:21 /machine.slice/machine-lxc\134x2dtest1.scope /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpuset 158 155 0:23 /machine.slice/machine-lxc\134x2dtest1.scope /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,memory 159 155 0:24 /machine.slice/machine-lxc\134x2dtest1.scope /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,devices 160 155 0:25 /machine.slice/machine-lxc\134x2dtest1.scope /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer 161 155 0:27 /machine.slice/machine-lxc\134x2dtest1.scope /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,blkio 162 155 0:26 /machine.slice/machine-lxc\134x2dtest1.scope /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,net_cls,net_prio 163 155 0:28 /machine.slice/machine-lxc\134x2dtest1.scope /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,perf_event 164 155 0:19 /machine.slice/machine-lxc\134x2dtest1.scope /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 165 147 0:52 / /dev rw,nosuid,relatime - tmpfs devfs rw,size=64k,mode=755 166 165 0:51 / /dev/pts rw,nosuid,relatime - devpts devpts rw,gid=10005,mode=620,ptmxmode=666 167 165 0:51 /ptmx /dev/ptmx rw,nosuid,relatime - devpts devpts rw,gid=10005,mode=620,ptmxmode=666 101 165 0:55 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw,uid=10000,gid=10000 102 147 0:59 / /run rw,nosuid,nodev - tmpfs tmpfs rw,mode=755,uid=10000,gid=10000 103 165 0:54 / /dev/mqueue rw,nodev,relatime - mqueue mqueue rw 104 147 0:59 / /var/run rw,nosuid,nodev - tmpfs tmpfs rw,mode=755,uid=10000,gid=10000 105 147 0:59 /lock /var/lock rw,nosuid,nodev - tmpfs tmpfs rw,mode=755,uid=10000,gid=10000 If you need more info, please let me know. :-) Thanks, //richard ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <55677AEF.1090809-/L3Ra7n9ekc@public.gmane.org>]
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) [not found] ` <55677AEF.1090809-/L3Ra7n9ekc@public.gmane.org> @ 2015-05-28 21:32 ` Eric W. Biederman [not found] ` <87iobcfkwx.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 85+ messages in thread From: Eric W. Biederman @ 2015-05-28 21:32 UTC (permalink / raw) To: Richard Weinberger Cc: Kenton Varda, Greg Kroah-Hartman, Linux Containers, Serge Hallyn, Andy Lutomirski, Seth Forshee, Michael Kerrisk-manpages, Linux API, Linux FS Devel, Tejun Heo Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes: > Am 28.05.2015 um 21:57 schrieb Eric W. Biederman: >>> FWIW, it breaks also libvirt-lxc: >>> Error: internal error: guest failed to start: Failed to re-mount /proc/sys on /proc/sys flags=1021: Operation not permitted >> >> Interesting. I had not anticipated a failure there? And it is failing >> in remount? Oh that is interesting. >> >> That implies that there is some flag of the original mount of /proc that >> the remount of /proc/sys is clearing, and that previously >> >> The flags specified are current rdonly,remount,bind so I expect there >> are some other flags on proc that libvirt-lxc is clearing by accident >> and we did not fail before because the kernel was not enforcing things. > > Please see: > http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l933 > lxcContainerMountBasicFS() > > and: > http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l850 > lxcBasicMounts > >> What are the mount flags in a working libvirt-lxc? > > See: > test1:~ # cat /proc/self/mountinfo > 149 147 0:56 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw > 150 149 0:56 /sys /proc/sys ro,nodev,relatime - proc proc rw > If you need more info, please let me know. :-) Oh interesting I had not realized libvirt-lxc had grown an unprivileged mode using user namespaces. This does appear to be a classic remount bug, where you are not preserving the permissions. It appears the fact that the code failed to enforce locked permissions on the fresh mount of proc was hiding this bug until now. I expect what you actually want is the code below: diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index 9a9ae5c2aaf0..f008a7484bfe 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -850,7 +850,7 @@ typedef struct { static const virLXCBasicMountInfo lxcBasicMounts[] = { { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, - { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false }, + { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true }, { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true }, { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, Or possibly just: diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index 9a9ae5c2aaf0..a60ccbd12bfc 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -850,7 +850,7 @@ typedef struct { static const virLXCBasicMountInfo lxcBasicMounts[] = { { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, - { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false }, + { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, true, false, false }, { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true }, { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true }, { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, As the there is little point in making /proc/sys read-only in a user-namespace, as the permission checks are uid based and no-one should have the global uid 0 in your container. Making mounting /proc/sys read-only rather pointless. Eric ^ permalink raw reply related [flat|nested] 85+ messages in thread
[parent not found: <87iobcfkwx.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) [not found] ` <87iobcfkwx.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-05-28 21:46 ` Richard Weinberger [not found] ` <55678CCA.80807-/L3Ra7n9ekc@public.gmane.org> 2015-05-29 9:30 ` Richard Weinberger 1 sibling, 1 reply; 85+ messages in thread From: Richard Weinberger @ 2015-05-28 21:46 UTC (permalink / raw) To: Eric W. Biederman Cc: Kenton Varda, Greg Kroah-Hartman, Linux Containers, Serge Hallyn, Andy Lutomirski, Seth Forshee, Michael Kerrisk-manpages, Linux API, Linux FS Devel, Tejun Heo Am 28.05.2015 um 23:32 schrieb Eric W. Biederman: > Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes: > >> Am 28.05.2015 um 21:57 schrieb Eric W. Biederman: >>>> FWIW, it breaks also libvirt-lxc: >>>> Error: internal error: guest failed to start: Failed to re-mount /proc/sys on /proc/sys flags=1021: Operation not permitted >>> >>> Interesting. I had not anticipated a failure there? And it is failing >>> in remount? Oh that is interesting. >>> >>> That implies that there is some flag of the original mount of /proc that >>> the remount of /proc/sys is clearing, and that previously >>> >>> The flags specified are current rdonly,remount,bind so I expect there >>> are some other flags on proc that libvirt-lxc is clearing by accident >>> and we did not fail before because the kernel was not enforcing things. >> >> Please see: >> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l933 >> lxcContainerMountBasicFS() >> >> and: >> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l850 >> lxcBasicMounts >> >>> What are the mount flags in a working libvirt-lxc? >> >> See: >> test1:~ # cat /proc/self/mountinfo >> 149 147 0:56 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw >> 150 149 0:56 /sys /proc/sys ro,nodev,relatime - proc proc rw > >> If you need more info, please let me know. :-) > > Oh interesting I had not realized libvirt-lxc had grown an unprivileged > mode using user namespaces. Yep. It works quite well. I've migrated all my containers from lxc to libvirt-lxc because libvirt-lxc had a working user-namespace implementation before lxc. > This does appear to be a classic remount bug, where you are not > preserving the permissions. It appears the fact that the code > failed to enforce locked permissions on the fresh mount of proc > was hiding this bug until now. > > I expect what you actually want is the code below: > > diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c > index 9a9ae5c2aaf0..f008a7484bfe 100644 > --- a/src/lxc/lxc_container.c > +++ b/src/lxc/lxc_container.c > @@ -850,7 +850,7 @@ typedef struct { > > static const virLXCBasicMountInfo lxcBasicMounts[] = { > { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, > - { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false }, > + { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, > { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true }, > { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true }, > { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, > > Or possibly just: > > diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c > index 9a9ae5c2aaf0..a60ccbd12bfc 100644 > --- a/src/lxc/lxc_container.c > +++ b/src/lxc/lxc_container.c > @@ -850,7 +850,7 @@ typedef struct { > > static const virLXCBasicMountInfo lxcBasicMounts[] = { > { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, > - { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false }, > + { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, true, false, false }, > { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true }, > { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true }, > { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, I'll test your diff tomorrow with a fresh brain. I sent a similar patch to libvirt folks some time ago, looks like it got lost. ;-\ > As the there is little point in making /proc/sys read-only in a > user-namespace, as the permission checks are uid based and no-one should > have the global uid 0 in your container. Making mounting /proc/sys > read-only rather pointless. Yeah, I've been ranting about that for ages... libvirt-lxc contains a lot of cruft to make privileged container kind of secure. Some users still fear using the user-namespace. Thanks, //richard ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <55678CCA.80807-/L3Ra7n9ekc@public.gmane.org>]
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) [not found] ` <55678CCA.80807-/L3Ra7n9ekc@public.gmane.org> @ 2015-06-16 12:30 ` Daniel P. Berrange 0 siblings, 0 replies; 85+ messages in thread From: Daniel P. Berrange @ 2015-06-16 12:30 UTC (permalink / raw) To: Richard Weinberger Cc: Eric W. Biederman, Kenton Varda, Greg Kroah-Hartman, Linux Containers, Serge Hallyn, Andy Lutomirski, Seth Forshee, Michael Kerrisk-manpages, Linux API, Linux FS Devel, Tejun Heo On Thu, May 28, 2015 at 11:46:50PM +0200, Richard Weinberger wrote: > Am 28.05.2015 um 23:32 schrieb Eric W. Biederman: > > Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes: > > > >> Am 28.05.2015 um 21:57 schrieb Eric W. Biederman: > >>>> FWIW, it breaks also libvirt-lxc: > >>>> Error: internal error: guest failed to start: Failed to re-mount /proc/sys on /proc/sys flags=1021: Operation not permitted > >>> > >>> Interesting. I had not anticipated a failure there? And it is failing > >>> in remount? Oh that is interesting. > >>> > >>> That implies that there is some flag of the original mount of /proc that > >>> the remount of /proc/sys is clearing, and that previously > >>> > >>> The flags specified are current rdonly,remount,bind so I expect there > >>> are some other flags on proc that libvirt-lxc is clearing by accident > >>> and we did not fail before because the kernel was not enforcing things. > >> > >> Please see: > >> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l933 > >> lxcContainerMountBasicFS() > >> > >> and: > >> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l850 > >> lxcBasicMounts > >> > >>> What are the mount flags in a working libvirt-lxc? > >> > >> See: > >> test1:~ # cat /proc/self/mountinfo > >> 149 147 0:56 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw > >> 150 149 0:56 /sys /proc/sys ro,nodev,relatime - proc proc rw > > > >> If you need more info, please let me know. :-) > > > > Oh interesting I had not realized libvirt-lxc had grown an unprivileged > > mode using user namespaces. > > Yep. It works quite well. I've migrated all my containers from lxc > to libvirt-lxc because libvirt-lxc had a working user-namespace > implementation before lxc. > > > This does appear to be a classic remount bug, where you are not > > preserving the permissions. It appears the fact that the code > > failed to enforce locked permissions on the fresh mount of proc > > was hiding this bug until now. > > > > I expect what you actually want is the code below: > > > > diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c > > index 9a9ae5c2aaf0..f008a7484bfe 100644 > > --- a/src/lxc/lxc_container.c > > +++ b/src/lxc/lxc_container.c > > @@ -850,7 +850,7 @@ typedef struct { > > > > static const virLXCBasicMountInfo lxcBasicMounts[] = { > > { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, > > - { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false }, > > + { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, > > { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true }, > > { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true }, > > { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, > > > > Or possibly just: > > > > diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c > > index 9a9ae5c2aaf0..a60ccbd12bfc 100644 > > --- a/src/lxc/lxc_container.c > > +++ b/src/lxc/lxc_container.c > > @@ -850,7 +850,7 @@ typedef struct { > > > > static const virLXCBasicMountInfo lxcBasicMounts[] = { > > { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, > > - { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false }, > > + { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, true, false, false }, > > { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true }, > > { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true }, > > { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, > > I'll test your diff tomorrow with a fresh brain. > I sent a similar patch to libvirt folks some time ago, looks like it got lost. ;-\ > > > As the there is little point in making /proc/sys read-only in a > > user-namespace, as the permission checks are uid based and no-one should > > have the global uid 0 in your container. Making mounting /proc/sys > > read-only rather pointless. > > Yeah, I've been ranting about that for ages... > libvirt-lxc contains a lot of cruft to make privileged container > kind of secure. Some users still fear using the user-namespace. Yes, we've discussed this before and I'd like to simplify this. The thing that has been stopping me tackling it has been figuring out a way to ensure we don't change semantics for existing deployed users. ie when RHEL-7 rebases to newer libvirt, I don't want existing containers to suddenly change their setup, because although the existing setup is sub-optimal, some apps / users might be relying on its behaviour in ways I can't predict. I do believe I have figured out a way to allow backwards compatibility now though, so we should have able to have another stab at simplifying and removing this cruft for newly deployed containers. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) [not found] ` <87iobcfkwx.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-05-28 21:46 ` Richard Weinberger @ 2015-05-29 9:30 ` Richard Weinberger [not found] ` <556831CF.9040600-/L3Ra7n9ekc@public.gmane.org> 2015-06-06 18:56 ` Eric W. Biederman 1 sibling, 2 replies; 85+ messages in thread From: Richard Weinberger @ 2015-05-29 9:30 UTC (permalink / raw) To: Eric W. Biederman Cc: Kenton Varda, libvir-list-H+wXaHxf7aLQT0dZR+AlfA, Greg Kroah-Hartman, Linux Containers, Serge Hallyn, Andy Lutomirski, Seth Forshee, Michael Kerrisk-manpages, Linux API, Linux FS Devel, Tejun Heo, Cedric Bosdonnat [CC'ing libvirt-lxc folks] Am 28.05.2015 um 23:32 schrieb Eric W. Biederman: > Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes: > >> Am 28.05.2015 um 21:57 schrieb Eric W. Biederman: >>>> FWIW, it breaks also libvirt-lxc: >>>> Error: internal error: guest failed to start: Failed to re-mount /proc/sys on /proc/sys flags=1021: Operation not permitted >>> >>> Interesting. I had not anticipated a failure there? And it is failing >>> in remount? Oh that is interesting. >>> >>> That implies that there is some flag of the original mount of /proc that >>> the remount of /proc/sys is clearing, and that previously >>> >>> The flags specified are current rdonly,remount,bind so I expect there >>> are some other flags on proc that libvirt-lxc is clearing by accident >>> and we did not fail before because the kernel was not enforcing things. >> >> Please see: >> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l933 >> lxcContainerMountBasicFS() >> >> and: >> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l850 >> lxcBasicMounts >> >>> What are the mount flags in a working libvirt-lxc? >> >> See: >> test1:~ # cat /proc/self/mountinfo >> 149 147 0:56 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw >> 150 149 0:56 /sys /proc/sys ro,nodev,relatime - proc proc rw > >> If you need more info, please let me know. :-) > > Oh interesting I had not realized libvirt-lxc had grown an unprivileged > mode using user namespaces. > > This does appear to be a classic remount bug, where you are not > preserving the permissions. It appears the fact that the code > failed to enforce locked permissions on the fresh mount of proc > was hiding this bug until now. > > I expect what you actually want is the code below: > > diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c > index 9a9ae5c2aaf0..f008a7484bfe 100644 > --- a/src/lxc/lxc_container.c > +++ b/src/lxc/lxc_container.c > @@ -850,7 +850,7 @@ typedef struct { > > static const virLXCBasicMountInfo lxcBasicMounts[] = { > { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, > - { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false }, > + { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, > { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true }, > { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true }, > { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, > > Or possibly just: > > diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c > index 9a9ae5c2aaf0..a60ccbd12bfc 100644 > --- a/src/lxc/lxc_container.c > +++ b/src/lxc/lxc_container.c > @@ -850,7 +850,7 @@ typedef struct { > > static const virLXCBasicMountInfo lxcBasicMounts[] = { > { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, > - { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false }, > + { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, true, false, false }, > { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true }, > { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true }, > { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, > > As the there is little point in making /proc/sys read-only in a > user-namespace, as the permission checks are uid based and no-one should > have the global uid 0 in your container. Making mounting /proc/sys > read-only rather pointless. Eric, using the patch below I was able to spawn a user-namespace enabled container using libvirt-lxc. :-) I had to: 1. Disable the read-only mount of /proc/sys which is anyway useless in the user-namespace case. 2. Disable the /proc/sys/net/ipv{4,6} bind mounts, this ugly hack is only needed for the non user-namespace case. 3. Remove MS_RDONLY from the sysfs mount (For the non user-namespace case we'd have to keep this, though). Daniel, I'd take this as a chance to disable all the MS_RDONLY games if user-namespace are configured. With Eric's fixes they hurt us. And as I wrote many times before if root within the user-namespace is able to do nasty things in /sys and /proc that's a plain kernel bug which needs fixing. There is no point in mounting these read-only. Except for the case then no user-namespace is used. diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index 9a9ae5c..497e05f 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -850,10 +850,10 @@ typedef struct { static const virLXCBasicMountInfo lxcBasicMounts[] = { { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, - { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false }, - { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true }, - { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true }, - { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, + { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, true, false, false }, + { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, true, false, true }, + { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, true, false, true }, + { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, { "securityfs", "/sys/kernel/security", "securityfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, true, true, false }, #if WITH_SELINUX { SELINUX_MOUNT, SELINUX_MOUNT, "selinuxfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, true, true, false }, Thanks, //richard ^ permalink raw reply related [flat|nested] 85+ messages in thread
[parent not found: <556831CF.9040600-/L3Ra7n9ekc@public.gmane.org>]
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) [not found] ` <556831CF.9040600-/L3Ra7n9ekc@public.gmane.org> @ 2015-05-29 17:41 ` Eric W. Biederman 0 siblings, 0 replies; 85+ messages in thread From: Eric W. Biederman @ 2015-05-29 17:41 UTC (permalink / raw) To: Richard Weinberger Cc: Serge Hallyn, Andy Lutomirski, Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Kenton Varda, Michael Kerrisk-manpages, Linux FS Devel, Tejun Heo, libvir-list@redhat.com, Daniel P. Berrange, Cedric Bosdonnat Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes: > [CC'ing libvirt-lxc folks] > > Am 28.05.2015 um 23:32 schrieb Eric W. Biederman: >> Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes: >> >>> Am 28.05.2015 um 21:57 schrieb Eric W. Biederman: >>>>> FWIW, it breaks also libvirt-lxc: >>>>> Error: internal error: guest failed to start: Failed to re-mount /proc/sys on /proc/sys flags=1021: Operation not permitted >>>> >>>> Interesting. I had not anticipated a failure there? And it is failing >>>> in remount? Oh that is interesting. >>>> >>>> That implies that there is some flag of the original mount of /proc that >>>> the remount of /proc/sys is clearing, and that previously >>>> >>>> The flags specified are current rdonly,remount,bind so I expect there >>>> are some other flags on proc that libvirt-lxc is clearing by accident >>>> and we did not fail before because the kernel was not enforcing things. >>> >>> Please see: >>> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l933 >>> lxcContainerMountBasicFS() >>> >>> and: >>> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l850 >>> lxcBasicMounts >>> >>>> What are the mount flags in a working libvirt-lxc? >>> >>> See: >>> test1:~ # cat /proc/self/mountinfo >>> 149 147 0:56 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw >>> 150 149 0:56 /sys /proc/sys ro,nodev,relatime - proc proc rw >> >>> If you need more info, please let me know. :-) >> >> Oh interesting I had not realized libvirt-lxc had grown an unprivileged >> mode using user namespaces. >> >> This does appear to be a classic remount bug, where you are not >> preserving the permissions. It appears the fact that the code >> failed to enforce locked permissions on the fresh mount of proc >> was hiding this bug until now. >> >> I expect what you actually want is the code below: >> >> diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c >> index 9a9ae5c2aaf0..f008a7484bfe 100644 >> --- a/src/lxc/lxc_container.c >> +++ b/src/lxc/lxc_container.c >> @@ -850,7 +850,7 @@ typedef struct { >> >> static const virLXCBasicMountInfo lxcBasicMounts[] = { >> { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, >> - { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false }, >> + { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, >> { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true }, >> { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true }, >> { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, >> >> Or possibly just: >> >> diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c >> index 9a9ae5c2aaf0..a60ccbd12bfc 100644 >> --- a/src/lxc/lxc_container.c >> +++ b/src/lxc/lxc_container.c >> @@ -850,7 +850,7 @@ typedef struct { >> >> static const virLXCBasicMountInfo lxcBasicMounts[] = { >> { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, >> - { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false }, >> + { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, true, false, false }, >> { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true }, >> { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true }, >> { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, >> >> As the there is little point in making /proc/sys read-only in a >> user-namespace, as the permission checks are uid based and no-one should >> have the global uid 0 in your container. Making mounting /proc/sys >> read-only rather pointless. > > Eric, using the patch below I was able to spawn a user-namespace enabled container > using libvirt-lxc. :-) I am glad. I am trying to figure out which set of changes were necessary vs just nice to have, to inform that part of the conversation that is asking is there a way we can avoid breaking userspace for this security fix. > I had to: > 1. Disable the read-only mount of /proc/sys which is anyway useless in > the user-namespace case. It is likely worth addressing the libvirt-lxc MS_REMOUNT code as it does not preserve any mount flags, or even have the capability to try. if (bindOverReadonly && mount(mnt_src, mnt->dst, NULL, MS_BIND|MS_REMOUNT|MS_RDONLY, NULL) < 0) { virReportSystemError(errno, _("Failed to re-mount %s on %s flags=%x"), mnt_src, mnt->dst, MS_BIND|MS_REMOUNT|MS_RDONLY); goto cleanup; } Aka the flags during remount are hard coded (which is buggy). So I believe even without the use of user-namespaces this code does the wrong thing. Likely statvfs needs to be called to get the existing mount flags and those should be applied during remount or possibly just the mount flags from the virLXCBasicMountInfo entry should be added. > 2. Disable the /proc/sys/net/ipv{4,6} bind mounts, this ugly hack is only needed for the non user-namespace case. *Scratches my head* Why was this necessary? Those are just plain bind mounts which do not need any remount-magic so they should have just worked and preserved the existing mount flags. I agree they are unnecessary in this context but I don't expect they would have cause problems or were "wrong". > 3. Remove MS_RDONLY from the sysfs mount (For the non user-namespace case we'd have to keep this, though). Ok. I can see this as being necessary as well, and missed in the first pass because the code did not get this far. The code flow for sysfs appears to trigger the bindOverReadOnly code as MS_RDONLY is set. Then the remount clears the other mount flags on sysfs. Which previously we would have not enforced as sysfs with a network namespace is a fresh mount (and that is the bug my patchset fixes). This does very much look like a bug in libvirt-lxc clearing flags it did not intend to. > Daniel, I'd take this as a chance to disable all the MS_RDONLY games if user-namespace are configured. > With Eric's fixes they hurt us. And as I wrote many times before if root within the user-namespace > is able to do nasty things in /sys and /proc that's a plain kernel bug which needs fixing. There is no > point in mounting these read-only. Except for the case then no user-namespace is used. > > diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c > index 9a9ae5c..497e05f 100644 > --- a/src/lxc/lxc_container.c > +++ b/src/lxc/lxc_container.c > @@ -850,10 +850,10 @@ typedef struct { > > static const virLXCBasicMountInfo lxcBasicMounts[] = { > { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, > - { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false }, > - { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true }, > - { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true }, > - { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, > + { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, true, false, false }, > + { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, true, false, true }, > + { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, true, false, true }, > + { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, > { "securityfs", "/sys/kernel/security", "securityfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, true, true, false }, > #if WITH_SELINUX > { SELINUX_MOUNT, SELINUX_MOUNT, "selinuxfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, true, true, false }, > > Thanks, > //richard ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) 2015-05-29 9:30 ` Richard Weinberger [not found] ` <556831CF.9040600-/L3Ra7n9ekc@public.gmane.org> @ 2015-06-06 18:56 ` Eric W. Biederman [not found] ` <87mw0c1x8p.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 1 sibling, 1 reply; 85+ messages in thread From: Eric W. Biederman @ 2015-06-06 18:56 UTC (permalink / raw) To: Richard Weinberger Cc: Serge Hallyn, Andy Lutomirski, Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Kenton Varda, Michael Kerrisk-manpages, Linux FS Devel, Tejun Heo, libvir-list, Daniel P. Berrange, Cedric Bosdonnat Richard Weinberger <richard@nod.at> writes: > [CC'ing libvirt-lxc folks] > > Am 28.05.2015 um 23:32 schrieb Eric W. Biederman: >> Richard Weinberger <richard@nod.at> writes: >> >>> Am 28.05.2015 um 21:57 schrieb Eric W. Biederman: >>>>> FWIW, it breaks also libvirt-lxc: >>>>> Error: internal error: guest failed to start: Failed to re-mount /proc/sys on /proc/sys flags=1021: Operation not permitted >>>> >>>> Interesting. I had not anticipated a failure there? And it is failing >>>> in remount? Oh that is interesting. >>>> >>>> That implies that there is some flag of the original mount of /proc that >>>> the remount of /proc/sys is clearing, and that previously >>>> >>>> The flags specified are current rdonly,remount,bind so I expect there >>>> are some other flags on proc that libvirt-lxc is clearing by accident >>>> and we did not fail before because the kernel was not enforcing things. >>> >>> Please see: >>> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l933 >>> lxcContainerMountBasicFS() >>> >>> and: >>> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l850 >>> lxcBasicMounts >>> >>>> What are the mount flags in a working libvirt-lxc? >>> >>> See: >>> test1:~ # cat /proc/self/mountinfo >>> 149 147 0:56 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw >>> 150 149 0:56 /sys /proc/sys ro,nodev,relatime - proc proc rw >> >>> If you need more info, please let me know. :-) >> >> Oh interesting I had not realized libvirt-lxc had grown an unprivileged >> mode using user namespaces. >> >> This does appear to be a classic remount bug, where you are not >> preserving the permissions. It appears the fact that the code >> failed to enforce locked permissions on the fresh mount of proc >> was hiding this bug until now. >> >> I expect what you actually want is the code below: >> >> diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c >> index 9a9ae5c2aaf0..f008a7484bfe 100644 >> --- a/src/lxc/lxc_container.c >> +++ b/src/lxc/lxc_container.c >> @@ -850,7 +850,7 @@ typedef struct { >> >> static const virLXCBasicMountInfo lxcBasicMounts[] = { >> { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, >> - { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false }, >> + { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, >> { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true }, >> { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true }, >> { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, >> >> Or possibly just: >> >> diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c >> index 9a9ae5c2aaf0..a60ccbd12bfc 100644 >> --- a/src/lxc/lxc_container.c >> +++ b/src/lxc/lxc_container.c >> @@ -850,7 +850,7 @@ typedef struct { >> >> static const virLXCBasicMountInfo lxcBasicMounts[] = { >> { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, >> - { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false }, >> + { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, true, false, false }, >> { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true }, >> { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true }, >> { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, >> >> As the there is little point in making /proc/sys read-only in a >> user-namespace, as the permission checks are uid based and no-one should >> have the global uid 0 in your container. Making mounting /proc/sys >> read-only rather pointless. > > Eric, using the patch below I was able to spawn a user-namespace enabled container > using libvirt-lxc. :-) > > I had to: > 1. Disable the read-only mount of /proc/sys which is anyway useless in the user-namespace case. > 2. Disable the /proc/sys/net/ipv{4,6} bind mounts, this ugly hack is only needed for the non user-namespace case. > 3. Remove MS_RDONLY from the sysfs mount (For the non user-namespace case we'd have to keep this, though). > > Daniel, I'd take this as a chance to disable all the MS_RDONLY games if user-namespace are configured. > With Eric's fixes they hurt us. And as I wrote many times before if root within the user-namespace > is able to do nasty things in /sys and /proc that's a plain kernel bug which needs fixing. There is no > point in mounting these read-only. Except for the case then no user-namespace is used. > For clarity the patch below appears to be the minimal change needed to fix this security issue. AKA add mnt_mflags in when remounting something read-only. /proc/sys needed to be updated so it had the proper flags to be added back in. I hope this helps. diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index 9a9ae5c2aaf0..11e9514e0761 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -850,7 +850,7 @@ typedef struct { static const virLXCBasicMountInfo lxcBasicMounts[] = { { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, - { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false }, + { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true }, { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true }, { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, @@ -1030,7 +1030,7 @@ static int lxcContainerMountBasicFS(bool userns_enabled, if (bindOverReadonly && mount(mnt_src, mnt->dst, NULL, - MS_BIND|MS_REMOUNT|MS_RDONLY, NULL) < 0) { + MS_BIND|MS_REMOUNT|mnt_mflags|MS_RDONLY, NULL) < 0) { virReportSystemError(errno, _("Failed to re-mount %s on %s flags=%x"), mnt_src, mnt->dst, Eric ^ permalink raw reply related [flat|nested] 85+ messages in thread
[parent not found: <87mw0c1x8p.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) [not found] ` <87mw0c1x8p.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-06-16 12:31 ` Daniel P. Berrange [not found] ` <20150616123148.GB18689-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 85+ messages in thread From: Daniel P. Berrange @ 2015-06-16 12:31 UTC (permalink / raw) To: Eric W. Biederman Cc: Richard Weinberger, Serge Hallyn, Andy Lutomirski, Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Kenton Varda, Michael Kerrisk-manpages, Linux FS Devel, Tejun Heo, libvir-list-H+wXaHxf7aLQT0dZR+AlfA, Cedric Bosdonnat On Sat, Jun 06, 2015 at 01:56:54PM -0500, Eric W. Biederman wrote: > Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes: > > > [CC'ing libvirt-lxc folks] > > > > Am 28.05.2015 um 23:32 schrieb Eric W. Biederman: > >> Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes: > >> > >>> Am 28.05.2015 um 21:57 schrieb Eric W. Biederman: > >>>>> FWIW, it breaks also libvirt-lxc: > >>>>> Error: internal error: guest failed to start: Failed to re-mount /proc/sys on /proc/sys flags=1021: Operation not permitted > >>>> > >>>> Interesting. I had not anticipated a failure there? And it is failing > >>>> in remount? Oh that is interesting. > >>>> > >>>> That implies that there is some flag of the original mount of /proc that > >>>> the remount of /proc/sys is clearing, and that previously > >>>> > >>>> The flags specified are current rdonly,remount,bind so I expect there > >>>> are some other flags on proc that libvirt-lxc is clearing by accident > >>>> and we did not fail before because the kernel was not enforcing things. > >>> > >>> Please see: > >>> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l933 > >>> lxcContainerMountBasicFS() > >>> > >>> and: > >>> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l850 > >>> lxcBasicMounts > >>> > >>>> What are the mount flags in a working libvirt-lxc? > >>> > >>> See: > >>> test1:~ # cat /proc/self/mountinfo > >>> 149 147 0:56 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw > >>> 150 149 0:56 /sys /proc/sys ro,nodev,relatime - proc proc rw > >> > >>> If you need more info, please let me know. :-) > >> > >> Oh interesting I had not realized libvirt-lxc had grown an unprivileged > >> mode using user namespaces. > >> > >> This does appear to be a classic remount bug, where you are not > >> preserving the permissions. It appears the fact that the code > >> failed to enforce locked permissions on the fresh mount of proc > >> was hiding this bug until now. > >> > >> I expect what you actually want is the code below: > >> > >> diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c > >> index 9a9ae5c2aaf0..f008a7484bfe 100644 > >> --- a/src/lxc/lxc_container.c > >> +++ b/src/lxc/lxc_container.c > >> @@ -850,7 +850,7 @@ typedef struct { > >> > >> static const virLXCBasicMountInfo lxcBasicMounts[] = { > >> { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, > >> - { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false }, > >> + { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, > >> { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true }, > >> { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true }, > >> { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, > >> > >> Or possibly just: > >> > >> diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c > >> index 9a9ae5c2aaf0..a60ccbd12bfc 100644 > >> --- a/src/lxc/lxc_container.c > >> +++ b/src/lxc/lxc_container.c > >> @@ -850,7 +850,7 @@ typedef struct { > >> > >> static const virLXCBasicMountInfo lxcBasicMounts[] = { > >> { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, > >> - { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false }, > >> + { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, true, false, false }, > >> { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true }, > >> { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true }, > >> { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, > >> > >> As the there is little point in making /proc/sys read-only in a > >> user-namespace, as the permission checks are uid based and no-one should > >> have the global uid 0 in your container. Making mounting /proc/sys > >> read-only rather pointless. > > > > Eric, using the patch below I was able to spawn a user-namespace enabled container > > using libvirt-lxc. :-) > > > > I had to: > > 1. Disable the read-only mount of /proc/sys which is anyway useless in the user-namespace case. > > 2. Disable the /proc/sys/net/ipv{4,6} bind mounts, this ugly hack is only needed for the non user-namespace case. > > 3. Remove MS_RDONLY from the sysfs mount (For the non user-namespace case we'd have to keep this, though). > > > > Daniel, I'd take this as a chance to disable all the MS_RDONLY games if user-namespace are configured. > > With Eric's fixes they hurt us. And as I wrote many times before if root within the user-namespace > > is able to do nasty things in /sys and /proc that's a plain kernel bug which needs fixing. There is no > > point in mounting these read-only. Except for the case then no user-namespace is used. > > > > For clarity the patch below appears to be the minimal change needed to > fix this security issue. > > AKA add mnt_mflags in when remounting something read-only. > > /proc/sys needed to be updated so it had the proper flags to be added > back in. > > I hope this helps. > > diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c > index 9a9ae5c2aaf0..11e9514e0761 100644 > --- a/src/lxc/lxc_container.c > +++ b/src/lxc/lxc_container.c > @@ -850,7 +850,7 @@ typedef struct { > > static const virLXCBasicMountInfo lxcBasicMounts[] = { > { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, > - { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false }, > + { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, > { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true }, > { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true }, > { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, > @@ -1030,7 +1030,7 @@ static int lxcContainerMountBasicFS(bool userns_enabled, > > if (bindOverReadonly && > mount(mnt_src, mnt->dst, NULL, > - MS_BIND|MS_REMOUNT|MS_RDONLY, NULL) < 0) { > + MS_BIND|MS_REMOUNT|mnt_mflags|MS_RDONLY, NULL) < 0) { > virReportSystemError(errno, > _("Failed to re-mount %s on %s flags=%x"), > mnt_src, mnt->dst, Thanks Richard / Eric for the suggested patches. I'll apply Eric's simplified patch to libvirt now, and backport it to our stable libvirt branches. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <20150616123148.GB18689-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) [not found] ` <20150616123148.GB18689-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2015-06-16 12:46 ` Richard Weinberger 0 siblings, 0 replies; 85+ messages in thread From: Richard Weinberger @ 2015-06-16 12:46 UTC (permalink / raw) To: Daniel P. Berrange, Eric W. Biederman Cc: Serge Hallyn, Andy Lutomirski, Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Kenton Varda, Michael Kerrisk-manpages, Linux FS Devel, Tejun Heo, libvir-list-H+wXaHxf7aLQT0dZR+AlfA, Cedric Bosdonnat Am 16.06.2015 um 14:31 schrieb Daniel P. Berrange: > Thanks Richard / Eric for the suggested patches. I'll apply Eric's > simplified patch to libvirt now, and backport it to our stable > libvirt branches. Thank you Daniel! ^ permalink raw reply [flat|nested] 85+ messages in thread
end of thread, other threads:[~2015-08-12 21:05 UTC | newest] Thread overview: 85+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-05-14 17:30 [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts Eric W. Biederman 2015-05-14 17:33 ` [CFT][PATCH 04/10] fs: Add helper functions for permanently empty directories Eric W. Biederman 2015-05-14 17:33 ` [CFT][PATCH 05/10] sysctl: Allow creating " Eric W. Biederman [not found] ` <87pp63jcca.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-05-14 17:31 ` [CFT][PATCH 01/10] mnt: Refactor the logic for mounting sysfs and proc in a user namespace Eric W. Biederman 2015-05-14 17:32 ` [CFT][PATCH 02/10] mnt: Modify fs_fully_visible to deal with mount attributes Eric W. Biederman 2015-05-14 17:32 ` [CFT][PATCH 03/10] vfs: Ignore unlocked mounts in fs_fully_visible Eric W. Biederman 2015-05-14 17:34 ` [CFT][PATCH 06/10] proc: Allow creating permanently empty directories Eric W. Biederman 2015-05-14 17:34 ` [CFT][PATCH 07/10] kernfs: Add support for always " Eric W. Biederman 2015-05-14 17:35 ` [CFT][PATCH 08/10] sysfs: Add support for permanently " Eric W. Biederman [not found] ` <87fv6zhxkp.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-05-14 20:31 ` Greg Kroah-Hartman [not found] ` <20150514203131.GB16416-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> 2015-05-14 21:33 ` Eric W. Biederman 2015-05-14 17:36 ` [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_empty_dir Eric W. Biederman [not found] ` <878ucrhxi9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-08-11 18:44 ` Tejun Heo 2015-08-11 18:57 ` Eric W. Biederman 2015-08-11 19:21 ` Andy Lutomirski [not found] ` <CALCETrXE=fKa3XkEEo6y2=ZNtsuBfX=kaoyDwiP0C2BwqKJWjw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-08-12 0:58 ` Eric W. Biederman [not found] ` <87mvxxcogp.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-08-12 20:00 ` Tejun Heo 2015-08-12 20:27 ` Eric W. Biederman [not found] ` <87r3n82qxd.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-08-12 21:05 ` Tejun Heo [not found] ` <877fp1hcuj.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-08-11 20:11 ` Tejun Heo [not found] ` <CAOS58YOHU8SFv4UXeBRr4t88UU=DXQCPg2HU_dMBmgM7WBB1zQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-08-12 0:37 ` Eric W. Biederman [not found] ` <87fv3pe3zn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-08-12 3:58 ` Eric W. Biederman [not found] ` <87a8txb1k8.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-08-12 4:04 ` Eric W. Biederman [not found] ` <871tf9b19v.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-08-12 19:15 ` Tejun Heo [not found] ` <20150812191515.GA4496-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 2015-08-12 20:07 ` [PATCH] fs: Set the size of empty dirs to 0 Eric W. Biederman [not found] ` <87mvxw46fc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-08-12 20:18 ` Tejun Heo 2015-05-14 17:37 ` [CFT][PATCH 10/10] mnt: Update fs_fully_visible to test for permanently empty directories Eric W. Biederman 2015-05-14 20:29 ` [CFT][PATCH 0/10] Making new mounts of proc and sysfs as safe as bind mounts Greg Kroah-Hartman 2015-05-14 21:10 ` Eric W. Biederman [not found] ` <87oalmg90j.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-05-15 6:26 ` Andy Lutomirski [not found] ` <CALCETrU1yxcDfv4YV3wVpWMAdiOOsSUFOPUpFAN-mVA4M-OxdQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-05-15 6:55 ` Eric W. Biederman 2015-05-16 2:05 ` [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) Eric W. Biederman 2015-05-16 2:06 ` [CFT][PATCH 02/10] mnt: Modify fs_fully_visible to deal with mount attributes Eric W. Biederman [not found] ` <87siaxuvik.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-05-16 2:06 ` [CFT][PATCH 01/10] mnt: Refactor the logic for mounting sysfs and proc in a user namespace Eric W. Biederman 2015-05-16 2:07 ` [CFT][PATCH 03/10] vfs: Ignore unlocked mounts in fs_fully_visible Eric W. Biederman 2015-05-16 2:07 ` [CFT][PATCH 04/10] fs: Add helper functions for permanently empty directories Eric W. Biederman 2015-05-16 2:08 ` [CFT][PATCH 05/10] sysctl: Allow creating permanently empty directories that serve as mountpoints Eric W. Biederman 2015-05-16 2:08 ` [CFT][PATCH 06/10] proc: Allow creating permanently empty directories that serve as mount points Eric W. Biederman 2015-05-16 2:09 ` [CFT][PATCH 07/10] kernfs: Add support for always empty directories Eric W. Biederman 2015-05-16 2:09 ` [CFT][PATCH 08/10] sysfs: Add support for permanently empty directories to serve as mount points Eric W. Biederman 2015-05-18 13:14 ` Greg Kroah-Hartman 2015-05-16 2:10 ` [CFT][PATCH 09/10] sysfs: Create mountpoints with sysfs_create_mount_point Eric W. Biederman 2015-05-18 13:14 ` Greg Kroah-Hartman 2015-05-16 2:11 ` [CFT][PATCH 10/10] mnt: Update fs_fully_visible to test for permanently empty directories Eric W. Biederman 2015-05-22 17:39 ` [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) Eric W. Biederman [not found] ` <87wq004im1.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-05-22 18:59 ` Andy Lutomirski [not found] ` <CALCETrUhXBR5WQ6gXr9KzGc4=7tph7kzopY29Hug4g+FhOzEKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-05-22 20:41 ` Eric W. Biederman 2015-05-28 14:08 ` Serge Hallyn 2015-05-28 15:03 ` Eric W. Biederman 2015-05-28 17:33 ` Andy Lutomirski [not found] ` <CALCETrXXax28s9kMTQ-zDx0MttQWG4rg2y-oz3bSGiumSL=3sg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-05-28 18:20 ` Kenton Varda [not found] ` <CAOP=4wid+N_80iyPpiVMN96_fuHZZRGtYQ6AOPn-HFBj2H6Vgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-05-28 19:14 ` Eric W. Biederman [not found] ` <87fv6gikfn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-05-28 20:12 ` Kenton Varda 2015-05-28 20:47 ` Richard Weinberger 2015-05-28 21:07 ` Kenton Varda [not found] ` <CAOP=4wiAA4SqvMn_rQJHOjg6M-75bi_G9Fx8ENgVnYdkT5WVQA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-05-28 21:12 ` Richard Weinberger 2015-05-29 0:30 ` Andy Lutomirski 2015-05-29 0:35 ` Andy Lutomirski [not found] ` <CALCETrXO21Y7PR=pKqaqJb1YZArNyjAv7Z-J44O53FcfLM_0Tw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-05-29 4:36 ` Eric W. Biederman [not found] ` <87fv6g80g7.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-05-29 4:54 ` Kenton Varda 2015-05-29 17:49 ` Andy Lutomirski 2015-06-03 21:13 ` Eric W. Biederman [not found] ` <87k2vkebri.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-06-03 21:15 ` [CFT][PATCH 11/10] mnt: Avoid unnecessary regressions in fs_fully_visible Eric W. Biederman [not found] ` <87eglseboh.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-06-04 4:35 ` [CFT][PATCH 11/10] mnt: Avoid unnecessary regressions in fs_fully_visible (take 2) Eric W. Biederman [not found] ` <874mmodral.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-06-04 5:20 ` Greg Kroah-Hartman 2015-06-05 0:46 ` [CFT][PATCH 11/10] mnt: Avoid unnecessary regressions in fs_fully_visible Andy Lutomirski [not found] ` <CALCETrWwtFaiaYGLoq4EPkrgcq9nEA2GseVfP3iBkbYZ8NfGPg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-06-06 19:14 ` Eric W. Biederman 2015-06-04 5:19 ` [CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2) Greg Kroah-Hartman 2015-06-04 6:27 ` Eric W. Biederman [not found] ` <87h9qo6la9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-06-04 7:34 ` Eric W. Biederman 2015-06-16 12:23 ` Daniel P. Berrange 2015-05-28 21:04 ` Serge E. Hallyn [not found] ` <20150528210438.GA14849-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> 2015-05-28 21:42 ` Eric W. Biederman 2015-05-28 21:52 ` Serge E. Hallyn 2015-05-28 19:36 ` Richard Weinberger [not found] ` <55676E32.3050006-/L3Ra7n9ekc@public.gmane.org> 2015-05-28 19:57 ` Eric W. Biederman 2015-05-28 20:30 ` Richard Weinberger [not found] ` <55677AEF.1090809-/L3Ra7n9ekc@public.gmane.org> 2015-05-28 21:32 ` Eric W. Biederman [not found] ` <87iobcfkwx.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-05-28 21:46 ` Richard Weinberger [not found] ` <55678CCA.80807-/L3Ra7n9ekc@public.gmane.org> 2015-06-16 12:30 ` Daniel P. Berrange 2015-05-29 9:30 ` Richard Weinberger [not found] ` <556831CF.9040600-/L3Ra7n9ekc@public.gmane.org> 2015-05-29 17:41 ` Eric W. Biederman 2015-06-06 18:56 ` Eric W. Biederman [not found] ` <87mw0c1x8p.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-06-16 12:31 ` Daniel P. Berrange [not found] ` <20150616123148.GB18689-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2015-06-16 12:46 ` Richard Weinberger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).