* [GIT PULL] User namespace related fixes for v4.2 @ 2015-06-26 20:50 Eric W. Biederman [not found] ` <CA+55aFysKDXr2HEwNzm3z9QOw=E4ZeWcvYQ-xLhy5_k+rGbeRg@mail.gmail.com> [not found] ` <87381eyz26.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 0 siblings, 2 replies; 16+ messages in thread From: Eric W. Biederman @ 2015-06-26 20:50 UTC (permalink / raw) To: Linus Torvalds Cc: Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Ivan Delalande Date: Fri, 22 May 2015 15:41:45 -0500 (4 weeks, 6 days, 23 hours ago) Linus, Please pull the for-linus branch from the git tree: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus HEAD: 81909cb3350299977a88f72264651f6cec06c836 mnt: Avoid unnecessary regressions in fs_fully_visible Long ago and far away when user namespaces where young and I was a more optimistic man it was realized that allowing fresh mounts of proc and sysfs with only user namespace permissions could violate the basic rule that only root gets to decide if proc or sysfs should be mounted at all. Some hacks were put in place to reduce the worst of the damage could be done, and the common sense rule was adopted that fresh mounts of proc and sysfs should allow no more than bind mounts of proc and sysfs. Unfortunately that rule has not been fully enforced. There are two kinds of gaps in that enforcement. Only filesystems mount on empty directories on proc and sysfs should be ignored but the test for empty directories was insufficient. So this patchset requires directories on proc, sysctl and sysfs that are will always be empty to be created specially. Every other technique is lossy as an ordinary directory can dynamically be added to later. This actually makes this code in the kernel a smidge clearer about it's purpose. I asked container developers from the various container projects to help test this and no holes were found in the set of mount points on proc and sysfs that this patchset identifies. This set of changes also starts enforcing the mount flags of fresh mounts of proc and sysfs are consistent with the existing mount of proc and sysfs. I expected this to be the boring part of this patchset but unfortunately userspace has been stupid and extra work has to be done to avoid regressions. The atime, read-only, and nodev attributes were not a problem and as such are enforced absolutely. People have been winding up mounting proc and sysfs in contaners with nosuid and noexec clear, when the global root had set nosuid and noexec. In practice this does not make a hill of beans difference today because currently there are no exectuables on proc and sysfs. Unfortunately that can not be guaranteed in the future. People refactor code and bugs get reintroduced, or people find a good reason to do something that today seems ludicrous. Give people 20 more years and who knows what will happen. The libvirt-lxc and lxc developers have been contacted so they can correct the bugs where they clear noexec and nosuid on proc and sysfs through oversights when they wrote their code. Thos bugs should be fixed in those projects shortly. These bugs are an issue however libvirt-lxc or lxc create containers. However they only violate kernel permission checks in the case of containers created by unprivileged users, which is a niche case today. Therefore this changeset marks for backporting the attribute enforcement that do not cause regressions in the existing userspace. Implements enforcement of nosuid and noexec. Then disables that enforcement of nosuid and nosexec and replaces that enforcment with a big fat warning. Userspace should be fixed before 4.2 ships so I do not expect these warnings to fire. However the warnings give userspace time to get their act together. I am optimistic that all of userspace that cares will be fixed and for v4.3 I can remove the warning messages and enforce the attribute checks. It is a fine line on the regression front and I hate walking it, but now is the best time to address the issue of clearing attributes that should not be cleared before lots of unprivileged container implementations accumulate, and before nosid and noexec proc and sysfs matter. This set of changes also addresses how open file descriptors from /proc/<pid>/ns/* are displayed. Recently readlink of /proc/<pid>/fd has been triggering a WARN_ON that has not been meaningful in nearly a decade, and is actively wrong now. An old bug (2 years?) in /proc/<pid>/mountinfo where bind mounts of these descriptors were not meaningfully show is fixed. Eric W. Biederman (14): mnt: Refactor the logic for mounting sysfs and proc in a user namespace mnt: Modify fs_fully_visible to deal with locked ro nodev and atime mnt: Modify fs_fully_visible to deal with locked nosuid and noexec vfs: Ignore unlocked mounts in fs_fully_visible fs: Add helper functions for permanently empty directories. sysctl: Allow creating permanently empty directories that serve as mountpoints. proc: Allow creating permanently empty directories that serve as mount points kernfs: Add support for always empty directories. sysfs: Add support for permanently empty directories to serve as mount points. sysfs: Create mountpoints with sysfs_create_mount_point mnt: Update fs_fully_visible to test for permanently empty directories vfs: Remove incorrect debugging WARN in prepend_path nsfs: Add a show_path method to fix mountinfo mnt: Avoid unnecessary regressions in fs_fully_visible arch/s390/hypfs/inode.c | 12 ++---- drivers/firmware/efi/efi.c | 6 +-- fs/configfs/mount.c | 10 ++--- fs/dcache.c | 11 ----- fs/debugfs/inode.c | 11 ++--- fs/fuse/inode.c | 9 ++--- fs/kernfs/dir.c | 38 +++++++++++++++++- fs/kernfs/inode.c | 2 + fs/libfs.c | 96 ++++++++++++++++++++++++++++++++++++++++++++ fs/namespace.c | 80 +++++++++++++++++++++++++++++++++--- fs/nsfs.c | 10 +++++ fs/proc/generic.c | 23 +++++++++++ fs/proc/inode.c | 4 ++ fs/proc/internal.h | 6 +++ fs/proc/proc_sysctl.c | 37 +++++++++++++++++ fs/proc/root.c | 9 ++--- fs/pstore/inode.c | 12 ++---- fs/sysfs/dir.c | 34 ++++++++++++++++ fs/sysfs/mount.c | 5 +-- fs/tracefs/inode.c | 6 +-- include/linux/fs.h | 4 +- include/linux/kernfs.h | 3 ++ include/linux/mount.h | 5 +++ include/linux/sysctl.h | 3 ++ include/linux/sysfs.h | 15 +++++++ kernel/cgroup.c | 10 ++--- kernel/sysctl.c | 8 +--- security/inode.c | 10 ++--- security/selinux/selinuxfs.c | 11 +++-- security/smack/smackfs.c | 8 ++-- 30 files changed, 397 insertions(+), 101 deletions(-) ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <CA+55aFysKDXr2HEwNzm3z9QOw=E4ZeWcvYQ-xLhy5_k+rGbeRg@mail.gmail.com>]
[parent not found: <CA+55aFysKDXr2HEwNzm3z9QOw=E4ZeWcvYQ-xLhy5_k+rGbeRg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [GIT PULL] User namespace related fixes for v4.2 [not found] ` <CA+55aFysKDXr2HEwNzm3z9QOw=E4ZeWcvYQ-xLhy5_k+rGbeRg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-06-29 21:13 ` Eric W. Biederman [not found] ` <87pp4eqktr.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> [not found] ` <CA+55aFw-DK-xDC-3HYa=BMX8WNyQgT9O01tihrAS9+-7PPj_jA@mail.gmail.com> 0 siblings, 2 replies; 16+ messages in thread From: Eric W. Biederman @ 2015-06-29 21:13 UTC (permalink / raw) To: Linus Torvalds Cc: Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Ivan Delalande Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> writes: > On Fri, Jun 26, 2015 at 1:50 PM, Eric W. Biederman > <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: >> >> Therefore this changeset marks for backporting the attribute enforcement >> that do not cause regressions in the existing userspace. Implements >> enforcement of nosuid and noexec. Then disables that enforcement of >> nosuid and nosexec and replaces that enforcment with a big fat warning. >> Userspace should be fixed before 4.2 ships so I do not expect these >> warnings to fire. > > Eric, that is *not* how this works. > > If people have old user-space binaries, we do not require them to be > updated. So it doesn't matter one whit if "Userspace should be fixed > before 4.2 ships", because it is entirely irrelevant if the upstream > project stops doing something, when users want to be able to upgrade > their kernels regardless of whether they've upgraded their system > apps. > > I'm going to hold off on pulling this, because I feel you don't > understand the regression rules. That is not the issue. What happen is I found myself between a rock and a hard place and I did not possess sufficient creativity to code my way out. Fearing regressions I sought out people to test these changes, on the applications most likely to care. I reduced the change from breaking userspace to a warning that userspace is being ludicriously stupid. I worked with the applications to get their bugs fixed. > I suggest we instead just always set nosuid and noexec for /proc and > /sys mounts, and make this whole thing a complete non-issue. Doing exactly as you suggest will be user visible (mount flags), and without care is likely to break remount. Can you live with the patch below and committing to never supporting executables on proc and sysfs? With that I can solve all of my concerns, without affecting the existing userspace programs. > Instead of this crazy "let's warn about it and plan on breaking old > existing setups". That's _wrong_. It's so fundamentally wrong that I > will not pull from people who do not understand this. > > The reason we have that "no regression" rule is not so that we fix up > bugs. It's because peopel should always feel safe upgrading their > kernel, and basically _know_ that kernel developers consider it > unacceptable to break user space. It should be a warm fuzzy feeling - > the feeling that we try our best, and if we ever fail because we > missed something or really believed that it can't ever matter, we'll > jump on it and we won't be making any excuses for our bugs. Because > breaking user space is a bug. > > Kernel developers who don't understand "it is unacceptable to break > user space" shouldn't be kernel developers. It is not that I do not understand it is that I had a failure of imagination. I have been agonizing about this issue since I have encountered it trying to think of a better way. Because of another failure of my imagination enabling user namespaces has introduced a number of security regressions into the kernel, because primarily I overlooked the effects of fine grained sharing in the mount namespace and I have been working carefully and dilligent to get those security regressions fixed. Because if the user namespace by it's designed semantics opens up security holes it is a failure. Crap happens and we occassionally over look things with the a greater amount of code exposed to unprivileged users, but that is no excuse for doing everything in my power to make user namespaces as safe to use as linux without user namespaces. Almost the entirety of my pull request is addressing that unfortunate regression in security. Until your comments suggested to me that it was acceptable to permanently bar exectuables from proc and sysfs I did not see a way to address this security hole (that does not currently appear exploitable), but has been exploitable in the past, without breaking something. Breaking two exectuables that will be unsafe to use at some point if I did not get this fixed seemed the least damage I could do. Hopefully you can live with permanently (and programatically) barring exectuables from proc and sysfs and I can then move forward without reworking things so I can fix this without breaking anything. Eric ------------------- cut here -------------------- From: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> Date: Mon, 29 Jun 2015 14:42:03 -0500 Subject: [PATCH] vfs: Commit to never having exectuables on proc and sysfs. Add a new flag to file_system_type for filesystems that are never exepected to support executables. Test that flag where MNT_NOEXEC is tested today, so that user visible changes to mount flags are not necessary. The only user visible effect will be that exectuables will be treated as if the execute bit is cleared, as happens today when the MNT_NOEXEC flag is set on a mount. Set the new flag on proc and sysfs. As proc and sysfs do not implement executables today there are no differences for userspace to notice. The point of this exercise is that there are some applications that due to oversites of their programmers do not set nosid and noexec when they mount fresh copies of proc and sysfs today, and would become instant security holes if we implemented exectubles especially suid executables on proc and sysfs today. With this change it becomes less necessary to vet each change to proc and sysfs very carefully to ensure that executable files are not implemented and to ensure that chattr can not create executable files on proc and sysfs. Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/exec.c | 10 ++++++++-- fs/open.c | 2 +- fs/proc/root.c | 2 +- fs/sysfs/mount.c | 2 +- include/linux/fs.h | 3 +++ kernel/sys.c | 3 +-- mm/mmap.c | 4 ++-- mm/nommu.c | 2 +- security/security.c | 2 +- 9 files changed, 19 insertions(+), 11 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 1977c2a553ac..1e063854571b 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -98,6 +98,12 @@ static inline void put_binfmt(struct linux_binfmt * fmt) module_put(fmt->module); } +bool path_noexec(const struct path *path) +{ + return (path->mnt->mnt_flags & MNT_NOEXEC) || + (path->mnt->mnt_sb->s_type->fs_flags & FS_NOEXEC); +} + #ifdef CONFIG_USELIB /* * Note that a shared library must be both readable and executable due to @@ -132,7 +138,7 @@ SYSCALL_DEFINE1(uselib, const char __user *, library) goto exit; error = -EACCES; - if (file->f_path.mnt->mnt_flags & MNT_NOEXEC) + if (path_noexec(&file->f_path)) goto exit; fsnotify_open(file); @@ -777,7 +783,7 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags) if (!S_ISREG(file_inode(file)->i_mode)) goto exit; - if (file->f_path.mnt->mnt_flags & MNT_NOEXEC) + if (path_noexec(&file->f_path)) goto exit; err = deny_write_access(file); diff --git a/fs/open.c b/fs/open.c index e0250bdcc440..9fbdb1bae049 100644 --- a/fs/open.c +++ b/fs/open.c @@ -375,7 +375,7 @@ retry: * with the "noexec" flag. */ res = -EACCES; - if (path.mnt->mnt_flags & MNT_NOEXEC) + if (path_noexec(&path)) goto out_path_release; } diff --git a/fs/proc/root.c b/fs/proc/root.c index b7fa4bfe896a..7e39312580d4 100644 --- a/fs/proc/root.c +++ b/fs/proc/root.c @@ -159,7 +159,7 @@ static struct file_system_type proc_fs_type = { .name = "proc", .mount = proc_mount, .kill_sb = proc_kill_sb, - .fs_flags = FS_USERNS_MOUNT, + .fs_flags = FS_NOEXEC | FS_USERNS_MOUNT, }; void __init proc_root_init(void) diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c index 8a49486bf30c..9cd8667feb94 100644 --- a/fs/sysfs/mount.c +++ b/fs/sysfs/mount.c @@ -58,7 +58,7 @@ static struct file_system_type sysfs_fs_type = { .name = "sysfs", .mount = sysfs_mount, .kill_sb = sysfs_kill_sb, - .fs_flags = FS_USERNS_MOUNT, + .fs_flags = FS_NOEXEC | FS_USERNS_MOUNT, }; int __init sysfs_init(void) diff --git a/include/linux/fs.h b/include/linux/fs.h index e351da4a934f..9e44c6d81bb2 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1916,6 +1916,7 @@ struct file_system_type { #define FS_HAS_SUBTYPE 4 #define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */ #define FS_USERNS_DEV_MOUNT 16 /* A userns mount does not imply MNT_NODEV */ +#define FS_NOEXEC 32 /* FS will not support executables */ #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */ struct dentry *(*mount) (struct file_system_type *, int, const char *, void *); @@ -3018,4 +3019,6 @@ static inline bool dir_relax(struct inode *inode) return !IS_DEADDIR(inode); } +extern bool path_noexec(const struct path *path); + #endif /* _LINUX_FS_H */ diff --git a/kernel/sys.c b/kernel/sys.c index 259fda25eb6b..fa2f2f671a5c 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1668,8 +1668,7 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd) * overall picture. */ err = -EACCES; - if (!S_ISREG(inode->i_mode) || - exe.file->f_path.mnt->mnt_flags & MNT_NOEXEC) + if (!S_ISREG(inode->i_mode) || path_noexec(&exe.file->f_path)) goto exit; err = inode_permission(inode, MAY_EXEC); diff --git a/mm/mmap.c b/mm/mmap.c index aa632ade2be7..f126923ce683 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1268,7 +1268,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, * mounted, in which case we dont add PROT_EXEC.) */ if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC)) - if (!(file && (file->f_path.mnt->mnt_flags & MNT_NOEXEC))) + if (!(file && path_noexec(&file->f_path))) prot |= PROT_EXEC; if (!(flags & MAP_FIXED)) @@ -1337,7 +1337,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, case MAP_PRIVATE: if (!(file->f_mode & FMODE_READ)) return -EACCES; - if (file->f_path.mnt->mnt_flags & MNT_NOEXEC) { + if (path_noexec(&file->f_path)) { if (vm_flags & VM_EXEC) return -EPERM; vm_flags &= ~VM_MAYEXEC; diff --git a/mm/nommu.c b/mm/nommu.c index 05e7447d960b..5fdec8885256 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1035,7 +1035,7 @@ static int validate_mmap_request(struct file *file, /* handle executable mappings and implied executable * mappings */ - if (file->f_path.mnt->mnt_flags & MNT_NOEXEC) { + if (path_noexec(&file->f_path)) { if (prot & PROT_EXEC) return -EPERM; } else if ((prot & PROT_READ) && !(prot & PROT_EXEC)) { diff --git a/security/security.c b/security/security.c index 595fffab48b0..062f3c997fdc 100644 --- a/security/security.c +++ b/security/security.c @@ -776,7 +776,7 @@ static inline unsigned long mmap_prot(struct file *file, unsigned long prot) * ditto if it's not on noexec mount, except that on !MMU we need * NOMMU_MAP_EXEC (== VM_MAYEXEC) in this case */ - if (!(file->f_path.mnt->mnt_flags & MNT_NOEXEC)) { + if (!path_noexec(&file->f_path)) { #ifndef CONFIG_MMU if (file->f_op->mmap_capabilities) { unsigned caps = file->f_op->mmap_capabilities(file); -- 2.2.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
[parent not found: <87pp4eqktr.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* Re: [GIT PULL] User namespace related fixes for v4.2 [not found] ` <87pp4eqktr.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-07-03 22:10 ` Linus Torvalds 0 siblings, 0 replies; 16+ messages in thread From: Linus Torvalds @ 2015-07-03 22:10 UTC (permalink / raw) To: Eric W. Biederman, Al Viro Cc: Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, <linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, Tejun Heo, Ivan Delalande On Mon, Jun 29, 2015 at 2:13 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: > > Can you live with the patch below and committing to never supporting > executables on proc and sysfs? Sure. I don't think executables make any sense what-so-ever in those filesystems. I think it's fine saying that /proc and /sys cannot have executables in them, and then use that flag to just ignore the relevant mount flags. Al, comments? Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <CA+55aFw-DK-xDC-3HYa=BMX8WNyQgT9O01tihrAS9+-7PPj_jA@mail.gmail.com>]
[parent not found: <CA+55aFw-DK-xDC-3HYa=BMX8WNyQgT9O01tihrAS9+-7PPj_jA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [GIT PULL] User namespace related fixes for v4.2 [not found] ` <CA+55aFw-DK-xDC-3HYa=BMX8WNyQgT9O01tihrAS9+-7PPj_jA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-07-04 23:11 ` Al Viro [not found] ` <20150704231118.GT17109-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> [not found] ` <87mvz4yomp.fsf_-_@x220.int.ebiederm.org> 0 siblings, 2 replies; 16+ messages in thread From: Al Viro @ 2015-07-04 23:11 UTC (permalink / raw) To: Linus Torvalds Cc: Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Tejun Heo, Eric W. Biederman, Richard Weinberger, <linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, Michael Kerrisk-manpages, Ivan Delalande On Fri, Jul 03, 2015 at 03:10:40PM -0700, Linus Torvalds wrote: > On Mon, Jun 29, 2015 at 2:13 PM, Eric W. Biederman > <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: > > > > Can you live with the patch below and committing to never supporting > > executables on proc and sysfs? > > Sure. I don't think executables make any sense what-so-ever in those > filesystems. I think it's fine saying that /proc and /sys cannot have > executables in them, and then use that flag to just ignore the > relevant mount flags. > > Al, comments? I can live with that, but I would prefer that to be a superblock flag force-set in ->mount() (and preserved in ->remount_fs()) rather than Yet Another FS Type Flag. OTOH, it's not hard to change afterwards. Al, bloody annoyed by having spent hours debugging an odd corruption in merge candidate, only to find that it correlated to temperature of the host ;-/ Seem to be all gone after replacing CPU fan and cleaning the mess under it... ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <20150704231118.GT17109-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>]
* [REVIEW][PATCH 0/2] noexec on proc and sysfs [not found] ` <20150704231118.GT17109-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> @ 2015-07-10 16:16 ` Eric W. Biederman 0 siblings, 0 replies; 16+ messages in thread From: Eric W. Biederman @ 2015-07-10 16:16 UTC (permalink / raw) To: Linux Containers Cc: Seth Forshee, Linux API, Greg Kroah-Hartman, Al Viro, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Ivan Delalande, Linus Torvalds Given the code I have seen executables especially suid root executable appearing on proc or sysfs will break userspace because there are current applications that depend on nosuid and noexec on proc and sysfs being meaningless. This patchset addes a new flag SB_I_NOEXEC to enforce that restriction, and to make it hard for a kernel developer to make the mistake of adding executables to sysfs or proc. The first patch has been updated since last time to a super block flags instead of a file_system type flag based on Al's suggestion. The code in fs_fully_visible to enforce nosuid and noexec when needed has also been added. At a practical level this code is a no-op on a slow path, to guard against future mistakes and to make auditing the kernel for this class of problem trivial. git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing Eric W. Biederman (2): vfs: Commit to never having exectuables on proc and sysfs. mnt: fs_fully_visible enforce noexec and nosuid if !SB_I_NOEXEC fs/exec.c | 10 ++++++++-- fs/namespace.c | 33 +++++++++++++++++++++++++-------- fs/open.c | 2 +- fs/proc/root.c | 2 ++ fs/sysfs/mount.c | 4 ++++ include/linux/fs.h | 3 +++ kernel/sys.c | 3 +-- mm/mmap.c | 4 ++-- mm/nommu.c | 2 +- security/security.c | 2 +- 10 files changed, 48 insertions(+), 17 deletions(-) ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <87mvz4yomp.fsf_-_@x220.int.ebiederm.org>]
[parent not found: <87mvz4yomp.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* [REVIEW][PATCH 1/2] vfs: Commit to never having exectuables on proc and sysfs. [not found] ` <87mvz4yomp.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-07-10 16:17 ` Eric W. Biederman 2015-07-10 16:18 ` [REVIEW][PATCH 2/2] mnt: fs_fully_visible enforce noexec and nosuid if !SB_I_NOEXEC Eric W. Biederman 1 sibling, 0 replies; 16+ messages in thread From: Eric W. Biederman @ 2015-07-10 16:17 UTC (permalink / raw) To: Linux Containers Cc: Seth Forshee, Linux API, Greg Kroah-Hartman, Al Viro, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Ivan Delalande, Linus Torvalds Today proc and sysfs do not contain any executable files. Several applications today mount proc or sysfs without noexec and nosuid and then depend on there being no exectuables files on proc or sysfs. Having any executable files show on proc or sysfs would cause a user space visible regression, and most likely security problems. Therefore commit to never allowing executables on proc and sysfs by adding a new flag to mark them as filesystems without executables and enforce that flag. Test the flag where MNT_NOEXEC is tested today, so that the only user visible effect will be that exectuables will be treated as if the execute bit is cleared. The filesystems proc and sysfs do not currently incoporate any executable files so this does not result in any user visible effects. This makes it unnecessary to vet changes to proc and sysfs tightly for adding exectuable files or changes to chattr that would modify existing files, as no matter what the individual file say they will not be treated as exectuable files by the vfs. Not having to vet changes to closely is important as without this we are only one proc_create call (or another goof up in the implementation of notify_change) from having problematic executables on proc. Those mistakes are all too easy to make and would create a situation where there are security issues or the assumptions of some program having to be broken (and cause userspace regressions). Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/exec.c | 10 ++++++++-- fs/open.c | 2 +- fs/proc/root.c | 2 ++ fs/sysfs/mount.c | 4 ++++ include/linux/fs.h | 3 +++ kernel/sys.c | 3 +-- mm/mmap.c | 4 ++-- mm/nommu.c | 2 +- security/security.c | 2 +- 9 files changed, 23 insertions(+), 9 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 1977c2a553ac..b06623a9347f 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -98,6 +98,12 @@ static inline void put_binfmt(struct linux_binfmt * fmt) module_put(fmt->module); } +bool path_noexec(const struct path *path) +{ + return (path->mnt->mnt_flags & MNT_NOEXEC) || + (path->mnt->mnt_sb->s_iflags & SB_I_NOEXEC); +} + #ifdef CONFIG_USELIB /* * Note that a shared library must be both readable and executable due to @@ -132,7 +138,7 @@ SYSCALL_DEFINE1(uselib, const char __user *, library) goto exit; error = -EACCES; - if (file->f_path.mnt->mnt_flags & MNT_NOEXEC) + if (path_noexec(&file->f_path)) goto exit; fsnotify_open(file); @@ -777,7 +783,7 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags) if (!S_ISREG(file_inode(file)->i_mode)) goto exit; - if (file->f_path.mnt->mnt_flags & MNT_NOEXEC) + if (path_noexec(&file->f_path)) goto exit; err = deny_write_access(file); diff --git a/fs/open.c b/fs/open.c index e33dab287fa0..b6f1e96a7c0b 100644 --- a/fs/open.c +++ b/fs/open.c @@ -377,7 +377,7 @@ retry: * with the "noexec" flag. */ res = -EACCES; - if (path.mnt->mnt_flags & MNT_NOEXEC) + if (path_noexec(&path)) goto out_path_release; } diff --git a/fs/proc/root.c b/fs/proc/root.c index 68feb0f70e63..361ab4ee42fc 100644 --- a/fs/proc/root.c +++ b/fs/proc/root.c @@ -134,6 +134,8 @@ static struct dentry *proc_mount(struct file_system_type *fs_type, } sb->s_flags |= MS_ACTIVE; + /* User space would break if executables appear on proc */ + sb->s_iflags |= SB_I_NOEXEC; } return dget(sb->s_root); diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c index 1c6ac6fcee9f..f3db82071cfb 100644 --- a/fs/sysfs/mount.c +++ b/fs/sysfs/mount.c @@ -40,6 +40,10 @@ static struct dentry *sysfs_mount(struct file_system_type *fs_type, SYSFS_MAGIC, &new_sb, ns); if (IS_ERR(root) || !new_sb) kobj_ns_drop(KOBJ_NS_TYPE_NET, ns); + else if (new_sb) + /* Userspace would break if executables appear on sysfs */ + root->d_sb->s_iflags |= SB_I_NOEXEC; + return root; } diff --git a/include/linux/fs.h b/include/linux/fs.h index a0653e560c26..42912f8d286e 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1244,6 +1244,7 @@ struct mm_struct; /* sb->s_iflags */ #define SB_I_CGROUPWB 0x00000001 /* cgroup-aware writeback enabled */ +#define SB_I_NOEXEC 0x00000002 /* Ignore executables on this fs */ /* Possible states of 'frozen' field */ enum { @@ -3030,4 +3031,6 @@ static inline bool dir_relax(struct inode *inode) return !IS_DEADDIR(inode); } +extern bool path_noexec(const struct path *path); + #endif /* _LINUX_FS_H */ diff --git a/kernel/sys.c b/kernel/sys.c index 259fda25eb6b..fa2f2f671a5c 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1668,8 +1668,7 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd) * overall picture. */ err = -EACCES; - if (!S_ISREG(inode->i_mode) || - exe.file->f_path.mnt->mnt_flags & MNT_NOEXEC) + if (!S_ISREG(inode->i_mode) || path_noexec(&exe.file->f_path)) goto exit; err = inode_permission(inode, MAY_EXEC); diff --git a/mm/mmap.c b/mm/mmap.c index aa632ade2be7..f126923ce683 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1268,7 +1268,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, * mounted, in which case we dont add PROT_EXEC.) */ if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC)) - if (!(file && (file->f_path.mnt->mnt_flags & MNT_NOEXEC))) + if (!(file && path_noexec(&file->f_path))) prot |= PROT_EXEC; if (!(flags & MAP_FIXED)) @@ -1337,7 +1337,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, case MAP_PRIVATE: if (!(file->f_mode & FMODE_READ)) return -EACCES; - if (file->f_path.mnt->mnt_flags & MNT_NOEXEC) { + if (path_noexec(&file->f_path)) { if (vm_flags & VM_EXEC) return -EPERM; vm_flags &= ~VM_MAYEXEC; diff --git a/mm/nommu.c b/mm/nommu.c index 58ea3643b9e9..ce17abf087ff 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1035,7 +1035,7 @@ static int validate_mmap_request(struct file *file, /* handle executable mappings and implied executable * mappings */ - if (file->f_path.mnt->mnt_flags & MNT_NOEXEC) { + if (path_noexec(&file->f_path)) { if (prot & PROT_EXEC) return -EPERM; } else if ((prot & PROT_READ) && !(prot & PROT_EXEC)) { diff --git a/security/security.c b/security/security.c index 595fffab48b0..062f3c997fdc 100644 --- a/security/security.c +++ b/security/security.c @@ -776,7 +776,7 @@ static inline unsigned long mmap_prot(struct file *file, unsigned long prot) * ditto if it's not on noexec mount, except that on !MMU we need * NOMMU_MAP_EXEC (== VM_MAYEXEC) in this case */ - if (!(file->f_path.mnt->mnt_flags & MNT_NOEXEC)) { + if (!path_noexec(&file->f_path)) { #ifndef CONFIG_MMU if (file->f_op->mmap_capabilities) { unsigned caps = file->f_op->mmap_capabilities(file); -- 2.2.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [REVIEW][PATCH 2/2] mnt: fs_fully_visible enforce noexec and nosuid if !SB_I_NOEXEC [not found] ` <87mvz4yomp.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-07-10 16:17 ` [REVIEW][PATCH 1/2] vfs: Commit to never having exectuables " Eric W. Biederman @ 2015-07-10 16:18 ` Eric W. Biederman 1 sibling, 0 replies; 16+ messages in thread From: Eric W. Biederman @ 2015-07-10 16:18 UTC (permalink / raw) To: Linux Containers Cc: Seth Forshee, Linux API, Greg Kroah-Hartman, Al Viro, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Ivan Delalande, Linus Torvalds The filesystems proc and sysfs do not have executable files do not have exectuable files today and portions of userspace break if we do enforce nosuid and noexec consistency of nosuid and noexec flags between previous mounts and new mounts of proc and sysfs. Add the code to enforce consistency of the nosuid and noexec flags, and use the presence of SB_I_NOEXEC to signal that there is no need to bother. This results in a completely userspace invisible change that makes it clear fs_fully_visible can only skip the enforcement of noexec and nosuid because it is known the filesystems in question do not support executables. Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/namespace.c | 33 +++++++++++++++++++++++++-------- 1 file changed, 25 insertions(+), 8 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index c7cb8a526c05..ce428cadd41f 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3194,6 +3194,8 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags) down_read(&namespace_sem); list_for_each_entry(mnt, &ns->list, mnt_list) { struct mount *child; + int mnt_flags; + if (mnt->mnt.mnt_sb->s_type != type) continue; @@ -3203,17 +3205,30 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags) if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root) continue; + /* Read the mount flags and filter out flags that + * may safely be ignored. + */ + mnt_flags = mnt->mnt.mnt_flags; + if (mnt->mnt.mnt_sb->s_iflags & SB_I_NOEXEC) + mnt_flags &= ~(MNT_LOCK_NOSUID | MNT_LOCK_NOEXEC); + /* Verify the mount flags are equal to or more permissive * than the proposed new mount. */ - if ((mnt->mnt.mnt_flags & MNT_LOCK_READONLY) && + if ((mnt_flags & MNT_LOCK_READONLY) && !(new_flags & MNT_READONLY)) continue; - if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) && + if ((mnt_flags & MNT_LOCK_NODEV) && !(new_flags & MNT_NODEV)) continue; - if ((mnt->mnt.mnt_flags & MNT_LOCK_ATIME) && - ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK))) + if ((mnt_flags & MNT_LOCK_NOSUID) && + !(new_flags & MNT_NOSUID)) + continue; + if ((mnt_flags & MNT_LOCK_NOEXEC) && + !(new_flags & MNT_NOEXEC)) + continue; + if ((mnt_flags & MNT_LOCK_ATIME) && + ((mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK))) continue; /* This mount is not fully visible if there are any @@ -3223,16 +3238,18 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags) list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) { struct inode *inode = child->mnt_mountpoint->d_inode; /* Only worry about locked mounts */ - if (!(mnt->mnt.mnt_flags & MNT_LOCKED)) + if (!(mnt_flags & MNT_LOCKED)) continue; /* Is the directory permanetly empty? */ if (!is_empty_dir_inode(inode)) goto next; } /* Preserve the locked attributes */ - *new_mnt_flags |= mnt->mnt.mnt_flags & (MNT_LOCK_READONLY | \ - MNT_LOCK_NODEV | \ - MNT_LOCK_ATIME); + *new_mnt_flags |= mnt_flags & (MNT_LOCK_READONLY | \ + MNT_LOCK_NODEV | \ + MNT_LOCK_NOSUID | \ + MNT_LOCK_NOEXEC | \ + MNT_LOCK_ATIME); visible = true; goto found; next: ; -- 2.2.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
[parent not found: <87h9pcyokc.fsf_-_@x220.int.ebiederm.org>]
[parent not found: <87h9pcyokc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* Re: [REVIEW][PATCH 1/2] vfs: Commit to never having exectuables on proc and sysfs. [not found] ` <87h9pcyokc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-07-10 18:24 ` Richard Weinberger [not found] ` <55A00DE9.7060806-/L3Ra7n9ekc@public.gmane.org> [not found] ` <20150710193052.GB19824@kroah.com> 0 siblings, 2 replies; 16+ messages in thread From: Richard Weinberger @ 2015-07-10 18:24 UTC (permalink / raw) To: Eric W. Biederman, Linux Containers Cc: Seth Forshee, Linux API, Al Viro, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Greg Kroah-Hartman, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Ivan Delalande, Linus Torvalds Am 10.07.2015 um 18:17 schrieb Eric W. Biederman: > > Today proc and sysfs do not contain any executable files. Several > applications today mount proc or sysfs without noexec and nosuid and > then depend on there being no exectuables files on proc or sysfs. > Having any executable files show on proc or sysfs would cause > a user space visible regression, and most likely security problems. > > Therefore commit to never allowing executables on proc and sysfs by > adding a new flag to mark them as filesystems without executables and > enforce that flag. > > Test the flag where MNT_NOEXEC is tested today, so that the only user > visible effect will be that exectuables will be treated as if the > execute bit is cleared. > > The filesystems proc and sysfs do not currently incoporate any > executable files so this does not result in any user visible effects. > > This makes it unnecessary to vet changes to proc and sysfs tightly for > adding exectuable files or changes to chattr that would modify > existing files, as no matter what the individual file say they will > not be treated as exectuable files by the vfs. > > Not having to vet changes to closely is important as without this we > are only one proc_create call (or another goof up in the > implementation of notify_change) from having problematic executables > on proc. Those mistakes are all too easy to make and would create > a situation where there are security issues or the assumptions of > some program having to be broken (and cause userspace regressions). Would it make sense to add SB_I_NOEXEC to more pseudo filesystems? Say pstore or devpts? Thanks, //richard ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <55A00DE9.7060806-/L3Ra7n9ekc@public.gmane.org>]
* Re: [REVIEW][PATCH 1/2] vfs: Commit to never having exectuables on proc and sysfs. [not found] ` <55A00DE9.7060806-/L3Ra7n9ekc@public.gmane.org> @ 2015-07-10 19:30 ` Greg Kroah-Hartman 0 siblings, 0 replies; 16+ messages in thread From: Greg Kroah-Hartman @ 2015-07-10 19:30 UTC (permalink / raw) To: Richard Weinberger Cc: Seth Forshee, Linux API, Linux Containers, Al Viro, Andy Lutomirski, Kenton Varda, Tejun Heo, Eric W. Biederman, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk-manpages, Ivan Delalande, Linus Torvalds On Fri, Jul 10, 2015 at 08:24:41PM +0200, Richard Weinberger wrote: > Am 10.07.2015 um 18:17 schrieb Eric W. Biederman: > > > > Today proc and sysfs do not contain any executable files. Several > > applications today mount proc or sysfs without noexec and nosuid and > > then depend on there being no exectuables files on proc or sysfs. > > Having any executable files show on proc or sysfs would cause > > a user space visible regression, and most likely security problems. > > > > Therefore commit to never allowing executables on proc and sysfs by > > adding a new flag to mark them as filesystems without executables and > > enforce that flag. > > > > Test the flag where MNT_NOEXEC is tested today, so that the only user > > visible effect will be that exectuables will be treated as if the > > execute bit is cleared. > > > > The filesystems proc and sysfs do not currently incoporate any > > executable files so this does not result in any user visible effects. > > > > This makes it unnecessary to vet changes to proc and sysfs tightly for > > adding exectuable files or changes to chattr that would modify > > existing files, as no matter what the individual file say they will > > not be treated as exectuable files by the vfs. > > > > Not having to vet changes to closely is important as without this we > > are only one proc_create call (or another goof up in the > > implementation of notify_change) from having problematic executables > > on proc. Those mistakes are all too easy to make and would create > > a situation where there are security issues or the assumptions of > > some program having to be broken (and cause userspace regressions). > > Would it make sense to add SB_I_NOEXEC to more pseudo filesystems? > Say pstore or devpts? And configfs and cgroupfs? ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <20150710193052.GB19824@kroah.com>]
[parent not found: <20150710193052.GB19824-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>]
* Re: [REVIEW][PATCH 1/2] vfs: Commit to never having exectuables on proc and sysfs. [not found] ` <20150710193052.GB19824-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> @ 2015-07-10 19:38 ` Richard Weinberger 0 siblings, 0 replies; 16+ messages in thread From: Richard Weinberger @ 2015-07-10 19:38 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Seth Forshee, Linux API, Linux Containers, Al Viro, Andy Lutomirski, Kenton Varda, Tejun Heo, Eric W. Biederman, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk-manpages, Ivan Delalande, Linus Torvalds Am 10.07.2015 um 21:30 schrieb Greg Kroah-Hartman: > On Fri, Jul 10, 2015 at 08:24:41PM +0200, Richard Weinberger wrote: >> Am 10.07.2015 um 18:17 schrieb Eric W. Biederman: >>> >>> Today proc and sysfs do not contain any executable files. Several >>> applications today mount proc or sysfs without noexec and nosuid and >>> then depend on there being no exectuables files on proc or sysfs. >>> Having any executable files show on proc or sysfs would cause >>> a user space visible regression, and most likely security problems. >>> >>> Therefore commit to never allowing executables on proc and sysfs by >>> adding a new flag to mark them as filesystems without executables and >>> enforce that flag. >>> >>> Test the flag where MNT_NOEXEC is tested today, so that the only user >>> visible effect will be that exectuables will be treated as if the >>> execute bit is cleared. >>> >>> The filesystems proc and sysfs do not currently incoporate any >>> executable files so this does not result in any user visible effects. >>> >>> This makes it unnecessary to vet changes to proc and sysfs tightly for >>> adding exectuable files or changes to chattr that would modify >>> existing files, as no matter what the individual file say they will >>> not be treated as exectuable files by the vfs. >>> >>> Not having to vet changes to closely is important as without this we >>> are only one proc_create call (or another goof up in the >>> implementation of notify_change) from having problematic executables >>> on proc. Those mistakes are all too easy to make and would create >>> a situation where there are security issues or the assumptions of >>> some program having to be broken (and cause userspace regressions). >> >> Would it make sense to add SB_I_NOEXEC to more pseudo filesystems? >> Say pstore or devpts? > > And configfs and cgroupfs? Yep. Any filesystem where exectuables do not make sense. :-) Thanks, //richard ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <55A01F4B.9010205@nod.at>]
[parent not found: <55A01F4B.9010205-/L3Ra7n9ekc@public.gmane.org>]
* Re: [REVIEW][PATCH 1/2] vfs: Commit to never having exectuables on proc and sysfs. [not found] ` <55A01F4B.9010205-/L3Ra7n9ekc@public.gmane.org> @ 2015-07-10 20:00 ` Eric W. Biederman 0 siblings, 0 replies; 16+ messages in thread From: Eric W. Biederman @ 2015-07-10 20:00 UTC (permalink / raw) To: Richard Weinberger Cc: Seth Forshee, Greg Kroah-Hartman, Linux Containers, Al Viro, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Linux API, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Ivan Delalande, Linus Torvalds Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes: > Am 10.07.2015 um 21:30 schrieb Greg Kroah-Hartman: >> On Fri, Jul 10, 2015 at 08:24:41PM +0200, Richard Weinberger wrote: >>> Am 10.07.2015 um 18:17 schrieb Eric W. Biederman: >>>> >>>> Today proc and sysfs do not contain any executable files. Several >>>> applications today mount proc or sysfs without noexec and nosuid and >>>> then depend on there being no exectuables files on proc or sysfs. >>>> Having any executable files show on proc or sysfs would cause >>>> a user space visible regression, and most likely security problems. >>>> >>>> Therefore commit to never allowing executables on proc and sysfs by >>>> adding a new flag to mark them as filesystems without executables and >>>> enforce that flag. >>>> >>>> Test the flag where MNT_NOEXEC is tested today, so that the only user >>>> visible effect will be that exectuables will be treated as if the >>>> execute bit is cleared. >>>> >>>> The filesystems proc and sysfs do not currently incoporate any >>>> executable files so this does not result in any user visible effects. >>>> >>>> This makes it unnecessary to vet changes to proc and sysfs tightly for >>>> adding exectuable files or changes to chattr that would modify >>>> existing files, as no matter what the individual file say they will >>>> not be treated as exectuable files by the vfs. >>>> >>>> Not having to vet changes to closely is important as without this we >>>> are only one proc_create call (or another goof up in the >>>> implementation of notify_change) from having problematic executables >>>> on proc. Those mistakes are all too easy to make and would create >>>> a situation where there are security issues or the assumptions of >>>> some program having to be broken (and cause userspace regressions). >>> >>> Would it make sense to add SB_I_NOEXEC to more pseudo filesystems? >>> Say pstore or devpts? >> >> And configfs and cgroupfs? > > Yep. Any filesystem where exectuables do not make sense. :-) With a threat model of a developer overlooking something and allows executables by accident? I certainly think it makes sense. Mostly because we are solving things at the vfs level. Which gives us one well tested solution instead of several filesystem specific solutions. That isn't quite the same problem that caused me to write this code, so I am not going to volunteer to write the patches for the additional filesystems. Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <87381eyz26.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* Re: [GIT PULL] User namespace related fixes for v4.2 [not found] ` <87381eyz26.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-06-29 16:43 ` Linus Torvalds 2015-07-01 20:41 ` Eric W. Biederman 1 sibling, 0 replies; 16+ messages in thread From: Linus Torvalds @ 2015-06-29 16:43 UTC (permalink / raw) To: Eric W. Biederman Cc: Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, linux-fsdevel, Tejun Heo, Ivan Delalande On Fri, Jun 26, 2015 at 1:50 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: > > Therefore this changeset marks for backporting the attribute enforcement > that do not cause regressions in the existing userspace. Implements > enforcement of nosuid and noexec. Then disables that enforcement of > nosuid and nosexec and replaces that enforcment with a big fat warning. > Userspace should be fixed before 4.2 ships so I do not expect these > warnings to fire. Eric, that is *not* how this works. If people have old user-space binaries, we do not require them to be updated. So it doesn't matter one whit if "Userspace should be fixed before 4.2 ships", because it is entirely irrelevant if the upstream project stops doing something, when users want to be able to upgrade their kernels regardless of whether they've upgraded their system apps. I'm going to hold off on pulling this, because I feel you don't understand the regression rules. I suggest we instead just always set nosuid and noexec for /proc and /sys mounts, and make this whole thing a complete non-issue. Instead of this crazy "let's warn about it and plan on breaking old existing setups". That's _wrong_. It's so fundamentally wrong that I will not pull from people who do not understand this. The reason we have that "no regression" rule is not so that we fix up bugs. It's because peopel should always feel safe upgrading their kernel, and basically _know_ that kernel developers consider it unacceptable to break user space. It should be a warm fuzzy feeling - the feeling that we try our best, and if we ever fail because we missed something or really believed that it can't ever matter, we'll jump on it and we won't be making any excuses for our bugs. Because breaking user space is a bug. Kernel developers who don't understand "it is unacceptable to break user space" shouldn't be kernel developers. Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
* [GIT PULL] User namespace related fixes for v4.2 [not found] ` <87381eyz26.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-06-29 16:43 ` [GIT PULL] User namespace related fixes for v4.2 Linus Torvalds @ 2015-07-01 20:41 ` Eric W. Biederman [not found] ` <878uazhapq.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> [not found] ` <20150706204748.GB22962@ubuntu-hedt> 1 sibling, 2 replies; 16+ messages in thread From: Eric W. Biederman @ 2015-07-01 20:41 UTC (permalink / raw) To: Linus Torvalds Cc: Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Ivan Delalande Linus, Please pull the for-linus branch from the git tree: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus HEAD: 93e3bce6287e1fb3e60d3324ed08555b5bbafa89 vfs: Remove incorrect debugging WARN in prepend_path Colds suck and I was tired and pushing a little too hard at the start of this merge window and I included a few things in my previous pull request that I felt a were not quite ready. Those things have been dropped from my tree. My apologies that was irresponsible. Long ago and far away when user namespaces where young it was realized that allowing fresh mounts of proc and sysfs with only user namespace permissions could violate the basic rule that only root gets to decide if proc or sysfs should be mounted at all. Some hacks were put in place to reduce the worst of the damage could be done, and the common sense rule was adopted that fresh mounts of proc and sysfs should allow no more than bind mounts of proc and sysfs. Unfortunately that rule has not been fully enforced. There are two kinds of gaps in that enforcement. Only filesystems mounted on empty directories of proc and sysfs should be ignored but the test for empty directories was insufficient. So in my tree directories on proc, sysctl and sysfs that will always be empty are created specially. Every other technique is imperfect as an ordinary directory can have entries added even after a readdir returns and shows that the directory is empty. Special creation of directories for mount points makes the code in the kernel a smidge clearer about it's purpose. I asked container developers from the various container projects to help test this and no holes were found in the set of mount points on proc and sysfs that are created specially. This set of changes also starts enforcing the mount flags of fresh mounts of proc and sysfs are consistent with the existing mount of proc and sysfs. I expected this to be the boring part of the work but unfortunately unprivileged userspace winds up mounting fresh copies of proc and sysfs with noexec and nosuid clear when root set those flags on the previous mount of proc and sysfs. So for now only the atime, read-only and nodev attributes which userspace happens to keep consistent are enforced. Dealing with the noexec and nosuid attributes remains for another time. This set of changes also addresses an issue with how open file descriptors from /proc/<pid>/ns/* are displayed. Recently readlink of /proc/<pid>/fd has been triggering a WARN_ON that has not been meaningful since it was added (as all of the code in the kernel was converted) and is not now actively wrong. There is also a short list of issues that have not been fixed yet that I will mention briefly. It is possible to rename a directory from below to above a bind mount. At which point any directory pointers below the renamed directory can be walked up to the root directory of the filesystem. With user namespaces enabled a bind mount of the bind mount can be created allowing the user to pick a directory whose children they can rename to outside of the bind mount. This is challenging to fix and doubly so because all obvious solutions must touch code that is in the performance part of pathname resolution. As mentioned above there is also a question of how to ensure that developers by accident or with purpose do not introduce exectuable files on sysfs and proc and in doing so introduce security regressions in the current userspace that will not be immediately obvious and as such are likely to require breaking userspace in painful ways once they are recognized. Eric W. Biederman (11): mnt: Refactor the logic for mounting sysfs and proc in a user namespace mnt: Modify fs_fully_visible to deal with locked ro nodev and atime vfs: Ignore unlocked mounts in fs_fully_visible fs: Add helper functions for permanently empty directories. sysctl: Allow creating permanently empty directories that serve as mountpoints. proc: Allow creating permanently empty directories that serve as mount points kernfs: Add support for always empty directories. sysfs: Add support for permanently empty directories to serve as mount points. sysfs: Create mountpoints with sysfs_create_mount_point mnt: Update fs_fully_visible to test for permanently empty directories vfs: Remove incorrect debugging WARN in prepend_path arch/s390/hypfs/inode.c | 12 ++---- drivers/firmware/efi/efi.c | 6 +-- fs/configfs/mount.c | 10 ++--- fs/dcache.c | 11 ----- fs/debugfs/inode.c | 11 ++--- fs/fuse/inode.c | 9 ++--- fs/kernfs/dir.c | 38 +++++++++++++++++- fs/kernfs/inode.c | 2 + fs/libfs.c | 96 ++++++++++++++++++++++++++++++++++++++++++++ fs/namespace.c | 39 +++++++++++++++--- fs/proc/generic.c | 23 +++++++++++ fs/proc/inode.c | 4 ++ fs/proc/internal.h | 6 +++ fs/proc/proc_sysctl.c | 37 +++++++++++++++++ fs/proc/root.c | 9 ++--- fs/pstore/inode.c | 12 ++---- fs/sysfs/dir.c | 34 ++++++++++++++++ fs/sysfs/mount.c | 5 +-- fs/tracefs/inode.c | 6 +-- include/linux/fs.h | 4 +- include/linux/kernfs.h | 3 ++ include/linux/sysctl.h | 3 ++ include/linux/sysfs.h | 15 +++++++ kernel/cgroup.c | 10 ++--- kernel/sysctl.c | 8 +--- security/inode.c | 10 ++--- security/selinux/selinuxfs.c | 11 +++-- security/smack/smackfs.c | 8 ++-- 28 files changed, 341 insertions(+), 101 deletions(-) Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <878uazhapq.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* Re: [GIT PULL] User namespace related fixes for v4.2 [not found] ` <878uazhapq.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-07-06 20:47 ` Seth Forshee 0 siblings, 0 replies; 16+ messages in thread From: Seth Forshee @ 2015-07-06 20:47 UTC (permalink / raw) To: Eric W. Biederman Cc: Linux API, Linux Containers, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Ivan Delalande, Linus Torvalds On Wed, Jul 01, 2015 at 03:41:37PM -0500, Eric W. Biederman wrote: > This set of changes also starts enforcing the mount flags of fresh > mounts of proc and sysfs are consistent with the existing mount of proc > and sysfs. I expected this to be the boring part of the work but > unfortunately unprivileged userspace winds up mounting fresh copies of > proc and sysfs with noexec and nosuid clear when root set those flags on > the previous mount of proc and sysfs. So for now only the atime, > read-only and nodev attributes which userspace happens to keep > consistent are enforced. Dealing with the noexec and nosuid attributes > remains for another time. Sorry to be the bearer of bad news, but I am seeing a regression in lxc with 4.2-rc1 due to this change. lxc is doing a fresh mount of sysfs that never specifies either read-only or nodev regardless of how sysfs has been mounted previously, and this is causing me to see mount failures because of the nodev check. If I comment out only the nodev check then the mount works on my system, but based on the code in lxc I don't think there's any guarantee at all of this mount having flags consistent with previous mounts. Seth ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <20150706204748.GB22962@ubuntu-hedt>]
* Re: [GIT PULL] User namespace related fixes for v4.2 [not found] ` <20150706204748.GB22962@ubuntu-hedt> @ 2015-07-06 21:24 ` Eric W. Biederman [not found] ` <E81DECCD-9B19-4D42-BE43-5987DE7B05DB@xmission.com> 1 sibling, 0 replies; 16+ messages in thread From: Eric W. Biederman @ 2015-07-06 21:24 UTC (permalink / raw) To: Seth Forshee Cc: Linux API, Linux Containers, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Ivan Delalande, Linus Torvalds On July 6, 2015 3:47:48 PM CDT, Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> wrote: >On Wed, Jul 01, 2015 at 03:41:37PM -0500, Eric W. Biederman wrote: >> This set of changes also starts enforcing the mount flags of fresh >> mounts of proc and sysfs are consistent with the existing mount of >proc >> and sysfs. I expected this to be the boring part of the work but >> unfortunately unprivileged userspace winds up mounting fresh copies >of >> proc and sysfs with noexec and nosuid clear when root set those flags >on >> the previous mount of proc and sysfs. So for now only the atime, >> read-only and nodev attributes which userspace happens to keep >> consistent are enforced. Dealing with the noexec and nosuid >attributes >> remains for another time. > >Sorry to be the bearer of bad news, but I am seeing a regression in lxc >with 4.2-rc1 due to this change. lxc is doing a fresh mount of sysfs >that never specifies either read-only or nodev regardless of how sysfs >has been mounted previously, and this is causing me to see mount >failures because of the nodev check. > >If I comment out only the nodev check then the mount works on my >system, >but based on the code in lxc I don't think there's any guarantee at all >of this mount having flags consistent with previous mounts. Seth you are testing your inprogress patchset that modifies how nodev works aren't you? In rc1 nodev is always forced on a mount in a user namespace. There is a fairly easy fix to the nodev cleanup in your patchset, but it takes a few lines of code change in fs_fully_visible. Essentially after we get the better nodev enforcement, fs_fully_visible does not need to bother with nodev. Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <E81DECCD-9B19-4D42-BE43-5987DE7B05DB@xmission.com>]
[parent not found: <E81DECCD-9B19-4D42-BE43-5987DE7B05DB-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: [GIT PULL] User namespace related fixes for v4.2 [not found] ` <E81DECCD-9B19-4D42-BE43-5987DE7B05DB-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2015-07-06 22:25 ` Seth Forshee 0 siblings, 0 replies; 16+ messages in thread From: Seth Forshee @ 2015-07-06 22:25 UTC (permalink / raw) To: Eric W. Biederman Cc: Linux API, Linux Containers, Greg Kroah-Hartman, Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Ivan Delalande, Linus Torvalds On Mon, Jul 06, 2015 at 04:24:00PM -0500, Eric W. Biederman wrote: > > > On July 6, 2015 3:47:48 PM CDT, Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> wrote: > >On Wed, Jul 01, 2015 at 03:41:37PM -0500, Eric W. Biederman wrote: > >> This set of changes also starts enforcing the mount flags of fresh > >> mounts of proc and sysfs are consistent with the existing mount of > >proc > >> and sysfs. I expected this to be the boring part of the work but > >> unfortunately unprivileged userspace winds up mounting fresh copies > >of > >> proc and sysfs with noexec and nosuid clear when root set those flags > >on > >> the previous mount of proc and sysfs. So for now only the atime, > >> read-only and nodev attributes which userspace happens to keep > >> consistent are enforced. Dealing with the noexec and nosuid > >attributes > >> remains for another time. > > > >Sorry to be the bearer of bad news, but I am seeing a regression in lxc > >with 4.2-rc1 due to this change. lxc is doing a fresh mount of sysfs > >that never specifies either read-only or nodev regardless of how sysfs > >has been mounted previously, and this is causing me to see mount > >failures because of the nodev check. > > > >If I comment out only the nodev check then the mount works on my > >system, > >but based on the code in lxc I don't think there's any guarantee at all > >of this mount having flags consistent with previous mounts. > > Seth you are testing your inprogress patchset that > modifies how nodev works aren't you? > > In rc1 nodev is always forced on a mount in a user namespace. > > There is a fairly easy fix to the nodev cleanup in your > patchset, but it takes a few lines of code change in > fs_fully_visible. Essentially after we get the better > nodev enforcement, fs_fully_visible does not need > to bother with nodev. Drat, you're right. I built an unmodified 4.2-rc1 but I apparently I had booted to the wrong kernel when I thought I was testing it. Without the extra patches it's fine; sorry for the noise. Seth ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2015-07-10 20:00 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-06-26 20:50 [GIT PULL] User namespace related fixes for v4.2 Eric W. Biederman [not found] ` <CA+55aFysKDXr2HEwNzm3z9QOw=E4ZeWcvYQ-xLhy5_k+rGbeRg@mail.gmail.com> [not found] ` <CA+55aFysKDXr2HEwNzm3z9QOw=E4ZeWcvYQ-xLhy5_k+rGbeRg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-06-29 21:13 ` Eric W. Biederman [not found] ` <87pp4eqktr.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-07-03 22:10 ` Linus Torvalds [not found] ` <CA+55aFw-DK-xDC-3HYa=BMX8WNyQgT9O01tihrAS9+-7PPj_jA@mail.gmail.com> [not found] ` <CA+55aFw-DK-xDC-3HYa=BMX8WNyQgT9O01tihrAS9+-7PPj_jA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-07-04 23:11 ` Al Viro [not found] ` <20150704231118.GT17109-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> 2015-07-10 16:16 ` [REVIEW][PATCH 0/2] noexec on proc and sysfs Eric W. Biederman [not found] ` <87mvz4yomp.fsf_-_@x220.int.ebiederm.org> [not found] ` <87mvz4yomp.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-07-10 16:17 ` [REVIEW][PATCH 1/2] vfs: Commit to never having exectuables " Eric W. Biederman 2015-07-10 16:18 ` [REVIEW][PATCH 2/2] mnt: fs_fully_visible enforce noexec and nosuid if !SB_I_NOEXEC Eric W. Biederman [not found] ` <87h9pcyokc.fsf_-_@x220.int.ebiederm.org> [not found] ` <87h9pcyokc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-07-10 18:24 ` [REVIEW][PATCH 1/2] vfs: Commit to never having exectuables on proc and sysfs Richard Weinberger [not found] ` <55A00DE9.7060806-/L3Ra7n9ekc@public.gmane.org> 2015-07-10 19:30 ` Greg Kroah-Hartman [not found] ` <20150710193052.GB19824@kroah.com> [not found] ` <20150710193052.GB19824-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> 2015-07-10 19:38 ` Richard Weinberger [not found] ` <55A01F4B.9010205@nod.at> [not found] ` <55A01F4B.9010205-/L3Ra7n9ekc@public.gmane.org> 2015-07-10 20:00 ` Eric W. Biederman [not found] ` <87381eyz26.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-06-29 16:43 ` [GIT PULL] User namespace related fixes for v4.2 Linus Torvalds 2015-07-01 20:41 ` Eric W. Biederman [not found] ` <878uazhapq.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-07-06 20:47 ` Seth Forshee [not found] ` <20150706204748.GB22962@ubuntu-hedt> 2015-07-06 21:24 ` Eric W. Biederman [not found] ` <E81DECCD-9B19-4D42-BE43-5987DE7B05DB@xmission.com> [not found] ` <E81DECCD-9B19-4D42-BE43-5987DE7B05DB-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2015-07-06 22:25 ` Seth Forshee
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).