containers.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [GIT PULL] User namespace related fixes for v4.2
@ 2015-06-26 20:50 Eric W. Biederman
       [not found] ` <CA+55aFysKDXr2HEwNzm3z9QOw=E4ZeWcvYQ-xLhy5_k+rGbeRg@mail.gmail.com>
       [not found] ` <87381eyz26.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 2 replies; 16+ messages in thread
From: Eric W. Biederman @ 2015-06-26 20:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman,
	Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages,
	Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Ivan Delalande


Date: Fri, 22 May 2015 15:41:45 -0500 (4 weeks, 6 days, 23 hours ago)

Linus,

Please pull the for-linus branch from the git tree:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus

   HEAD: 81909cb3350299977a88f72264651f6cec06c836 mnt: Avoid unnecessary regressions in fs_fully_visible

Long ago and far away when user namespaces where young and I was a more
optimistic man it was realized that allowing fresh mounts of proc and
sysfs with only user namespace permissions could violate the basic rule
that only root gets to decide if proc or sysfs should be mounted at all.

Some hacks were put in place to reduce the worst of the damage could be
done, and the common sense rule was adopted that fresh mounts of proc
and sysfs should allow no more than bind mounts of proc and sysfs.
Unfortunately that rule has not been fully enforced.

There are two kinds of gaps in that enforcement.  Only filesystems mount
on empty directories on proc and sysfs should be ignored but the test
for empty directories was insufficient.  So this patchset requires
directories on proc, sysctl and sysfs that are will always be empty to
be created specially.  Every other technique is lossy as an ordinary
directory can dynamically be added to later.  This actually makes this
code in the kernel a smidge clearer about it's purpose.  I asked
container developers from the various container projects to help test
this and no holes were found in the set of mount points on proc and
sysfs that this patchset identifies.

This set of changes also starts enforcing the mount flags of fresh
mounts of proc and sysfs are consistent with the existing mount of proc
and sysfs.  I expected this to be the boring part of this patchset but
unfortunately userspace has been stupid and extra work has to be done to
avoid regressions.  The atime, read-only, and nodev attributes were not
a problem and as such are enforced absolutely.

People have been winding up mounting proc and sysfs in contaners with
nosuid and noexec clear, when the global root had set nosuid and noexec.
In practice this does not make a hill of beans difference today because
currently there are no exectuables on proc and sysfs.  Unfortunately
that can not be guaranteed in the future.  People refactor code and bugs
get reintroduced, or people find a good reason to do something that
today seems ludicrous.  Give people 20 more years and who knows what
will happen.

The libvirt-lxc and lxc developers have been contacted so they can
correct the bugs where they clear noexec and nosuid on proc and sysfs
through oversights when they wrote their code.  Thos bugs should be
fixed in those projects shortly.  These bugs are an issue however
libvirt-lxc or lxc create containers.  However they only violate kernel
permission checks in the case of containers created by unprivileged
users, which is a niche case today.

Therefore this changeset marks for backporting the attribute enforcement
that do not cause regressions in the existing userspace. Implements
enforcement of nosuid and noexec.  Then disables that enforcement of
nosuid and nosexec and replaces that enforcment with a big fat warning.
Userspace should be fixed before 4.2 ships so I do not expect these
warnings to fire.  However the warnings give userspace time to get their
act together.  I am optimistic that all of userspace that cares will be
fixed and for v4.3 I can remove the warning messages and enforce the
attribute checks.

It is a fine line on the regression front and I hate walking it, but now
is the best time to address the issue of clearing attributes that should
not be cleared before lots of unprivileged container implementations
accumulate, and before nosid and noexec proc and sysfs matter.

This set of changes also addresses how open file descriptors from
/proc/<pid>/ns/* are displayed.  Recently readlink of /proc/<pid>/fd has
been triggering a WARN_ON that has not been meaningful in nearly a
decade, and is actively wrong now.  An old bug (2 years?) in
/proc/<pid>/mountinfo where bind mounts of these descriptors were
not meaningfully show is fixed.

Eric W. Biederman (14):
      mnt: Refactor the logic for mounting sysfs and proc in a user namespace
      mnt: Modify fs_fully_visible to deal with locked ro nodev and atime
      mnt: Modify fs_fully_visible to deal with locked nosuid and noexec
      vfs: Ignore unlocked mounts in fs_fully_visible
      fs: Add helper functions for permanently empty directories.
      sysctl: Allow creating permanently empty directories that serve as mountpoints.
      proc: Allow creating permanently empty directories that serve as mount points
      kernfs: Add support for always empty directories.
      sysfs: Add support for permanently empty directories to serve as mount points.
      sysfs: Create mountpoints with sysfs_create_mount_point
      mnt: Update fs_fully_visible to test for permanently empty directories
      vfs: Remove incorrect debugging WARN in prepend_path
      nsfs: Add a show_path method to fix mountinfo
      mnt: Avoid unnecessary regressions in fs_fully_visible

 arch/s390/hypfs/inode.c      | 12 ++----
 drivers/firmware/efi/efi.c   |  6 +--
 fs/configfs/mount.c          | 10 ++---
 fs/dcache.c                  | 11 -----
 fs/debugfs/inode.c           | 11 ++---
 fs/fuse/inode.c              |  9 ++---
 fs/kernfs/dir.c              | 38 +++++++++++++++++-
 fs/kernfs/inode.c            |  2 +
 fs/libfs.c                   | 96 ++++++++++++++++++++++++++++++++++++++++++++
 fs/namespace.c               | 80 +++++++++++++++++++++++++++++++++---
 fs/nsfs.c                    | 10 +++++
 fs/proc/generic.c            | 23 +++++++++++
 fs/proc/inode.c              |  4 ++
 fs/proc/internal.h           |  6 +++
 fs/proc/proc_sysctl.c        | 37 +++++++++++++++++
 fs/proc/root.c               |  9 ++---
 fs/pstore/inode.c            | 12 ++----
 fs/sysfs/dir.c               | 34 ++++++++++++++++
 fs/sysfs/mount.c             |  5 +--
 fs/tracefs/inode.c           |  6 +--
 include/linux/fs.h           |  4 +-
 include/linux/kernfs.h       |  3 ++
 include/linux/mount.h        |  5 +++
 include/linux/sysctl.h       |  3 ++
 include/linux/sysfs.h        | 15 +++++++
 kernel/cgroup.c              | 10 ++---
 kernel/sysctl.c              |  8 +---
 security/inode.c             | 10 ++---
 security/selinux/selinuxfs.c | 11 +++--
 security/smack/smackfs.c     |  8 ++--
 30 files changed, 397 insertions(+), 101 deletions(-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] User namespace related fixes for v4.2
       [not found] ` <87381eyz26.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-06-29 16:43   ` Linus Torvalds
  2015-07-01 20:41   ` Eric W. Biederman
  1 sibling, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2015-06-29 16:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman,
	Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages,
	Richard Weinberger, linux-fsdevel, Tejun Heo, Ivan Delalande

On Fri, Jun 26, 2015 at 1:50 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
> Therefore this changeset marks for backporting the attribute enforcement
> that do not cause regressions in the existing userspace. Implements
> enforcement of nosuid and noexec.  Then disables that enforcement of
> nosuid and nosexec and replaces that enforcment with a big fat warning.
> Userspace should be fixed before 4.2 ships so I do not expect these
> warnings to fire.

Eric, that is *not* how this works.

If people have old user-space binaries, we do not require them to be
updated. So it doesn't matter one whit if "Userspace should be fixed
before 4.2 ships", because it is entirely irrelevant if the upstream
project stops doing something, when users want to be able to upgrade
their kernels regardless of whether they've upgraded their system
apps.

I'm going to hold off on pulling this, because I feel you don't
understand the regression rules.

I suggest we instead just always set nosuid and noexec for /proc and
/sys mounts, and make this whole thing a complete non-issue.

Instead of this crazy "let's warn about it and plan on breaking old
existing setups". That's _wrong_. It's so fundamentally wrong that I
will not pull from people who do not understand this.

The reason we have that "no regression" rule is not so that we fix up
bugs. It's because peopel should always feel safe upgrading their
kernel, and basically _know_ that kernel developers consider it
unacceptable to break user space. It should be a warm fuzzy feeling -
the feeling that we try our best, and if we ever fail because we
missed something or really believed that it can't ever matter, we'll
jump on it and we won't be making any excuses for our bugs. Because
breaking user space is a bug.

Kernel developers who don't understand "it is unacceptable to break
user space" shouldn't be kernel developers.

                 Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] User namespace related fixes for v4.2
       [not found]   ` <CA+55aFysKDXr2HEwNzm3z9QOw=E4ZeWcvYQ-xLhy5_k+rGbeRg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-06-29 21:13     ` Eric W. Biederman
       [not found]       ` <87pp4eqktr.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
       [not found]       ` <CA+55aFw-DK-xDC-3HYa=BMX8WNyQgT9O01tihrAS9+-7PPj_jA@mail.gmail.com>
  0 siblings, 2 replies; 16+ messages in thread
From: Eric W. Biederman @ 2015-06-29 21:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman,
	Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages,
	Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Ivan Delalande

Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> writes:

> On Fri, Jun 26, 2015 at 1:50 PM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>
>> Therefore this changeset marks for backporting the attribute enforcement
>> that do not cause regressions in the existing userspace. Implements
>> enforcement of nosuid and noexec.  Then disables that enforcement of
>> nosuid and nosexec and replaces that enforcment with a big fat warning.
>> Userspace should be fixed before 4.2 ships so I do not expect these
>> warnings to fire.
>
> Eric, that is *not* how this works.
>
> If people have old user-space binaries, we do not require them to be
> updated. So it doesn't matter one whit if "Userspace should be fixed
> before 4.2 ships", because it is entirely irrelevant if the upstream
> project stops doing something, when users want to be able to upgrade
> their kernels regardless of whether they've upgraded their system
> apps.
>
> I'm going to hold off on pulling this, because I feel you don't
> understand the regression rules.

That is not the issue.  What happen is I found myself between a rock and
a hard place and I did not possess sufficient creativity to code my way
out.

Fearing regressions I sought out people to test these changes, on the
applications most likely to care.

I reduced the change from breaking userspace to a warning that userspace
is being ludicriously stupid.

I worked with the applications to get their bugs fixed.

> I suggest we instead just always set nosuid and noexec for /proc and
> /sys mounts, and make this whole thing a complete non-issue.

Doing exactly as you suggest will be user visible (mount flags), and
without care is likely to break remount.

Can you live with the patch below and committing to never supporting
executables on proc and sysfs?

With that I can solve all of my concerns, without affecting the existing
userspace programs.

> Instead of this crazy "let's warn about it and plan on breaking old
> existing setups". That's _wrong_. It's so fundamentally wrong that I
> will not pull from people who do not understand this.
>
> The reason we have that "no regression" rule is not so that we fix up
> bugs. It's because peopel should always feel safe upgrading their
> kernel, and basically _know_ that kernel developers consider it
> unacceptable to break user space. It should be a warm fuzzy feeling -
> the feeling that we try our best, and if we ever fail because we
> missed something or really believed that it can't ever matter, we'll
> jump on it and we won't be making any excuses for our bugs. Because
> breaking user space is a bug.
>
> Kernel developers who don't understand "it is unacceptable to break
> user space" shouldn't be kernel developers.

It is not that I do not understand it is that I had a failure of
imagination.  I have been agonizing about this issue since I have
encountered it trying to think of a better way.

Because of another failure of my imagination enabling user namespaces
has introduced a number of security regressions into the kernel, because
primarily I overlooked the effects of fine grained sharing in the mount
namespace and I have been working carefully and dilligent to get those
security regressions fixed.  Because if the user namespace by it's
designed semantics opens up security holes it is a failure.  Crap
happens and we occassionally over look things with the a greater amount
of code exposed to unprivileged users, but that is no excuse for doing
everything in my power to make user namespaces as safe to use as linux
without user namespaces.

Almost the entirety of my pull request is addressing that unfortunate
regression in security.

Until your comments suggested to me that it was acceptable to
permanently bar exectuables from proc and sysfs I did not see a way to
address this security hole (that does not currently appear exploitable),
but has been exploitable in the past, without breaking something.

Breaking two exectuables that will be unsafe to use at some point if I
did not get this fixed seemed the least damage I could do.

Hopefully you can live with permanently (and programatically) barring
exectuables from proc and sysfs and I can then move forward without 
reworking things so I can fix this without breaking anything.

Eric

------------------- cut here --------------------

From: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Date: Mon, 29 Jun 2015 14:42:03 -0500
Subject: [PATCH] vfs: Commit to never having exectuables on proc and sysfs.

Add a new flag to file_system_type for filesystems that are never
exepected to support executables.  Test that flag where MNT_NOEXEC is
tested today, so that user visible changes to mount flags are not
necessary.  The only user visible effect will be that exectuables
will be treated as if the execute bit is cleared, as happens today
when the MNT_NOEXEC flag is set on a mount.

Set the new flag on proc and sysfs.  As proc and sysfs do not implement
executables today there are no differences for userspace to notice.

The point of this exercise is that there are some applications that
due to oversites of their programmers do not set nosid and noexec when
they mount fresh copies of proc and sysfs today, and would become instant
security holes if we implemented exectubles especially suid executables
on proc and sysfs today.

With this change it becomes less necessary to vet each change to
proc and sysfs very carefully to ensure that executable files are
not implemented and to ensure that chattr can not create executable
files on proc and sysfs.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/exec.c           | 10 ++++++++--
 fs/open.c           |  2 +-
 fs/proc/root.c      |  2 +-
 fs/sysfs/mount.c    |  2 +-
 include/linux/fs.h  |  3 +++
 kernel/sys.c        |  3 +--
 mm/mmap.c           |  4 ++--
 mm/nommu.c          |  2 +-
 security/security.c |  2 +-
 9 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 1977c2a553ac..1e063854571b 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -98,6 +98,12 @@ static inline void put_binfmt(struct linux_binfmt * fmt)
 	module_put(fmt->module);
 }
 
+bool path_noexec(const struct path *path)
+{
+	return (path->mnt->mnt_flags & MNT_NOEXEC) ||
+	       (path->mnt->mnt_sb->s_type->fs_flags & FS_NOEXEC);
+}
+
 #ifdef CONFIG_USELIB
 /*
  * Note that a shared library must be both readable and executable due to
@@ -132,7 +138,7 @@ SYSCALL_DEFINE1(uselib, const char __user *, library)
 		goto exit;
 
 	error = -EACCES;
-	if (file->f_path.mnt->mnt_flags & MNT_NOEXEC)
+	if (path_noexec(&file->f_path))
 		goto exit;
 
 	fsnotify_open(file);
@@ -777,7 +783,7 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
 	if (!S_ISREG(file_inode(file)->i_mode))
 		goto exit;
 
-	if (file->f_path.mnt->mnt_flags & MNT_NOEXEC)
+	if (path_noexec(&file->f_path))
 		goto exit;
 
 	err = deny_write_access(file);
diff --git a/fs/open.c b/fs/open.c
index e0250bdcc440..9fbdb1bae049 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -375,7 +375,7 @@ retry:
 		 * with the "noexec" flag.
 		 */
 		res = -EACCES;
-		if (path.mnt->mnt_flags & MNT_NOEXEC)
+		if (path_noexec(&path))
 			goto out_path_release;
 	}
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index b7fa4bfe896a..7e39312580d4 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -159,7 +159,7 @@ static struct file_system_type proc_fs_type = {
 	.name		= "proc",
 	.mount		= proc_mount,
 	.kill_sb	= proc_kill_sb,
-	.fs_flags	= FS_USERNS_MOUNT,
+	.fs_flags	= FS_NOEXEC | FS_USERNS_MOUNT,
 };
 
 void __init proc_root_init(void)
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 8a49486bf30c..9cd8667feb94 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -58,7 +58,7 @@ static struct file_system_type sysfs_fs_type = {
 	.name		= "sysfs",
 	.mount		= sysfs_mount,
 	.kill_sb	= sysfs_kill_sb,
-	.fs_flags	= FS_USERNS_MOUNT,
+	.fs_flags	= FS_NOEXEC | FS_USERNS_MOUNT,
 };
 
 int __init sysfs_init(void)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e351da4a934f..9e44c6d81bb2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1916,6 +1916,7 @@ struct file_system_type {
 #define FS_HAS_SUBTYPE		4
 #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
 #define FS_USERNS_DEV_MOUNT	16 /* A userns mount does not imply MNT_NODEV */
+#define FS_NOEXEC		32 /* FS will not support executables */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
 	struct dentry *(*mount) (struct file_system_type *, int,
 		       const char *, void *);
@@ -3018,4 +3019,6 @@ static inline bool dir_relax(struct inode *inode)
 	return !IS_DEADDIR(inode);
 }
 
+extern bool path_noexec(const struct path *path);
+
 #endif /* _LINUX_FS_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 259fda25eb6b..fa2f2f671a5c 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1668,8 +1668,7 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
 	 * overall picture.
 	 */
 	err = -EACCES;
-	if (!S_ISREG(inode->i_mode)	||
-	    exe.file->f_path.mnt->mnt_flags & MNT_NOEXEC)
+	if (!S_ISREG(inode->i_mode) || path_noexec(&exe.file->f_path))
 		goto exit;
 
 	err = inode_permission(inode, MAY_EXEC);
diff --git a/mm/mmap.c b/mm/mmap.c
index aa632ade2be7..f126923ce683 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1268,7 +1268,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
 	 *  mounted, in which case we dont add PROT_EXEC.)
 	 */
 	if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
-		if (!(file && (file->f_path.mnt->mnt_flags & MNT_NOEXEC)))
+		if (!(file && path_noexec(&file->f_path)))
 			prot |= PROT_EXEC;
 
 	if (!(flags & MAP_FIXED))
@@ -1337,7 +1337,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
 		case MAP_PRIVATE:
 			if (!(file->f_mode & FMODE_READ))
 				return -EACCES;
-			if (file->f_path.mnt->mnt_flags & MNT_NOEXEC) {
+			if (path_noexec(&file->f_path)) {
 				if (vm_flags & VM_EXEC)
 					return -EPERM;
 				vm_flags &= ~VM_MAYEXEC;
diff --git a/mm/nommu.c b/mm/nommu.c
index 05e7447d960b..5fdec8885256 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1035,7 +1035,7 @@ static int validate_mmap_request(struct file *file,
 
 		/* handle executable mappings and implied executable
 		 * mappings */
-		if (file->f_path.mnt->mnt_flags & MNT_NOEXEC) {
+		if (path_noexec(&file->f_path)) {
 			if (prot & PROT_EXEC)
 				return -EPERM;
 		} else if ((prot & PROT_READ) && !(prot & PROT_EXEC)) {
diff --git a/security/security.c b/security/security.c
index 595fffab48b0..062f3c997fdc 100644
--- a/security/security.c
+++ b/security/security.c
@@ -776,7 +776,7 @@ static inline unsigned long mmap_prot(struct file *file, unsigned long prot)
 	 * ditto if it's not on noexec mount, except that on !MMU we need
 	 * NOMMU_MAP_EXEC (== VM_MAYEXEC) in this case
 	 */
-	if (!(file->f_path.mnt->mnt_flags & MNT_NOEXEC)) {
+	if (!path_noexec(&file->f_path)) {
 #ifndef CONFIG_MMU
 		if (file->f_op->mmap_capabilities) {
 			unsigned caps = file->f_op->mmap_capabilities(file);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [GIT PULL] User namespace related fixes for v4.2
       [not found] ` <87381eyz26.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-06-29 16:43   ` [GIT PULL] User namespace related fixes for v4.2 Linus Torvalds
@ 2015-07-01 20:41   ` Eric W. Biederman
       [not found]     ` <878uazhapq.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
       [not found]     ` <20150706204748.GB22962@ubuntu-hedt>
  1 sibling, 2 replies; 16+ messages in thread
From: Eric W. Biederman @ 2015-07-01 20:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman,
	Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages,
	Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Ivan Delalande

Linus,

Please pull the for-linus branch from the git tree:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus

   HEAD: 93e3bce6287e1fb3e60d3324ed08555b5bbafa89 vfs: Remove incorrect debugging WARN in prepend_path

Colds suck and I was tired and pushing a little too hard at the start of
this merge window and I included a few things in my previous pull
request that I felt a were not quite ready.  Those things have
been dropped from my tree.  My apologies that was irresponsible.

Long ago and far away when user namespaces where young it was realized
that allowing fresh mounts of proc and sysfs with only user namespace
permissions could violate the basic rule that only root gets to decide
if proc or sysfs should be mounted at all.

Some hacks were put in place to reduce the worst of the damage could be
done, and the common sense rule was adopted that fresh mounts of proc
and sysfs should allow no more than bind mounts of proc and sysfs.
Unfortunately that rule has not been fully enforced.

There are two kinds of gaps in that enforcement.  Only filesystems
mounted on empty directories of proc and sysfs should be ignored but the
test for empty directories was insufficient.  So in my tree directories
on proc, sysctl and sysfs that will always be empty are created
specially.  Every other technique is imperfect as an ordinary directory
can have entries added even after a readdir returns and shows that the
directory is empty.  Special creation of directories for mount points
makes the code in the kernel a smidge clearer about it's purpose.  I
asked container developers from the various container projects to help
test this and no holes were found in the set of mount points on proc and
sysfs that are created specially.

This set of changes also starts enforcing the mount flags of fresh
mounts of proc and sysfs are consistent with the existing mount of proc
and sysfs.  I expected this to be the boring part of the work but
unfortunately unprivileged userspace winds up mounting fresh copies of
proc and sysfs with noexec and nosuid clear when root set those flags on
the previous mount of proc and sysfs.  So for now only the atime,
read-only and nodev attributes which userspace happens to keep
consistent are enforced.  Dealing with the noexec and nosuid attributes
remains for another time.

This set of changes also addresses an issue with how open file
descriptors from /proc/<pid>/ns/* are displayed.  Recently readlink of
/proc/<pid>/fd has been triggering a WARN_ON that has not been
meaningful since it was added (as all of the code in the kernel was
converted) and is not now actively wrong.

There is also a short list of issues that have not been fixed yet that I
will mention briefly.

It is possible to rename a directory from below to above a bind mount.
At which point any directory pointers below the renamed directory can be
walked up to the root directory of the filesystem.  With user namespaces
enabled a bind mount of the bind mount can be created allowing the user
to pick a directory whose children they can rename to outside of the
bind mount.  This is challenging to fix and doubly so because all
obvious solutions must touch code that is in the performance part of
pathname resolution.

As mentioned above there is also a question of how to ensure that
developers by accident or with purpose do not introduce exectuable files
on sysfs and proc and in doing so introduce security regressions in the
current userspace that will not be immediately obvious and as such are
likely to require breaking userspace in painful ways once they are
recognized.

Eric W. Biederman (11):
      mnt: Refactor the logic for mounting sysfs and proc in a user namespace
      mnt: Modify fs_fully_visible to deal with locked ro nodev and atime
      vfs: Ignore unlocked mounts in fs_fully_visible
      fs: Add helper functions for permanently empty directories.
      sysctl: Allow creating permanently empty directories that serve as mountpoints.
      proc: Allow creating permanently empty directories that serve as mount points
      kernfs: Add support for always empty directories.
      sysfs: Add support for permanently empty directories to serve as mount points.
      sysfs: Create mountpoints with sysfs_create_mount_point
      mnt: Update fs_fully_visible to test for permanently empty directories
      vfs: Remove incorrect debugging WARN in prepend_path

 arch/s390/hypfs/inode.c      | 12 ++----
 drivers/firmware/efi/efi.c   |  6 +--
 fs/configfs/mount.c          | 10 ++---
 fs/dcache.c                  | 11 -----
 fs/debugfs/inode.c           | 11 ++---
 fs/fuse/inode.c              |  9 ++---
 fs/kernfs/dir.c              | 38 +++++++++++++++++-
 fs/kernfs/inode.c            |  2 +
 fs/libfs.c                   | 96 ++++++++++++++++++++++++++++++++++++++++++++
 fs/namespace.c               | 39 +++++++++++++++---
 fs/proc/generic.c            | 23 +++++++++++
 fs/proc/inode.c              |  4 ++
 fs/proc/internal.h           |  6 +++
 fs/proc/proc_sysctl.c        | 37 +++++++++++++++++
 fs/proc/root.c               |  9 ++---
 fs/pstore/inode.c            | 12 ++----
 fs/sysfs/dir.c               | 34 ++++++++++++++++
 fs/sysfs/mount.c             |  5 +--
 fs/tracefs/inode.c           |  6 +--
 include/linux/fs.h           |  4 +-
 include/linux/kernfs.h       |  3 ++
 include/linux/sysctl.h       |  3 ++
 include/linux/sysfs.h        | 15 +++++++
 kernel/cgroup.c              | 10 ++---
 kernel/sysctl.c              |  8 +---
 security/inode.c             | 10 ++---
 security/selinux/selinuxfs.c | 11 +++--
 security/smack/smackfs.c     |  8 ++--
 28 files changed, 341 insertions(+), 101 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] User namespace related fixes for v4.2
       [not found]       ` <87pp4eqktr.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-07-03 22:10         ` Linus Torvalds
  0 siblings, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2015-07-03 22:10 UTC (permalink / raw)
  To: Eric W. Biederman, Al Viro
  Cc: Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman,
	Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages,
	Richard Weinberger,
	<linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Tejun Heo, Ivan Delalande

On Mon, Jun 29, 2015 at 2:13 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
> Can you live with the patch below and committing to never supporting
> executables on proc and sysfs?

Sure. I don't think executables make any sense what-so-ever in those
filesystems. I think it's fine saying that /proc and /sys cannot have
executables in them, and then use that flag to just ignore the
relevant mount flags.

Al, comments?

                  Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] User namespace related fixes for v4.2
       [not found]         ` <CA+55aFw-DK-xDC-3HYa=BMX8WNyQgT9O01tihrAS9+-7PPj_jA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-07-04 23:11           ` Al Viro
       [not found]             ` <20150704231118.GT17109-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
       [not found]             ` <87mvz4yomp.fsf_-_@x220.int.ebiederm.org>
  0 siblings, 2 replies; 16+ messages in thread
From: Al Viro @ 2015-07-04 23:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Seth Forshee, Linux API, Linux Containers, Greg Kroah-Hartman,
	Andy Lutomirski, Kenton Varda, Tejun Heo, Eric W. Biederman,
	Richard Weinberger,
	<linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Michael Kerrisk-manpages, Ivan Delalande

On Fri, Jul 03, 2015 at 03:10:40PM -0700, Linus Torvalds wrote:
> On Mon, Jun 29, 2015 at 2:13 PM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> >
> > Can you live with the patch below and committing to never supporting
> > executables on proc and sysfs?
> 
> Sure. I don't think executables make any sense what-so-ever in those
> filesystems. I think it's fine saying that /proc and /sys cannot have
> executables in them, and then use that flag to just ignore the
> relevant mount flags.
> 
> Al, comments?

	I can live with that, but I would prefer that to be a superblock
flag force-set in ->mount() (and preserved in ->remount_fs()) rather than
Yet Another FS Type Flag.  OTOH, it's not hard to change afterwards.

	Al, bloody annoyed by having spent hours debugging an odd corruption
in merge candidate, only to find that it correlated to temperature of the
host ;-/  Seem to be all gone after replacing CPU fan and cleaning the
mess under it...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] User namespace related fixes for v4.2
       [not found]     ` <878uazhapq.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-07-06 20:47       ` Seth Forshee
  0 siblings, 0 replies; 16+ messages in thread
From: Seth Forshee @ 2015-07-06 20:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux API, Linux Containers, Greg Kroah-Hartman, Andy Lutomirski,
	Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Ivan Delalande,
	Linus Torvalds

On Wed, Jul 01, 2015 at 03:41:37PM -0500, Eric W. Biederman wrote:
> This set of changes also starts enforcing the mount flags of fresh
> mounts of proc and sysfs are consistent with the existing mount of proc
> and sysfs.  I expected this to be the boring part of the work but
> unfortunately unprivileged userspace winds up mounting fresh copies of
> proc and sysfs with noexec and nosuid clear when root set those flags on
> the previous mount of proc and sysfs.  So for now only the atime,
> read-only and nodev attributes which userspace happens to keep
> consistent are enforced.  Dealing with the noexec and nosuid attributes
> remains for another time.

Sorry to be the bearer of bad news, but I am seeing a regression in lxc
with 4.2-rc1 due to this change. lxc is doing a fresh mount of sysfs
that never specifies either read-only or nodev regardless of how sysfs
has been mounted previously, and this is causing me to see mount
failures because of the nodev check.

If I comment out only the nodev check then the mount works on my system,
but based on the code in lxc I don't think there's any guarantee at all
of this mount having flags consistent with previous mounts.

Seth

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] User namespace related fixes for v4.2
       [not found]     ` <20150706204748.GB22962@ubuntu-hedt>
@ 2015-07-06 21:24       ` Eric W. Biederman
       [not found]       ` <E81DECCD-9B19-4D42-BE43-5987DE7B05DB@xmission.com>
  1 sibling, 0 replies; 16+ messages in thread
From: Eric W. Biederman @ 2015-07-06 21:24 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Linux API, Linux Containers, Greg Kroah-Hartman, Andy Lutomirski,
	Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Ivan Delalande,
	Linus Torvalds



On July 6, 2015 3:47:48 PM CDT, Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> wrote:
>On Wed, Jul 01, 2015 at 03:41:37PM -0500, Eric W. Biederman wrote:
>> This set of changes also starts enforcing the mount flags of fresh
>> mounts of proc and sysfs are consistent with the existing mount of
>proc
>> and sysfs.  I expected this to be the boring part of the work but
>> unfortunately unprivileged userspace winds up mounting fresh copies
>of
>> proc and sysfs with noexec and nosuid clear when root set those flags
>on
>> the previous mount of proc and sysfs.  So for now only the atime,
>> read-only and nodev attributes which userspace happens to keep
>> consistent are enforced.  Dealing with the noexec and nosuid
>attributes
>> remains for another time.
>
>Sorry to be the bearer of bad news, but I am seeing a regression in lxc
>with 4.2-rc1 due to this change. lxc is doing a fresh mount of sysfs
>that never specifies either read-only or nodev regardless of how sysfs
>has been mounted previously, and this is causing me to see mount
>failures because of the nodev check.
>
>If I comment out only the nodev check then the mount works on my
>system,
>but based on the code in lxc I don't think there's any guarantee at all
>of this mount having flags consistent with previous mounts.

Seth you are testing your inprogress patchset that
modifies how nodev works aren't you?

In rc1 nodev is always forced on a mount in a user namespace.

There is a fairly easy fix to the nodev cleanup in your
patchset, but it takes a few lines of code change in
fs_fully_visible.  Essentially after we get the better
nodev enforcement, fs_fully_visible does not need
to bother with nodev.

Eric

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] User namespace related fixes for v4.2
       [not found]         ` <E81DECCD-9B19-4D42-BE43-5987DE7B05DB-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2015-07-06 22:25           ` Seth Forshee
  0 siblings, 0 replies; 16+ messages in thread
From: Seth Forshee @ 2015-07-06 22:25 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux API, Linux Containers, Greg Kroah-Hartman, Andy Lutomirski,
	Kenton Varda, Michael Kerrisk-manpages, Richard Weinberger,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Ivan Delalande,
	Linus Torvalds

On Mon, Jul 06, 2015 at 04:24:00PM -0500, Eric W. Biederman wrote:
> 
> 
> On July 6, 2015 3:47:48 PM CDT, Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> wrote:
> >On Wed, Jul 01, 2015 at 03:41:37PM -0500, Eric W. Biederman wrote:
> >> This set of changes also starts enforcing the mount flags of fresh
> >> mounts of proc and sysfs are consistent with the existing mount of
> >proc
> >> and sysfs.  I expected this to be the boring part of the work but
> >> unfortunately unprivileged userspace winds up mounting fresh copies
> >of
> >> proc and sysfs with noexec and nosuid clear when root set those flags
> >on
> >> the previous mount of proc and sysfs.  So for now only the atime,
> >> read-only and nodev attributes which userspace happens to keep
> >> consistent are enforced.  Dealing with the noexec and nosuid
> >attributes
> >> remains for another time.
> >
> >Sorry to be the bearer of bad news, but I am seeing a regression in lxc
> >with 4.2-rc1 due to this change. lxc is doing a fresh mount of sysfs
> >that never specifies either read-only or nodev regardless of how sysfs
> >has been mounted previously, and this is causing me to see mount
> >failures because of the nodev check.
> >
> >If I comment out only the nodev check then the mount works on my
> >system,
> >but based on the code in lxc I don't think there's any guarantee at all
> >of this mount having flags consistent with previous mounts.
> 
> Seth you are testing your inprogress patchset that
> modifies how nodev works aren't you?
> 
> In rc1 nodev is always forced on a mount in a user namespace.
> 
> There is a fairly easy fix to the nodev cleanup in your
> patchset, but it takes a few lines of code change in
> fs_fully_visible.  Essentially after we get the better
> nodev enforcement, fs_fully_visible does not need
> to bother with nodev.

Drat, you're right. I built an unmodified 4.2-rc1 but I apparently I had
booted to the wrong kernel when I thought I was testing it. Without the
extra patches it's fine; sorry for the noise.

Seth

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [REVIEW][PATCH 0/2] noexec on proc and sysfs
       [not found]             ` <20150704231118.GT17109-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-07-10 16:16               ` Eric W. Biederman
  0 siblings, 0 replies; 16+ messages in thread
From: Eric W. Biederman @ 2015-07-10 16:16 UTC (permalink / raw)
  To: Linux Containers
  Cc: Seth Forshee, Linux API, Greg Kroah-Hartman, Al Viro,
	Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages,
	Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Ivan Delalande, Linus Torvalds


Given the code I have seen executables especially suid root executable
appearing on proc or sysfs will break userspace because there are
current applications that depend on nosuid and noexec on proc and sysfs
being meaningless.

This patchset addes a new flag SB_I_NOEXEC to enforce that restriction,
and to make it hard for a kernel developer to make the mistake of adding
executables to sysfs or proc.

The first patch has been updated since last time to a super block  flags
instead of a file_system type flag based on Al's suggestion.

The code in fs_fully_visible to enforce nosuid and noexec when needed
has also been added.

At a practical level this code is a no-op on a slow path, to guard
against future mistakes and to make auditing the kernel for this class
of problem trivial.

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing

Eric W. Biederman (2):
      vfs: Commit to never having exectuables on proc and sysfs.
      mnt: fs_fully_visible enforce noexec and nosuid  if !SB_I_NOEXEC

 fs/exec.c           | 10 ++++++++--
 fs/namespace.c      | 33 +++++++++++++++++++++++++--------
 fs/open.c           |  2 +-
 fs/proc/root.c      |  2 ++
 fs/sysfs/mount.c    |  4 ++++
 include/linux/fs.h  |  3 +++
 kernel/sys.c        |  3 +--
 mm/mmap.c           |  4 ++--
 mm/nommu.c          |  2 +-
 security/security.c |  2 +-
 10 files changed, 48 insertions(+), 17 deletions(-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [REVIEW][PATCH 1/2] vfs: Commit to never having exectuables on proc and sysfs.
       [not found]               ` <87mvz4yomp.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-07-10 16:17                 ` Eric W. Biederman
  2015-07-10 16:18                 ` [REVIEW][PATCH 2/2] mnt: fs_fully_visible enforce noexec and nosuid if !SB_I_NOEXEC Eric W. Biederman
  1 sibling, 0 replies; 16+ messages in thread
From: Eric W. Biederman @ 2015-07-10 16:17 UTC (permalink / raw)
  To: Linux Containers
  Cc: Seth Forshee, Linux API, Greg Kroah-Hartman, Al Viro,
	Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages,
	Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Ivan Delalande, Linus Torvalds


Today proc and sysfs do not contain any executable files.  Several
applications today mount proc or sysfs without noexec and nosuid and
then depend on there being no exectuables files on proc or sysfs.
Having any executable files show on proc or sysfs would cause
a user space visible regression, and most likely security problems.

Therefore commit to never allowing executables on proc and sysfs by
adding a new flag to mark them as filesystems without executables and
enforce that flag.

Test the flag where MNT_NOEXEC is tested today, so that the only user
visible effect will be that exectuables will be treated as if the
execute bit is cleared.

The filesystems proc and sysfs do not currently incoporate any
executable files so this does not result in any user visible effects.

This makes it unnecessary to vet changes to proc and sysfs tightly for
adding exectuable files or changes to chattr that would modify
existing files, as no matter what the individual file say they will
not be treated as exectuable files by the vfs.

Not having to vet changes to closely is important as without this we
are only one proc_create call (or another goof up in the
implementation of notify_change) from having problematic executables
on proc.  Those mistakes are all too easy to make and would create
a situation where there are security issues or the assumptions of
some program having to be broken (and cause userspace regressions).

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/exec.c           | 10 ++++++++--
 fs/open.c           |  2 +-
 fs/proc/root.c      |  2 ++
 fs/sysfs/mount.c    |  4 ++++
 include/linux/fs.h  |  3 +++
 kernel/sys.c        |  3 +--
 mm/mmap.c           |  4 ++--
 mm/nommu.c          |  2 +-
 security/security.c |  2 +-
 9 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 1977c2a553ac..b06623a9347f 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -98,6 +98,12 @@ static inline void put_binfmt(struct linux_binfmt * fmt)
 	module_put(fmt->module);
 }
 
+bool path_noexec(const struct path *path)
+{
+	return (path->mnt->mnt_flags & MNT_NOEXEC) ||
+	       (path->mnt->mnt_sb->s_iflags & SB_I_NOEXEC);
+}
+
 #ifdef CONFIG_USELIB
 /*
  * Note that a shared library must be both readable and executable due to
@@ -132,7 +138,7 @@ SYSCALL_DEFINE1(uselib, const char __user *, library)
 		goto exit;
 
 	error = -EACCES;
-	if (file->f_path.mnt->mnt_flags & MNT_NOEXEC)
+	if (path_noexec(&file->f_path))
 		goto exit;
 
 	fsnotify_open(file);
@@ -777,7 +783,7 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
 	if (!S_ISREG(file_inode(file)->i_mode))
 		goto exit;
 
-	if (file->f_path.mnt->mnt_flags & MNT_NOEXEC)
+	if (path_noexec(&file->f_path))
 		goto exit;
 
 	err = deny_write_access(file);
diff --git a/fs/open.c b/fs/open.c
index e33dab287fa0..b6f1e96a7c0b 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -377,7 +377,7 @@ retry:
 		 * with the "noexec" flag.
 		 */
 		res = -EACCES;
-		if (path.mnt->mnt_flags & MNT_NOEXEC)
+		if (path_noexec(&path))
 			goto out_path_release;
 	}
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 68feb0f70e63..361ab4ee42fc 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -134,6 +134,8 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
 		}
 
 		sb->s_flags |= MS_ACTIVE;
+		/* User space would break if executables appear on proc */
+		sb->s_iflags |= SB_I_NOEXEC;
 	}
 
 	return dget(sb->s_root);
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 1c6ac6fcee9f..f3db82071cfb 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -40,6 +40,10 @@ static struct dentry *sysfs_mount(struct file_system_type *fs_type,
 				SYSFS_MAGIC, &new_sb, ns);
 	if (IS_ERR(root) || !new_sb)
 		kobj_ns_drop(KOBJ_NS_TYPE_NET, ns);
+	else if (new_sb)
+		/* Userspace would break if executables appear on sysfs */
+		root->d_sb->s_iflags |= SB_I_NOEXEC;
+
 	return root;
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a0653e560c26..42912f8d286e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1244,6 +1244,7 @@ struct mm_struct;
 
 /* sb->s_iflags */
 #define SB_I_CGROUPWB	0x00000001	/* cgroup-aware writeback enabled */
+#define SB_I_NOEXEC	0x00000002	/* Ignore executables on this fs */
 
 /* Possible states of 'frozen' field */
 enum {
@@ -3030,4 +3031,6 @@ static inline bool dir_relax(struct inode *inode)
 	return !IS_DEADDIR(inode);
 }
 
+extern bool path_noexec(const struct path *path);
+
 #endif /* _LINUX_FS_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 259fda25eb6b..fa2f2f671a5c 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1668,8 +1668,7 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
 	 * overall picture.
 	 */
 	err = -EACCES;
-	if (!S_ISREG(inode->i_mode)	||
-	    exe.file->f_path.mnt->mnt_flags & MNT_NOEXEC)
+	if (!S_ISREG(inode->i_mode) || path_noexec(&exe.file->f_path))
 		goto exit;
 
 	err = inode_permission(inode, MAY_EXEC);
diff --git a/mm/mmap.c b/mm/mmap.c
index aa632ade2be7..f126923ce683 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1268,7 +1268,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
 	 *  mounted, in which case we dont add PROT_EXEC.)
 	 */
 	if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
-		if (!(file && (file->f_path.mnt->mnt_flags & MNT_NOEXEC)))
+		if (!(file && path_noexec(&file->f_path)))
 			prot |= PROT_EXEC;
 
 	if (!(flags & MAP_FIXED))
@@ -1337,7 +1337,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
 		case MAP_PRIVATE:
 			if (!(file->f_mode & FMODE_READ))
 				return -EACCES;
-			if (file->f_path.mnt->mnt_flags & MNT_NOEXEC) {
+			if (path_noexec(&file->f_path)) {
 				if (vm_flags & VM_EXEC)
 					return -EPERM;
 				vm_flags &= ~VM_MAYEXEC;
diff --git a/mm/nommu.c b/mm/nommu.c
index 58ea3643b9e9..ce17abf087ff 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1035,7 +1035,7 @@ static int validate_mmap_request(struct file *file,
 
 		/* handle executable mappings and implied executable
 		 * mappings */
-		if (file->f_path.mnt->mnt_flags & MNT_NOEXEC) {
+		if (path_noexec(&file->f_path)) {
 			if (prot & PROT_EXEC)
 				return -EPERM;
 		} else if ((prot & PROT_READ) && !(prot & PROT_EXEC)) {
diff --git a/security/security.c b/security/security.c
index 595fffab48b0..062f3c997fdc 100644
--- a/security/security.c
+++ b/security/security.c
@@ -776,7 +776,7 @@ static inline unsigned long mmap_prot(struct file *file, unsigned long prot)
 	 * ditto if it's not on noexec mount, except that on !MMU we need
 	 * NOMMU_MAP_EXEC (== VM_MAYEXEC) in this case
 	 */
-	if (!(file->f_path.mnt->mnt_flags & MNT_NOEXEC)) {
+	if (!path_noexec(&file->f_path)) {
 #ifndef CONFIG_MMU
 		if (file->f_op->mmap_capabilities) {
 			unsigned caps = file->f_op->mmap_capabilities(file);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [REVIEW][PATCH 2/2] mnt: fs_fully_visible enforce noexec and nosuid if !SB_I_NOEXEC
       [not found]               ` <87mvz4yomp.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2015-07-10 16:17                 ` [REVIEW][PATCH 1/2] vfs: Commit to never having exectuables " Eric W. Biederman
@ 2015-07-10 16:18                 ` Eric W. Biederman
  1 sibling, 0 replies; 16+ messages in thread
From: Eric W. Biederman @ 2015-07-10 16:18 UTC (permalink / raw)
  To: Linux Containers
  Cc: Seth Forshee, Linux API, Greg Kroah-Hartman, Al Viro,
	Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages,
	Richard Weinberger, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Ivan Delalande, Linus Torvalds


The filesystems proc and sysfs do not have executable files do not
have exectuable files today and portions of userspace break if we do
enforce nosuid and noexec consistency of nosuid and noexec flags
between previous mounts and new mounts of proc and sysfs.

Add the code to enforce consistency of the nosuid and noexec flags,
and use the presence of SB_I_NOEXEC to signal that there is no need to
bother.

This results in a completely userspace invisible change that makes it
clear fs_fully_visible can only skip the enforcement of noexec and
nosuid because it is known the filesystems in question do not support
executables.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c | 33 +++++++++++++++++++++++++--------
 1 file changed, 25 insertions(+), 8 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c7cb8a526c05..ce428cadd41f 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3194,6 +3194,8 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags)
 	down_read(&namespace_sem);
 	list_for_each_entry(mnt, &ns->list, mnt_list) {
 		struct mount *child;
+		int mnt_flags;
+
 		if (mnt->mnt.mnt_sb->s_type != type)
 			continue;
 
@@ -3203,17 +3205,30 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags)
 		if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
 			continue;
 
+		/* Read the mount flags and filter out flags that
+		 * may safely be ignored.
+		 */
+		mnt_flags = mnt->mnt.mnt_flags;
+		if (mnt->mnt.mnt_sb->s_iflags & SB_I_NOEXEC)
+			mnt_flags &= ~(MNT_LOCK_NOSUID | MNT_LOCK_NOEXEC);
+
 		/* Verify the mount flags are equal to or more permissive
 		 * than the proposed new mount.
 		 */
-		if ((mnt->mnt.mnt_flags & MNT_LOCK_READONLY) &&
+		if ((mnt_flags & MNT_LOCK_READONLY) &&
 		    !(new_flags & MNT_READONLY))
 			continue;
-		if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) &&
+		if ((mnt_flags & MNT_LOCK_NODEV) &&
 		    !(new_flags & MNT_NODEV))
 			continue;
-		if ((mnt->mnt.mnt_flags & MNT_LOCK_ATIME) &&
-		    ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
+		if ((mnt_flags & MNT_LOCK_NOSUID) &&
+		    !(new_flags & MNT_NOSUID))
+			continue;
+		if ((mnt_flags & MNT_LOCK_NOEXEC) &&
+		    !(new_flags & MNT_NOEXEC))
+			continue;
+		if ((mnt_flags & MNT_LOCK_ATIME) &&
+		    ((mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
 			continue;
 
 		/* This mount is not fully visible if there are any
@@ -3223,16 +3238,18 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags)
 		list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
 			struct inode *inode = child->mnt_mountpoint->d_inode;
 			/* Only worry about locked mounts */
-			if (!(mnt->mnt.mnt_flags & MNT_LOCKED))
+			if (!(mnt_flags & MNT_LOCKED))
 				continue;
 			/* Is the directory permanetly empty? */
 			if (!is_empty_dir_inode(inode))
 				goto next;
 		}
 		/* Preserve the locked attributes */
-		*new_mnt_flags |= mnt->mnt.mnt_flags & (MNT_LOCK_READONLY | \
-							MNT_LOCK_NODEV    | \
-							MNT_LOCK_ATIME);
+		*new_mnt_flags |= mnt_flags & (MNT_LOCK_READONLY | \
+					       MNT_LOCK_NODEV    | \
+					       MNT_LOCK_NOSUID   | \
+					       MNT_LOCK_NOEXEC   | \
+					       MNT_LOCK_ATIME);
 		visible = true;
 		goto found;
 	next:	;
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [REVIEW][PATCH 1/2] vfs: Commit to never having exectuables on proc and sysfs.
       [not found]                 ` <87h9pcyokc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2015-07-10 18:24                   ` Richard Weinberger
       [not found]                     ` <55A00DE9.7060806-/L3Ra7n9ekc@public.gmane.org>
       [not found]                     ` <20150710193052.GB19824@kroah.com>
  0 siblings, 2 replies; 16+ messages in thread
From: Richard Weinberger @ 2015-07-10 18:24 UTC (permalink / raw)
  To: Eric W. Biederman, Linux Containers
  Cc: Seth Forshee, Linux API, Al Viro, Andy Lutomirski, Kenton Varda,
	Michael Kerrisk-manpages, Greg Kroah-Hartman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Ivan Delalande,
	Linus Torvalds

Am 10.07.2015 um 18:17 schrieb Eric W. Biederman:
> 
> Today proc and sysfs do not contain any executable files.  Several
> applications today mount proc or sysfs without noexec and nosuid and
> then depend on there being no exectuables files on proc or sysfs.
> Having any executable files show on proc or sysfs would cause
> a user space visible regression, and most likely security problems.
> 
> Therefore commit to never allowing executables on proc and sysfs by
> adding a new flag to mark them as filesystems without executables and
> enforce that flag.
> 
> Test the flag where MNT_NOEXEC is tested today, so that the only user
> visible effect will be that exectuables will be treated as if the
> execute bit is cleared.
> 
> The filesystems proc and sysfs do not currently incoporate any
> executable files so this does not result in any user visible effects.
> 
> This makes it unnecessary to vet changes to proc and sysfs tightly for
> adding exectuable files or changes to chattr that would modify
> existing files, as no matter what the individual file say they will
> not be treated as exectuable files by the vfs.
> 
> Not having to vet changes to closely is important as without this we
> are only one proc_create call (or another goof up in the
> implementation of notify_change) from having problematic executables
> on proc.  Those mistakes are all too easy to make and would create
> a situation where there are security issues or the assumptions of
> some program having to be broken (and cause userspace regressions).

Would it make sense to add SB_I_NOEXEC to more pseudo filesystems?
Say pstore or devpts?

Thanks,
//richard

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [REVIEW][PATCH 1/2] vfs: Commit to never having exectuables on proc and sysfs.
       [not found]                     ` <55A00DE9.7060806-/L3Ra7n9ekc@public.gmane.org>
@ 2015-07-10 19:30                       ` Greg Kroah-Hartman
  0 siblings, 0 replies; 16+ messages in thread
From: Greg Kroah-Hartman @ 2015-07-10 19:30 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Seth Forshee, Linux API, Linux Containers, Al Viro,
	Andy Lutomirski, Kenton Varda, Tejun Heo, Eric W. Biederman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk-manpages,
	Ivan Delalande, Linus Torvalds

On Fri, Jul 10, 2015 at 08:24:41PM +0200, Richard Weinberger wrote:
> Am 10.07.2015 um 18:17 schrieb Eric W. Biederman:
> > 
> > Today proc and sysfs do not contain any executable files.  Several
> > applications today mount proc or sysfs without noexec and nosuid and
> > then depend on there being no exectuables files on proc or sysfs.
> > Having any executable files show on proc or sysfs would cause
> > a user space visible regression, and most likely security problems.
> > 
> > Therefore commit to never allowing executables on proc and sysfs by
> > adding a new flag to mark them as filesystems without executables and
> > enforce that flag.
> > 
> > Test the flag where MNT_NOEXEC is tested today, so that the only user
> > visible effect will be that exectuables will be treated as if the
> > execute bit is cleared.
> > 
> > The filesystems proc and sysfs do not currently incoporate any
> > executable files so this does not result in any user visible effects.
> > 
> > This makes it unnecessary to vet changes to proc and sysfs tightly for
> > adding exectuable files or changes to chattr that would modify
> > existing files, as no matter what the individual file say they will
> > not be treated as exectuable files by the vfs.
> > 
> > Not having to vet changes to closely is important as without this we
> > are only one proc_create call (or another goof up in the
> > implementation of notify_change) from having problematic executables
> > on proc.  Those mistakes are all too easy to make and would create
> > a situation where there are security issues or the assumptions of
> > some program having to be broken (and cause userspace regressions).
> 
> Would it make sense to add SB_I_NOEXEC to more pseudo filesystems?
> Say pstore or devpts?

And configfs and cgroupfs?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [REVIEW][PATCH 1/2] vfs: Commit to never having exectuables on proc and sysfs.
       [not found]                       ` <20150710193052.GB19824-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
@ 2015-07-10 19:38                         ` Richard Weinberger
  0 siblings, 0 replies; 16+ messages in thread
From: Richard Weinberger @ 2015-07-10 19:38 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Seth Forshee, Linux API, Linux Containers, Al Viro,
	Andy Lutomirski, Kenton Varda, Tejun Heo, Eric W. Biederman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk-manpages,
	Ivan Delalande, Linus Torvalds

Am 10.07.2015 um 21:30 schrieb Greg Kroah-Hartman:
> On Fri, Jul 10, 2015 at 08:24:41PM +0200, Richard Weinberger wrote:
>> Am 10.07.2015 um 18:17 schrieb Eric W. Biederman:
>>>
>>> Today proc and sysfs do not contain any executable files.  Several
>>> applications today mount proc or sysfs without noexec and nosuid and
>>> then depend on there being no exectuables files on proc or sysfs.
>>> Having any executable files show on proc or sysfs would cause
>>> a user space visible regression, and most likely security problems.
>>>
>>> Therefore commit to never allowing executables on proc and sysfs by
>>> adding a new flag to mark them as filesystems without executables and
>>> enforce that flag.
>>>
>>> Test the flag where MNT_NOEXEC is tested today, so that the only user
>>> visible effect will be that exectuables will be treated as if the
>>> execute bit is cleared.
>>>
>>> The filesystems proc and sysfs do not currently incoporate any
>>> executable files so this does not result in any user visible effects.
>>>
>>> This makes it unnecessary to vet changes to proc and sysfs tightly for
>>> adding exectuable files or changes to chattr that would modify
>>> existing files, as no matter what the individual file say they will
>>> not be treated as exectuable files by the vfs.
>>>
>>> Not having to vet changes to closely is important as without this we
>>> are only one proc_create call (or another goof up in the
>>> implementation of notify_change) from having problematic executables
>>> on proc.  Those mistakes are all too easy to make and would create
>>> a situation where there are security issues or the assumptions of
>>> some program having to be broken (and cause userspace regressions).
>>
>> Would it make sense to add SB_I_NOEXEC to more pseudo filesystems?
>> Say pstore or devpts?
> 
> And configfs and cgroupfs?

Yep. Any filesystem where exectuables do not make sense. :-)

Thanks,
//richard

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [REVIEW][PATCH 1/2] vfs: Commit to never having exectuables on proc and sysfs.
       [not found]                         ` <55A01F4B.9010205-/L3Ra7n9ekc@public.gmane.org>
@ 2015-07-10 20:00                           ` Eric W. Biederman
  0 siblings, 0 replies; 16+ messages in thread
From: Eric W. Biederman @ 2015-07-10 20:00 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Seth Forshee, Greg Kroah-Hartman, Linux Containers, Al Viro,
	Andy Lutomirski, Kenton Varda, Michael Kerrisk-manpages,
	Linux API, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	Ivan Delalande, Linus Torvalds

Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:

> Am 10.07.2015 um 21:30 schrieb Greg Kroah-Hartman:
>> On Fri, Jul 10, 2015 at 08:24:41PM +0200, Richard Weinberger wrote:
>>> Am 10.07.2015 um 18:17 schrieb Eric W. Biederman:
>>>>
>>>> Today proc and sysfs do not contain any executable files.  Several
>>>> applications today mount proc or sysfs without noexec and nosuid and
>>>> then depend on there being no exectuables files on proc or sysfs.
>>>> Having any executable files show on proc or sysfs would cause
>>>> a user space visible regression, and most likely security problems.
>>>>
>>>> Therefore commit to never allowing executables on proc and sysfs by
>>>> adding a new flag to mark them as filesystems without executables and
>>>> enforce that flag.
>>>>
>>>> Test the flag where MNT_NOEXEC is tested today, so that the only user
>>>> visible effect will be that exectuables will be treated as if the
>>>> execute bit is cleared.
>>>>
>>>> The filesystems proc and sysfs do not currently incoporate any
>>>> executable files so this does not result in any user visible effects.
>>>>
>>>> This makes it unnecessary to vet changes to proc and sysfs tightly for
>>>> adding exectuable files or changes to chattr that would modify
>>>> existing files, as no matter what the individual file say they will
>>>> not be treated as exectuable files by the vfs.
>>>>
>>>> Not having to vet changes to closely is important as without this we
>>>> are only one proc_create call (or another goof up in the
>>>> implementation of notify_change) from having problematic executables
>>>> on proc.  Those mistakes are all too easy to make and would create
>>>> a situation where there are security issues or the assumptions of
>>>> some program having to be broken (and cause userspace regressions).
>>>
>>> Would it make sense to add SB_I_NOEXEC to more pseudo filesystems?
>>> Say pstore or devpts?
>> 
>> And configfs and cgroupfs?
>
> Yep. Any filesystem where exectuables do not make sense. :-)

With a threat model of a developer overlooking something and allows
executables by accident?  I certainly think it makes sense.

Mostly because we are solving things at the vfs level.  Which gives us
one well tested solution instead of several filesystem specific
solutions.

That isn't quite the same problem that caused me to write this code,
so I am not going to volunteer to write the patches for the additional
filesystems. 

Eric

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-07-10 20:00 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-26 20:50 [GIT PULL] User namespace related fixes for v4.2 Eric W. Biederman
     [not found] ` <CA+55aFysKDXr2HEwNzm3z9QOw=E4ZeWcvYQ-xLhy5_k+rGbeRg@mail.gmail.com>
     [not found]   ` <CA+55aFysKDXr2HEwNzm3z9QOw=E4ZeWcvYQ-xLhy5_k+rGbeRg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-06-29 21:13     ` Eric W. Biederman
     [not found]       ` <87pp4eqktr.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-07-03 22:10         ` Linus Torvalds
     [not found]       ` <CA+55aFw-DK-xDC-3HYa=BMX8WNyQgT9O01tihrAS9+-7PPj_jA@mail.gmail.com>
     [not found]         ` <CA+55aFw-DK-xDC-3HYa=BMX8WNyQgT9O01tihrAS9+-7PPj_jA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-04 23:11           ` Al Viro
     [not found]             ` <20150704231118.GT17109-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-07-10 16:16               ` [REVIEW][PATCH 0/2] noexec on proc and sysfs Eric W. Biederman
     [not found]             ` <87mvz4yomp.fsf_-_@x220.int.ebiederm.org>
     [not found]               ` <87mvz4yomp.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-07-10 16:17                 ` [REVIEW][PATCH 1/2] vfs: Commit to never having exectuables " Eric W. Biederman
2015-07-10 16:18                 ` [REVIEW][PATCH 2/2] mnt: fs_fully_visible enforce noexec and nosuid if !SB_I_NOEXEC Eric W. Biederman
     [not found]               ` <87h9pcyokc.fsf_-_@x220.int.ebiederm.org>
     [not found]                 ` <87h9pcyokc.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-07-10 18:24                   ` [REVIEW][PATCH 1/2] vfs: Commit to never having exectuables on proc and sysfs Richard Weinberger
     [not found]                     ` <55A00DE9.7060806-/L3Ra7n9ekc@public.gmane.org>
2015-07-10 19:30                       ` Greg Kroah-Hartman
     [not found]                     ` <20150710193052.GB19824@kroah.com>
     [not found]                       ` <20150710193052.GB19824-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2015-07-10 19:38                         ` Richard Weinberger
     [not found]                       ` <55A01F4B.9010205@nod.at>
     [not found]                         ` <55A01F4B.9010205-/L3Ra7n9ekc@public.gmane.org>
2015-07-10 20:00                           ` Eric W. Biederman
     [not found] ` <87381eyz26.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-06-29 16:43   ` [GIT PULL] User namespace related fixes for v4.2 Linus Torvalds
2015-07-01 20:41   ` Eric W. Biederman
     [not found]     ` <878uazhapq.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2015-07-06 20:47       ` Seth Forshee
     [not found]     ` <20150706204748.GB22962@ubuntu-hedt>
2015-07-06 21:24       ` Eric W. Biederman
     [not found]       ` <E81DECCD-9B19-4D42-BE43-5987DE7B05DB@xmission.com>
     [not found]         ` <E81DECCD-9B19-4D42-BE43-5987DE7B05DB-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2015-07-06 22:25           ` Seth Forshee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).