linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GIT PULL] user namespace and namespace infrastructure changes for 3.8
@ 2012-12-11 21:17 Eric W. Biederman
  2012-12-13 19:24 ` Andy Lutomirski
  0 siblings, 1 reply; 27+ messages in thread
From: Eric W. Biederman @ 2012-12-11 21:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Linux Containers, Serge E. Hallyn


Linus,

Please pull the for-linus git tree from:

   git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus

   HEAD: 98f842e675f96ffac96e6c50315790912b2812be proc: Usable inode numbers for the namespace file descriptors.

   This tree is against v3.7-rc3

While small this set of changes is very significant with respect to
containers in general and user namespaces in particular.  The user space
interface is now complete.

This set of changes adds support for unprivileged users to create user
namespaces and as a user namespace root to create other namespaces.  The
tyrrany of supporting suid root preventing unprivileged users from using
cool new kernel features is broken.

This set of changes completes the work on setns, adding support for
the pid, user, mount namespaces.

This set of changes includes a bunch of basic pid namespace
cleanups/simplifications.  Of particular significance is the rework of
the pid namespace cleanup so it no longer requires sending out tendrils
into all kinds of unexpected cleanup paths for operation.  At least one
case of broken error handling is fixed by this cleanup.

The files under /proc/<pid>/ns/ have been converted from regular files
to magic symlinks which prevents incorrect caching by the VFS, ensuring
the files always refer to the namespace the process is currently using
and ensuring that the ptrace_mayaccess permission checks are always
applied.

The files under /proc/<pid>/ns/ have been given stable inode numbers so
it is now possible to see if different processes share the same
namespaces.

Through the David Miller's net tree are changes to relax many of the
permission checks in the networking stack to allowing the user namespace
root to usefully use the networking stack.  Similar changes for the
mount namespace and the pid namespace are coming through my tree.

Two small nework namespace changes were double committed here and in
David Millers -net tree so that I could complete the work on the
/proc/<pid>/ns/ files in this tree.

The user namespace work that remains is converting, 9p, afs, ceph, cifs,
coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so they are safe to enable
when user namespaces are enabled, and implementing unprivileged mounts
of more than just /proc and /sys.

I had hoped to get through more of those changes this cycle but
I turned into a cold magnet this season and the UAPI changes caused
a lot of churn late into the 3.7 -rc cycle that made a stable starting
place hard to work from hard to find.

Eric W. Biederman (37):
      userns: Support autofs4 interacing with multiple user namespaces
      userns: Support fuse interacting with multiple user namespaces
      netns: Deduplicate and fix copy_net_ns when !CONFIG_NET_NS
      userns: make each net (net_ns) belong to a user_ns
      userns: On mips modify check_same_owner to use uid_eq
      procfs: Use the proc generic infrastructure for proc/self.
      procfs: Don't cache a pid in the root inode.
      pidns: Capture the user namespace and filter ns_last_pid
      pidns: Use task_active_pid_ns where appropriate
      pidns: Make the pidns proc mount/umount logic obvious.
      pidns: Don't allow new processes in a dead pid namespace.
      pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1
      pidns: Deny strange cases when creating pid namespaces.
      pidns: Add setns support
      pidns: Consolidate initialzation of special init task state
      pidns: Support unsharing the pid namespace.
      vfs: Allow chroot if you have CAP_SYS_CHROOT in your user namespace
      vfs: Add setns support for the mount namespace
      vfs: Add a user namespace reference from struct mnt_namespace
      vfs: Only support slave subtrees across different user namespaces
      vfs: Allow unprivileged manipulation of the mount namespace.
      userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
      userns: Allow unprivileged users to create user namespaces.
      userns: Allow chown and setgid preservation
      userns: Allow setting a userns mapping to your current uid.
      userns: Allow unprivileged users to create new namespaces
      userns: Allow unprivileged use of setns.
      userns: Make create_new_namespaces take a user_ns parameter
      userns: Kill task_user_ns
      userns: Implent proc namespace operations
      userns: Implement unshare of the user namespace
      procfs: Print task uids and gids in the userns that opened the proc file
      userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
      userns: Allow unprivilged mounts of proc and sysfs
      proc: Generalize proc inode allocation
      proc: Fix the namespace inode permission checks.
      proc: Usable inode numbers for the namespace file descriptors.

Zhao Hongjiang (1):
      userns: fix return value on mntns_install() failure

 arch/mips/kernel/mips-mt-fpaff.c          |    4 +-
 arch/powerpc/platforms/cell/spufs/sched.c |    2 +-
 arch/um/drivers/mconsole_kern.c           |    2 +-
 drivers/staging/android/binder.c          |    3 +-
 fs/attr.c                                 |   11 +-
 fs/autofs4/autofs_i.h                     |    8 +-
 fs/autofs4/dev-ioctl.c                    |    4 +-
 fs/autofs4/inode.c                        |   24 ++--
 fs/autofs4/waitq.c                        |    5 +-
 fs/exec.c                                 |    9 +-
 fs/fuse/dev.c                             |    4 +-
 fs/fuse/dir.c                             |   20 ++--
 fs/fuse/fuse_i.h                          |    4 +-
 fs/fuse/inode.c                           |   23 ++--
 fs/hppfs/hppfs.c                          |    2 +-
 fs/mount.h                                |    3 +
 fs/namespace.c                            |  211 ++++++++++++++++++++++++-----
 fs/open.c                                 |    2 +-
 fs/pnode.h                                |    1 +
 fs/proc/Makefile                          |    1 +
 fs/proc/array.c                           |    2 +-
 fs/proc/base.c                            |  169 +-----------------------
 fs/proc/generic.c                         |   26 ++--
 fs/proc/inode.c                           |    6 +-
 fs/proc/internal.h                        |    1 +
 fs/proc/namespaces.c                      |  185 ++++++++++++++++++++++---
 fs/proc/root.c                            |   17 +--
 fs/proc/self.c                            |   59 ++++++++
 fs/sysfs/mount.c                          |    1 +
 include/linux/cred.h                      |    2 -
 include/linux/fs.h                        |    2 +
 include/linux/ipc_namespace.h             |    9 +-
 include/linux/mnt_namespace.h             |    3 +-
 include/linux/nsproxy.h                   |    2 +-
 include/linux/pid_namespace.h             |   11 +-
 include/linux/proc_fs.h                   |   26 ++++-
 include/linux/user_namespace.h            |   10 ++
 include/linux/utsname.h                   |    7 +-
 include/net/net_namespace.h               |   26 +++-
 init/Kconfig                              |    2 -
 init/main.c                               |    1 -
 init/version.c                            |    2 +
 ipc/msgutil.c                             |    2 +
 ipc/namespace.c                           |   32 ++++-
 kernel/cgroup.c                           |    2 +-
 kernel/events/core.c                      |    2 +-
 kernel/exit.c                             |   12 --
 kernel/fork.c                             |   69 +++++++---
 kernel/nsproxy.c                          |   36 +++---
 kernel/pid.c                              |   47 ++++++-
 kernel/pid_namespace.c                    |  112 ++++++++++++---
 kernel/ptrace.c                           |   10 +-
 kernel/sched/core.c                       |   10 +-
 kernel/signal.c                           |    2 +-
 kernel/sysctl_binary.c                    |    2 +-
 kernel/user.c                             |    2 +
 kernel/user_namespace.c                   |  147 +++++++++++++++++---
 kernel/utsname.c                          |   33 ++++-
 net/core/net_namespace.c                  |   54 ++++++--
 security/yama/yama_lsm.c                  |   12 ++-
 60 files changed, 1026 insertions(+), 472 deletions(-)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
  2012-12-11 21:17 [GIT PULL] user namespace and namespace infrastructure changes for 3.8 Eric W. Biederman
@ 2012-12-13 19:24 ` Andy Lutomirski
  2012-12-13 22:01   ` Eric W. Biederman
  0 siblings, 1 reply; 27+ messages in thread
From: Andy Lutomirski @ 2012-12-13 19:24 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linus Torvalds, containers, Linux Kernel Mailing List

On 12/11/2012 01:17 PM, Eric W. Biederman wrote:
> 
> Linus,
> 
> Please pull the for-linus git tree from:
> 
>    git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus
> 
>    HEAD: 98f842e675f96ffac96e6c50315790912b2812be proc: Usable inode numbers for the namespace file descriptors.
> 
>    This tree is against v3.7-rc3

You've just allowed unprivileged users to create new pid namespaces,
etc, by creating a new userns, then creating a new pid namespace inside
that userns, then setns-ing from outside the userns into the pid ns.  Is
this intentional?  (The mount ns is okay -- it checks for CAP_CHROOT on
setns.)


In user_namespace.c:

        /* Threaded many not enter a different user namespace */
        if (atomic_read(&current->mm->mm_users) > 1)
                return -EINVAL;

The comment has a typo.  Also, you're checking the wrong condition:
that's whether the vm is shared, not whether the thread group has more
than one member.

In any case, why are threads special here?



I think, although I haven't verified it, that these changes allow
CAP_SYS_ADMIN to bypass the bounding set (and, in particular, to gain
CAP_MODULE): unshare the user namespace and then setfd yourself back.  I
think that setns should only grant caps when changing to a descendent
namespace.

Also in userns_install:

796         /* Don't allow gaining capabilities by reentering
797          * the same user namespace.
798          */
799         if (user_ns == current_user_ns())
800                 return -EINVAL;

Why?  You can trivially bypass this by creating a temporary user ns.
(If you're the owner of your own ns, then you can create a subsidiary
ns, map yourself into it, then setns back -- you'll still be the owner.)


unshare has a bug.  This code:

#define _GNU_SOURCE

#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <signal.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/prctl.h>
#include <sys/mman.h>
#include <sched.h>
#include <fcntl.h>

static void fail(const char *msg)
{
  perror(msg);
  exit(1);
}

int main()
{
  if (unshare(CLONE_NEWUSER) != 0)
    fail("CLONE_NEWUSER");

  if (open("/proc/self/uid_map", O_RDWR) == -1)
    perror("/proc/self/uid_map O_RDWR");

  int fd = open("/proc/self/uid_map", O_RDONLY);
  if (fd == -1) {
    perror("/proc/self/uid_map O_RDONLY");
  } else {
    char buf[4096];
    ssize_t len = read(fd, buf, sizeof(buf));
    if (len > 0)
      write(1, buf, len);
    else
      printf("read uid_map returned %d\n", (int)len);
  }
}

produces this output:

/proc/self/uid_map O_RDWR: Permission denied
read uid_map returned 0

With clone instead of unshare, it works.  I'm not quite sure what's
going on.



Also, I'm entirely unconvinced that the owner of a userns should
automatically have all caps (in the cap_capable sense) on a userns if
the task is inside that ns.  What's wrong with just using normal caps?
(Of course, the fact that caps don't usefully inherit is an issue --
there's a loooong thread going on right now about that.)  But doing this
enshrines root-has-caps even farther into ABI.  At least please make
this depend on !SECURE_NOROOT.


> 
> While small this set of changes is very significant with respect to
> containers in general and user namespaces in particular.  The user space
> interface is now complete.
> 
> This set of changes adds support for unprivileged users to create user
> namespaces and as a user namespace root to create other namespaces.  The
> tyrrany of supporting suid root preventing unprivileged users from using
> cool new kernel features is broken.
> 

no_new_privs already (kind of) did that :)

--Andy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
  2012-12-13 19:24 ` Andy Lutomirski
@ 2012-12-13 22:01   ` Eric W. Biederman
  2012-12-13 22:39     ` [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps Eric W. Biederman
  2012-12-13 23:02     ` [GIT PULL] user namespace and namespace infrastructure changes for 3.8 Andy Lutomirski
  0 siblings, 2 replies; 27+ messages in thread
From: Eric W. Biederman @ 2012-12-13 22:01 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Linus Torvalds, containers, Linux Kernel Mailing List

Andy Lutomirski <luto@amacapital.net> writes:

> On 12/11/2012 01:17 PM, Eric W. Biederman wrote:
>> 
>> Linus,
>> 
>> Please pull the for-linus git tree from:
>> 
>>    git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus
>> 
>>    HEAD: 98f842e675f96ffac96e6c50315790912b2812be proc: Usable inode numbers for the namespace file descriptors.
>> 
>>    This tree is against v3.7-rc3
>
> You've just allowed unprivileged users to create new pid namespaces,
> etc, by creating a new userns, then creating a new pid namespace inside
> that userns, then setns-ing from outside the userns into the pid ns.  Is
> this intentional?  (The mount ns is okay -- it checks for CAP_CHROOT on
> setns.)

Absolutely.  My commit messages talk about this.  I allow creating other
namespaces once inside a user namespace deliberately.  There is no
reason I know of to ban creation of pid and other namespaces once you
are inside of a user namespace.

But please also note the difference between capable and ns_capable.  Any
security check that is capable() still requires priviliges in the
initial user namespace.

> In user_namespace.c:
>
>         /* Threaded many not enter a different user namespace */
>         if (atomic_read(&current->mm->mm_users) > 1)
>                 return -EINVAL;
>
> The comment has a typo.  Also, you're checking the wrong condition:
> that's whether the vm is shared, not whether the thread group has more
> than one member.

Yes the comment should say.

Threaded processes may not enter a different user namespace.

As for the condition.  mm_users will equal one for a non-threaded
process.  And mm_users is the check we use in unshare to detect if
a threaded process calls unshare so I think the check seems perfectly
reasonable.  Especially since the vm must have more than one member if
there is more than one member in the thread group.

> In any case, why are threads special here?

You know I don't think I stopped to think about it.   The combination
of CLONE_NEWUSER and CLONE_THREAD have been denined since the first user
namespace support was merged in 2008.

I do know that things can get really strange when you mix multiple
namespaces in a process.  tkill of your own threads will stop working.
Which access permissions should apply to files you mmap, file handles
you have open, the core dumper etc.

We do allow setresuid per thread so we might be able to cope
with a process that mixes with user namespaces in different threads,
but I would want a close review of things before we allow that kind of
sharing.

> I think, although I haven't verified it, that these changes allow
> CAP_SYS_ADMIN to bypass the bounding set (and, in particular, to gain
> CAP_MODULE): unshare the user namespace and then setfd yourself back.  I
> think that setns should only grant caps when changing to a descendent
> namespace.

(See the end.  A significant bug in cap_capable slipped in about
 3.5. cap_capable is only supposed to grant permissions to the owner
 of a user namespace if it is a child user namespace).

These changes do not allow CAP_SYS_ADMIN to bypass the bounding set.

The test:

	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
		return -EPERM;

verifies that the user namespace we are entering is a nested user
namespace, because we can only have CAP_SYS_ADMIN in our current
namespace and in nested user namespaces.

> Also in userns_install:
>
> 796         /* Don't allow gaining capabilities by reentering
> 797          * the same user namespace.
> 798          */
> 799         if (user_ns == current_user_ns())
> 800                 return -EINVAL;
>
> Why?

To keep processes that deliberately drop some capabilities from being
able to gain those capabilities back by reentering the current user
namespace.

Aka that test plus the ns_capable test prevent are the combination
that prevent a process gaining privileges in the current user namespace.

> You can trivially bypass this by creating a temporary user ns.
> (If you're the owner of your own ns, then you can create a subsidiary
> ns, map yourself into it, then setns back -- you'll still be the
> owner.)

Nope.   Once you have capabilities in a user namespace you do not have
any capabilities in the parent user namespace.  Entering a user
namespace is a one way operation.

> unshare has a bug.  This code:

Interesting...

Looking at it this is a very small misfeature.

What is happening is that commit_creds is setting is making the task
undumpable because we changed the set of capabilities in struct cred.

This in turn results in pid_revalidate setting the owner of
of /proc/self/uid_map to GLOBAL_ROOT_UID.

>From the outside the permissions on /proc/self/uid_map look like:
-rw-r--r-- 1 root root 0 2012-12-13 12:43 /proc/30530/uid_map

Then since /proc/self/uid_map uses the default permission function
and the test program below is not run as root the read-write open
of uid_map fails.

> Also, I'm entirely unconvinced that the owner of a userns should
> automatically have all caps (in the cap_capable sense) on a userns if
> the task is inside that ns.   What's wrong with just using normal
> caps?

You have said it strangely. But I absolutely agree with you.

There is a bug in cap_capable that snuck in when kuids made it possible
for an owner to exist in multiple user namespaces.

cap_capable is only supposed to grant capabilities without looking at
the capability bitmap to the owner of the user namespace when the user
namespace in question is a child of the current user namespace.

When cap_capable was written the owner had to live in the parent user
namespace so the fact that it was a child user namespace was
automatically guaranteed.  kuids broke that assumption, and cap_capable
broke when I chaned it to use kuids.

That is an embarrassing brown paper bag bug.

I am going to step back and see if I can figure out how to fix
cap_capable.  There is the "targ_ns != &init_user_ns" test before we
grant the owner of a user namespace permission so the effects are
limited to what you can do in a child user namespace.

> (Of course, the fact that caps don't usefully inherit is an issue --
> there's a loooong thread going on right now about that.)  But doing this
> enshrines root-has-caps even farther into ABI.  At least please make
> this depend on !SECURE_NOROOT.

With the bug fixed there is nothing new here.  On creation of a user
namespace of course the initial process with have all capabilities just
like init starts out with all capabilities.

The owner of a user namespace when outside of a user namespace also
has all capabilities over a user namespace, and has since March of 2011.

When entering a user namespace what happens is that the owner permanently
drops whatever privileges the owner has to their previous user namespace
and their implied capabilites get put into the capability bitmask.

This has absolutely nothing to do with SECURE_NOROOT and the case where
becomming root gains privileges.  What does happen from a global
perspective is that there are some capabilities that a process has that
it will not be able to drop (capabilities over the user namespaces it
owns), and if the owner of the process has created user namespaces.

Eric

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps.
  2012-12-13 22:01   ` Eric W. Biederman
@ 2012-12-13 22:39     ` Eric W. Biederman
  2012-12-13 22:43       ` Linus Torvalds
                         ` (2 more replies)
  2012-12-13 23:02     ` [GIT PULL] user namespace and namespace infrastructure changes for 3.8 Andy Lutomirski
  1 sibling, 3 replies; 27+ messages in thread
From: Eric W. Biederman @ 2012-12-13 22:39 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Linus Torvalds, containers, Linux Kernel Mailing List,
	Andy Lutomirski, linux-security-module


Andy Lutomirski pointed out that the current behavior of allowing the
owner of a user namespace to have all caps when that owner is not in a
parent user namespace is wrong.

This is a bug introduced by the kuid conversion which made it possible
for the owner of a user namespace to live in a child user namespace.  I
goofed and totally missed this implication.

Serge and can you please take a look and see if my corrected cap_capable
reads correctly to you.

Andy or anyone else that wants to give me a second eyeball and double
check me on this I would appreciate it.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

---

diff --git a/security/commoncap.c b/security/commoncap.c
index 6dbae46..4639f44 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -70,37 +70,44 @@ int cap_netlink_send(struct sock *sk, struct sk_buff *skb)
  *
  * NOTE WELL: cap_has_capability() cannot be used like the kernel's capable()
  * and has_capability() functions.  That is, it has the reverse semantics:
  * cap_has_capability() returns 0 when a task has a capability, but the
  * kernel's capable() and has_capability() returns 1 for this case.
  */
 int cap_capable(const struct cred *cred, struct user_namespace *targ_ns,
 		int cap, int audit)
 {
 	for (;;) {
-		/* The owner of the user namespace has all caps. */
-		if (targ_ns != &init_user_ns && uid_eq(targ_ns->owner, cred->euid))
-			return 0;
+		struct user_namespace *parent_ns;
 
 		/* Do we have the necessary capabilities? */
 		if (targ_ns == cred->user_ns)
 			return cap_raised(cred->cap_effective, cap) ? 0 : -EPERM;
 
 		/* Have we tried all of the parent namespaces? */
 		if (targ_ns == &init_user_ns)
 			return -EPERM;
 
+		parent_ns = targ_ns->parent;
+
+		/* 
+		 * The owner of the user namespace in the parent user
+		 * namespace has all caps.
+		 */
+		if ((parent_ns == cred->user_ns) && uid_eq(targ_ns->owner, cred->euid))
+			return 0;
+
 		/*
-		 *If you have a capability in a parent user ns, then you have
+		 * If you have a capability in a parent user ns, then you have
 		 * it over all children user namespaces as well.
 		 */
-		targ_ns = targ_ns->parent;
+		targ_ns = parent_ns;
 	}
 
 	/* We never get here */
 }
 
 /**
  * cap_settime - Determine whether the current process may set the system clock
  * @ts: The time to set
  * @tz: The timezone to set
  *

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps.
  2012-12-13 22:39     ` [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps Eric W. Biederman
@ 2012-12-13 22:43       ` Linus Torvalds
  2012-12-13 22:55         ` Eric W. Biederman
  2012-12-13 23:21       ` Andy Lutomirski
  2012-12-14  3:28       ` [RFC][PATCH] " Serge E. Hallyn
  2 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2012-12-13 22:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, containers, Linux Kernel Mailing List,
	Andy Lutomirski, LSM List

On Thu, Dec 13, 2012 at 2:39 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Andy Lutomirski pointed out that the current behavior of allowing the
> owner of a user namespace to have all caps when that owner is not in a
> parent user namespace is wrong.
>
> This is a bug introduced by the kuid conversion which made it possible
> for the owner of a user namespace to live in a child user namespace.  I
> goofed and totally missed this implication.

Hmm. Shouldn't this be cc: stable if it was introduced in the kuid
conversion? Or is it only an issue with your new namespace tree (which
I haven't pulled yet)?

              Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps.
  2012-12-13 22:43       ` Linus Torvalds
@ 2012-12-13 22:55         ` Eric W. Biederman
  0 siblings, 0 replies; 27+ messages in thread
From: Eric W. Biederman @ 2012-12-13 22:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Serge E. Hallyn, containers, Linux Kernel Mailing List,
	Andy Lutomirski, LSM List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Dec 13, 2012 at 2:39 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> Andy Lutomirski pointed out that the current behavior of allowing the
>> owner of a user namespace to have all caps when that owner is not in a
>> parent user namespace is wrong.
>>
>> This is a bug introduced by the kuid conversion which made it possible
>> for the owner of a user namespace to live in a child user namespace.  I
>> goofed and totally missed this implication.
>
> Hmm. Shouldn't this be cc: stable if it was introduced in the kuid
> conversion? Or is it only an issue with your new namespace tree (which
> I haven't pulled yet)?

It should be CC stable.

I think I have fixed the bug I am hoping to get a second pair of eyeballs
before I send the patch officially.

The test for &init_user_ns keeps the bugs from affecting kernels with user
namespaces disabled.

The bug exists in 3.5 and 3.6 but barely matters because you can't
enable user namespaces without additional patches.

The bug exists in 3.7 but is should be of limited affect because
distributions are likely to prefer enabling nfs and fuse over user
namespaces.

I am going to step away for about an hour or so and then with hopefully
fresh eyes myself work to push the good version.  

Eric

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
  2012-12-13 22:01   ` Eric W. Biederman
  2012-12-13 22:39     ` [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps Eric W. Biederman
@ 2012-12-13 23:02     ` Andy Lutomirski
  2012-12-14  4:11       ` Eric W. Biederman
  1 sibling, 1 reply; 27+ messages in thread
From: Andy Lutomirski @ 2012-12-13 23:02 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linus Torvalds, containers, Linux Kernel Mailing List

On Thu, Dec 13, 2012 at 2:01 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Andy Lutomirski <luto@amacapital.net> writes:
>
>> On 12/11/2012 01:17 PM, Eric W. Biederman wrote:
>>>
>>> Linus,
>>>
>>> Please pull the for-linus git tree from:
>>>
>>>    git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus
>>>
>>>    HEAD: 98f842e675f96ffac96e6c50315790912b2812be proc: Usable inode numbers for the namespace file descriptors.
>>>
>>>    This tree is against v3.7-rc3
>>
>> You've just allowed unprivileged users to create new pid namespaces,
>> etc, by creating a new userns, then creating a new pid namespace inside
>> that userns, then setns-ing from outside the userns into the pid ns.  Is
>> this intentional?  (The mount ns is okay -- it checks for CAP_CHROOT on
>> setns.)
>
> Absolutely.  My commit messages talk about this.  I allow creating other
> namespaces once inside a user namespace deliberately.  There is no
> reason I know of to ban creation of pid and other namespaces once you
> are inside of a user namespace.
>

I must be looking at the wrong commit message.

> But please also note the difference between capable and ns_capable.  Any
> security check that is capable() still requires priviliges in the
> initial user namespace.

Huh?

I'm talking about:

clone with CLONE_NEWUSER
 - child does unshare(CLONE_NEWPID)
 - parent does setfd(child's pid namespace)

Now the parent is running in the init userns with a different pid ns.
Setuid binaries will work but will see the unexpected pid ns.  With
mount namespaces, this would be Really Bad (tm).  With pid or ipc or
net, it's less obviously dangerous, but I'm not convinced it's safe.

I sort of think that setns on a *non*-userns should require
CAP_SYS_ADMIN in the current userns, at least if no_new_privs isn't
set.

>
>> In user_namespace.c:
>>
>>         /* Threaded many not enter a different user namespace */
>>         if (atomic_read(&current->mm->mm_users) > 1)
>>                 return -EINVAL;
>>
>> The comment has a typo.  Also, you're checking the wrong condition:
>> that's whether the vm is shared, not whether the thread group has more
>> than one member.
>
> Yes the comment should say.
>
> Threaded processes may not enter a different user namespace.
>
> As for the condition.  mm_users will equal one for a non-threaded
> process.  And mm_users is the check we use in unshare to detect if
> a threaded process calls unshare so I think the check seems perfectly
> reasonable.  Especially since the vm must have more than one member if
> there is more than one member in the thread group.

A non-threaded process can have mm_users == 2 with CLONE_VM.  I'm not
sure this is a problem, though.

>
>> In any case, why are threads special here?
>
> You know I don't think I stopped to think about it.   The combination
> of CLONE_NEWUSER and CLONE_THREAD have been denined since the first user
> namespace support was merged in 2008.
>
> I do know that things can get really strange when you mix multiple
> namespaces in a process.  tkill of your own threads will stop working.
> Which access permissions should apply to files you mmap, file handles
> you have open, the core dumper etc.
>
> We do allow setresuid per thread so we might be able to cope
> with a process that mixes with user namespaces in different threads,
> but I would want a close review of things before we allow that kind of
> sharing.
>

Fair enough.

I'd personally be more concerned about shared signal handlers than
shared tgid, though.  The signal handler set has all kinds of weird
things like session id.

>> I think, although I haven't verified it, that these changes allow
>> CAP_SYS_ADMIN to bypass the bounding set (and, in particular, to gain
>> CAP_MODULE): unshare the user namespace and then setfd yourself back.  I
>> think that setns should only grant caps when changing to a descendent
>> namespace.
>
> (See the end.  A significant bug in cap_capable slipped in about
>  3.5. cap_capable is only supposed to grant permissions to the owner
>  of a user namespace if it is a child user namespace).

[snipping lots of stuff]

If the intended semantics of cap_capable are, indeed:

If targ_ns is equals or is a descendent of cred->user_ns and the cap
is effective, return true.  If targ_ns is owned by cred->euid and
targ_ns is a descendent of cred->user_ns (but is not equal to
cred->user_ns), then return true.  Else return false

then I agree with you on almost everything you said.  I assumed that
the actual semantics were intended.

>
>> unshare has a bug.  This code:
>
> Interesting...
>
> Looking at it this is a very small misfeature.
>
> What is happening is that commit_creds is setting is making the task
> undumpable because we changed the set of capabilities in struct cred.
>
> This in turn results in pid_revalidate setting the owner of
> of /proc/self/uid_map to GLOBAL_ROOT_UID.
>
> From the outside the permissions on /proc/self/uid_map look like:
> -rw-r--r-- 1 root root 0 2012-12-13 12:43 /proc/30530/uid_map
>
> Then since /proc/self/uid_map uses the default permission function
> and the test program below is not run as root the read-write open
> of uid_map fails.

It's probably either worth fixing this or disabling unshare of userns.
 This makes it hard to use.  IMO non-dumpable tasks should still be
able to access the contents of /proc/self -- i.e. I'd call this a more
general bug.

But I'd also argue that unsharing userns shouldn't set non-dumpable --
cap_permitted increased, but the new capabilities are still logically
a subset of the old ones.

>
>> (Of course, the fact that caps don't usefully inherit is an issue --
>> there's a loooong thread going on right now about that.)  But doing this
>> enshrines root-has-caps even farther into ABI.  At least please make
>> this depend on !SECURE_NOROOT.
>
> With the bug fixed there is nothing new here.  On creation of a user
> namespace of course the initial process with have all capabilities just
> like init starts out with all capabilities.
>
> The owner of a user namespace when outside of a user namespace also
> has all capabilities over a user namespace, and has since March of 2011.
>
> When entering a user namespace what happens is that the owner permanently
> drops whatever privileges the owner has to their previous user namespace
> and their implied capabilites get put into the capability bitmask.
>
> This has absolutely nothing to do with SECURE_NOROOT and the case where
> becomming root gains privileges.  What does happen from a global
> perspective is that there are some capabilities that a process has that
> it will not be able to drop (capabilities over the user namespaces it
> owns), and if the owner of the process has created user namespaces.

I think I agree with you.  This would still be more useful with
capability inheritance, though :).


One more issue: the requirement that both upper and lower uids (etc.)
in the maps are in order is rather limiting.  I have no objection if
you only require upper ids to be monotonic, but currently there's no
way to may root outside to uid n (for n > 0) and some nonroot user
outside to uid 0.

Also, the fact that you can remap onto the unmapped id is a little
strange.  I haven't found any reason it would be a security bug, but
it's still odd.


--Andy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps.
  2012-12-13 22:39     ` [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps Eric W. Biederman
  2012-12-13 22:43       ` Linus Torvalds
@ 2012-12-13 23:21       ` Andy Lutomirski
  2012-12-14  2:33         ` Eric W. Biederman
  2012-12-14  3:28       ` [RFC][PATCH] " Serge E. Hallyn
  2 siblings, 1 reply; 27+ messages in thread
From: Andy Lutomirski @ 2012-12-13 23:21 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Linus Torvalds, containers,
	Linux Kernel Mailing List, linux-security-module

On Thu, Dec 13, 2012 at 2:39 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Andy Lutomirski pointed out that the current behavior of allowing the
> owner of a user namespace to have all caps when that owner is not in a
> parent user namespace is wrong.
>
> This is a bug introduced by the kuid conversion which made it possible
> for the owner of a user namespace to live in a child user namespace.  I
> goofed and totally missed this implication.
>
> Serge and can you please take a look and see if my corrected cap_capable
> reads correctly to you.
>
> Andy or anyone else that wants to give me a second eyeball and double
> check me on this I would appreciate it.
>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
> ---
>
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 6dbae46..4639f44 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -70,37 +70,44 @@ int cap_netlink_send(struct sock *sk, struct sk_buff *skb)
>   *
>   * NOTE WELL: cap_has_capability() cannot be used like the kernel's capable()
>   * and has_capability() functions.  That is, it has the reverse semantics:
>   * cap_has_capability() returns 0 when a task has a capability, but the
>   * kernel's capable() and has_capability() returns 1 for this case.
>   */
>  int cap_capable(const struct cred *cred, struct user_namespace *targ_ns,
>                 int cap, int audit)
>  {
>         for (;;) {
> -               /* The owner of the user namespace has all caps. */
> -               if (targ_ns != &init_user_ns && uid_eq(targ_ns->owner, cred->euid))
> -                       return 0;
> +               struct user_namespace *parent_ns;
>
>                 /* Do we have the necessary capabilities? */
>                 if (targ_ns == cred->user_ns)
>                         return cap_raised(cred->cap_effective, cap) ? 0 : -EPERM;
>
>                 /* Have we tried all of the parent namespaces? */
>                 if (targ_ns == &init_user_ns)
>                         return -EPERM;
>
> +               parent_ns = targ_ns->parent;
> +
> +               /*
> +                * The owner of the user namespace in the parent user
> +                * namespace has all caps.
> +                */
> +               if ((parent_ns == cred->user_ns) && uid_eq(targ_ns->owner, cred->euid))
> +                       return 0;

This is confusing enough that I can't immediately tell whether it's
correct.  I think it's close but out of order.

Should this be transitive?  I.e. suppose uid 1 owns a child of
init_user_ns and uid 2 (mapped in the first ns as the identity) owns
an inner ns.  Does uid 2 in the root ns have all caps inside?  I'd say
no, but I don't have a great argument for that.  But uid 1 presumably
does have caps because it could enter the parent with setns, then
change uid, then enter the child.

How about (severely whitespace damaged):

int cap_capable(const struct cred *cred, struct user_namespace *targ_ns,
                int cap, int audit)
{
        struct user_namespace *here = targ_ns;

        /* Walk up the namespace hierarchy until we find our own namespace. */
        for (;;) {
                /* The owner of an ancestor namespace has all caps, if
that owner is in the parent ns. */
                if (cred->user_ns == here->parent &&
uid_eq(targ_ns->owner, cred->euid))
                        return 0;

                /* Do we have the necessary capabilities? */
                if (here == cred->user_ns)
                        return cap_raised(cred->cap_effective, cap) ?
0 : -EPERM;

                /* Have we tried all of the parent namespaces? */
                if (here == &init_user_ns)
                        return -EPERM;
                else
                        here = targ_ns->parent;
        }
}

--Andy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps.
  2012-12-13 23:21       ` Andy Lutomirski
@ 2012-12-14  2:33         ` Eric W. Biederman
  2012-12-14  2:36           ` Andy Lutomirski
  0 siblings, 1 reply; 27+ messages in thread
From: Eric W. Biederman @ 2012-12-14  2:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Serge E. Hallyn, Linus Torvalds, containers,
	Linux Kernel Mailing List, linux-security-module


Andy thank you for your review.

Andy Lutomirski <luto@amacapital.net> writes:
> This is confusing enough that I can't immediately tell whether it's
> correct.  I think it's close but out of order.

Yeah.  That is the trick. Figuring out how to write that code so it is
correct and obvious.

I have added a comment at the top and removed the extra variable I was
adding.

The order except for verifying a user_ns->parent is valid by checking
for targ_ns == &init_user_ns doesn't make a difference.  Because
cred->user_ns can only be one of targ_ns or targ_ns->parent.

> Should this be transitive?

Yes.
>  I.e. suppose uid 1 owns a child of
> init_user_ns and uid 2 (mapped in the first ns as the identity) owns
> an inner ns.  Does uid 2 in the root ns have all caps inside?  I'd say
> no, but I don't have a great argument for that. 

I also say no.  It is more code and it doesn't fit my nice small
definition.  You have to be the owner and you have to be in the parent
of the target user namespace.  Being able to remember the rules in
your head is important.

> But uid 1 presumably
> does have caps because it could enter the parent with setns, then
> change uid, then enter the child.

Yes. uid 1 does have caps.

> How about (severely whitespace damaged):

You know that makes the termination condition a bit clearer.  But it
looses the nice place to put a comment when we loop again.  This loop
is just subtle enough that I want to preserve my comments.

I think I must have put -EPERM towards the end for the same reason to
make it clear that is the termination condition.

In practice I think it is important to have the cap_raised case first,
as that is the common case, and if we can be clear and still test that
case first it means the code will be faster.  With my reordering it is
obvious that nothing strange happens in the initial user namespace with
the owner test after the exit when we are the initial user namespace.

> int cap_capable(const struct cred *cred, struct user_namespace *targ_ns,
>                 int cap, int audit)
> {
>         struct user_namespace *here = targ_ns;
>
>         /* Walk up the namespace hierarchy until we find our own namespace. */
>         for (;;) {
>                 /* The owner of an ancestor namespace has all caps, if
> that owner is in the parent ns. */
>                 if (cred->user_ns == here->parent &&
> uid_eq(targ_ns->owner, cred->euid))
>                         return 0;

This would have needed a check that (here != &init_user_ns).  Because
the init_user_ns does not have a parent.

>                 /* Do we have the necessary capabilities? */
>                 if (here == cred->user_ns)
>                         return cap_raised(cred->cap_effective, cap) ?
> 0 : -EPERM;
>
>                 /* Have we tried all of the parent namespaces? */
>                 if (here == &init_user_ns)
>                         return -EPERM;
>                 else
>                         here = targ_ns->parent;
>         }
> }


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps.
  2012-12-14  2:33         ` Eric W. Biederman
@ 2012-12-14  2:36           ` Andy Lutomirski
  2012-12-14  3:20             ` [PATCH] " Eric W. Biederman
  0 siblings, 1 reply; 27+ messages in thread
From: Andy Lutomirski @ 2012-12-14  2:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Linus Torvalds, containers,
	Linux Kernel Mailing List, linux-security-module

On Thu, Dec 13, 2012 at 6:33 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Andy thank you for your review.
>
> Andy Lutomirski <luto@amacapital.net> writes:
>> This is confusing enough that I can't immediately tell whether it's
>> correct.  I think it's close but out of order.
>
> Yeah.  That is the trick. Figuring out how to write that code so it is
> correct and obvious.
>
> I have added a comment at the top and removed the extra variable I was
> adding.
>
> The order except for verifying a user_ns->parent is valid by checking
> for targ_ns == &init_user_ns doesn't make a difference.  Because
> cred->user_ns can only be one of targ_ns or targ_ns->parent.
>
>> Should this be transitive?
>
> Yes.
>>  I.e. suppose uid 1 owns a child of
>> init_user_ns and uid 2 (mapped in the first ns as the identity) owns
>> an inner ns.  Does uid 2 in the root ns have all caps inside?  I'd say
>> no, but I don't have a great argument for that.
>
> I also say no.  It is more code and it doesn't fit my nice small
> definition.  You have to be the owner and you have to be in the parent
> of the target user namespace.  Being able to remember the rules in
> your head is important.
>
>> But uid 1 presumably
>> does have caps because it could enter the parent with setns, then
>> change uid, then enter the child.
>
> Yes. uid 1 does have caps.
>
>> How about (severely whitespace damaged):
>
> You know that makes the termination condition a bit clearer.  But it
> looses the nice place to put a comment when we loop again.  This loop
> is just subtle enough that I want to preserve my comments.
>
> I think I must have put -EPERM towards the end for the same reason to
> make it clear that is the termination condition.
>
> In practice I think it is important to have the cap_raised case first,
> as that is the common case, and if we can be clear and still test that
> case first it means the code will be faster.  With my reordering it is
> obvious that nothing strange happens in the initial user namespace with
> the owner test after the exit when we are the initial user namespace.

Ah.  You are correct about the ordering.  I read it slightly wrong.

I'd still suggest using a variable like "here" instead of "targ_ns".
The latter is confusing because it changes on the second and later
iterations.

--Andy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps.
  2012-12-14  2:36           ` Andy Lutomirski
@ 2012-12-14  3:20             ` Eric W. Biederman
  0 siblings, 0 replies; 27+ messages in thread
From: Eric W. Biederman @ 2012-12-14  3:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Serge E. Hallyn, containers, Linux Kernel Mailing List,
	linux-security-module, Andy Lutomirski


Andy Lutomirski pointed out that the current behavior of allowing the
owner of a user namespace to have all caps when that owner is not in a
parent user namespace is wrong.  Add a test to ensure the owner of a user
namespace is in the parent of the user namespace to fix this bug.

Thankfully this bug did not apply to the initial user namespace, keeping
the mischief that can be caused by this bug quite small.

This is bug was introduced in v3.5 by commit 783291e6900
"Simplify the user_namespace by making userns->creator a kuid."

The bug made it possible for the owner of a user namespace to be
present in a child user namespace.  Since the owner of a user nameapce
is granted all capabilities it became possible for users in a
grandchild user namespace to have all privilges over their parent user
namspace.

Reorder the checks in cap_capable.  This should make the common case
faster and make it clear that nothing magic happens in the initial
user namespace.  The reordering is safe because cred->user_ns
can only be in targ_ns or targ_ns->parent but not both.

Add a comment a the top of the loop to make the logic of
the code clear.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 security/commoncap.c |   19 ++++++++++++++-----
 1 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/security/commoncap.c b/security/commoncap.c
index 6dbae46..ded41a0 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -76,11 +76,12 @@ int cap_netlink_send(struct sock *sk, struct sk_buff *skb)
 int cap_capable(const struct cred *cred, struct user_namespace *targ_ns,
 		int cap, int audit)
 {
-	for (;;) {
-		/* The owner of the user namespace has all caps. */
-		if (targ_ns != &init_user_ns && uid_eq(targ_ns->owner, cred->euid))
-			return 0;
 
+	/* See if cred has the capability in the target user namespace
+	 * by examining the target user namespace and all of the target
+	 * user namespace's parents.
+	 */
+	for (;;) {
 		/* Do we have the necessary capabilities? */
 		if (targ_ns == cred->user_ns)
 			return cap_raised(cred->cap_effective, cap) ? 0 : -EPERM;
@@ -89,8 +90,16 @@ int cap_capable(const struct cred *cred, struct user_namespace *targ_ns,
 		if (targ_ns == &init_user_ns)
 			return -EPERM;
 
+		/* 
+		 * The owner of the user namespace in the parent of the
+		 * user namespace has all caps.
+		 */
+		if ((targ_ns->parent == cred->user_ns) &&
+		     uid_eq(targ_ns->owner, cred->euid))
+			return 0;
+
 		/*
-		 *If you have a capability in a parent user ns, then you have
+		 * If you have a capability in a parent user ns, then you have
 		 * it over all children user namespaces as well.
 		 */
 		targ_ns = targ_ns->parent;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps.
  2012-12-13 22:39     ` [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps Eric W. Biederman
  2012-12-13 22:43       ` Linus Torvalds
  2012-12-13 23:21       ` Andy Lutomirski
@ 2012-12-14  3:28       ` Serge E. Hallyn
  2012-12-14  3:32         ` Eric W. Biederman
  2 siblings, 1 reply; 27+ messages in thread
From: Serge E. Hallyn @ 2012-12-14  3:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Linus Torvalds, containers,
	Linux Kernel Mailing List, Andy Lutomirski,
	linux-security-module

Quoting Eric W. Biederman (ebiederm@xmission.com):
> 
> Andy Lutomirski pointed out that the current behavior of allowing the
> owner of a user namespace to have all caps when that owner is not in a
> parent user namespace is wrong.

To make sure I understand right, the issue is when a uid is mapped
into multiple namespaces, i.e. uid 1000 in ns1 may own ns2, but uid
1000 in ns3 does not?

> This is a bug introduced by the kuid conversion which made it possible
> for the owner of a user namespace to live in a child user namespace.  I
> goofed and totally missed this implication.
> 
> Serge and can you please take a look and see if my corrected cap_capable
> reads correctly to you.
> 
> Andy or anyone else that wants to give me a second eyeball and double
> check me on this I would appreciate it.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>

> ---
> 
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 6dbae46..4639f44 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -70,37 +70,44 @@ int cap_netlink_send(struct sock *sk, struct sk_buff *skb)
>   *
>   * NOTE WELL: cap_has_capability() cannot be used like the kernel's capable()
>   * and has_capability() functions.  That is, it has the reverse semantics:
>   * cap_has_capability() returns 0 when a task has a capability, but the
>   * kernel's capable() and has_capability() returns 1 for this case.
>   */
>  int cap_capable(const struct cred *cred, struct user_namespace *targ_ns,
>  		int cap, int audit)
>  {
>  	for (;;) {
> -		/* The owner of the user namespace has all caps. */
> -		if (targ_ns != &init_user_ns && uid_eq(targ_ns->owner, cred->euid))
> -			return 0;
> +		struct user_namespace *parent_ns;
>  
>  		/* Do we have the necessary capabilities? */
>  		if (targ_ns == cred->user_ns)
>  			return cap_raised(cred->cap_effective, cap) ? 0 : -EPERM;
>  
>  		/* Have we tried all of the parent namespaces? */
>  		if (targ_ns == &init_user_ns)
>  			return -EPERM;
>  
> +		parent_ns = targ_ns->parent;
> +
> +		/* 
> +		 * The owner of the user namespace in the parent user
> +		 * namespace has all caps.
> +		 */
> +		if ((parent_ns == cred->user_ns) && uid_eq(targ_ns->owner, cred->euid))
> +			return 0;
> +
>  		/*
> -		 *If you have a capability in a parent user ns, then you have
> +		 * If you have a capability in a parent user ns, then you have
>  		 * it over all children user namespaces as well.
>  		 */
> -		targ_ns = targ_ns->parent;
> +		targ_ns = parent_ns;
>  	}
>  
>  	/* We never get here */
>  }
>  
>  /**
>   * cap_settime - Determine whether the current process may set the system clock
>   * @ts: The time to set
>   * @tz: The timezone to set
>   *

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps.
  2012-12-14  3:28       ` [RFC][PATCH] " Serge E. Hallyn
@ 2012-12-14  3:32         ` Eric W. Biederman
  2012-12-14 15:26           ` Serge E. Hallyn
  0 siblings, 1 reply; 27+ messages in thread
From: Eric W. Biederman @ 2012-12-14  3:32 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Linus Torvalds, containers, Linux Kernel Mailing List,
	Andy Lutomirski, linux-security-module

"Serge E. Hallyn" <serge@hallyn.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> 
>> Andy Lutomirski pointed out that the current behavior of allowing the
>> owner of a user namespace to have all caps when that owner is not in a
>> parent user namespace is wrong.
>
> To make sure I understand right, the issue is when a uid is mapped
> into multiple namespaces.

Yes.

i.e. uid 1000 in ns1 may own ns2, but uid 1000 in ns3 does not?

I am not certain of your example.

The simple case is:

init_user_ns:
     child_user_ns1 (owned by uid == 0 [in all user namespaces])
           child_user_ns2 (owned by uid == 0 [ in all user namespaces])


root (uid == 0) in child_user_ns2 has all rights over anything in
child_user_ns1.


Thank you for looking.

Eric

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
  2012-12-13 23:02     ` [GIT PULL] user namespace and namespace infrastructure changes for 3.8 Andy Lutomirski
@ 2012-12-14  4:11       ` Eric W. Biederman
  2012-12-14  5:34         ` Andy Lutomirski
  0 siblings, 1 reply; 27+ messages in thread
From: Eric W. Biederman @ 2012-12-14  4:11 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Linus Torvalds, containers, Linux Kernel Mailing List

Andy Lutomirski <luto@amacapital.net> writes:

> On Thu, Dec 13, 2012 at 2:01 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Andy Lutomirski <luto@amacapital.net> writes:
>>
>>> On 12/11/2012 01:17 PM, Eric W. Biederman wrote:
>
>> But please also note the difference between capable and ns_capable.  Any
>> security check that is capable() still requires priviliges in the
>> initial user namespace.
>
> Huh?
>
> I'm talking about:
>
> clone with CLONE_NEWUSER
>  - child does unshare(CLONE_NEWPID)
>  - parent does setfd(child's pid namespace)
>
> Now the parent is running in the init userns with a different pid ns.
> Setuid binaries will work but will see the unexpected pid ns.  With
> mount namespaces, this would be Really Bad (tm).  With pid or ipc or
> net, it's less obviously dangerous, but I'm not convinced it's safe.

That isn't safe.  That is a sneaky bug in my tree that I overlooked. :(

> I sort of think that setns on a *non*-userns should require
> CAP_SYS_ADMIN in the current userns, at least if no_new_privs isn't
> set.

Yes.  CAP_SYS_ADMIN in your current user namespace should make
setns as safe as it currently is before my patches.

That is just a matter of adding a couple nsown_capable(CAP_SYS_ADMIN)
permission checks.

Right now I test for nsown_capable(CAP_SYS_CHROOT) for the mount
namespace, which is probably sufficient to prevent those kinds of
shenanigans but I am going to add a nsown_capable(CAP_SYS_ADMIN) for
good measure.

> A non-threaded process can have mm_users == 2 with CLONE_VM.  I'm not
> sure this is a problem, though.

No it isn't.  I said threads because they are the easy concept not
because that covers all possible corner cases.

>>> In any case, why are threads special here?
>>
>> You know I don't think I stopped to think about it.   The combination
>> of CLONE_NEWUSER and CLONE_THREAD have been denined since the first user
>> namespace support was merged in 2008.
>>
>> I do know that things can get really strange when you mix multiple
>> namespaces in a process.  tkill of your own threads will stop working.
>> Which access permissions should apply to files you mmap, file handles
>> you have open, the core dumper etc.
>>
>> We do allow setresuid per thread so we might be able to cope
>> with a process that mixes with user namespaces in different threads,
>> but I would want a close review of things before we allow that kind of
>> sharing.
>>
>
> Fair enough.
>
> I'd personally be more concerned about shared signal handlers than
> shared tgid, though.  The signal handler set has all kinds of weird
> things like session id.

CLONE_THREAD implies CLONE_VM and CLONE_SIGNAL,
and in practice mm_users > 1 protects against all of those cases.

So I was really thinking all of those cases.


>> (See the end.  A significant bug in cap_capable slipped in about
>>  3.5. cap_capable is only supposed to grant permissions to the owner
>>  of a user namespace if it is a child user namespace).
>
> [snipping lots of stuff]
>
> If the intended semantics of cap_capable are, indeed:
>
> If targ_ns is equals or is a descendent of cred->user_ns and the cap
> is effective, return true.  If targ_ns is owned by cred->euid and
> targ_ns is a descendent of cred->user_ns (but is not equal to
> cred->user_ns), then return true.  Else return false
>
> then I agree with you on almost everything you said.  I assumed that
> the actual semantics were intended.

Good.

>>> unshare has a bug.  This code:
>>
>> Interesting...
>>
>> Looking at it this is a very small misfeature.
>>
>> What is happening is that commit_creds is setting is making the task
>> undumpable because we changed the set of capabilities in struct cred.
>>
>> This in turn results in pid_revalidate setting the owner of
>> of /proc/self/uid_map to GLOBAL_ROOT_UID.
>>
>> From the outside the permissions on /proc/self/uid_map look like:
>> -rw-r--r-- 1 root root 0 2012-12-13 12:43 /proc/30530/uid_map
>>
>> Then since /proc/self/uid_map uses the default permission function
>> and the test program below is not run as root the read-write open
>> of uid_map fails.
>
> It's probably either worth fixing this or disabling unshare of userns.
>  This makes it hard to use.  IMO non-dumpable tasks should still be
> able to access the contents of /proc/self -- i.e. I'd call this a more
> general bug.
>
> But I'd also argue that unsharing userns shouldn't set non-dumpable --
> cap_permitted increased, but the new capabilities are still logically
> a subset of the old ones.

Agreed.  Setting dumpable is the bug.

I am going to sleep on it but the code in commit_creds should probably
read:

	/* dumpability changes */
	if (!uid_eq(old->euid, new->euid) ||
	    !gid_eq(old->egid, new->egid) ||
	    !uid_eq(old->fsuid, new->fsuid) ||
	    !gid_eq(old->fsgid, new->fsgid) ||
	    ((old->user_ns == new->user_ns) &&
	     !cap_issubset(new->cap_permitted, old->cap_permitted))) {
		if (task->mm)
			set_dumpable(task->mm, suid_dumpable);
		task->pdeath_signal = 0;
		smp_wmb();
	}

Which is correct assuming a user_ns change is always to a more nested
user namespace.

> One more issue: the requirement that both upper and lower uids (etc.)
> in the maps are in order is rather limiting.  I have no objection if
> you only require upper ids to be monotonic, but currently there's no
> way to may root outside to uid n (for n > 0) and some nonroot user
> outside to uid 0.

There is.  You may set up to 5 (extents).  You just have to use a second
extent for the non-contiguous bits.  My reader is lazy and you have to
set all of the extents with a single write, so you may have missed the
ability to set more than one extent.

> Also, the fact that you can remap onto the unmapped id is a little
> strange.  I haven't found any reason it would be a security bug, but
> it's still odd.

Yes the unmapped id is odd.  Including the fact it can be set with a
sysctl.

The world has such lovely round edges.

Eric


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
  2012-12-14  4:11       ` Eric W. Biederman
@ 2012-12-14  5:34         ` Andy Lutomirski
  2012-12-14 17:48           ` Eric W. Biederman
  0 siblings, 1 reply; 27+ messages in thread
From: Andy Lutomirski @ 2012-12-14  5:34 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linus Torvalds, containers, Linux Kernel Mailing List

On Thu, Dec 13, 2012 at 8:11 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Andy Lutomirski <luto@amacapital.net> writes:
>
>> One more issue: the requirement that both upper and lower uids (etc.)
>> in the maps are in order is rather limiting.  I have no objection if
>> you only require upper ids to be monotonic, but currently there's no
>> way to may root outside to uid n (for n > 0) and some nonroot user
>> outside to uid 0.
>
> There is.  You may set up to 5 (extents).  You just have to use a second
> extent for the non-contiguous bits.  My reader is lazy and you have to
> set all of the extents with a single write, so you may have missed the
> ability to set more than one extent.
>

If I'm wrong, I'll happily eat my words.  Both:

0 1 1
1 0 1

and

1 0 1
0 1 1

are rejected, unless I totally messed up.

--Andy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps.
  2012-12-14  3:32         ` Eric W. Biederman
@ 2012-12-14 15:26           ` Serge E. Hallyn
  2012-12-14 15:47             ` Eric W. Biederman
  0 siblings, 1 reply; 27+ messages in thread
From: Serge E. Hallyn @ 2012-12-14 15:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Linus Torvalds, containers,
	Linux Kernel Mailing List, Andy Lutomirski,
	linux-security-module

Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serge@hallyn.com> writes:
> 
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> 
> >> Andy Lutomirski pointed out that the current behavior of allowing the
> >> owner of a user namespace to have all caps when that owner is not in a
> >> parent user namespace is wrong.
> >
> > To make sure I understand right, the issue is when a uid is mapped
> > into multiple namespaces.
> 
> Yes.
> 
> i.e. uid 1000 in ns1 may own ns2, but uid 1000 in ns3 does not?
> 
> I am not certain of your example.
> 
> The simple case is:
> 
> init_user_ns:
>      child_user_ns1 (owned by uid == 0 [in all user namespaces])
>            child_user_ns2 (owned by uid == 0 [ in all user namespaces])
> 
> 
> root (uid == 0) in child_user_ns2 has all rights over anything in
> child_user_ns1.

Well that is only if there was no mapping.  (since we're comparing
kuids, not uid_ts).  right?  If you didn't map uid 0 in child_user_ns2
to another id in the parent ns, you weren't all *that* serious about
isolating the ns.

The case I was thinking is

  init_user_ns:  [0-uidmax]
      child_user_ns1  [100000-199999]
      child_user_ns2  [100000-199999]
        child_user_ns3  [200000-299999]

with unfortunate mappings  - ns1 and ns2 should have had nonoverlapping
ranges, but in any case now uid 1000 in ns1 can exert privilege over
ns3.  Again, uids comparisons will succeed for file access anyway, so
ns1 can 0wn ns2 and ns3 other ways.

Heck I'm starting to think the bug is a feature - surely given the
mappings above I meant for ns1 and ns2 to bleed privilege to each
other?

> Thank you for looking.
> 
> Eric

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps.
  2012-12-14 15:26           ` Serge E. Hallyn
@ 2012-12-14 15:47             ` Eric W. Biederman
  2012-12-14 16:15               ` Serge E. Hallyn
  0 siblings, 1 reply; 27+ messages in thread
From: Eric W. Biederman @ 2012-12-14 15:47 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Linus Torvalds, containers, Linux Kernel Mailing List,
	Andy Lutomirski, linux-security-module

"Serge E. Hallyn" <serge@hallyn.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> "Serge E. Hallyn" <serge@hallyn.com> writes:
>> 
>> > Quoting Eric W. Biederman (ebiederm@xmission.com):
>> >> 
>> >> Andy Lutomirski pointed out that the current behavior of allowing the
>> >> owner of a user namespace to have all caps when that owner is not in a
>> >> parent user namespace is wrong.
>> >
>> > To make sure I understand right, the issue is when a uid is mapped
>> > into multiple namespaces.
>> 
>> Yes.
>> 
>> i.e. uid 1000 in ns1 may own ns2, but uid 1000 in ns3 does not?
>> 
>> I am not certain of your example.
>> 
>> The simple case is:
>> 
>> init_user_ns:
>>      child_user_ns1 (owned by uid == 0 [in all user namespaces])
>>            child_user_ns2 (owned by uid == 0 [ in all user namespaces])
>> 
>> 
>> root (uid == 0) in child_user_ns2 has all rights over anything in
>> child_user_ns1.
>
> Well that is only if there was no mapping.  (since we're comparing
> kuids, not uid_ts).  right?  If you didn't map uid 0 in child_user_ns2
> to another id in the parent ns, you weren't all *that* serious about
> isolating the ns.
>
> The case I was thinking is
>
>   init_user_ns:  [0-uidmax]
>       child_user_ns1  [100000-199999]
>       child_user_ns2  [100000-199999]
>         child_user_ns3  [200000-299999]
>
> with unfortunate mappings  - ns1 and ns2 should have had nonoverlapping
> ranges, but in any case now uid 1000 in ns1 can exert privilege over
> ns3.  Again, uids comparisons will succeed for file access anyway, so
> ns1 can 0wn ns2 and ns3 other ways.

Yes yours is the more realistic scenario.  Mine was simplified to be clear.

> Heck I'm starting to think the bug is a feature - surely given the
> mappings above I meant for ns1 and ns2 to bleed privilege to each
> other?

The serious problem is that privileges can bleed up. A user in 
ns3 can wind up owning ns2 or ns1.  Which totally defeats the permission
model.  You have CAP_DAC_OVERRIDE so you don't even need access to files
you own, etc, etc.

Or the more fun version as root:
	/* Drop all privs */
	pid = clone(CLONE_NEWUSER);
	if (pid == 0) {
		/* Have all privs! Bahaha */
	}

Which makes dropping capabilies a joke.

And since up to through 3.7 the only user that can create a user namespace
is root that is a very realistic scenario.  It doesn't work against the
initial user namespace thankfully, but in v3.7 it works against every
other user namespace.

To keep user namespaces safe we need to maintain the invariant that a
child user namespace can not get capapabilities in it's parent user
namespace.

Eric

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps.
  2012-12-14 15:47             ` Eric W. Biederman
@ 2012-12-14 16:15               ` Serge E. Hallyn
  2012-12-14 18:12                 ` Eric W. Biederman
  0 siblings, 1 reply; 27+ messages in thread
From: Serge E. Hallyn @ 2012-12-14 16:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Linus Torvalds, containers,
	Linux Kernel Mailing List, Andy Lutomirski,
	linux-security-module

Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serge@hallyn.com> writes:
> 
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> "Serge E. Hallyn" <serge@hallyn.com> writes:
> >> 
> >> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> >> 
> >> >> Andy Lutomirski pointed out that the current behavior of allowing the
> >> >> owner of a user namespace to have all caps when that owner is not in a
> >> >> parent user namespace is wrong.
> >> >
> >> > To make sure I understand right, the issue is when a uid is mapped
> >> > into multiple namespaces.
> >> 
> >> Yes.
> >> 
> >> i.e. uid 1000 in ns1 may own ns2, but uid 1000 in ns3 does not?
> >> 
> >> I am not certain of your example.
> >> 
> >> The simple case is:
> >> 
> >> init_user_ns:
> >>      child_user_ns1 (owned by uid == 0 [in all user namespaces])
> >>            child_user_ns2 (owned by uid == 0 [ in all user namespaces])
> >> 
> >> 
> >> root (uid == 0) in child_user_ns2 has all rights over anything in
> >> child_user_ns1.
> >
> > Well that is only if there was no mapping.  (since we're comparing
> > kuids, not uid_ts).  right?  If you didn't map uid 0 in child_user_ns2
> > to another id in the parent ns, you weren't all *that* serious about
> > isolating the ns.
> >
> > The case I was thinking is
> >
> >   init_user_ns:  [0-uidmax]
> >       child_user_ns1  [100000-199999]
> >       child_user_ns2  [100000-199999]
> >         child_user_ns3  [200000-299999]

Wait is my example above possible?  Or does child_user_ns3's range need
to be a subset of child_user_ns2's?

In which case it would be

       child_user_ns1  [100000-199999]
       child_user_ns2  [100000-199999]
         child_user_ns3  [120000-129999]

> > with unfortunate mappings  - ns1 and ns2 should have had nonoverlapping
> > ranges, but in any case now uid 1000 in ns1 can exert privilege over
> > ns3.  Again, uids comparisons will succeed for file access anyway, so
> > ns1 can 0wn ns2 and ns3 other ways.
> 
> Yes yours is the more realistic scenario.  Mine was simplified to be clear.
> 
> > Heck I'm starting to think the bug is a feature - surely given the
> > mappings above I meant for ns1 and ns2 to bleed privilege to each
> > other?
> 
> The serious problem is that privileges can bleed up. A user in 
> ns3 can wind up owning ns2 or ns1.  Which totally defeats the permission
> model.  You have CAP_DAC_OVERRIDE so you don't even need access to files
> you own, etc, etc.

Would that not require intervention from the init_user_ns?  In my
example above (let's add that ns2 is owned by kuid.uid=1000 in
init_user_ns), root in child_user_ns2 cannot map kuid.val=0 or
kuid.val=1000 into ns3 because 0 and 1000 are not in the range
100000-199999.  So there is no uid in child_user_ns3 which is able
to spoof uid=0 in child_user_ns1.

-serge

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
  2012-12-14  5:34         ` Andy Lutomirski
@ 2012-12-14 17:48           ` Eric W. Biederman
  0 siblings, 0 replies; 27+ messages in thread
From: Eric W. Biederman @ 2012-12-14 17:48 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Linus Torvalds, containers, Linux Kernel Mailing List

Andy Lutomirski <luto@amacapital.net> writes:

> On Thu, Dec 13, 2012 at 8:11 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Andy Lutomirski <luto@amacapital.net> writes:
>>
>>> One more issue: the requirement that both upper and lower uids (etc.)
>>> in the maps are in order is rather limiting.  I have no objection if
>>> you only require upper ids to be monotonic, but currently there's no
>>> way to may root outside to uid n (for n > 0) and some nonroot user
>>> outside to uid 0.
>>
>> There is.  You may set up to 5 (extents).  You just have to use a second
>> extent for the non-contiguous bits.  My reader is lazy and you have to
>> set all of the extents with a single write, so you may have missed the
>> ability to set more than one extent.
>>
>
> If I'm wrong, I'll happily eat my words.  Both:
>
> 0 1 1
> 1 0 1
>
> and
>
> 1 0 1
> 0 1 1
>
> are rejected, unless I totally messed up.

Duh.  You are right.  

It is this check:

		/* For now only accept extents that are strictly in order */
		if (last &&
		    (((last->first + last->count) > extent->first) ||
		     ((last->lower_first + last->count) > extent->lower_first)))
			goto out;

Fundamentally every value mapped must be distinct.  Aka the direction of
the mapping must be reversible without loss of information.  

Ensuring all of the values were increasing in the extents was just a
lame way of ensuring that the same value was not mapped twice in either
the upper or lower ranges.

That check can most certainly be relaxed (patches welcome).  But that
probably isn't 3.8 material as that is feature work.

Not having bumped into this limitation myself I'm not certain the value
in removing this check.  But there is no good reason not to replace the
current check with a more general one either.

So your example should work, and that it doesn't is a misfeature.

Eric


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps.
  2012-12-14 16:15               ` Serge E. Hallyn
@ 2012-12-14 18:12                 ` Eric W. Biederman
  2012-12-14 18:43                   ` Linus Torvalds
  2012-12-14 20:29                   ` Serge E. Hallyn
  0 siblings, 2 replies; 27+ messages in thread
From: Eric W. Biederman @ 2012-12-14 18:12 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Linus Torvalds, containers, Linux Kernel Mailing List,
	Andy Lutomirski, linux-security-module

"Serge E. Hallyn" <serge@hallyn.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> "Serge E. Hallyn" <serge@hallyn.com> writes:
>> 
>> > Quoting Eric W. Biederman (ebiederm@xmission.com):
>> >> "Serge E. Hallyn" <serge@hallyn.com> writes:
>> >> 
>> >> > Quoting Eric W. Biederman (ebiederm@xmission.com):
>> >> >> 
>> >> >> Andy Lutomirski pointed out that the current behavior of allowing the
>> >> >> owner of a user namespace to have all caps when that owner is not in a
>> >> >> parent user namespace is wrong.
>> >> >
>> >> > To make sure I understand right, the issue is when a uid is mapped
>> >> > into multiple namespaces.
>> >> 
>> >> Yes.
>> >> 
>> >> i.e. uid 1000 in ns1 may own ns2, but uid 1000 in ns3 does not?
>> >> 
>> >> I am not certain of your example.
>> >> 
>> >> The simple case is:
>> >> 
>> >> init_user_ns:
>> >>      child_user_ns1 (owned by uid == 0 [in all user namespaces])
>> >>            child_user_ns2 (owned by uid == 0 [ in all user namespaces])
>> >> 
>> >> 
>> >> root (uid == 0) in child_user_ns2 has all rights over anything in
>> >> child_user_ns1.
>> >
>> > Well that is only if there was no mapping.  (since we're comparing
>> > kuids, not uid_ts).  right?  If you didn't map uid 0 in child_user_ns2
>> > to another id in the parent ns, you weren't all *that* serious about
>> > isolating the ns.
>> >
>> > The case I was thinking is
>> >
>> >   init_user_ns:  [0-uidmax]
>> >       child_user_ns1  [100000-199999]
>> >       child_user_ns2  [100000-199999]
>> >         child_user_ns3  [200000-299999]
>
> Wait is my example above possible?  Or does child_user_ns3's range need
> to be a subset of child_user_ns2's?
>
> In which case it would be
>
>        child_user_ns1  [100000-199999]
>        child_user_ns2  [100000-199999]
>          child_user_ns3  [120000-129999]
>

Yes.  You have to nest uids.

>> > with unfortunate mappings  - ns1 and ns2 should have had nonoverlapping
>> > ranges, but in any case now uid 1000 in ns1 can exert privilege over
>> > ns3.  Again, uids comparisons will succeed for file access anyway, so
>> > ns1 can 0wn ns2 and ns3 other ways.
>> 
>> Yes yours is the more realistic scenario.  Mine was simplified to be clear.
>> 
>> > Heck I'm starting to think the bug is a feature - surely given the
>> > mappings above I meant for ns1 and ns2 to bleed privilege to each
>> > other?
>> 
>> The serious problem is that privileges can bleed up. A user in 
>> ns3 can wind up owning ns2 or ns1.  Which totally defeats the permission
>> model.  You have CAP_DAC_OVERRIDE so you don't even need access to files
>> you own, etc, etc.
>
> Would that not require intervention from the init_user_ns?  In my
> example above (let's add that ns2 is owned by kuid.uid=1000 in
> init_user_ns), root in child_user_ns2 cannot map kuid.val=0 or
> kuid.val=1000 into ns3 because 0 and 1000 are not in the range
> 100000-199999.  So there is no uid in child_user_ns3 which is able
> to spoof uid=0 in child_user_ns1.

Right.  It does require having the uid of the owner of ns1 or ns2 in
ns3.  So you have to explicitly allow it.

What I don't see is any point in allowing something like that.


After taking a second look I just realized that this is completely
unexploitable with the code that is currently merged.  As creating
a grand child user namespace is competelely impossible.  Creating
a user namespace is requires capable(CAP_SYS_ADMIN) which is never
present in anything but the initial user namespace.


That said I think the current semantics of cap_capable are completely
fatal to reasoning about user namespaces.

A child user namespace having capabilities against processes in it's
parent seems totally bizarre and pretty dangerous from a capabilities
standpoint.

That said Serge I think I have lost track of the point of your question.

Eric

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps.
  2012-12-14 18:12                 ` Eric W. Biederman
@ 2012-12-14 18:43                   ` Linus Torvalds
  2012-12-14 18:47                     ` Andy Lutomirski
                                       ` (2 more replies)
  2012-12-14 20:29                   ` Serge E. Hallyn
  1 sibling, 3 replies; 27+ messages in thread
From: Linus Torvalds @ 2012-12-14 18:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, containers, Linux Kernel Mailing List,
	Andy Lutomirski, LSM List

On Fri, Dec 14, 2012 at 10:12 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> That said Serge I think I have lost track of the point of your question.

.. and I'm a bit unsure what I should do about this all. Including
pulling the pull request that actually can make this all matter.

Hmm? Any consensus?

            Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps.
  2012-12-14 18:43                   ` Linus Torvalds
@ 2012-12-14 18:47                     ` Andy Lutomirski
  2012-12-14 20:50                     ` Serge E. Hallyn
  2012-12-14 21:43                     ` Eric W. Biederman
  2 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2012-12-14 18:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Serge E. Hallyn, containers,
	Linux Kernel Mailing List, LSM List

On Fri, Dec 14, 2012 at 10:43 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Dec 14, 2012 at 10:12 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> That said Serge I think I have lost track of the point of your question.
>
> .. and I'm a bit unsure what I should do about this all. Including
> pulling the pull request that actually can make this all matter.
>
> Hmm? Any consensus?

I think that, if Eric submits a newer version that renames the loop
variable for added comprehensibility, I'm okay with it.

Changing the semantics to a more expansive version like Serge was
talking about later on wouldn't break anything.  But I don't think
there's any reason to change it.

--Andy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps.
  2012-12-14 18:12                 ` Eric W. Biederman
  2012-12-14 18:43                   ` Linus Torvalds
@ 2012-12-14 20:29                   ` Serge E. Hallyn
  2012-12-14 22:32                     ` Eric W. Biederman
  1 sibling, 1 reply; 27+ messages in thread
From: Serge E. Hallyn @ 2012-12-14 20:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Linus Torvalds, containers,
	Linux Kernel Mailing List, Andy Lutomirski,
	linux-security-module

Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serge@hallyn.com> writes:

Note: I acked your patch before and still don't object to it.

> > In which case it would be
> >
> >        child_user_ns1  [100000-199999]
> >        child_user_ns2  [100000-199999]
> >          child_user_ns3  [120000-129999]
> >
> 
> Yes.  You have to nest uids.
> 
> >> > with unfortunate mappings  - ns1 and ns2 should have had nonoverlapping
> >> > ranges, but in any case now uid 1000 in ns1 can exert privilege over
> >> > ns3.  Again, uids comparisons will succeed for file access anyway, so
> >> > ns1 can 0wn ns2 and ns3 other ways.
> >> 
> >> Yes yours is the more realistic scenario.  Mine was simplified to be clear.
> >> 
> >> > Heck I'm starting to think the bug is a feature - surely given the
> >> > mappings above I meant for ns1 and ns2 to bleed privilege to each
> >> > other?
> >> 
> >> The serious problem is that privileges can bleed up. A user in 
> >> ns3 can wind up owning ns2 or ns1.  Which totally defeats the permission
> >> model.  You have CAP_DAC_OVERRIDE so you don't even need access to files
> >> you own, etc, etc.
> >
> > Would that not require intervention from the init_user_ns?  In my
> > example above (let's add that ns2 is owned by kuid.uid=1000 in
> > init_user_ns), root in child_user_ns2 cannot map kuid.val=0 or
> > kuid.val=1000 into ns3 because 0 and 1000 are not in the range
> > 100000-199999.  So there is no uid in child_user_ns3 which is able
> > to spoof uid=0 in child_user_ns1.
> 
> Right.  It does require having the uid of the owner of ns1 or ns2 in
> ns3.  So you have to explicitly allow it.
> 
> What I don't see is any point in allowing something like that.

The point isn't specifically to allow that, but rather to not muddle the
kuids with the user namespace *.

If I clone two tasks in separate user namespace but give them the same
uid mappings, how should the uids between those namespaces be related?
The way you're doing it now they will equate for DAC checks but not
for privilege checks.  It feels ad-hoc and flaky.

> A child user namespace having capabilities against processes in it's
> parent seems totally bizarre and pretty dangerous from a capabilities
> standpoint.

How would it have them against its parent?

> That said Serge I think I have lost track of the point of your question.

Look at it this way.  You've introduced kuids which allow "uids" to be
disassociated from the user namespace * in kernel.  You have the uid
mapping rules which enforce some sanity (requiring nesting).

If the parent ns has re-used an active uid of its own to give to a child
namespace, then any DAC-governed objects (files in particular) will be
owned by that uid in the child user ns.  I'm not sure - does this also
apply to /proc/$$/fd/* access?

-serge

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps.
  2012-12-14 18:43                   ` Linus Torvalds
  2012-12-14 18:47                     ` Andy Lutomirski
@ 2012-12-14 20:50                     ` Serge E. Hallyn
  2012-12-14 21:43                     ` Eric W. Biederman
  2 siblings, 0 replies; 27+ messages in thread
From: Serge E. Hallyn @ 2012-12-14 20:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Serge E. Hallyn, containers,
	Linux Kernel Mailing List, Andy Lutomirski, LSM List

Quoting Linus Torvalds (torvalds@linux-foundation.org):
> On Fri, Dec 14, 2012 at 10:12 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
> >
> > That said Serge I think I have lost track of the point of your question.
> 
> .. and I'm a bit unsure what I should do about this all. Including
> pulling the pull request that actually can make this all matter.

Sorry I didn't mean to complicate this.

I did ack the patch and we can cull the cc list for continued discussion.

In practical terms, the only thing the patch prevent is having two
separate tasks each clone a new user ns with the same uid mapping, and
having consistent relationships between the same uids between the
namespaces.  It's worth it to prevent (or while we consider) the case
Andy and Eric bring up.

> Hmm? Any consensus?

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>

-serge

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps.
  2012-12-14 18:43                   ` Linus Torvalds
  2012-12-14 18:47                     ` Andy Lutomirski
  2012-12-14 20:50                     ` Serge E. Hallyn
@ 2012-12-14 21:43                     ` Eric W. Biederman
  2 siblings, 0 replies; 27+ messages in thread
From: Eric W. Biederman @ 2012-12-14 21:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Serge E. Hallyn, containers, Linux Kernel Mailing List,
	Andy Lutomirski, LSM List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Fri, Dec 14, 2012 at 10:12 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> That said Serge I think I have lost track of the point of your question.
>
> .. and I'm a bit unsure what I should do about this all. Including
> pulling the pull request that actually can make this all matter.
>
> Hmm? Any consensus?

It looks like we have consensus (baring the color of the shed)
of what the code should look like for v3.8.

>From the most embarrassingly timed, but most useful review by Andy I have
4 fixes queued up in my development tree.

Fixing cap_capable to test for the parent namespace.
Fixing setns to require nsown_capable(CAP_SYS_ADMIN)
--
Fixing commit_creds to not clear task dumpable unnecessarily.
Fixing a typo in the description.

What I would like to do is to do is what I would if this was not the
middle of the merge window with changes like this.

Toss those patches out for last round of review.

Possibly toss the last two patches if there are any problems because
they are not necessary.

Put the patches in my for-next branch and have them sit in linux-next
for a day or three. 

Send you an updated pull request.


I am recovering from a cold so I am running slower than I would like
this week and would really rather not rush getting these patches out.

What I don't want to be is so cautious and careful that you decide to
pass on my pull request.  The code is harmless with user namespaces
disabled.  The code has been baking for a long time, some of it for much
too long and it is as solid as I think it will get out before being
merged.  Nor is the code complex Andy managed to dig and figure it all
out in about a day.

Linus does that work for you?

Eric

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps.
  2012-12-14 20:29                   ` Serge E. Hallyn
@ 2012-12-14 22:32                     ` Eric W. Biederman
  2012-12-15  0:14                       ` Serge E. Hallyn
  0 siblings, 1 reply; 27+ messages in thread
From: Eric W. Biederman @ 2012-12-14 22:32 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Linus Torvalds, containers, Linux Kernel Mailing List,
	Andy Lutomirski, linux-security-module

"Serge E. Hallyn" <serge@hallyn.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):

>> A child user namespace having capabilities against processes in it's
>> parent seems totally bizarre and pretty dangerous from a capabilities
>> standpoint.
>
> How would it have them against its parent?

init_user_ns
   userns a --- created by kuid 1
     userns b -- created by kuid 2
        process c in userns b with kuid 1

Serge read the first permisison check in common_cap. 
Think what happens in the above example.

For the rest I understand your concern.

Serge please read and look at the patches I have posted to fix
the issues Andy found with the user namespace tree.  Especially
the fix to commit_creds.

After you have looked at the patches to fix the issues I will
be happy to discuss things further with you.

Eric


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps.
  2012-12-14 22:32                     ` Eric W. Biederman
@ 2012-12-15  0:14                       ` Serge E. Hallyn
  0 siblings, 0 replies; 27+ messages in thread
From: Serge E. Hallyn @ 2012-12-15  0:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Linus Torvalds, containers,
	Linux Kernel Mailing List, Andy Lutomirski,
	linux-security-module

Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serge@hallyn.com> writes:
> 
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> 
> >> A child user namespace having capabilities against processes in it's
> >> parent seems totally bizarre and pretty dangerous from a capabilities
> >> standpoint.
> >
> > How would it have them against its parent?
> 
> init_user_ns
>    userns a --- created by kuid 1

Now a mapping needs to be set up (by a task with CAP_SYS_ADMIN in
init_user_ns) which allows kuids 1 and 2 to be used by userns a.
Otherwise (if no mapping is set up) userns a only has the overlowuid.

Realistically only kuids over 100000 (let's say) would used.  i.e.
kuids 100,000-199,999 would map to container uids 0-99,999.

>      userns b -- created by kuid 2

Now a mapping needs to be set up (by a task with CAP_SYS_ADMIN in
userns a) which allows kuids 1 and 2 to be used by userns b.

If userns had been mapped with kuids 100,000-199,999 mapping to
uids 0-99999, then only kuids in that range could be mapped into
userns b.

>         process c in userns b with kuid 1
>
> Serge read the first permisison check in common_cap. 
> Think what happens in the above example.
> 
> For the rest I understand your concern.

Ok.  Then we can discuss my concern later (after the new year).

> Serge please read and look at the patches I have posted to fix
> the issues Andy found with the user namespace tree.  Especially
> the fix to commit_creds.

The setns fixes were IMO the most important - and interesting - ones :)

Thanks, Andy!

> After you have looked at the patches to fix the issues I will
> be happy to discuss things further with you.

Thanks,
-serge

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2012-12-15  0:09 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-11 21:17 [GIT PULL] user namespace and namespace infrastructure changes for 3.8 Eric W. Biederman
2012-12-13 19:24 ` Andy Lutomirski
2012-12-13 22:01   ` Eric W. Biederman
2012-12-13 22:39     ` [RFC][PATCH] Fix cap_capable to only allow owners in the parent user namespace to have caps Eric W. Biederman
2012-12-13 22:43       ` Linus Torvalds
2012-12-13 22:55         ` Eric W. Biederman
2012-12-13 23:21       ` Andy Lutomirski
2012-12-14  2:33         ` Eric W. Biederman
2012-12-14  2:36           ` Andy Lutomirski
2012-12-14  3:20             ` [PATCH] " Eric W. Biederman
2012-12-14  3:28       ` [RFC][PATCH] " Serge E. Hallyn
2012-12-14  3:32         ` Eric W. Biederman
2012-12-14 15:26           ` Serge E. Hallyn
2012-12-14 15:47             ` Eric W. Biederman
2012-12-14 16:15               ` Serge E. Hallyn
2012-12-14 18:12                 ` Eric W. Biederman
2012-12-14 18:43                   ` Linus Torvalds
2012-12-14 18:47                     ` Andy Lutomirski
2012-12-14 20:50                     ` Serge E. Hallyn
2012-12-14 21:43                     ` Eric W. Biederman
2012-12-14 20:29                   ` Serge E. Hallyn
2012-12-14 22:32                     ` Eric W. Biederman
2012-12-15  0:14                       ` Serge E. Hallyn
2012-12-13 23:02     ` [GIT PULL] user namespace and namespace infrastructure changes for 3.8 Andy Lutomirski
2012-12-14  4:11       ` Eric W. Biederman
2012-12-14  5:34         ` Andy Lutomirski
2012-12-14 17:48           ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).