All of lore.kernel.org
 help / color / mirror / Atom feed
* user namespaces v3: continue targetting capabilities
@ 2011-09-02 19:56 Serge Hallyn
  2011-09-02 19:56   ` (unknown), Serge Hallyn
                   ` (11 more replies)
  0 siblings, 12 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap

This was last sent Jul 26, and incorporates feedback from that thread.
The last patch, 0015-make-kernel-signal.c-user-ns-safe-v2.patch, is new,
so could stand extra scrutiny.

This patchset is a basis for Eric's set which allows assigning a
filesystem to a user namespace
(http://git.kernel.org/?p=linux/kernel/git/ebiederm/linux-userns-devel.git),
which is the last hurdle to starting to employ user namespaces to help
constrain root in a container.  So if there is no more major feedback,
I'd love to see this get a spin in -mm so we can proceed with that.

[ v2 intro message: ]

here is a set of patches to continue targetting capabilities
where appropriate.  This set goes about as far as is possible
without making the VFS user namespace aware, meaning that the
VFS can provide a namespaced view of userids, i.e init_user_ns
sees file owner 500, while child user ns sees file owner 0 or
1000.  (There are a few other things, like siginfos, which can
be addressed before we address the VFS).

With this set applied, you can create and configure veth netdevs
if your user namespace owns your network namespace (and you are
privileged), but not otherwise.

Some simple testcases can be found at
https://code.launchpad.net/~serge-hallyn/+junk/usernstests with
packages at
https://launchpad.net/~serge-hallyn/+archive/userns-natty

Feedback very much appreciated.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* (unknown), 
       [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
@ 2011-09-02 19:56   ` Serge Hallyn
  2011-09-02 19:56   ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

GIT: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
GIT: [PATCH 02/15] user ns: setns: move capable checks into per-ns attach
GIT: [PATCH 03/15] keyctl: check capabilities against key's user_ns
GIT: [PATCH 04/15] user_ns: convert fs/attr.c to targeted capabilities
GIT: [PATCH 05/15] userns: clamp down users of cap_raised
GIT: [PATCH 06/15] user namespace: make each net (net_ns) belong to a
GIT: [PATCH 07/15] user namespace: use net->user_ns for some capable
GIT: [PATCH 08/15] af_netlink.c: make netlink_capable userns-aware
GIT: [PATCH 09/15] user ns: convert ipv6 to targeted capabilities
GIT: [PATCH 10/15] net/core/scm.c: target capable() calls to user_ns
GIT: [PATCH 11/15] userns: make some net-sysfs capable calls targeted
GIT: [PATCH 12/15] user_ns: target af_key capability check
GIT: [PATCH 13/15] userns: net: make many network capable calls targeted
GIT: [PATCH 14/15] net: pass user_ns to cap_netlink_recv()
GIT: [PATCH 15/15] make kernel/signal.c user ns safe (v2)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* (no subject)
  2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn
@ 2011-09-02 19:56   ` Serge Hallyn
  2011-09-02 19:56 ` Serge Hallyn
                     ` (10 subsequent siblings)
  11 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
  Cc: Serge Hallyn

GIT: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
GIT: [PATCH 02/15] user ns: setns: move capable checks into per-ns attach
GIT: [PATCH 03/15] keyctl: check capabilities against key's user_ns
GIT: [PATCH 04/15] user_ns: convert fs/attr.c to targeted capabilities
GIT: [PATCH 05/15] userns: clamp down users of cap_raised
GIT: [PATCH 06/15] user namespace: make each net (net_ns) belong to a
GIT: [PATCH 07/15] user namespace: use net->user_ns for some capable
GIT: [PATCH 08/15] af_netlink.c: make netlink_capable userns-aware
GIT: [PATCH 09/15] user ns: convert ipv6 to targeted capabilities
GIT: [PATCH 10/15] net/core/scm.c: target capable() calls to user_ns
GIT: [PATCH 11/15] userns: make some net-sysfs capable calls targeted
GIT: [PATCH 12/15] user_ns: target af_key capability check
GIT: [PATCH 13/15] userns: net: make many network capable calls targeted
GIT: [PATCH 14/15] net: pass user_ns to cap_netlink_recv()
GIT: [PATCH 15/15] make kernel/signal.c user ns safe (v2)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* (unknown), 
@ 2011-09-02 19:56   ` Serge Hallyn
  0 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
  Cc: Serge Hallyn

GIT: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
GIT: [PATCH 02/15] user ns: setns: move capable checks into per-ns attach
GIT: [PATCH 03/15] keyctl: check capabilities against key's user_ns
GIT: [PATCH 04/15] user_ns: convert fs/attr.c to targeted capabilities
GIT: [PATCH 05/15] userns: clamp down users of cap_raised
GIT: [PATCH 06/15] user namespace: make each net (net_ns) belong to a
GIT: [PATCH 07/15] user namespace: use net->user_ns for some capable
GIT: [PATCH 08/15] af_netlink.c: make netlink_capable userns-aware
GIT: [PATCH 09/15] user ns: convert ipv6 to targeted capabilities
GIT: [PATCH 10/15] net/core/scm.c: target capable() calls to user_ns
GIT: [PATCH 11/15] userns: make some net-sysfs capable calls targeted
GIT: [PATCH 12/15] user_ns: target af_key capability check
GIT: [PATCH 13/15] userns: net: make many network capable calls targeted
GIT: [PATCH 14/15] net: pass user_ns to cap_netlink_recv()
GIT: [PATCH 15/15] make kernel/signal.c user ns safe (v2)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* (no subject)
  2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn
  2011-09-02 19:56   ` (unknown), Serge Hallyn
@ 2011-09-02 19:56 ` Serge Hallyn
       [not found]   ` <1314993400-6910-3-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
  2011-09-02 23:49   ` Eric W. Biederman
  2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap



^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
       [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
  2011-09-02 19:56   ` (unknown), Serge Hallyn
@ 2011-09-02 19:56   ` Serge Hallyn
  2011-09-02 19:56     ` Serge Hallyn
                     ` (13 subsequent siblings)
  15 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

From: "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>

Quoting David Howells (dhowells-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org):
> Randy Dunlap <rdunlap-/UHa2rfvQTnk1uMJSBkQmQ@public.gmane.org> wrote:
>
> > > +Any task in or resource belonging to the initial user namespace will, to this
> > > +new task, appear to belong to UID and GID -1 - which is usually known as
> >
> > that extra hyphen is confusing.  how about:
> >
> >                               to UID and GID -1, which is
>
> 'which are'.
>
> David

This will hold some info about the design.  Currently it contains
future todos, issues and questions.

Changelog:
   jul 26: incorporate feedback from David Howells.
   jul 29: incorporate feedback from Randy Dunlap.

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Cc: David Howells <dhowells-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Randy Dunlap <rdunlap-/UHa2rfvQTnk1uMJSBkQmQ@public.gmane.org>
---
 Documentation/namespaces/user_namespace.txt |  107 +++++++++++++++++++++++++++
 1 files changed, 107 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/namespaces/user_namespace.txt

diff --git a/Documentation/namespaces/user_namespace.txt b/Documentation/namespaces/user_namespace.txt
new file mode 100644
index 0000000..b0bc480
--- /dev/null
+++ b/Documentation/namespaces/user_namespace.txt
@@ -0,0 +1,107 @@
+Description
+===========
+
+Traditionally, each task is owned by a user ID (UID) and belongs to one or more
+groups (GID).  Both are simple numeric IDs, though userspace usually translates
+them to names.  The user namespace allows tasks to have different views of the
+UIDs and GIDs associated with tasks and other resources.  (See 'UID mapping'
+below for more.)
+
+The user namespace is a simple hierarchical one.  The system starts with all
+tasks belonging to the initial user namespace.  A task creates a new user
+namespace by passing the CLONE_NEWUSER flag to clone(2).  This requires the
+creating task to have the CAP_SETUID, CAP_SETGID, and CAP_CHOWN capabilities,
+but it does not need to be running as root.  The clone(2) call will result in a
+new task which to itself appears to be running as UID and GID 0, but to its
+creator seems to have the creator's credentials.
+
+To this new task, any resource belonging to the initial user namespace will
+appear to belong to user and group 'nobody', which are UID and GID -1.
+Permission to open such files will be granted according to world access
+permissions.  UID comparisons and group membership checks will return false,
+and privilege will be denied.
+
+When a task belonging to (for example) userid 500 in the initial user namespace
+creates a new user namespace, even though the new task will see itself as
+belonging to UID 0, any task in the initial user namespace will see it as
+belonging to UID 500.  Therefore, UID 500 in the initial user namespace will be
+able to kill the new task.  Files created by the new user will (eventually) be
+seen by tasks in its own user namespace as belonging to UID 0, but to tasks in
+the initial user namespace as belonging to UID 500.
+
+Note that this userid mapping for the VFS is not yet implemented, though the
+lkml and containers mailing list archives will show several previous
+prototypes.  In the end, those got hung up waiting on the concept of targeted
+capabilities to be developed, which, thanks to the insight of Eric Biederman,
+they finally did.
+
+Relationship between the User namespace and other namespaces
+============================================================
+
+Other namespaces, such as UTS and network, are owned by a user namespace.  When
+such a namespace is created, it is assigned to the user namespace of the task
+by which it was created.  Therefore, attempts to exercise privilege to
+resources in, for instance, a particular network namespace, can be properly
+validated by checking whether the caller has the needed privilege (i.e.
+CAP_NET_ADMIN) targeted to the user namespace which owns the network namespace.
+This is done using the ns_capable() function.
+
+As an example, if a new task is cloned with a private user namespace but
+no private network namespace, then the task's network namespace is owned
+by the parent user namespace.  The new task has no privilege to the
+parent user namespace, so it will not be able to create or configure
+network devices.  If, instead, the task were cloned with both private
+user and network namespaces, then the private network namespace is owned
+by the private user namespace, and so root in the new user namespace
+will have privilege targeted to the network namespace.  It will be able
+to create and configure network devices.
+
+UID Mapping
+===========
+The current plan (see 'flexible UID mapping' at
+https://wiki.ubuntu.com/UserNamespace) is:
+
+The UID/GID stored on disk will be that in the init_user_ns.  Most likely
+UID/GID in other namespaces will be stored in xattrs.  But Eric was advocating
+(a few years ago) leaving the details up to filesystems while providing a lib/
+stock implementation.  See the thread around here:
+http://www.mail-archive.com/devel-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org/msg09331.html
+
+
+Working notes
+=============
+Capability checks for actions related to syslog must be against the
+init_user_ns until syslog is containerized.
+
+Same is true for reboot and power, control groups, devices, and time.
+
+Perf actions (kernel/event/core.c for instance) will always be constrained to
+init_user_ns.
+
+Q:
+Is accounting considered properly containerized with respect to pidns?  (it
+appears to be).  If so, then we can change the capable() check in
+kernel/acct.c to 'ns_capable(current_pid_ns()->user_ns, CAP_PACCT)'
+
+Q:
+For things like nice and schedaffinity, we could allow root in a container to
+control those, and leave only cgroups to constrain the container.  I'm not sure
+whether that is right, or whether it violates admin expectations.
+
+I deferred some of commoncap.c.  I'm punting on xattr stuff as they take
+dentries, not inodes.
+
+For drivers/tty/tty_io.c and drivers/tty/vt/vt.c, we'll want to (for some of
+them) target the capability checks at the user_ns owning the tty.  That will
+have to wait until we get userns owning files straightened out.
+
+We need to figure out how to label devices.  Should we just toss a user_ns
+right into struct device?
+
+capable(CAP_MAC_ADMIN) checks are always to be against init_user_ns, unless
+some day LSMs were to be containerized, near zero chance.
+
+inode_owner_or_capable() should probably take an optional ns and cap parameter.
+If cap is 0, then CAP_FOWNER is checked.  If ns is NULL, we derive the ns from
+inode.  But if ns is provided, then callers who need to derive
+inode_userns(inode) anyway can save a few cycles.
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
  2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn
  2011-09-02 19:56   ` (unknown), Serge Hallyn
  2011-09-02 19:56 ` Serge Hallyn
@ 2011-09-02 19:56 ` Serge Hallyn
  2011-09-07 22:50   ` Andrew Morton
                     ` (2 more replies)
  2011-09-02 19:56 ` [PATCH 07/15] user namespace: use net->user_ns for some capable calls under net/ Serge Hallyn
                   ` (8 subsequent siblings)
  11 siblings, 3 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
  Cc: Serge E. Hallyn, Serge E. Hallyn

From: "Serge E. Hallyn" <serge@hallyn.com>

Quoting David Howells (dhowells@redhat.com):
> Randy Dunlap <rdunlap@xenotime.net> wrote:
>
> > > +Any task in or resource belonging to the initial user namespace will, to this
> > > +new task, appear to belong to UID and GID -1 - which is usually known as
> >
> > that extra hyphen is confusing.  how about:
> >
> >                               to UID and GID -1, which is
>
> 'which are'.
>
> David

This will hold some info about the design.  Currently it contains
future todos, issues and questions.

Changelog:
   jul 26: incorporate feedback from David Howells.
   jul 29: incorporate feedback from Randy Dunlap.

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
---
 Documentation/namespaces/user_namespace.txt |  107 +++++++++++++++++++++++++++
 1 files changed, 107 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/namespaces/user_namespace.txt

diff --git a/Documentation/namespaces/user_namespace.txt b/Documentation/namespaces/user_namespace.txt
new file mode 100644
index 0000000..b0bc480
--- /dev/null
+++ b/Documentation/namespaces/user_namespace.txt
@@ -0,0 +1,107 @@
+Description
+===========
+
+Traditionally, each task is owned by a user ID (UID) and belongs to one or more
+groups (GID).  Both are simple numeric IDs, though userspace usually translates
+them to names.  The user namespace allows tasks to have different views of the
+UIDs and GIDs associated with tasks and other resources.  (See 'UID mapping'
+below for more.)
+
+The user namespace is a simple hierarchical one.  The system starts with all
+tasks belonging to the initial user namespace.  A task creates a new user
+namespace by passing the CLONE_NEWUSER flag to clone(2).  This requires the
+creating task to have the CAP_SETUID, CAP_SETGID, and CAP_CHOWN capabilities,
+but it does not need to be running as root.  The clone(2) call will result in a
+new task which to itself appears to be running as UID and GID 0, but to its
+creator seems to have the creator's credentials.
+
+To this new task, any resource belonging to the initial user namespace will
+appear to belong to user and group 'nobody', which are UID and GID -1.
+Permission to open such files will be granted according to world access
+permissions.  UID comparisons and group membership checks will return false,
+and privilege will be denied.
+
+When a task belonging to (for example) userid 500 in the initial user namespace
+creates a new user namespace, even though the new task will see itself as
+belonging to UID 0, any task in the initial user namespace will see it as
+belonging to UID 500.  Therefore, UID 500 in the initial user namespace will be
+able to kill the new task.  Files created by the new user will (eventually) be
+seen by tasks in its own user namespace as belonging to UID 0, but to tasks in
+the initial user namespace as belonging to UID 500.
+
+Note that this userid mapping for the VFS is not yet implemented, though the
+lkml and containers mailing list archives will show several previous
+prototypes.  In the end, those got hung up waiting on the concept of targeted
+capabilities to be developed, which, thanks to the insight of Eric Biederman,
+they finally did.
+
+Relationship between the User namespace and other namespaces
+============================================================
+
+Other namespaces, such as UTS and network, are owned by a user namespace.  When
+such a namespace is created, it is assigned to the user namespace of the task
+by which it was created.  Therefore, attempts to exercise privilege to
+resources in, for instance, a particular network namespace, can be properly
+validated by checking whether the caller has the needed privilege (i.e.
+CAP_NET_ADMIN) targeted to the user namespace which owns the network namespace.
+This is done using the ns_capable() function.
+
+As an example, if a new task is cloned with a private user namespace but
+no private network namespace, then the task's network namespace is owned
+by the parent user namespace.  The new task has no privilege to the
+parent user namespace, so it will not be able to create or configure
+network devices.  If, instead, the task were cloned with both private
+user and network namespaces, then the private network namespace is owned
+by the private user namespace, and so root in the new user namespace
+will have privilege targeted to the network namespace.  It will be able
+to create and configure network devices.
+
+UID Mapping
+===========
+The current plan (see 'flexible UID mapping' at
+https://wiki.ubuntu.com/UserNamespace) is:
+
+The UID/GID stored on disk will be that in the init_user_ns.  Most likely
+UID/GID in other namespaces will be stored in xattrs.  But Eric was advocating
+(a few years ago) leaving the details up to filesystems while providing a lib/
+stock implementation.  See the thread around here:
+http://www.mail-archive.com/devel@openvz.org/msg09331.html
+
+
+Working notes
+=============
+Capability checks for actions related to syslog must be against the
+init_user_ns until syslog is containerized.
+
+Same is true for reboot and power, control groups, devices, and time.
+
+Perf actions (kernel/event/core.c for instance) will always be constrained to
+init_user_ns.
+
+Q:
+Is accounting considered properly containerized with respect to pidns?  (it
+appears to be).  If so, then we can change the capable() check in
+kernel/acct.c to 'ns_capable(current_pid_ns()->user_ns, CAP_PACCT)'
+
+Q:
+For things like nice and schedaffinity, we could allow root in a container to
+control those, and leave only cgroups to constrain the container.  I'm not sure
+whether that is right, or whether it violates admin expectations.
+
+I deferred some of commoncap.c.  I'm punting on xattr stuff as they take
+dentries, not inodes.
+
+For drivers/tty/tty_io.c and drivers/tty/vt/vt.c, we'll want to (for some of
+them) target the capability checks at the user_ns owning the tty.  That will
+have to wait until we get userns owning files straightened out.
+
+We need to figure out how to label devices.  Should we just toss a user_ns
+right into struct device?
+
+capable(CAP_MAC_ADMIN) checks are always to be against init_user_ns, unless
+some day LSMs were to be containerized, near zero chance.
+
+inode_owner_or_capable() should probably take an optional ns and cap parameter.
+If cap is 0, then CAP_FOWNER is checked.  If ns is NULL, we derive the ns from
+inode.  But if ns is provided, then callers who need to derive
+inode_userns(inode) anyway can save a few cycles.
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 02/15] user ns: setns: move capable checks into per-ns attach helper
  2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn
  2011-09-02 19:56   ` (unknown), Serge Hallyn
@ 2011-09-02 19:56     ` Serge Hallyn
  2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn
                       ` (9 subsequent siblings)
  11 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

From: "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 ipc/namespace.c          |    7 +++++++
 kernel/fork.c            |    5 +++++
 kernel/nsproxy.c         |   11 ++++++++---
 kernel/utsname.c         |    7 +++++++
 net/core/net_namespace.c |    7 +++++++
 5 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/ipc/namespace.c b/ipc/namespace.c
index ce0a647..a0a7609 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -163,6 +163,13 @@ static void ipcns_put(void *ns)
 
 static int ipcns_install(struct nsproxy *nsproxy, void *ns)
 {
+#if 0
+	struct ipc_namespace *newns = ns;
+	if (!ns_capable(newns->user_ns, CAP_SYS_ADMIN))
+#else
+	if (!capable(CAP_SYS_ADMIN))
+#endif
+		return -1;
 	/* Ditch state from the old ipc namespace */
 	exit_sem(current);
 	put_ipc_ns(nsproxy->ipc_ns);
diff --git a/kernel/fork.c b/kernel/fork.c
index 8e6b6f4..ca712f5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1489,8 +1489,13 @@ long do_fork(unsigned long clone_flags,
 		/* hopefully this check will go away when userns support is
 		 * complete
 		 */
+#if 0
+		if (!nsown_capable(CAP_SYS_ADMIN) || !nsown_capable(CAP_SETUID) ||
+				!nsown_capable(CAP_SETGID))
+#else
 		if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SETUID) ||
 				!capable(CAP_SETGID))
+#endif
 			return -EPERM;
 	}
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 9aeab4b..e274577 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -134,7 +134,11 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
 				CLONE_NEWPID | CLONE_NEWNET)))
 		return 0;
 
+#if 0
+	if (!nsown_capable(CAP_SYS_ADMIN)) {
+#else
 	if (!capable(CAP_SYS_ADMIN)) {
+#endif
 		err = -EPERM;
 		goto out;
 	}
@@ -191,7 +195,11 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 			       CLONE_NEWNET)))
 		return 0;
 
+#if 0
+	if (!nsown_capable(CAP_SYS_ADMIN))
+#else
 	if (!capable(CAP_SYS_ADMIN))
+#endif
 		return -EPERM;
 
 	*new_nsp = create_new_namespaces(unshare_flags, current,
@@ -241,9 +249,6 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
 	struct file *file;
 	int err;
 
-	if (!capable(CAP_SYS_ADMIN))
-		return -EPERM;
-
 	file = proc_ns_fget(fd);
 	if (IS_ERR(file))
 		return PTR_ERR(file);
diff --git a/kernel/utsname.c b/kernel/utsname.c
index bff131b..4638a54 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -104,6 +104,13 @@ static void utsns_put(void *ns)
 
 static int utsns_install(struct nsproxy *nsproxy, void *ns)
 {
+#if 0
+	struct uts_namespace *newns = ns;
+	if (!ns_capable(newns->user_ns, CAP_SYS_ADMIN))
+#else
+	if (!capable(CAP_SYS_ADMIN))
+#endif
+		return -1;
 	get_uts_ns(ns);
 	put_uts_ns(nsproxy->uts_ns);
 	nsproxy->uts_ns = ns;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 5bbdbf0..6f6698d 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -620,6 +620,13 @@ static void netns_put(void *ns)
 
 static int netns_install(struct nsproxy *nsproxy, void *ns)
 {
+#if 0
+	struct net *net = ns;
+	if (!ns_capable(net->user_ns, CAP_SYS_ADMIN))
+#else
+	if (capable(CAP_SYS_ADMIN))
+#endif
+		return -1;
 	put_net(nsproxy->net_ns);
 	nsproxy->net_ns = get_net(ns);
 	return 0;
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 02/15] user ns: setns: move capable checks into per-ns attach helper
@ 2011-09-02 19:56     ` Serge Hallyn
  0 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
  Cc: Serge E. Hallyn, Serge E. Hallyn

From: "Serge E. Hallyn" <serge@hallyn.com>

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---
 ipc/namespace.c          |    7 +++++++
 kernel/fork.c            |    5 +++++
 kernel/nsproxy.c         |   11 ++++++++---
 kernel/utsname.c         |    7 +++++++
 net/core/net_namespace.c |    7 +++++++
 5 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/ipc/namespace.c b/ipc/namespace.c
index ce0a647..a0a7609 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -163,6 +163,13 @@ static void ipcns_put(void *ns)
 
 static int ipcns_install(struct nsproxy *nsproxy, void *ns)
 {
+#if 0
+	struct ipc_namespace *newns = ns;
+	if (!ns_capable(newns->user_ns, CAP_SYS_ADMIN))
+#else
+	if (!capable(CAP_SYS_ADMIN))
+#endif
+		return -1;
 	/* Ditch state from the old ipc namespace */
 	exit_sem(current);
 	put_ipc_ns(nsproxy->ipc_ns);
diff --git a/kernel/fork.c b/kernel/fork.c
index 8e6b6f4..ca712f5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1489,8 +1489,13 @@ long do_fork(unsigned long clone_flags,
 		/* hopefully this check will go away when userns support is
 		 * complete
 		 */
+#if 0
+		if (!nsown_capable(CAP_SYS_ADMIN) || !nsown_capable(CAP_SETUID) ||
+				!nsown_capable(CAP_SETGID))
+#else
 		if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SETUID) ||
 				!capable(CAP_SETGID))
+#endif
 			return -EPERM;
 	}
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 9aeab4b..e274577 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -134,7 +134,11 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
 				CLONE_NEWPID | CLONE_NEWNET)))
 		return 0;
 
+#if 0
+	if (!nsown_capable(CAP_SYS_ADMIN)) {
+#else
 	if (!capable(CAP_SYS_ADMIN)) {
+#endif
 		err = -EPERM;
 		goto out;
 	}
@@ -191,7 +195,11 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 			       CLONE_NEWNET)))
 		return 0;
 
+#if 0
+	if (!nsown_capable(CAP_SYS_ADMIN))
+#else
 	if (!capable(CAP_SYS_ADMIN))
+#endif
 		return -EPERM;
 
 	*new_nsp = create_new_namespaces(unshare_flags, current,
@@ -241,9 +249,6 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
 	struct file *file;
 	int err;
 
-	if (!capable(CAP_SYS_ADMIN))
-		return -EPERM;
-
 	file = proc_ns_fget(fd);
 	if (IS_ERR(file))
 		return PTR_ERR(file);
diff --git a/kernel/utsname.c b/kernel/utsname.c
index bff131b..4638a54 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -104,6 +104,13 @@ static void utsns_put(void *ns)
 
 static int utsns_install(struct nsproxy *nsproxy, void *ns)
 {
+#if 0
+	struct uts_namespace *newns = ns;
+	if (!ns_capable(newns->user_ns, CAP_SYS_ADMIN))
+#else
+	if (!capable(CAP_SYS_ADMIN))
+#endif
+		return -1;
 	get_uts_ns(ns);
 	put_uts_ns(nsproxy->uts_ns);
 	nsproxy->uts_ns = ns;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 5bbdbf0..6f6698d 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -620,6 +620,13 @@ static void netns_put(void *ns)
 
 static int netns_install(struct nsproxy *nsproxy, void *ns)
 {
+#if 0
+	struct net *net = ns;
+	if (!ns_capable(net->user_ns, CAP_SYS_ADMIN))
+#else
+	if (capable(CAP_SYS_ADMIN))
+#endif
+		return -1;
 	put_net(nsproxy->net_ns);
 	nsproxy->net_ns = get_net(ns);
 	return 0;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 02/15] user ns: setns: move capable checks into per-ns attach helper
@ 2011-09-02 19:56     ` Serge Hallyn
  0 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

From: "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 ipc/namespace.c          |    7 +++++++
 kernel/fork.c            |    5 +++++
 kernel/nsproxy.c         |   11 ++++++++---
 kernel/utsname.c         |    7 +++++++
 net/core/net_namespace.c |    7 +++++++
 5 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/ipc/namespace.c b/ipc/namespace.c
index ce0a647..a0a7609 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -163,6 +163,13 @@ static void ipcns_put(void *ns)
 
 static int ipcns_install(struct nsproxy *nsproxy, void *ns)
 {
+#if 0
+	struct ipc_namespace *newns = ns;
+	if (!ns_capable(newns->user_ns, CAP_SYS_ADMIN))
+#else
+	if (!capable(CAP_SYS_ADMIN))
+#endif
+		return -1;
 	/* Ditch state from the old ipc namespace */
 	exit_sem(current);
 	put_ipc_ns(nsproxy->ipc_ns);
diff --git a/kernel/fork.c b/kernel/fork.c
index 8e6b6f4..ca712f5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1489,8 +1489,13 @@ long do_fork(unsigned long clone_flags,
 		/* hopefully this check will go away when userns support is
 		 * complete
 		 */
+#if 0
+		if (!nsown_capable(CAP_SYS_ADMIN) || !nsown_capable(CAP_SETUID) ||
+				!nsown_capable(CAP_SETGID))
+#else
 		if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SETUID) ||
 				!capable(CAP_SETGID))
+#endif
 			return -EPERM;
 	}
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 9aeab4b..e274577 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -134,7 +134,11 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
 				CLONE_NEWPID | CLONE_NEWNET)))
 		return 0;
 
+#if 0
+	if (!nsown_capable(CAP_SYS_ADMIN)) {
+#else
 	if (!capable(CAP_SYS_ADMIN)) {
+#endif
 		err = -EPERM;
 		goto out;
 	}
@@ -191,7 +195,11 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 			       CLONE_NEWNET)))
 		return 0;
 
+#if 0
+	if (!nsown_capable(CAP_SYS_ADMIN))
+#else
 	if (!capable(CAP_SYS_ADMIN))
+#endif
 		return -EPERM;
 
 	*new_nsp = create_new_namespaces(unshare_flags, current,
@@ -241,9 +249,6 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
 	struct file *file;
 	int err;
 
-	if (!capable(CAP_SYS_ADMIN))
-		return -EPERM;
-
 	file = proc_ns_fget(fd);
 	if (IS_ERR(file))
 		return PTR_ERR(file);
diff --git a/kernel/utsname.c b/kernel/utsname.c
index bff131b..4638a54 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -104,6 +104,13 @@ static void utsns_put(void *ns)
 
 static int utsns_install(struct nsproxy *nsproxy, void *ns)
 {
+#if 0
+	struct uts_namespace *newns = ns;
+	if (!ns_capable(newns->user_ns, CAP_SYS_ADMIN))
+#else
+	if (!capable(CAP_SYS_ADMIN))
+#endif
+		return -1;
 	get_uts_ns(ns);
 	put_uts_ns(nsproxy->uts_ns);
 	nsproxy->uts_ns = ns;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 5bbdbf0..6f6698d 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -620,6 +620,13 @@ static void netns_put(void *ns)
 
 static int netns_install(struct nsproxy *nsproxy, void *ns)
 {
+#if 0
+	struct net *net = ns;
+	if (!ns_capable(net->user_ns, CAP_SYS_ADMIN))
+#else
+	if (capable(CAP_SYS_ADMIN))
+#endif
+		return -1;
 	put_net(nsproxy->net_ns);
 	nsproxy->net_ns = get_net(ns);
 	return 0;
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 03/15] keyctl: check capabilities against key's user_ns
  2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn
  2011-09-02 19:56   ` (unknown), Serge Hallyn
@ 2011-09-02 19:56     ` Serge Hallyn
  2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn
                       ` (9 subsequent siblings)
  11 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

ATM, task should only be able to get his own user_ns's keys
anyway, so nsown_capable should also work, but there is no
advantage to doing that, while using key's user_ns is clearer.

changelog: jun 6:
	compile fix: keyctl.c (key_user, not key has user_ns)

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Acked-by: David Howells <dhowells-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 security/keys/keyctl.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index eca5191..fa7d420 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -745,7 +745,7 @@ long keyctl_chown_key(key_serial_t id, uid_t uid, gid_t gid)
 	ret = -EACCES;
 	down_write(&key->sem);
 
-	if (!capable(CAP_SYS_ADMIN)) {
+	if (!ns_capable(key->user->user_ns, CAP_SYS_ADMIN)) {
 		/* only the sysadmin can chown a key to some other UID */
 		if (uid != (uid_t) -1 && key->uid != uid)
 			goto error_put;
@@ -852,7 +852,8 @@ long keyctl_setperm_key(key_serial_t id, key_perm_t perm)
 	down_write(&key->sem);
 
 	/* if we're not the sysadmin, we can only change a key that we own */
-	if (capable(CAP_SYS_ADMIN) || key->uid == current_fsuid()) {
+	if (ns_capable(key->user->user_ns, CAP_SYS_ADMIN) ||
+	    key->uid == current_fsuid()) {
 		key->perm = perm;
 		ret = 0;
 	}
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 03/15] keyctl: check capabilities against key's user_ns
@ 2011-09-02 19:56     ` Serge Hallyn
  0 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
  Cc: Serge E. Hallyn

From: "Serge E. Hallyn" <serge.hallyn@canonical.com>

ATM, task should only be able to get his own user_ns's keys
anyway, so nsown_capable should also work, but there is no
advantage to doing that, while using key's user_ns is clearer.

changelog: jun 6:
	compile fix: keyctl.c (key_user, not key has user_ns)

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---
 security/keys/keyctl.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index eca5191..fa7d420 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -745,7 +745,7 @@ long keyctl_chown_key(key_serial_t id, uid_t uid, gid_t gid)
 	ret = -EACCES;
 	down_write(&key->sem);
 
-	if (!capable(CAP_SYS_ADMIN)) {
+	if (!ns_capable(key->user->user_ns, CAP_SYS_ADMIN)) {
 		/* only the sysadmin can chown a key to some other UID */
 		if (uid != (uid_t) -1 && key->uid != uid)
 			goto error_put;
@@ -852,7 +852,8 @@ long keyctl_setperm_key(key_serial_t id, key_perm_t perm)
 	down_write(&key->sem);
 
 	/* if we're not the sysadmin, we can only change a key that we own */
-	if (capable(CAP_SYS_ADMIN) || key->uid == current_fsuid()) {
+	if (ns_capable(key->user->user_ns, CAP_SYS_ADMIN) ||
+	    key->uid == current_fsuid()) {
 		key->perm = perm;
 		ret = 0;
 	}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 03/15] keyctl: check capabilities against key's user_ns
@ 2011-09-02 19:56     ` Serge Hallyn
  0 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

ATM, task should only be able to get his own user_ns's keys
anyway, so nsown_capable should also work, but there is no
advantage to doing that, while using key's user_ns is clearer.

changelog: jun 6:
	compile fix: keyctl.c (key_user, not key has user_ns)

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Acked-by: David Howells <dhowells-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 security/keys/keyctl.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index eca5191..fa7d420 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -745,7 +745,7 @@ long keyctl_chown_key(key_serial_t id, uid_t uid, gid_t gid)
 	ret = -EACCES;
 	down_write(&key->sem);
 
-	if (!capable(CAP_SYS_ADMIN)) {
+	if (!ns_capable(key->user->user_ns, CAP_SYS_ADMIN)) {
 		/* only the sysadmin can chown a key to some other UID */
 		if (uid != (uid_t) -1 && key->uid != uid)
 			goto error_put;
@@ -852,7 +852,8 @@ long keyctl_setperm_key(key_serial_t id, key_perm_t perm)
 	down_write(&key->sem);
 
 	/* if we're not the sysadmin, we can only change a key that we own */
-	if (capable(CAP_SYS_ADMIN) || key->uid == current_fsuid()) {
+	if (ns_capable(key->user->user_ns, CAP_SYS_ADMIN) ||
+	    key->uid == current_fsuid()) {
 		key->perm = perm;
 		ret = 0;
 	}
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 04/15] user_ns: convert fs/attr.c to targeted capabilities
  2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn
  2011-09-02 19:56   ` (unknown), Serge Hallyn
@ 2011-09-02 19:56     ` Serge Hallyn
  2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn
                       ` (9 subsequent siblings)
  11 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/attr.c |   20 +++++++++++++-------
 1 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 538e279..e0cf46a 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -29,6 +29,7 @@
 int inode_change_ok(const struct inode *inode, struct iattr *attr)
 {
 	unsigned int ia_valid = attr->ia_valid;
+	struct user_namespace *ns;
 
 	/*
 	 * First check size constraints.  These can't be overriden using
@@ -44,26 +45,28 @@ int inode_change_ok(const struct inode *inode, struct iattr *attr)
 	if (ia_valid & ATTR_FORCE)
 		return 0;
 
+	ns = inode_userns(inode);
 	/* Make sure a caller can chown. */
 	if ((ia_valid & ATTR_UID) &&
-	    (current_fsuid() != inode->i_uid ||
-	     attr->ia_uid != inode->i_uid) && !capable(CAP_CHOWN))
+	    (ns != current_user_ns() || current_fsuid() != inode->i_uid ||
+	     attr->ia_uid != inode->i_uid) && !ns_capable(ns, CAP_CHOWN))
 		return -EPERM;
 
 	/* Make sure caller can chgrp. */
 	if ((ia_valid & ATTR_GID) &&
-	    (current_fsuid() != inode->i_uid ||
+	    (ns != current_user_ns() || current_fsuid() != inode->i_uid ||
 	    (!in_group_p(attr->ia_gid) && attr->ia_gid != inode->i_gid)) &&
-	    !capable(CAP_CHOWN))
+	    !ns_capable(ns, CAP_CHOWN))
 		return -EPERM;
 
 	/* Make sure a caller can chmod. */
 	if (ia_valid & ATTR_MODE) {
+		gid_t gid = (ia_valid & ATTR_GID) ? attr->ia_gid : inode->i_gid;
 		if (!inode_owner_or_capable(inode))
 			return -EPERM;
 		/* Also check the setgid bit! */
-		if (!in_group_p((ia_valid & ATTR_GID) ? attr->ia_gid :
-				inode->i_gid) && !capable(CAP_FSETID))
+		if ((ns != current_user_ns() || !in_group_p(gid)) &&
+		    !ns_capable(ns, CAP_FSETID))
 			attr->ia_mode &= ~S_ISGID;
 	}
 
@@ -154,9 +157,12 @@ void setattr_copy(struct inode *inode, const struct iattr *attr)
 						inode->i_sb->s_time_gran);
 	if (ia_valid & ATTR_MODE) {
 		umode_t mode = attr->ia_mode;
+		struct user_namespace *ns = inode_userns(inode);
 
-		if (!in_group_p(inode->i_gid) && !capable(CAP_FSETID))
+		if ((ns != current_user_ns() || !in_group_p(inode->i_gid)) &&
+		    !ns_capable(ns, CAP_FSETID))
 			mode &= ~S_ISGID;
+
 		inode->i_mode = mode;
 	}
 }
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 04/15] user_ns: convert fs/attr.c to targeted capabilities
@ 2011-09-02 19:56     ` Serge Hallyn
  0 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
  Cc: Serge E. Hallyn

From: "Serge E. Hallyn" <serge.hallyn@canonical.com>

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/attr.c |   20 +++++++++++++-------
 1 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 538e279..e0cf46a 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -29,6 +29,7 @@
 int inode_change_ok(const struct inode *inode, struct iattr *attr)
 {
 	unsigned int ia_valid = attr->ia_valid;
+	struct user_namespace *ns;
 
 	/*
 	 * First check size constraints.  These can't be overriden using
@@ -44,26 +45,28 @@ int inode_change_ok(const struct inode *inode, struct iattr *attr)
 	if (ia_valid & ATTR_FORCE)
 		return 0;
 
+	ns = inode_userns(inode);
 	/* Make sure a caller can chown. */
 	if ((ia_valid & ATTR_UID) &&
-	    (current_fsuid() != inode->i_uid ||
-	     attr->ia_uid != inode->i_uid) && !capable(CAP_CHOWN))
+	    (ns != current_user_ns() || current_fsuid() != inode->i_uid ||
+	     attr->ia_uid != inode->i_uid) && !ns_capable(ns, CAP_CHOWN))
 		return -EPERM;
 
 	/* Make sure caller can chgrp. */
 	if ((ia_valid & ATTR_GID) &&
-	    (current_fsuid() != inode->i_uid ||
+	    (ns != current_user_ns() || current_fsuid() != inode->i_uid ||
 	    (!in_group_p(attr->ia_gid) && attr->ia_gid != inode->i_gid)) &&
-	    !capable(CAP_CHOWN))
+	    !ns_capable(ns, CAP_CHOWN))
 		return -EPERM;
 
 	/* Make sure a caller can chmod. */
 	if (ia_valid & ATTR_MODE) {
+		gid_t gid = (ia_valid & ATTR_GID) ? attr->ia_gid : inode->i_gid;
 		if (!inode_owner_or_capable(inode))
 			return -EPERM;
 		/* Also check the setgid bit! */
-		if (!in_group_p((ia_valid & ATTR_GID) ? attr->ia_gid :
-				inode->i_gid) && !capable(CAP_FSETID))
+		if ((ns != current_user_ns() || !in_group_p(gid)) &&
+		    !ns_capable(ns, CAP_FSETID))
 			attr->ia_mode &= ~S_ISGID;
 	}
 
@@ -154,9 +157,12 @@ void setattr_copy(struct inode *inode, const struct iattr *attr)
 						inode->i_sb->s_time_gran);
 	if (ia_valid & ATTR_MODE) {
 		umode_t mode = attr->ia_mode;
+		struct user_namespace *ns = inode_userns(inode);
 
-		if (!in_group_p(inode->i_gid) && !capable(CAP_FSETID))
+		if ((ns != current_user_ns() || !in_group_p(inode->i_gid)) &&
+		    !ns_capable(ns, CAP_FSETID))
 			mode &= ~S_ISGID;
+
 		inode->i_mode = mode;
 	}
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 04/15] user_ns: convert fs/attr.c to targeted capabilities
@ 2011-09-02 19:56     ` Serge Hallyn
  0 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/attr.c |   20 +++++++++++++-------
 1 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 538e279..e0cf46a 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -29,6 +29,7 @@
 int inode_change_ok(const struct inode *inode, struct iattr *attr)
 {
 	unsigned int ia_valid = attr->ia_valid;
+	struct user_namespace *ns;
 
 	/*
 	 * First check size constraints.  These can't be overriden using
@@ -44,26 +45,28 @@ int inode_change_ok(const struct inode *inode, struct iattr *attr)
 	if (ia_valid & ATTR_FORCE)
 		return 0;
 
+	ns = inode_userns(inode);
 	/* Make sure a caller can chown. */
 	if ((ia_valid & ATTR_UID) &&
-	    (current_fsuid() != inode->i_uid ||
-	     attr->ia_uid != inode->i_uid) && !capable(CAP_CHOWN))
+	    (ns != current_user_ns() || current_fsuid() != inode->i_uid ||
+	     attr->ia_uid != inode->i_uid) && !ns_capable(ns, CAP_CHOWN))
 		return -EPERM;
 
 	/* Make sure caller can chgrp. */
 	if ((ia_valid & ATTR_GID) &&
-	    (current_fsuid() != inode->i_uid ||
+	    (ns != current_user_ns() || current_fsuid() != inode->i_uid ||
 	    (!in_group_p(attr->ia_gid) && attr->ia_gid != inode->i_gid)) &&
-	    !capable(CAP_CHOWN))
+	    !ns_capable(ns, CAP_CHOWN))
 		return -EPERM;
 
 	/* Make sure a caller can chmod. */
 	if (ia_valid & ATTR_MODE) {
+		gid_t gid = (ia_valid & ATTR_GID) ? attr->ia_gid : inode->i_gid;
 		if (!inode_owner_or_capable(inode))
 			return -EPERM;
 		/* Also check the setgid bit! */
-		if (!in_group_p((ia_valid & ATTR_GID) ? attr->ia_gid :
-				inode->i_gid) && !capable(CAP_FSETID))
+		if ((ns != current_user_ns() || !in_group_p(gid)) &&
+		    !ns_capable(ns, CAP_FSETID))
 			attr->ia_mode &= ~S_ISGID;
 	}
 
@@ -154,9 +157,12 @@ void setattr_copy(struct inode *inode, const struct iattr *attr)
 						inode->i_sb->s_time_gran);
 	if (ia_valid & ATTR_MODE) {
 		umode_t mode = attr->ia_mode;
+		struct user_namespace *ns = inode_userns(inode);
 
-		if (!in_group_p(inode->i_gid) && !capable(CAP_FSETID))
+		if ((ns != current_user_ns() || !in_group_p(inode->i_gid)) &&
+		    !ns_capable(ns, CAP_FSETID))
 			mode &= ~S_ISGID;
+
 		inode->i_mode = mode;
 	}
 }
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 05/15] userns: clamp down users of cap_raised
  2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn
  2011-09-02 19:56   ` (unknown), Serge Hallyn
@ 2011-09-02 19:56     ` Serge Hallyn
  2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn
                       ` (9 subsequent siblings)
  11 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

A few modules are using cap_raised(current_cap(), cap) to authorize
actions, but the privilege should be applicable against the initial
user namespace.  Refuse privilege if the caller is not in init_user_ns.

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 drivers/block/drbd/drbd_nl.c           |    5 +++++
 drivers/md/dm-log-userspace-transfer.c |    3 +++
 drivers/staging/pohmelfs/config.c      |    3 +++
 drivers/video/uvesafb.c                |    3 +++
 4 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 0feab26..9a87a14 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -2297,6 +2297,11 @@ static void drbd_connector_callback(struct cn_msg *req, struct netlink_skb_parms
 		return;
 	}
 
+	if (current_user_ns() != &init_user_ns) {
+		retcode = ERR_PERM;
+		goto fail;
+	}
+
 	if (!cap_raised(current_cap(), CAP_SYS_ADMIN)) {
 		retcode = ERR_PERM;
 		goto fail;
diff --git a/drivers/md/dm-log-userspace-transfer.c b/drivers/md/dm-log-userspace-transfer.c
index 1f23e04..140ca81 100644
--- a/drivers/md/dm-log-userspace-transfer.c
+++ b/drivers/md/dm-log-userspace-transfer.c
@@ -134,6 +134,9 @@ static void cn_ulog_callback(struct cn_msg *msg, struct netlink_skb_parms *nsp)
 {
 	struct dm_ulog_request *tfr = (struct dm_ulog_request *)(msg + 1);
 
+	if (current_user_ns() != &init_user_ns)
+		return;
+
 	if (!cap_raised(current_cap(), CAP_SYS_ADMIN))
 		return;
 
diff --git a/drivers/staging/pohmelfs/config.c b/drivers/staging/pohmelfs/config.c
index b6c42cb..cd259d0 100644
--- a/drivers/staging/pohmelfs/config.c
+++ b/drivers/staging/pohmelfs/config.c
@@ -525,6 +525,9 @@ static void pohmelfs_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *n
 {
 	int err;
 
+	if (current_user_ns() != &init_user_ns)
+		return;
+
 	if (!cap_raised(current_cap(), CAP_SYS_ADMIN))
 		return;
 
diff --git a/drivers/video/uvesafb.c b/drivers/video/uvesafb.c
index 7f8472c..71dab8e 100644
--- a/drivers/video/uvesafb.c
+++ b/drivers/video/uvesafb.c
@@ -73,6 +73,9 @@ static void uvesafb_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *ns
 	struct uvesafb_task *utask;
 	struct uvesafb_ktask *task;
 
+	if (current_user_ns() != &init_user_ns)
+		return;
+
 	if (!cap_raised(current_cap(), CAP_SYS_ADMIN))
 		return;
 
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 05/15] userns: clamp down users of cap_raised
@ 2011-09-02 19:56     ` Serge Hallyn
  0 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
  Cc: Serge E. Hallyn

From: "Serge E. Hallyn" <serge.hallyn@canonical.com>

A few modules are using cap_raised(current_cap(), cap) to authorize
actions, but the privilege should be applicable against the initial
user namespace.  Refuse privilege if the caller is not in init_user_ns.

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---
 drivers/block/drbd/drbd_nl.c           |    5 +++++
 drivers/md/dm-log-userspace-transfer.c |    3 +++
 drivers/staging/pohmelfs/config.c      |    3 +++
 drivers/video/uvesafb.c                |    3 +++
 4 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 0feab26..9a87a14 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -2297,6 +2297,11 @@ static void drbd_connector_callback(struct cn_msg *req, struct netlink_skb_parms
 		return;
 	}
 
+	if (current_user_ns() != &init_user_ns) {
+		retcode = ERR_PERM;
+		goto fail;
+	}
+
 	if (!cap_raised(current_cap(), CAP_SYS_ADMIN)) {
 		retcode = ERR_PERM;
 		goto fail;
diff --git a/drivers/md/dm-log-userspace-transfer.c b/drivers/md/dm-log-userspace-transfer.c
index 1f23e04..140ca81 100644
--- a/drivers/md/dm-log-userspace-transfer.c
+++ b/drivers/md/dm-log-userspace-transfer.c
@@ -134,6 +134,9 @@ static void cn_ulog_callback(struct cn_msg *msg, struct netlink_skb_parms *nsp)
 {
 	struct dm_ulog_request *tfr = (struct dm_ulog_request *)(msg + 1);
 
+	if (current_user_ns() != &init_user_ns)
+		return;
+
 	if (!cap_raised(current_cap(), CAP_SYS_ADMIN))
 		return;
 
diff --git a/drivers/staging/pohmelfs/config.c b/drivers/staging/pohmelfs/config.c
index b6c42cb..cd259d0 100644
--- a/drivers/staging/pohmelfs/config.c
+++ b/drivers/staging/pohmelfs/config.c
@@ -525,6 +525,9 @@ static void pohmelfs_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *n
 {
 	int err;
 
+	if (current_user_ns() != &init_user_ns)
+		return;
+
 	if (!cap_raised(current_cap(), CAP_SYS_ADMIN))
 		return;
 
diff --git a/drivers/video/uvesafb.c b/drivers/video/uvesafb.c
index 7f8472c..71dab8e 100644
--- a/drivers/video/uvesafb.c
+++ b/drivers/video/uvesafb.c
@@ -73,6 +73,9 @@ static void uvesafb_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *ns
 	struct uvesafb_task *utask;
 	struct uvesafb_ktask *task;
 
+	if (current_user_ns() != &init_user_ns)
+		return;
+
 	if (!cap_raised(current_cap(), CAP_SYS_ADMIN))
 		return;
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 05/15] userns: clamp down users of cap_raised
@ 2011-09-02 19:56     ` Serge Hallyn
  0 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

A few modules are using cap_raised(current_cap(), cap) to authorize
actions, but the privilege should be applicable against the initial
user namespace.  Refuse privilege if the caller is not in init_user_ns.

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 drivers/block/drbd/drbd_nl.c           |    5 +++++
 drivers/md/dm-log-userspace-transfer.c |    3 +++
 drivers/staging/pohmelfs/config.c      |    3 +++
 drivers/video/uvesafb.c                |    3 +++
 4 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 0feab26..9a87a14 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -2297,6 +2297,11 @@ static void drbd_connector_callback(struct cn_msg *req, struct netlink_skb_parms
 		return;
 	}
 
+	if (current_user_ns() != &init_user_ns) {
+		retcode = ERR_PERM;
+		goto fail;
+	}
+
 	if (!cap_raised(current_cap(), CAP_SYS_ADMIN)) {
 		retcode = ERR_PERM;
 		goto fail;
diff --git a/drivers/md/dm-log-userspace-transfer.c b/drivers/md/dm-log-userspace-transfer.c
index 1f23e04..140ca81 100644
--- a/drivers/md/dm-log-userspace-transfer.c
+++ b/drivers/md/dm-log-userspace-transfer.c
@@ -134,6 +134,9 @@ static void cn_ulog_callback(struct cn_msg *msg, struct netlink_skb_parms *nsp)
 {
 	struct dm_ulog_request *tfr = (struct dm_ulog_request *)(msg + 1);
 
+	if (current_user_ns() != &init_user_ns)
+		return;
+
 	if (!cap_raised(current_cap(), CAP_SYS_ADMIN))
 		return;
 
diff --git a/drivers/staging/pohmelfs/config.c b/drivers/staging/pohmelfs/config.c
index b6c42cb..cd259d0 100644
--- a/drivers/staging/pohmelfs/config.c
+++ b/drivers/staging/pohmelfs/config.c
@@ -525,6 +525,9 @@ static void pohmelfs_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *n
 {
 	int err;
 
+	if (current_user_ns() != &init_user_ns)
+		return;
+
 	if (!cap_raised(current_cap(), CAP_SYS_ADMIN))
 		return;
 
diff --git a/drivers/video/uvesafb.c b/drivers/video/uvesafb.c
index 7f8472c..71dab8e 100644
--- a/drivers/video/uvesafb.c
+++ b/drivers/video/uvesafb.c
@@ -73,6 +73,9 @@ static void uvesafb_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *ns
 	struct uvesafb_task *utask;
 	struct uvesafb_ktask *task;
 
+	if (current_user_ns() != &init_user_ns)
+		return;
+
 	if (!cap_raised(current_cap(), CAP_SYS_ADMIN))
 		return;
 
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 06/15] user namespace: make each net (net_ns) belong to a user_ns
  2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn
  2011-09-02 19:56   ` (unknown), Serge Hallyn
@ 2011-09-02 19:56     ` Serge Hallyn
  2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn
                       ` (9 subsequent siblings)
  11 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

This way we can target capabilites at the user_ns which created the
net ns.

Changelog:
   jul 8: nsproxy: don't assign netns->userns if not cloning.

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 include/net/net_namespace.h |    2 ++
 kernel/nsproxy.c            |    2 ++
 net/core/net_namespace.c    |    3 +++
 3 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 3bb6fa0..d91fe5f 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -29,6 +29,7 @@ struct ctl_table_header;
 struct net_generic;
 struct sock;
 struct netns_ipvs;
+struct user_namespace;
 
 
 #define NETDEV_HASHBITS    8
@@ -101,6 +102,7 @@ struct net {
 	struct netns_xfrm	xfrm;
 #endif
 	struct netns_ipvs	*ipvs;
+	struct user_namespace	*user_ns;
 };
 
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index e274577..752b477 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -95,6 +95,8 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 		err = PTR_ERR(new_nsp->net_ns);
 		goto out_net;
 	}
+	if (flags & CLONE_NEWNET)
+		new_nsp->net_ns->user_ns = get_user_ns(task_cred_xxx(tsk, user_ns));
 
 	return new_nsp;
 
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 6f6698d..5ca95cc 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -10,6 +10,7 @@
 #include <linux/nsproxy.h>
 #include <linux/proc_fs.h>
 #include <linux/file.h>
+#include <linux/user_namespace.h>
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
 
@@ -209,6 +210,7 @@ static void net_free(struct net *net)
 	}
 #endif
 	kfree(net->gen);
+	put_user_ns(net->user_ns);
 	kmem_cache_free(net_cachep, net);
 }
 
@@ -389,6 +391,7 @@ static int __init net_ns_init(void)
 	rcu_assign_pointer(init_net.gen, ng);
 
 	mutex_lock(&net_mutex);
+	init_net.user_ns = &init_user_ns;
 	if (setup_net(&init_net))
 		panic("Could not setup the initial network namespace");
 
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 06/15] user namespace: make each net (net_ns) belong to a user_ns
@ 2011-09-02 19:56     ` Serge Hallyn
  0 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
  Cc: Serge E. Hallyn

From: "Serge E. Hallyn" <serge.hallyn@canonical.com>

This way we can target capabilites at the user_ns which created the
net ns.

Changelog:
   jul 8: nsproxy: don't assign netns->userns if not cloning.

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---
 include/net/net_namespace.h |    2 ++
 kernel/nsproxy.c            |    2 ++
 net/core/net_namespace.c    |    3 +++
 3 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 3bb6fa0..d91fe5f 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -29,6 +29,7 @@ struct ctl_table_header;
 struct net_generic;
 struct sock;
 struct netns_ipvs;
+struct user_namespace;
 
 
 #define NETDEV_HASHBITS    8
@@ -101,6 +102,7 @@ struct net {
 	struct netns_xfrm	xfrm;
 #endif
 	struct netns_ipvs	*ipvs;
+	struct user_namespace	*user_ns;
 };
 
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index e274577..752b477 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -95,6 +95,8 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 		err = PTR_ERR(new_nsp->net_ns);
 		goto out_net;
 	}
+	if (flags & CLONE_NEWNET)
+		new_nsp->net_ns->user_ns = get_user_ns(task_cred_xxx(tsk, user_ns));
 
 	return new_nsp;
 
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 6f6698d..5ca95cc 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -10,6 +10,7 @@
 #include <linux/nsproxy.h>
 #include <linux/proc_fs.h>
 #include <linux/file.h>
+#include <linux/user_namespace.h>
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
 
@@ -209,6 +210,7 @@ static void net_free(struct net *net)
 	}
 #endif
 	kfree(net->gen);
+	put_user_ns(net->user_ns);
 	kmem_cache_free(net_cachep, net);
 }
 
@@ -389,6 +391,7 @@ static int __init net_ns_init(void)
 	rcu_assign_pointer(init_net.gen, ng);
 
 	mutex_lock(&net_mutex);
+	init_net.user_ns = &init_user_ns;
 	if (setup_net(&init_net))
 		panic("Could not setup the initial network namespace");
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 06/15] user namespace: make each net (net_ns) belong to a user_ns
@ 2011-09-02 19:56     ` Serge Hallyn
  0 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

This way we can target capabilites at the user_ns which created the
net ns.

Changelog:
   jul 8: nsproxy: don't assign netns->userns if not cloning.

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 include/net/net_namespace.h |    2 ++
 kernel/nsproxy.c            |    2 ++
 net/core/net_namespace.c    |    3 +++
 3 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 3bb6fa0..d91fe5f 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -29,6 +29,7 @@ struct ctl_table_header;
 struct net_generic;
 struct sock;
 struct netns_ipvs;
+struct user_namespace;
 
 
 #define NETDEV_HASHBITS    8
@@ -101,6 +102,7 @@ struct net {
 	struct netns_xfrm	xfrm;
 #endif
 	struct netns_ipvs	*ipvs;
+	struct user_namespace	*user_ns;
 };
 
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index e274577..752b477 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -95,6 +95,8 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 		err = PTR_ERR(new_nsp->net_ns);
 		goto out_net;
 	}
+	if (flags & CLONE_NEWNET)
+		new_nsp->net_ns->user_ns = get_user_ns(task_cred_xxx(tsk, user_ns));
 
 	return new_nsp;
 
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 6f6698d..5ca95cc 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -10,6 +10,7 @@
 #include <linux/nsproxy.h>
 #include <linux/proc_fs.h>
 #include <linux/file.h>
+#include <linux/user_namespace.h>
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
 
@@ -209,6 +210,7 @@ static void net_free(struct net *net)
 	}
 #endif
 	kfree(net->gen);
+	put_user_ns(net->user_ns);
 	kmem_cache_free(net_cachep, net);
 }
 
@@ -389,6 +391,7 @@ static int __init net_ns_init(void)
 	rcu_assign_pointer(init_net.gen, ng);
 
 	mutex_lock(&net_mutex);
+	init_net.user_ns = &init_user_ns;
 	if (setup_net(&init_net))
 		panic("Could not setup the initial network namespace");
 
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 07/15] user namespace: use net->user_ns for some capable calls under net/
       [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
                     ` (6 preceding siblings ...)
  2011-09-02 19:56     ` Serge Hallyn
@ 2011-09-02 19:56   ` Serge Hallyn
  2011-09-02 19:56   ` [PATCH 08/15] af_netlink.c: make netlink_capable userns-aware Serge Hallyn
                     ` (7 subsequent siblings)
  15 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w
  Cc: Serge Hallyn

From: Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

Just a partial conversion to show how the previous patch is expected to
be used.

Changelog:
  6/28/11: fix typo in net/core/sock.c
  7/08/11: don't target capability which authorizes module loading

Signed-off-by: Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 net/core/dev.c  |    4 ++--
 net/core/sock.c |   14 ++++++++------
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 17d67b5..6ae955f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5014,7 +5014,7 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 	case SIOCGMIIPHY:
 	case SIOCGMIIREG:
 	case SIOCSIFNAME:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 		dev_load(net, ifr.ifr_name);
 		rtnl_lock();
@@ -5053,7 +5053,7 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 	case SIOCBRADDIF:
 	case SIOCBRDELIF:
 	case SIOCSHWTSTAMP:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 		/* fall through */
 	case SIOCBONDSLAVEINFOQUERY:
diff --git a/net/core/sock.c b/net/core/sock.c
index bc745d0..0f31675 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -420,7 +420,7 @@ static int sock_bindtodevice(struct sock *sk, char __user *optval, int optlen)
 
 	/* Sorry... */
 	ret = -EPERM;
-	if (!capable(CAP_NET_RAW))
+	if (!ns_capable(net->user_ns, CAP_NET_RAW))
 		goto out;
 
 	ret = -EINVAL;
@@ -488,6 +488,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 	int valbool;
 	struct linger ling;
 	int ret = 0;
+	struct net *net = sock_net(sk);
 
 	/*
 	 *	Options without arguments
@@ -508,7 +509,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 
 	switch (optname) {
 	case SO_DEBUG:
-		if (val && !capable(CAP_NET_ADMIN))
+		if (val && !ns_capable(net->user_ns, CAP_NET_ADMIN))
 			ret = -EACCES;
 		else
 			sock_valbool_flag(sk, SOCK_DBG, valbool);
@@ -551,7 +552,7 @@ set_sndbuf:
 		break;
 
 	case SO_SNDBUFFORCE:
-		if (!capable(CAP_NET_ADMIN)) {
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) {
 			ret = -EPERM;
 			break;
 		}
@@ -589,7 +590,7 @@ set_rcvbuf:
 		break;
 
 	case SO_RCVBUFFORCE:
-		if (!capable(CAP_NET_ADMIN)) {
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) {
 			ret = -EPERM;
 			break;
 		}
@@ -612,7 +613,8 @@ set_rcvbuf:
 		break;
 
 	case SO_PRIORITY:
-		if ((val >= 0 && val <= 6) || capable(CAP_NET_ADMIN))
+		if ((val >= 0 && val <= 6) ||
+		     ns_capable(net->user_ns, CAP_NET_ADMIN))
 			sk->sk_priority = val;
 		else
 			ret = -EPERM;
@@ -729,7 +731,7 @@ set_rcvbuf:
 			clear_bit(SOCK_PASSSEC, &sock->flags);
 		break;
 	case SO_MARK:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			ret = -EPERM;
 		else
 			sk->sk_mark = val;
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 07/15] user namespace: use net->user_ns for some capable calls under net/
  2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn
                   ` (2 preceding siblings ...)
  2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn
@ 2011-09-02 19:56 ` Serge Hallyn
  2011-09-02 19:56 ` [PATCH 08/15] af_netlink.c: make netlink_capable userns-aware Serge Hallyn
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
  Cc: Serge Hallyn

From: Serge Hallyn <serge.hallyn@ubuntu.com>

Just a partial conversion to show how the previous patch is expected to
be used.

Changelog:
  6/28/11: fix typo in net/core/sock.c
  7/08/11: don't target capability which authorizes module loading

Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---
 net/core/dev.c  |    4 ++--
 net/core/sock.c |   14 ++++++++------
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 17d67b5..6ae955f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5014,7 +5014,7 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 	case SIOCGMIIPHY:
 	case SIOCGMIIREG:
 	case SIOCSIFNAME:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 		dev_load(net, ifr.ifr_name);
 		rtnl_lock();
@@ -5053,7 +5053,7 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 	case SIOCBRADDIF:
 	case SIOCBRDELIF:
 	case SIOCSHWTSTAMP:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 		/* fall through */
 	case SIOCBONDSLAVEINFOQUERY:
diff --git a/net/core/sock.c b/net/core/sock.c
index bc745d0..0f31675 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -420,7 +420,7 @@ static int sock_bindtodevice(struct sock *sk, char __user *optval, int optlen)
 
 	/* Sorry... */
 	ret = -EPERM;
-	if (!capable(CAP_NET_RAW))
+	if (!ns_capable(net->user_ns, CAP_NET_RAW))
 		goto out;
 
 	ret = -EINVAL;
@@ -488,6 +488,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 	int valbool;
 	struct linger ling;
 	int ret = 0;
+	struct net *net = sock_net(sk);
 
 	/*
 	 *	Options without arguments
@@ -508,7 +509,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 
 	switch (optname) {
 	case SO_DEBUG:
-		if (val && !capable(CAP_NET_ADMIN))
+		if (val && !ns_capable(net->user_ns, CAP_NET_ADMIN))
 			ret = -EACCES;
 		else
 			sock_valbool_flag(sk, SOCK_DBG, valbool);
@@ -551,7 +552,7 @@ set_sndbuf:
 		break;
 
 	case SO_SNDBUFFORCE:
-		if (!capable(CAP_NET_ADMIN)) {
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) {
 			ret = -EPERM;
 			break;
 		}
@@ -589,7 +590,7 @@ set_rcvbuf:
 		break;
 
 	case SO_RCVBUFFORCE:
-		if (!capable(CAP_NET_ADMIN)) {
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) {
 			ret = -EPERM;
 			break;
 		}
@@ -612,7 +613,8 @@ set_rcvbuf:
 		break;
 
 	case SO_PRIORITY:
-		if ((val >= 0 && val <= 6) || capable(CAP_NET_ADMIN))
+		if ((val >= 0 && val <= 6) ||
+		     ns_capable(net->user_ns, CAP_NET_ADMIN))
 			sk->sk_priority = val;
 		else
 			ret = -EPERM;
@@ -729,7 +731,7 @@ set_rcvbuf:
 			clear_bit(SOCK_PASSSEC, &sock->flags);
 		break;
 	case SO_MARK:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			ret = -EPERM;
 		else
 			sk->sk_mark = val;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 08/15] af_netlink.c: make netlink_capable userns-aware
       [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
                     ` (7 preceding siblings ...)
  2011-09-02 19:56   ` [PATCH 07/15] user namespace: use net->user_ns for some capable calls under net/ Serge Hallyn
@ 2011-09-02 19:56   ` Serge Hallyn
  2011-09-02 19:56   ` [PATCH 09/15] user ns: convert ipv6 to targeted capabilities Serge Hallyn
                     ` (6 subsequent siblings)
  15 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w
  Cc: Eric Dumazet

From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

netlink_capable should check for permissions against the user
namespace owning the socket in question.

Changelog:
  Per Eric Dumazet advice, use sock_net(sk) instead of #ifdef.

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Cc: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 net/netlink/af_netlink.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 0a4db02..3cc0bbe 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -580,8 +580,9 @@ retry:
 
 static inline int netlink_capable(struct socket *sock, unsigned int flag)
 {
-	return (nl_table[sock->sk->sk_protocol].nl_nonroot & flag) ||
-	       capable(CAP_NET_ADMIN);
+	if (nl_table[sock->sk->sk_protocol].nl_nonroot & flag)
+		return 1;
+	return ns_capable(sock_net(sock->sk)->user_ns, CAP_NET_ADMIN);
 }
 
 static void
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 08/15] af_netlink.c: make netlink_capable userns-aware
  2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn
                   ` (3 preceding siblings ...)
  2011-09-02 19:56 ` [PATCH 07/15] user namespace: use net->user_ns for some capable calls under net/ Serge Hallyn
@ 2011-09-02 19:56 ` Serge Hallyn
       [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
  Cc: Serge E. Hallyn, Eric Dumazet

From: "Serge E. Hallyn" <serge.hallyn@canonical.com>

netlink_capable should check for permissions against the user
namespace owning the socket in question.

Changelog:
  Per Eric Dumazet advice, use sock_net(sk) instead of #ifdef.

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/netlink/af_netlink.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 0a4db02..3cc0bbe 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -580,8 +580,9 @@ retry:
 
 static inline int netlink_capable(struct socket *sock, unsigned int flag)
 {
-	return (nl_table[sock->sk->sk_protocol].nl_nonroot & flag) ||
-	       capable(CAP_NET_ADMIN);
+	if (nl_table[sock->sk->sk_protocol].nl_nonroot & flag)
+		return 1;
+	return ns_capable(sock_net(sock->sk)->user_ns, CAP_NET_ADMIN);
 }
 
 static void
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 09/15] user ns: convert ipv6 to targeted capabilities
       [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
                     ` (8 preceding siblings ...)
  2011-09-02 19:56   ` [PATCH 08/15] af_netlink.c: make netlink_capable userns-aware Serge Hallyn
@ 2011-09-02 19:56   ` Serge Hallyn
  2011-09-02 19:56   ` [PATCH 10/15] net/core/scm.c: target capable() calls to user_ns owning the net_ns Serge Hallyn
                     ` (5 subsequent siblings)
  15 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 net/ipv6/addrconf.c             |    4 ++--
 net/ipv6/af_inet6.c             |    6 ++++--
 net/ipv6/datagram.c             |    6 +++---
 net/ipv6/ip6_flowlabel.c        |   24 ++++++++++++++----------
 net/ipv6/ip6_tunnel.c           |    4 ++--
 net/ipv6/ip6mr.c                |    2 +-
 net/ipv6/ipv6_sockglue.c        |    7 ++++---
 net/ipv6/netfilter/ip6_tables.c |    8 ++++----
 net/ipv6/route.c                |    2 +-
 net/ipv6/sit.c                  |   10 +++++-----
 10 files changed, 40 insertions(+), 33 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index f012ebd..871e5cf 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -2230,7 +2230,7 @@ int addrconf_add_ifaddr(struct net *net, void __user *arg)
 	struct in6_ifreq ireq;
 	int err;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	if (copy_from_user(&ireq, arg, sizeof(struct in6_ifreq)))
@@ -2249,7 +2249,7 @@ int addrconf_del_ifaddr(struct net *net, void __user *arg)
 	struct in6_ifreq ireq;
 	int err;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	if (copy_from_user(&ireq, arg, sizeof(struct in6_ifreq)))
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 3b5669a..1854ffe 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -160,7 +160,8 @@ lookup_protocol:
 	}
 
 	err = -EPERM;
-	if (sock->type == SOCK_RAW && !kern && !capable(CAP_NET_RAW))
+	if (sock->type == SOCK_RAW && !kern &&
+	    !ns_capable(net->user_ns, CAP_NET_RAW))
 		goto out_rcu_unlock;
 
 	sock->ops = answer->ops;
@@ -281,7 +282,8 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		return -EINVAL;
 
 	snum = ntohs(addr->sin6_port);
-	if (snum && snum < PROT_SOCK && !capable(CAP_NET_BIND_SERVICE))
+	if (snum && snum < PROT_SOCK &&
+	    !ns_capable(sock_net(sk)->user_ns, CAP_NET_BIND_SERVICE))
 		return -EACCES;
 
 	lock_sock(sk);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 9ef1831..33b1b0f 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -701,7 +701,7 @@ int datagram_send_ctl(struct net *net,
 				err = -EINVAL;
 				goto exit_f;
 			}
-			if (!capable(CAP_NET_RAW)) {
+			if (!ns_capable(net->user_ns, CAP_NET_RAW)) {
 				err = -EPERM;
 				goto exit_f;
 			}
@@ -721,7 +721,7 @@ int datagram_send_ctl(struct net *net,
 				err = -EINVAL;
 				goto exit_f;
 			}
-			if (!capable(CAP_NET_RAW)) {
+			if (!ns_capable(net->user_ns, CAP_NET_RAW)) {
 				err = -EPERM;
 				goto exit_f;
 			}
@@ -746,7 +746,7 @@ int datagram_send_ctl(struct net *net,
 				err = -EINVAL;
 				goto exit_f;
 			}
-			if (!capable(CAP_NET_RAW)) {
+			if (!ns_capable(net->user_ns, CAP_NET_RAW)) {
 				err = -EPERM;
 				goto exit_f;
 			}
diff --git a/net/ipv6/ip6_flowlabel.c b/net/ipv6/ip6_flowlabel.c
index f3caf1b..4726c02 100644
--- a/net/ipv6/ip6_flowlabel.c
+++ b/net/ipv6/ip6_flowlabel.c
@@ -294,21 +294,22 @@ struct ipv6_txoptions *fl6_merge_options(struct ipv6_txoptions * opt_space,
 	return opt_space;
 }
 
-static unsigned long check_linger(unsigned long ttl)
+static unsigned long check_linger(unsigned long ttl, struct user_namespace *ns)
 {
 	if (ttl < FL_MIN_LINGER)
 		return FL_MIN_LINGER*HZ;
-	if (ttl > FL_MAX_LINGER && !capable(CAP_NET_ADMIN))
+	if (ttl > FL_MAX_LINGER && !ns_capable(ns, CAP_NET_ADMIN))
 		return 0;
 	return ttl*HZ;
 }
 
-static int fl6_renew(struct ip6_flowlabel *fl, unsigned long linger, unsigned long expires)
+static int fl6_renew(struct ip6_flowlabel *fl, unsigned long linger,
+		     unsigned long expires, struct user_namespace *ns)
 {
-	linger = check_linger(linger);
+	linger = check_linger(linger, ns);
 	if (!linger)
 		return -EPERM;
-	expires = check_linger(expires);
+	expires = check_linger(expires, ns);
 	if (!expires)
 		return -EPERM;
 	fl->lastuse = jiffies;
@@ -375,7 +376,7 @@ fl_create(struct net *net, struct in6_flowlabel_req *freq, char __user *optval,
 
 	fl->fl_net = hold_net(net);
 	fl->expires = jiffies;
-	err = fl6_renew(fl, freq->flr_linger, freq->flr_expires);
+	err = fl6_renew(fl, freq->flr_linger, freq->flr_expires, net->user_ns);
 	if (err)
 		goto done;
 	fl->share = freq->flr_share;
@@ -425,7 +426,7 @@ static int mem_check(struct sock *sk)
 	if (room <= 0 ||
 	    ((count >= FL_MAX_PER_SOCK ||
 	      (count > 0 && room < FL_MAX_SIZE/2) || room < FL_MAX_SIZE/4) &&
-	     !capable(CAP_NET_ADMIN)))
+	     !ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)))
 		return -ENOBUFS;
 
 	return 0;
@@ -507,17 +508,20 @@ int ipv6_flowlabel_opt(struct sock *sk, char __user *optval, int optlen)
 		read_lock_bh(&ip6_sk_fl_lock);
 		for (sfl = np->ipv6_fl_list; sfl; sfl = sfl->next) {
 			if (sfl->fl->label == freq.flr_label) {
-				err = fl6_renew(sfl->fl, freq.flr_linger, freq.flr_expires);
+				err = fl6_renew(sfl->fl, freq.flr_linger, freq.flr_expires,
+						net->user_ns);
 				read_unlock_bh(&ip6_sk_fl_lock);
 				return err;
 			}
 		}
 		read_unlock_bh(&ip6_sk_fl_lock);
 
-		if (freq.flr_share == IPV6_FL_S_NONE && capable(CAP_NET_ADMIN)) {
+		if (freq.flr_share == IPV6_FL_S_NONE &&
+		    ns_capable(net->user_ns, CAP_NET_ADMIN)) {
 			fl = fl_lookup(net, freq.flr_label);
 			if (fl) {
-				err = fl6_renew(fl, freq.flr_linger, freq.flr_expires);
+				err = fl6_renew(fl, freq.flr_linger, freq.flr_expires,
+						net->user_ns);
 				fl_release(fl);
 				return err;
 			}
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 0bc9888..c430d69 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1269,7 +1269,7 @@ ip6_tnl_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 	case SIOCADDTUNNEL:
 	case SIOCCHGTUNNEL:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			break;
 		err = -EFAULT;
 		if (copy_from_user(&p, ifr->ifr_ifru.ifru_data, sizeof (p)))
@@ -1304,7 +1304,7 @@ ip6_tnl_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 		break;
 	case SIOCDELTUNNEL:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			break;
 
 		if (dev == ip6n->fb_tnl_dev) {
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 705c828..1649ccd 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -1582,7 +1582,7 @@ int ip6_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, uns
 		return -ENOENT;
 
 	if (optname != MRT6_INIT) {
-		if (sk != mrt->mroute6_sk && !capable(CAP_NET_ADMIN))
+		if (sk != mrt->mroute6_sk && !ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EACCES;
 	}
 
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 147ede38..485e181 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -343,7 +343,7 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
 		break;
 
 	case IPV6_TRANSPARENT:
-		if (!capable(CAP_NET_ADMIN)) {
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) {
 			retv = -EPERM;
 			break;
 		}
@@ -381,7 +381,8 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
 
 		/* hop-by-hop / destination options are privileged option */
 		retv = -EPERM;
-		if (optname != IPV6_RTHDR && !capable(CAP_NET_RAW))
+		if (optname != IPV6_RTHDR &&
+		    !ns_capable(net->user_ns, CAP_NET_RAW))
 			break;
 
 		opt = ipv6_renew_options(sk, np->opt, optname,
@@ -725,7 +726,7 @@ done:
 	case IPV6_IPSEC_POLICY:
 	case IPV6_XFRM_POLICY:
 		retv = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			break;
 		retv = xfrm_user_policy(sk, optname, optval, optlen);
 		break;
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 94874b0..7fce7d8 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -1869,7 +1869,7 @@ compat_do_ip6t_set_ctl(struct sock *sk, int cmd, void __user *user,
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -1984,7 +1984,7 @@ compat_do_ip6t_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -2006,7 +2006,7 @@ do_ip6t_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len)
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -2031,7 +2031,7 @@ do_ip6t_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 9e69eb0..f00c18d 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1938,7 +1938,7 @@ int ipv6_route_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 	switch(cmd) {
 	case SIOCADDRT:		/* Add a route */
 	case SIOCDELRT:		/* Delete a route */
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 		err = copy_from_user(&rtmsg, arg,
 				     sizeof(struct in6_rtmsg));
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 00b15ac..7438711 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -308,7 +308,7 @@ static int ipip6_tunnel_get_prl(struct ip_tunnel *t,
 	/* For simple GET or for root users,
 	 * we try harder to allocate.
 	 */
-	kp = (cmax <= 1 || capable(CAP_NET_ADMIN)) ?
+	kp = (cmax <= 1 || ns_capable(dev_net(t->dev)->user_ns, CAP_NET_ADMIN)) ?
 		kcalloc(cmax, sizeof(*kp), GFP_KERNEL) :
 		NULL;
 
@@ -929,7 +929,7 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 	case SIOCADDTUNNEL:
 	case SIOCCHGTUNNEL:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			goto done;
 
 		err = -EFAULT;
@@ -988,7 +988,7 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 
 	case SIOCDELTUNNEL:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			goto done;
 
 		if (dev == sitn->fb_tunnel_dev) {
@@ -1021,7 +1021,7 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 	case SIOCDELPRL:
 	case SIOCCHGPRL:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			goto done;
 		err = -EINVAL;
 		if (dev == sitn->fb_tunnel_dev)
@@ -1050,7 +1050,7 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 	case SIOCCHG6RD:
 	case SIOCDEL6RD:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			goto done;
 
 		err = -EFAULT;
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 09/15] user ns: convert ipv6 to targeted capabilities
  2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn
                   ` (5 preceding siblings ...)
       [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
@ 2011-09-02 19:56 ` Serge Hallyn
  2011-09-02 19:56 ` [PATCH 10/15] net/core/scm.c: target capable() calls to user_ns owning the net_ns Serge Hallyn
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
  Cc: Serge E. Hallyn

From: "Serge E. Hallyn" <serge.hallyn@canonical.com>

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---
 net/ipv6/addrconf.c             |    4 ++--
 net/ipv6/af_inet6.c             |    6 ++++--
 net/ipv6/datagram.c             |    6 +++---
 net/ipv6/ip6_flowlabel.c        |   24 ++++++++++++++----------
 net/ipv6/ip6_tunnel.c           |    4 ++--
 net/ipv6/ip6mr.c                |    2 +-
 net/ipv6/ipv6_sockglue.c        |    7 ++++---
 net/ipv6/netfilter/ip6_tables.c |    8 ++++----
 net/ipv6/route.c                |    2 +-
 net/ipv6/sit.c                  |   10 +++++-----
 10 files changed, 40 insertions(+), 33 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index f012ebd..871e5cf 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -2230,7 +2230,7 @@ int addrconf_add_ifaddr(struct net *net, void __user *arg)
 	struct in6_ifreq ireq;
 	int err;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	if (copy_from_user(&ireq, arg, sizeof(struct in6_ifreq)))
@@ -2249,7 +2249,7 @@ int addrconf_del_ifaddr(struct net *net, void __user *arg)
 	struct in6_ifreq ireq;
 	int err;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	if (copy_from_user(&ireq, arg, sizeof(struct in6_ifreq)))
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 3b5669a..1854ffe 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -160,7 +160,8 @@ lookup_protocol:
 	}
 
 	err = -EPERM;
-	if (sock->type == SOCK_RAW && !kern && !capable(CAP_NET_RAW))
+	if (sock->type == SOCK_RAW && !kern &&
+	    !ns_capable(net->user_ns, CAP_NET_RAW))
 		goto out_rcu_unlock;
 
 	sock->ops = answer->ops;
@@ -281,7 +282,8 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		return -EINVAL;
 
 	snum = ntohs(addr->sin6_port);
-	if (snum && snum < PROT_SOCK && !capable(CAP_NET_BIND_SERVICE))
+	if (snum && snum < PROT_SOCK &&
+	    !ns_capable(sock_net(sk)->user_ns, CAP_NET_BIND_SERVICE))
 		return -EACCES;
 
 	lock_sock(sk);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 9ef1831..33b1b0f 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -701,7 +701,7 @@ int datagram_send_ctl(struct net *net,
 				err = -EINVAL;
 				goto exit_f;
 			}
-			if (!capable(CAP_NET_RAW)) {
+			if (!ns_capable(net->user_ns, CAP_NET_RAW)) {
 				err = -EPERM;
 				goto exit_f;
 			}
@@ -721,7 +721,7 @@ int datagram_send_ctl(struct net *net,
 				err = -EINVAL;
 				goto exit_f;
 			}
-			if (!capable(CAP_NET_RAW)) {
+			if (!ns_capable(net->user_ns, CAP_NET_RAW)) {
 				err = -EPERM;
 				goto exit_f;
 			}
@@ -746,7 +746,7 @@ int datagram_send_ctl(struct net *net,
 				err = -EINVAL;
 				goto exit_f;
 			}
-			if (!capable(CAP_NET_RAW)) {
+			if (!ns_capable(net->user_ns, CAP_NET_RAW)) {
 				err = -EPERM;
 				goto exit_f;
 			}
diff --git a/net/ipv6/ip6_flowlabel.c b/net/ipv6/ip6_flowlabel.c
index f3caf1b..4726c02 100644
--- a/net/ipv6/ip6_flowlabel.c
+++ b/net/ipv6/ip6_flowlabel.c
@@ -294,21 +294,22 @@ struct ipv6_txoptions *fl6_merge_options(struct ipv6_txoptions * opt_space,
 	return opt_space;
 }
 
-static unsigned long check_linger(unsigned long ttl)
+static unsigned long check_linger(unsigned long ttl, struct user_namespace *ns)
 {
 	if (ttl < FL_MIN_LINGER)
 		return FL_MIN_LINGER*HZ;
-	if (ttl > FL_MAX_LINGER && !capable(CAP_NET_ADMIN))
+	if (ttl > FL_MAX_LINGER && !ns_capable(ns, CAP_NET_ADMIN))
 		return 0;
 	return ttl*HZ;
 }
 
-static int fl6_renew(struct ip6_flowlabel *fl, unsigned long linger, unsigned long expires)
+static int fl6_renew(struct ip6_flowlabel *fl, unsigned long linger,
+		     unsigned long expires, struct user_namespace *ns)
 {
-	linger = check_linger(linger);
+	linger = check_linger(linger, ns);
 	if (!linger)
 		return -EPERM;
-	expires = check_linger(expires);
+	expires = check_linger(expires, ns);
 	if (!expires)
 		return -EPERM;
 	fl->lastuse = jiffies;
@@ -375,7 +376,7 @@ fl_create(struct net *net, struct in6_flowlabel_req *freq, char __user *optval,
 
 	fl->fl_net = hold_net(net);
 	fl->expires = jiffies;
-	err = fl6_renew(fl, freq->flr_linger, freq->flr_expires);
+	err = fl6_renew(fl, freq->flr_linger, freq->flr_expires, net->user_ns);
 	if (err)
 		goto done;
 	fl->share = freq->flr_share;
@@ -425,7 +426,7 @@ static int mem_check(struct sock *sk)
 	if (room <= 0 ||
 	    ((count >= FL_MAX_PER_SOCK ||
 	      (count > 0 && room < FL_MAX_SIZE/2) || room < FL_MAX_SIZE/4) &&
-	     !capable(CAP_NET_ADMIN)))
+	     !ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)))
 		return -ENOBUFS;
 
 	return 0;
@@ -507,17 +508,20 @@ int ipv6_flowlabel_opt(struct sock *sk, char __user *optval, int optlen)
 		read_lock_bh(&ip6_sk_fl_lock);
 		for (sfl = np->ipv6_fl_list; sfl; sfl = sfl->next) {
 			if (sfl->fl->label == freq.flr_label) {
-				err = fl6_renew(sfl->fl, freq.flr_linger, freq.flr_expires);
+				err = fl6_renew(sfl->fl, freq.flr_linger, freq.flr_expires,
+						net->user_ns);
 				read_unlock_bh(&ip6_sk_fl_lock);
 				return err;
 			}
 		}
 		read_unlock_bh(&ip6_sk_fl_lock);
 
-		if (freq.flr_share == IPV6_FL_S_NONE && capable(CAP_NET_ADMIN)) {
+		if (freq.flr_share == IPV6_FL_S_NONE &&
+		    ns_capable(net->user_ns, CAP_NET_ADMIN)) {
 			fl = fl_lookup(net, freq.flr_label);
 			if (fl) {
-				err = fl6_renew(fl, freq.flr_linger, freq.flr_expires);
+				err = fl6_renew(fl, freq.flr_linger, freq.flr_expires,
+						net->user_ns);
 				fl_release(fl);
 				return err;
 			}
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 0bc9888..c430d69 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1269,7 +1269,7 @@ ip6_tnl_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 	case SIOCADDTUNNEL:
 	case SIOCCHGTUNNEL:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			break;
 		err = -EFAULT;
 		if (copy_from_user(&p, ifr->ifr_ifru.ifru_data, sizeof (p)))
@@ -1304,7 +1304,7 @@ ip6_tnl_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 		break;
 	case SIOCDELTUNNEL:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			break;
 
 		if (dev == ip6n->fb_tnl_dev) {
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 705c828..1649ccd 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -1582,7 +1582,7 @@ int ip6_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, uns
 		return -ENOENT;
 
 	if (optname != MRT6_INIT) {
-		if (sk != mrt->mroute6_sk && !capable(CAP_NET_ADMIN))
+		if (sk != mrt->mroute6_sk && !ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EACCES;
 	}
 
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 147ede38..485e181 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -343,7 +343,7 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
 		break;
 
 	case IPV6_TRANSPARENT:
-		if (!capable(CAP_NET_ADMIN)) {
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) {
 			retv = -EPERM;
 			break;
 		}
@@ -381,7 +381,8 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
 
 		/* hop-by-hop / destination options are privileged option */
 		retv = -EPERM;
-		if (optname != IPV6_RTHDR && !capable(CAP_NET_RAW))
+		if (optname != IPV6_RTHDR &&
+		    !ns_capable(net->user_ns, CAP_NET_RAW))
 			break;
 
 		opt = ipv6_renew_options(sk, np->opt, optname,
@@ -725,7 +726,7 @@ done:
 	case IPV6_IPSEC_POLICY:
 	case IPV6_XFRM_POLICY:
 		retv = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			break;
 		retv = xfrm_user_policy(sk, optname, optval, optlen);
 		break;
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 94874b0..7fce7d8 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -1869,7 +1869,7 @@ compat_do_ip6t_set_ctl(struct sock *sk, int cmd, void __user *user,
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -1984,7 +1984,7 @@ compat_do_ip6t_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -2006,7 +2006,7 @@ do_ip6t_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len)
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -2031,7 +2031,7 @@ do_ip6t_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 9e69eb0..f00c18d 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1938,7 +1938,7 @@ int ipv6_route_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 	switch(cmd) {
 	case SIOCADDRT:		/* Add a route */
 	case SIOCDELRT:		/* Delete a route */
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 		err = copy_from_user(&rtmsg, arg,
 				     sizeof(struct in6_rtmsg));
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 00b15ac..7438711 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -308,7 +308,7 @@ static int ipip6_tunnel_get_prl(struct ip_tunnel *t,
 	/* For simple GET or for root users,
 	 * we try harder to allocate.
 	 */
-	kp = (cmax <= 1 || capable(CAP_NET_ADMIN)) ?
+	kp = (cmax <= 1 || ns_capable(dev_net(t->dev)->user_ns, CAP_NET_ADMIN)) ?
 		kcalloc(cmax, sizeof(*kp), GFP_KERNEL) :
 		NULL;
 
@@ -929,7 +929,7 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 	case SIOCADDTUNNEL:
 	case SIOCCHGTUNNEL:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			goto done;
 
 		err = -EFAULT;
@@ -988,7 +988,7 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 
 	case SIOCDELTUNNEL:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			goto done;
 
 		if (dev == sitn->fb_tunnel_dev) {
@@ -1021,7 +1021,7 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 	case SIOCDELPRL:
 	case SIOCCHGPRL:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			goto done;
 		err = -EINVAL;
 		if (dev == sitn->fb_tunnel_dev)
@@ -1050,7 +1050,7 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 	case SIOCCHG6RD:
 	case SIOCDEL6RD:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			goto done;
 
 		err = -EFAULT;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 10/15] net/core/scm.c: target capable() calls to user_ns owning the net_ns
       [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
                     ` (9 preceding siblings ...)
  2011-09-02 19:56   ` [PATCH 09/15] user ns: convert ipv6 to targeted capabilities Serge Hallyn
@ 2011-09-02 19:56   ` Serge Hallyn
  2011-09-02 19:56     ` Serge Hallyn
                     ` (4 subsequent siblings)
  15 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

The uid/gid comparisons don't have to be pulled out.  This just seemed
more easily proved correct.

Changelog:
   mark struct cred arg const

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 net/core/scm.c |   41 ++++++++++++++++++++++++++++++++++-------
 1 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/net/core/scm.c b/net/core/scm.c
index 811b53f..4f376bf 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -43,17 +43,44 @@
  *	setu(g)id.
  */
 
-static __inline__ int scm_check_creds(struct ucred *creds)
+static __inline__ bool uidequiv(const struct cred *src, struct ucred *tgt,
+			       struct user_namespace *ns)
+{
+	if (src->user_ns != ns)
+		goto check_capable;
+	if (src->uid == tgt->uid || src->euid == tgt->uid ||
+	    src->suid == tgt->uid)
+		return true;
+check_capable:
+	if (ns_capable(ns, CAP_SETUID))
+		return true;
+	return false;
+}
+
+static __inline__ bool gidequiv(const struct cred *src, struct ucred *tgt,
+			       struct user_namespace *ns)
+{
+	if (src->user_ns != ns)
+		goto check_capable;
+	if (src->gid == tgt->gid || src->egid == tgt->gid ||
+	    src->sgid == tgt->gid)
+		return true;
+check_capable:
+	if (ns_capable(ns, CAP_SETGID))
+		return true;
+	return false;
+}
+
+static __inline__ int scm_check_creds(struct ucred *creds, struct socket *sock)
 {
 	const struct cred *cred = current_cred();
+	struct user_namespace *ns = sock_net(sock->sk)->user_ns;
 
-	if ((creds->pid == task_tgid_vnr(current) || capable(CAP_SYS_ADMIN)) &&
-	    ((creds->uid == cred->uid   || creds->uid == cred->euid ||
-	      creds->uid == cred->suid) || capable(CAP_SETUID)) &&
-	    ((creds->gid == cred->gid   || creds->gid == cred->egid ||
-	      creds->gid == cred->sgid) || capable(CAP_SETGID))) {
+	if ((creds->pid == task_tgid_vnr(current) || ns_capable(ns, CAP_SYS_ADMIN)) &&
+	     uidequiv(cred, creds, ns) && gidequiv(cred, creds, ns)) {
 	       return 0;
 	}
+
 	return -EPERM;
 }
 
@@ -169,7 +196,7 @@ int __scm_send(struct socket *sock, struct msghdr *msg, struct scm_cookie *p)
 			if (cmsg->cmsg_len != CMSG_LEN(sizeof(struct ucred)))
 				goto error;
 			memcpy(&p->creds, CMSG_DATA(cmsg), sizeof(struct ucred));
-			err = scm_check_creds(&p->creds);
+			err = scm_check_creds(&p->creds, sock);
 			if (err)
 				goto error;
 
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 10/15] net/core/scm.c: target capable() calls to user_ns owning the net_ns
  2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn
                   ` (6 preceding siblings ...)
  2011-09-02 19:56 ` [PATCH 09/15] user ns: convert ipv6 to targeted capabilities Serge Hallyn
@ 2011-09-02 19:56 ` Serge Hallyn
  2011-09-02 19:56 ` [PATCH 13/15] userns: net: make many network capable calls targeted Serge Hallyn
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
  Cc: Serge E. Hallyn

From: "Serge E. Hallyn" <serge.hallyn@canonical.com>

The uid/gid comparisons don't have to be pulled out.  This just seemed
more easily proved correct.

Changelog:
   mark struct cred arg const

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---
 net/core/scm.c |   41 ++++++++++++++++++++++++++++++++++-------
 1 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/net/core/scm.c b/net/core/scm.c
index 811b53f..4f376bf 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -43,17 +43,44 @@
  *	setu(g)id.
  */
 
-static __inline__ int scm_check_creds(struct ucred *creds)
+static __inline__ bool uidequiv(const struct cred *src, struct ucred *tgt,
+			       struct user_namespace *ns)
+{
+	if (src->user_ns != ns)
+		goto check_capable;
+	if (src->uid == tgt->uid || src->euid == tgt->uid ||
+	    src->suid == tgt->uid)
+		return true;
+check_capable:
+	if (ns_capable(ns, CAP_SETUID))
+		return true;
+	return false;
+}
+
+static __inline__ bool gidequiv(const struct cred *src, struct ucred *tgt,
+			       struct user_namespace *ns)
+{
+	if (src->user_ns != ns)
+		goto check_capable;
+	if (src->gid == tgt->gid || src->egid == tgt->gid ||
+	    src->sgid == tgt->gid)
+		return true;
+check_capable:
+	if (ns_capable(ns, CAP_SETGID))
+		return true;
+	return false;
+}
+
+static __inline__ int scm_check_creds(struct ucred *creds, struct socket *sock)
 {
 	const struct cred *cred = current_cred();
+	struct user_namespace *ns = sock_net(sock->sk)->user_ns;
 
-	if ((creds->pid == task_tgid_vnr(current) || capable(CAP_SYS_ADMIN)) &&
-	    ((creds->uid == cred->uid   || creds->uid == cred->euid ||
-	      creds->uid == cred->suid) || capable(CAP_SETUID)) &&
-	    ((creds->gid == cred->gid   || creds->gid == cred->egid ||
-	      creds->gid == cred->sgid) || capable(CAP_SETGID))) {
+	if ((creds->pid == task_tgid_vnr(current) || ns_capable(ns, CAP_SYS_ADMIN)) &&
+	     uidequiv(cred, creds, ns) && gidequiv(cred, creds, ns)) {
 	       return 0;
 	}
+
 	return -EPERM;
 }
 
@@ -169,7 +196,7 @@ int __scm_send(struct socket *sock, struct msghdr *msg, struct scm_cookie *p)
 			if (cmsg->cmsg_len != CMSG_LEN(sizeof(struct ucred)))
 				goto error;
 			memcpy(&p->creds, CMSG_DATA(cmsg), sizeof(struct ucred));
-			err = scm_check_creds(&p->creds);
+			err = scm_check_creds(&p->creds, sock);
 			if (err)
 				goto error;
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 11/15] userns: make some net-sysfs capable calls targeted
  2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn
  2011-09-02 19:56   ` (unknown), Serge Hallyn
@ 2011-09-02 19:56     ` Serge Hallyn
  2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn
                       ` (9 subsequent siblings)
  11 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Changelog: jul 1: fix compilation errors (net_device != net)

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 net/core/net-sysfs.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 1683e5d..876915b 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -76,7 +76,7 @@ static ssize_t netdev_store(struct device *dev, struct device_attribute *attr,
 	unsigned long new;
 	int ret = -EINVAL;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(net)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	new = simple_strtoul(buf, &endp, 0);
@@ -261,7 +261,7 @@ static ssize_t store_ifalias(struct device *dev, struct device_attribute *attr,
 	size_t count = len;
 	ssize_t ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(netdev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	/* ignore trailing newline */
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 11/15] userns: make some net-sysfs capable calls targeted
@ 2011-09-02 19:56     ` Serge Hallyn
  0 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
  Cc: Serge E. Hallyn

From: "Serge E. Hallyn" <serge.hallyn@canonical.com>

Changelog: jul 1: fix compilation errors (net_device != net)

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---
 net/core/net-sysfs.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 1683e5d..876915b 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -76,7 +76,7 @@ static ssize_t netdev_store(struct device *dev, struct device_attribute *attr,
 	unsigned long new;
 	int ret = -EINVAL;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(net)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	new = simple_strtoul(buf, &endp, 0);
@@ -261,7 +261,7 @@ static ssize_t store_ifalias(struct device *dev, struct device_attribute *attr,
 	size_t count = len;
 	ssize_t ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(netdev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	/* ignore trailing newline */
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 11/15] userns: make some net-sysfs capable calls targeted
@ 2011-09-02 19:56     ` Serge Hallyn
  0 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Changelog: jul 1: fix compilation errors (net_device != net)

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 net/core/net-sysfs.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 1683e5d..876915b 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -76,7 +76,7 @@ static ssize_t netdev_store(struct device *dev, struct device_attribute *attr,
 	unsigned long new;
 	int ret = -EINVAL;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(net)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	new = simple_strtoul(buf, &endp, 0);
@@ -261,7 +261,7 @@ static ssize_t store_ifalias(struct device *dev, struct device_attribute *attr,
 	size_t count = len;
 	ssize_t ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(netdev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	/* ignore trailing newline */
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 12/15] user_ns: target af_key capability check
  2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn
  2011-09-02 19:56   ` (unknown), Serge Hallyn
@ 2011-09-02 19:56     ` Serge Hallyn
  2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn
                       ` (9 subsequent siblings)
  11 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

This presumes that it really is complete wrt network namespaces.  Looking
at the code it appears to be.

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 net/key/af_key.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/key/af_key.c b/net/key/af_key.c
index 1e733e9..1f90f4e 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -141,7 +141,7 @@ static int pfkey_create(struct net *net, struct socket *sock, int protocol,
 	struct sock *sk;
 	int err;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 	if (sock->type != SOCK_RAW)
 		return -ESOCKTNOSUPPORT;
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 12/15] user_ns: target af_key capability check
@ 2011-09-02 19:56     ` Serge Hallyn
  0 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
  Cc: Serge E. Hallyn

From: "Serge E. Hallyn" <serge.hallyn@canonical.com>

This presumes that it really is complete wrt network namespaces.  Looking
at the code it appears to be.

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---
 net/key/af_key.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/key/af_key.c b/net/key/af_key.c
index 1e733e9..1f90f4e 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -141,7 +141,7 @@ static int pfkey_create(struct net *net, struct socket *sock, int protocol,
 	struct sock *sk;
 	int err;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 	if (sock->type != SOCK_RAW)
 		return -ESOCKTNOSUPPORT;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 12/15] user_ns: target af_key capability check
@ 2011-09-02 19:56     ` Serge Hallyn
  0 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

This presumes that it really is complete wrt network namespaces.  Looking
at the code it appears to be.

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 net/key/af_key.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/key/af_key.c b/net/key/af_key.c
index 1e733e9..1f90f4e 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -141,7 +141,7 @@ static int pfkey_create(struct net *net, struct socket *sock, int protocol,
 	struct sock *sk;
 	int err;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 	if (sock->type != SOCK_RAW)
 		return -ESOCKTNOSUPPORT;
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 13/15] userns: net: make many network capable calls targeted
       [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
                     ` (12 preceding siblings ...)
  2011-09-02 19:56     ` Serge Hallyn
@ 2011-09-02 19:56   ` Serge Hallyn
  2011-09-02 19:56   ` [PATCH 14/15] net: pass user_ns to cap_netlink_recv() Serge Hallyn
  2011-09-02 19:56   ` [PATCH 15/15] make kernel/signal.c user ns safe (v2) Serge Hallyn
  15 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

When privilege is protected a namespaced network resource, then having
the required privilege targed toward the user namespace which owns the
resource suffices.

As with other patches, a big concern here is that we be cleanly separating
the cases where privilege protects a network resource from cases where
privilege can lead to laxer constraints on input and, subsequently,
the ability to corrupt, crash, or own the host kernel.

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 net/8021q/vlan.c                  |   12 ++++++------
 net/bridge/br_ioctl.c             |   22 +++++++++++-----------
 net/bridge/br_sysfs_br.c          |    8 ++++----
 net/bridge/br_sysfs_if.c          |    2 +-
 net/bridge/netfilter/ebtables.c   |    8 ++++----
 net/core/ethtool.c                |    2 +-
 net/ipv4/arp.c                    |    2 +-
 net/ipv4/devinet.c                |    4 ++--
 net/ipv4/fib_frontend.c           |    2 +-
 net/ipv4/ip_options.c             |    6 +++---
 net/ipv4/ip_sockglue.c            |    4 ++--
 net/ipv4/ipip.c                   |    4 ++--
 net/ipv4/ipmr.c                   |    2 +-
 net/ipv4/netfilter/arp_tables.c   |    8 ++++----
 net/ipv4/netfilter/ip_tables.c    |    8 ++++----
 net/netfilter/ipset/ip_set_core.c |    2 +-
 net/netfilter/ipvs/ip_vs_ctl.c    |    4 ++--
 net/packet/af_packet.c            |    2 +-
 18 files changed, 51 insertions(+), 51 deletions(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 8970ba1..7d12f63 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -558,7 +558,7 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg)
 	switch (args.cmd) {
 	case SET_VLAN_INGRESS_PRIORITY_CMD:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			break;
 		vlan_dev_set_ingress_priority(dev,
 					      args.u.skb_priority,
@@ -568,7 +568,7 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg)
 
 	case SET_VLAN_EGRESS_PRIORITY_CMD:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			break;
 		err = vlan_dev_set_egress_priority(dev,
 						   args.u.skb_priority,
@@ -577,7 +577,7 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg)
 
 	case SET_VLAN_FLAG_CMD:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			break;
 		err = vlan_dev_change_flags(dev,
 					    args.vlan_qos ? args.u.flag : 0,
@@ -586,7 +586,7 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg)
 
 	case SET_VLAN_NAME_TYPE_CMD:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			break;
 		if ((args.u.name_type >= 0) &&
 		    (args.u.name_type < VLAN_NAME_TYPE_HIGHEST)) {
@@ -602,14 +602,14 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg)
 
 	case ADD_VLAN_CMD:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			break;
 		err = register_vlan_device(dev, args.u.VID);
 		break;
 
 	case DEL_VLAN_CMD:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			break;
 		unregister_vlan_dev(dev, NULL);
 		err = 0;
diff --git a/net/bridge/br_ioctl.c b/net/bridge/br_ioctl.c
index 7222fe1..c82f9cb 100644
--- a/net/bridge/br_ioctl.c
+++ b/net/bridge/br_ioctl.c
@@ -88,7 +88,7 @@ static int add_del_if(struct net_bridge *br, int ifindex, int isadd)
 	struct net_device *dev;
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	dev = __dev_get_by_index(dev_net(br->dev), ifindex);
@@ -178,25 +178,25 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 	}
 
 	case BRCTL_SET_BRIDGE_FORWARD_DELAY:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		return br_set_forward_delay(br, args[1]);
 
 	case BRCTL_SET_BRIDGE_HELLO_TIME:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		return br_set_hello_time(br, args[1]);
 
 	case BRCTL_SET_BRIDGE_MAX_AGE:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		return br_set_max_age(br, args[1]);
 
 	case BRCTL_SET_AGEING_TIME:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		br->ageing_time = clock_t_to_jiffies(args[1]);
@@ -236,14 +236,14 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 	}
 
 	case BRCTL_SET_BRIDGE_STP_STATE:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		br_stp_set_enabled(br, args[1]);
 		return 0;
 
 	case BRCTL_SET_BRIDGE_PRIORITY:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		spin_lock_bh(&br->lock);
@@ -256,7 +256,7 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 		struct net_bridge_port *p;
 		int ret;
 
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		spin_lock_bh(&br->lock);
@@ -273,7 +273,7 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 		struct net_bridge_port *p;
 		int ret;
 
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		spin_lock_bh(&br->lock);
@@ -330,7 +330,7 @@ static int old_deviceless(struct net *net, void __user *uarg)
 	{
 		char buf[IFNAMSIZ];
 
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		if (copy_from_user(buf, (void __user *)args[1], IFNAMSIZ))
@@ -360,7 +360,7 @@ int br_ioctl_deviceless_stub(struct net *net, unsigned int cmd, void __user *uar
 	{
 		char buf[IFNAMSIZ];
 
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		if (copy_from_user(buf, uarg, IFNAMSIZ))
diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
index 68b893e..7f4fa3a 100644
--- a/net/bridge/br_sysfs_br.c
+++ b/net/bridge/br_sysfs_br.c
@@ -36,7 +36,7 @@ static ssize_t store_bridge_parm(struct device *d,
 	unsigned long val;
 	int err;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	val = simple_strtoul(buf, &endp, 0);
@@ -132,7 +132,7 @@ static ssize_t store_stp_state(struct device *d,
 	char *endp;
 	unsigned long val;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	val = simple_strtoul(buf, &endp, 0);
@@ -267,7 +267,7 @@ static ssize_t store_group_addr(struct device *d,
 	unsigned new_addr[6];
 	int i;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	if (sscanf(buf, "%x:%x:%x:%x:%x:%x",
@@ -304,7 +304,7 @@ static ssize_t store_flush(struct device *d,
 {
 	struct net_bridge *br = to_bridge(d);
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	br_fdb_flush(br);
diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c
index 6229b62..9cb4d2e 100644
--- a/net/bridge/br_sysfs_if.c
+++ b/net/bridge/br_sysfs_if.c
@@ -209,7 +209,7 @@ static ssize_t brport_store(struct kobject * kobj,
 	char *endp;
 	unsigned long val;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(p->br->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	val = simple_strtoul(buf, &endp, 0);
diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
index 5864cc4..cc1198b 100644
--- a/net/bridge/netfilter/ebtables.c
+++ b/net/bridge/netfilter/ebtables.c
@@ -1463,7 +1463,7 @@ static int do_ebt_set_ctl(struct sock *sk,
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch(cmd) {
@@ -1485,7 +1485,7 @@ static int do_ebt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 	struct ebt_replace tmp;
 	struct ebt_table *t;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	if (copy_from_user(&tmp, user, sizeof(tmp)))
@@ -2276,7 +2276,7 @@ static int compat_do_ebt_set_ctl(struct sock *sk,
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -2299,7 +2299,7 @@ static int compat_do_ebt_get_ctl(struct sock *sk, int cmd,
 	struct compat_ebt_replace tmp;
 	struct ebt_table *t;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	/* try real handler in case userland supplied needed padding */
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 6cdba5f..56878bf 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -1676,7 +1676,7 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
 	case ETHTOOL_GFEATURES:
 		break;
 	default:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 	}
 
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 96a164a..023ad24 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -1175,7 +1175,7 @@ int arp_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 	switch (cmd) {
 	case SIOCDARP:
 	case SIOCSARP:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 	case SIOCGARP:
 		err = copy_from_user(&r, arg, sizeof(struct arpreq));
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index bc19bd0..93b5b0b 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -728,7 +728,7 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 
 	case SIOCSIFFLAGS:
 		ret = -EACCES;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			goto out;
 		break;
 	case SIOCSIFADDR:	/* Set interface address (and family) */
@@ -736,7 +736,7 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 	case SIOCSIFDSTADDR:	/* Set the destination address */
 	case SIOCSIFNETMASK: 	/* Set the netmask for the interface */
 		ret = -EACCES;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			goto out;
 		ret = -EINVAL;
 		if (sin->sin_family != AF_INET)
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 92fc5f6..8f34a07 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -437,7 +437,7 @@ int ip_rt_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 	switch (cmd) {
 	case SIOCADDRT:		/* Add a route */
 	case SIOCDELRT:		/* Delete a route */
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		if (copy_from_user(&rt, arg, sizeof(rt)))
diff --git a/net/ipv4/ip_options.c b/net/ipv4/ip_options.c
index ec93335..21df700 100644
--- a/net/ipv4/ip_options.c
+++ b/net/ipv4/ip_options.c
@@ -396,7 +396,7 @@ int ip_options_compile(struct net *net,
 					optptr[2] += 8;
 					break;
 				      default:
-					if (!skb && !capable(CAP_NET_RAW)) {
+					if (!skb && !ns_capable(net->user_ns, CAP_NET_RAW)) {
 						pp_ptr = optptr + 3;
 						goto error;
 					}
@@ -432,7 +432,7 @@ int ip_options_compile(struct net *net,
 				opt->router_alert = optptr - iph;
 			break;
 		      case IPOPT_CIPSO:
-			if ((!skb && !capable(CAP_NET_RAW)) || opt->cipso) {
+			if ((!skb && !ns_capable(net->user_ns, CAP_NET_RAW)) || opt->cipso) {
 				pp_ptr = optptr;
 				goto error;
 			}
@@ -445,7 +445,7 @@ int ip_options_compile(struct net *net,
 		      case IPOPT_SEC:
 		      case IPOPT_SID:
 		      default:
-			if (!skb && !capable(CAP_NET_RAW)) {
+			if (!skb && !ns_capable(net->user_ns, CAP_NET_RAW)) {
 				pp_ptr = optptr;
 				goto error;
 			}
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 8905e92..6408507 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -955,13 +955,13 @@ mc_msf_out:
 	case IP_IPSEC_POLICY:
 	case IP_XFRM_POLICY:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 			break;
 		err = xfrm_user_policy(sk, optname, optval, optlen);
 		break;
 
 	case IP_TRANSPARENT:
-		if (!capable(CAP_NET_ADMIN)) {
+		if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
 			err = -EPERM;
 			break;
 		}
diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index 378b20b..6725832 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -629,7 +629,7 @@ ipip_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 	case SIOCADDTUNNEL:
 	case SIOCCHGTUNNEL:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			goto done;
 
 		err = -EFAULT;
@@ -689,7 +689,7 @@ ipip_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 
 	case SIOCDELTUNNEL:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			goto done;
 
 		if (dev == ipn->fb_tunnel_dev) {
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 58e8791..309aa0c 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1204,7 +1204,7 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi
 
 	if (optname != MRT_INIT) {
 		if (sk != rcu_dereference_raw(mrt->mroute_sk) &&
-		    !capable(CAP_NET_ADMIN))
+		    !ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EACCES;
 	}
 
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index fd7a3f6..acc908f 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -1534,7 +1534,7 @@ static int compat_do_arpt_set_ctl(struct sock *sk, int cmd, void __user *user,
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -1678,7 +1678,7 @@ static int compat_do_arpt_get_ctl(struct sock *sk, int cmd, void __user *user,
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -1699,7 +1699,7 @@ static int do_arpt_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -1723,7 +1723,7 @@ static int do_arpt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 24e556e..72f2cde 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -1847,7 +1847,7 @@ compat_do_ipt_set_ctl(struct sock *sk,	int cmd, void __user *user,
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -1962,7 +1962,7 @@ compat_do_ipt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -1984,7 +1984,7 @@ do_ipt_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len)
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -2009,7 +2009,7 @@ do_ipt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index d7e86ef..38d69a5 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -1596,7 +1596,7 @@ ip_set_sockfn_get(struct sock *sk, int optval, void __user *user, int *len)
 	void *data;
 	int copylen = *len, ret = 0;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 	if (optval != SO_IP_SET)
 		return -EBADF;
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 2b771dc..db224ef 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -2284,7 +2284,7 @@ do_ip_vs_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len)
 	struct ip_vs_dest_user *udest_compat;
 	struct ip_vs_dest_user_kern udest;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	if (cmd < IP_VS_BASE_CTL || cmd > IP_VS_SO_SET_MAX)
@@ -2566,7 +2566,7 @@ do_ip_vs_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 	struct netns_ipvs *ipvs = net_ipvs(net);
 
 	BUG_ON(!net);
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	if (cmd < IP_VS_BASE_CTL || cmd > IP_VS_SO_GET_MAX)
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index c698cec..c2e6bb6 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1793,7 +1793,7 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
 	__be16 proto = (__force __be16)protocol; /* weird, but documented */
 	int err;
 
-	if (!capable(CAP_NET_RAW))
+	if (!ns_capable(net->user_ns, CAP_NET_RAW))
 		return -EPERM;
 	if (sock->type != SOCK_DGRAM && sock->type != SOCK_RAW &&
 	    sock->type != SOCK_PACKET)
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 13/15] userns: net: make many network capable calls targeted
  2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn
                   ` (7 preceding siblings ...)
  2011-09-02 19:56 ` [PATCH 10/15] net/core/scm.c: target capable() calls to user_ns owning the net_ns Serge Hallyn
@ 2011-09-02 19:56 ` Serge Hallyn
  2011-09-02 19:56 ` [PATCH 14/15] net: pass user_ns to cap_netlink_recv() Serge Hallyn
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
  Cc: Serge E. Hallyn

From: "Serge E. Hallyn" <serge.hallyn@canonical.com>

When privilege is protected a namespaced network resource, then having
the required privilege targed toward the user namespace which owns the
resource suffices.

As with other patches, a big concern here is that we be cleanly separating
the cases where privilege protects a network resource from cases where
privilege can lead to laxer constraints on input and, subsequently,
the ability to corrupt, crash, or own the host kernel.

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---
 net/8021q/vlan.c                  |   12 ++++++------
 net/bridge/br_ioctl.c             |   22 +++++++++++-----------
 net/bridge/br_sysfs_br.c          |    8 ++++----
 net/bridge/br_sysfs_if.c          |    2 +-
 net/bridge/netfilter/ebtables.c   |    8 ++++----
 net/core/ethtool.c                |    2 +-
 net/ipv4/arp.c                    |    2 +-
 net/ipv4/devinet.c                |    4 ++--
 net/ipv4/fib_frontend.c           |    2 +-
 net/ipv4/ip_options.c             |    6 +++---
 net/ipv4/ip_sockglue.c            |    4 ++--
 net/ipv4/ipip.c                   |    4 ++--
 net/ipv4/ipmr.c                   |    2 +-
 net/ipv4/netfilter/arp_tables.c   |    8 ++++----
 net/ipv4/netfilter/ip_tables.c    |    8 ++++----
 net/netfilter/ipset/ip_set_core.c |    2 +-
 net/netfilter/ipvs/ip_vs_ctl.c    |    4 ++--
 net/packet/af_packet.c            |    2 +-
 18 files changed, 51 insertions(+), 51 deletions(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 8970ba1..7d12f63 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -558,7 +558,7 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg)
 	switch (args.cmd) {
 	case SET_VLAN_INGRESS_PRIORITY_CMD:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			break;
 		vlan_dev_set_ingress_priority(dev,
 					      args.u.skb_priority,
@@ -568,7 +568,7 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg)
 
 	case SET_VLAN_EGRESS_PRIORITY_CMD:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			break;
 		err = vlan_dev_set_egress_priority(dev,
 						   args.u.skb_priority,
@@ -577,7 +577,7 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg)
 
 	case SET_VLAN_FLAG_CMD:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			break;
 		err = vlan_dev_change_flags(dev,
 					    args.vlan_qos ? args.u.flag : 0,
@@ -586,7 +586,7 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg)
 
 	case SET_VLAN_NAME_TYPE_CMD:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			break;
 		if ((args.u.name_type >= 0) &&
 		    (args.u.name_type < VLAN_NAME_TYPE_HIGHEST)) {
@@ -602,14 +602,14 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg)
 
 	case ADD_VLAN_CMD:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			break;
 		err = register_vlan_device(dev, args.u.VID);
 		break;
 
 	case DEL_VLAN_CMD:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			break;
 		unregister_vlan_dev(dev, NULL);
 		err = 0;
diff --git a/net/bridge/br_ioctl.c b/net/bridge/br_ioctl.c
index 7222fe1..c82f9cb 100644
--- a/net/bridge/br_ioctl.c
+++ b/net/bridge/br_ioctl.c
@@ -88,7 +88,7 @@ static int add_del_if(struct net_bridge *br, int ifindex, int isadd)
 	struct net_device *dev;
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	dev = __dev_get_by_index(dev_net(br->dev), ifindex);
@@ -178,25 +178,25 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 	}
 
 	case BRCTL_SET_BRIDGE_FORWARD_DELAY:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		return br_set_forward_delay(br, args[1]);
 
 	case BRCTL_SET_BRIDGE_HELLO_TIME:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		return br_set_hello_time(br, args[1]);
 
 	case BRCTL_SET_BRIDGE_MAX_AGE:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		return br_set_max_age(br, args[1]);
 
 	case BRCTL_SET_AGEING_TIME:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		br->ageing_time = clock_t_to_jiffies(args[1]);
@@ -236,14 +236,14 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 	}
 
 	case BRCTL_SET_BRIDGE_STP_STATE:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		br_stp_set_enabled(br, args[1]);
 		return 0;
 
 	case BRCTL_SET_BRIDGE_PRIORITY:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		spin_lock_bh(&br->lock);
@@ -256,7 +256,7 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 		struct net_bridge_port *p;
 		int ret;
 
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		spin_lock_bh(&br->lock);
@@ -273,7 +273,7 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 		struct net_bridge_port *p;
 		int ret;
 
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		spin_lock_bh(&br->lock);
@@ -330,7 +330,7 @@ static int old_deviceless(struct net *net, void __user *uarg)
 	{
 		char buf[IFNAMSIZ];
 
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		if (copy_from_user(buf, (void __user *)args[1], IFNAMSIZ))
@@ -360,7 +360,7 @@ int br_ioctl_deviceless_stub(struct net *net, unsigned int cmd, void __user *uar
 	{
 		char buf[IFNAMSIZ];
 
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		if (copy_from_user(buf, uarg, IFNAMSIZ))
diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
index 68b893e..7f4fa3a 100644
--- a/net/bridge/br_sysfs_br.c
+++ b/net/bridge/br_sysfs_br.c
@@ -36,7 +36,7 @@ static ssize_t store_bridge_parm(struct device *d,
 	unsigned long val;
 	int err;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	val = simple_strtoul(buf, &endp, 0);
@@ -132,7 +132,7 @@ static ssize_t store_stp_state(struct device *d,
 	char *endp;
 	unsigned long val;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	val = simple_strtoul(buf, &endp, 0);
@@ -267,7 +267,7 @@ static ssize_t store_group_addr(struct device *d,
 	unsigned new_addr[6];
 	int i;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	if (sscanf(buf, "%x:%x:%x:%x:%x:%x",
@@ -304,7 +304,7 @@ static ssize_t store_flush(struct device *d,
 {
 	struct net_bridge *br = to_bridge(d);
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	br_fdb_flush(br);
diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c
index 6229b62..9cb4d2e 100644
--- a/net/bridge/br_sysfs_if.c
+++ b/net/bridge/br_sysfs_if.c
@@ -209,7 +209,7 @@ static ssize_t brport_store(struct kobject * kobj,
 	char *endp;
 	unsigned long val;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(p->br->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	val = simple_strtoul(buf, &endp, 0);
diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
index 5864cc4..cc1198b 100644
--- a/net/bridge/netfilter/ebtables.c
+++ b/net/bridge/netfilter/ebtables.c
@@ -1463,7 +1463,7 @@ static int do_ebt_set_ctl(struct sock *sk,
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch(cmd) {
@@ -1485,7 +1485,7 @@ static int do_ebt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 	struct ebt_replace tmp;
 	struct ebt_table *t;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	if (copy_from_user(&tmp, user, sizeof(tmp)))
@@ -2276,7 +2276,7 @@ static int compat_do_ebt_set_ctl(struct sock *sk,
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -2299,7 +2299,7 @@ static int compat_do_ebt_get_ctl(struct sock *sk, int cmd,
 	struct compat_ebt_replace tmp;
 	struct ebt_table *t;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	/* try real handler in case userland supplied needed padding */
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 6cdba5f..56878bf 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -1676,7 +1676,7 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
 	case ETHTOOL_GFEATURES:
 		break;
 	default:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 	}
 
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 96a164a..023ad24 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -1175,7 +1175,7 @@ int arp_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 	switch (cmd) {
 	case SIOCDARP:
 	case SIOCSARP:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 	case SIOCGARP:
 		err = copy_from_user(&r, arg, sizeof(struct arpreq));
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index bc19bd0..93b5b0b 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -728,7 +728,7 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 
 	case SIOCSIFFLAGS:
 		ret = -EACCES;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			goto out;
 		break;
 	case SIOCSIFADDR:	/* Set interface address (and family) */
@@ -736,7 +736,7 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 	case SIOCSIFDSTADDR:	/* Set the destination address */
 	case SIOCSIFNETMASK: 	/* Set the netmask for the interface */
 		ret = -EACCES;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			goto out;
 		ret = -EINVAL;
 		if (sin->sin_family != AF_INET)
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 92fc5f6..8f34a07 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -437,7 +437,7 @@ int ip_rt_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 	switch (cmd) {
 	case SIOCADDRT:		/* Add a route */
 	case SIOCDELRT:		/* Delete a route */
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		if (copy_from_user(&rt, arg, sizeof(rt)))
diff --git a/net/ipv4/ip_options.c b/net/ipv4/ip_options.c
index ec93335..21df700 100644
--- a/net/ipv4/ip_options.c
+++ b/net/ipv4/ip_options.c
@@ -396,7 +396,7 @@ int ip_options_compile(struct net *net,
 					optptr[2] += 8;
 					break;
 				      default:
-					if (!skb && !capable(CAP_NET_RAW)) {
+					if (!skb && !ns_capable(net->user_ns, CAP_NET_RAW)) {
 						pp_ptr = optptr + 3;
 						goto error;
 					}
@@ -432,7 +432,7 @@ int ip_options_compile(struct net *net,
 				opt->router_alert = optptr - iph;
 			break;
 		      case IPOPT_CIPSO:
-			if ((!skb && !capable(CAP_NET_RAW)) || opt->cipso) {
+			if ((!skb && !ns_capable(net->user_ns, CAP_NET_RAW)) || opt->cipso) {
 				pp_ptr = optptr;
 				goto error;
 			}
@@ -445,7 +445,7 @@ int ip_options_compile(struct net *net,
 		      case IPOPT_SEC:
 		      case IPOPT_SID:
 		      default:
-			if (!skb && !capable(CAP_NET_RAW)) {
+			if (!skb && !ns_capable(net->user_ns, CAP_NET_RAW)) {
 				pp_ptr = optptr;
 				goto error;
 			}
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 8905e92..6408507 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -955,13 +955,13 @@ mc_msf_out:
 	case IP_IPSEC_POLICY:
 	case IP_XFRM_POLICY:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 			break;
 		err = xfrm_user_policy(sk, optname, optval, optlen);
 		break;
 
 	case IP_TRANSPARENT:
-		if (!capable(CAP_NET_ADMIN)) {
+		if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
 			err = -EPERM;
 			break;
 		}
diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index 378b20b..6725832 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -629,7 +629,7 @@ ipip_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 	case SIOCADDTUNNEL:
 	case SIOCCHGTUNNEL:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			goto done;
 
 		err = -EFAULT;
@@ -689,7 +689,7 @@ ipip_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 
 	case SIOCDELTUNNEL:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			goto done;
 
 		if (dev == ipn->fb_tunnel_dev) {
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 58e8791..309aa0c 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1204,7 +1204,7 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi
 
 	if (optname != MRT_INIT) {
 		if (sk != rcu_dereference_raw(mrt->mroute_sk) &&
-		    !capable(CAP_NET_ADMIN))
+		    !ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EACCES;
 	}
 
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index fd7a3f6..acc908f 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -1534,7 +1534,7 @@ static int compat_do_arpt_set_ctl(struct sock *sk, int cmd, void __user *user,
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -1678,7 +1678,7 @@ static int compat_do_arpt_get_ctl(struct sock *sk, int cmd, void __user *user,
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -1699,7 +1699,7 @@ static int do_arpt_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -1723,7 +1723,7 @@ static int do_arpt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 24e556e..72f2cde 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -1847,7 +1847,7 @@ compat_do_ipt_set_ctl(struct sock *sk,	int cmd, void __user *user,
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -1962,7 +1962,7 @@ compat_do_ipt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -1984,7 +1984,7 @@ do_ipt_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len)
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -2009,7 +2009,7 @@ do_ipt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index d7e86ef..38d69a5 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -1596,7 +1596,7 @@ ip_set_sockfn_get(struct sock *sk, int optval, void __user *user, int *len)
 	void *data;
 	int copylen = *len, ret = 0;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 	if (optval != SO_IP_SET)
 		return -EBADF;
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 2b771dc..db224ef 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -2284,7 +2284,7 @@ do_ip_vs_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len)
 	struct ip_vs_dest_user *udest_compat;
 	struct ip_vs_dest_user_kern udest;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	if (cmd < IP_VS_BASE_CTL || cmd > IP_VS_SO_SET_MAX)
@@ -2566,7 +2566,7 @@ do_ip_vs_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 	struct netns_ipvs *ipvs = net_ipvs(net);
 
 	BUG_ON(!net);
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	if (cmd < IP_VS_BASE_CTL || cmd > IP_VS_SO_GET_MAX)
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index c698cec..c2e6bb6 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1793,7 +1793,7 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
 	__be16 proto = (__force __be16)protocol; /* weird, but documented */
 	int err;
 
-	if (!capable(CAP_NET_RAW))
+	if (!ns_capable(net->user_ns, CAP_NET_RAW))
 		return -EPERM;
 	if (sock->type != SOCK_DGRAM && sock->type != SOCK_RAW &&
 	    sock->type != SOCK_PACKET)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 14/15] net: pass user_ns to cap_netlink_recv()
       [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
                     ` (13 preceding siblings ...)
  2011-09-02 19:56   ` [PATCH 13/15] userns: net: make many network capable calls targeted Serge Hallyn
@ 2011-09-02 19:56   ` Serge Hallyn
  2011-09-02 19:56   ` [PATCH 15/15] make kernel/signal.c user ns safe (v2) Serge Hallyn
  15 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

and make cap_netlink_recv() userns-aware

cap_netlink_recv() was granting privilege if a capability is in
current_cap(), regardless of the user namespace.  Fix that by
targeting the capability check against the user namespace which
owns the skb.

Because sock_net is static inline defined in net/sock.h, which we
don't want to #include at the cap_netlink_recv function (commoncap.h).

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 drivers/scsi/scsi_netlink.c     |    3 ++-
 include/linux/security.h        |   14 +++++++++-----
 kernel/audit.c                  |    6 ++++--
 net/core/rtnetlink.c            |    3 ++-
 net/decnet/netfilter/dn_rtmsg.c |    3 ++-
 net/ipv4/netfilter/ip_queue.c   |    3 ++-
 net/ipv6/netfilter/ip6_queue.c  |    3 ++-
 net/netfilter/nfnetlink.c       |    2 +-
 net/netlink/genetlink.c         |    2 +-
 net/xfrm/xfrm_user.c            |    2 +-
 security/commoncap.c            |    6 ++----
 security/security.c             |    4 ++--
 security/selinux/hooks.c        |    5 +++--
 13 files changed, 33 insertions(+), 23 deletions(-)

diff --git a/drivers/scsi/scsi_netlink.c b/drivers/scsi/scsi_netlink.c
index 26a8a45..0aa2e57 100644
--- a/drivers/scsi/scsi_netlink.c
+++ b/drivers/scsi/scsi_netlink.c
@@ -111,7 +111,8 @@ scsi_nl_rcv_msg(struct sk_buff *skb)
 			goto next_msg;
 		}
 
-		if (security_netlink_recv(skb, CAP_SYS_ADMIN)) {
+		if (security_netlink_recv(skb, CAP_SYS_ADMIN,
+					  sock_net(skb->sk)->user_ns)) {
 			err = -EPERM;
 			goto next_msg;
 		}
diff --git a/include/linux/security.h b/include/linux/security.h
index ebd2a53..cfa1f47 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -95,7 +95,8 @@ struct xfrm_user_sec_ctx;
 struct seq_file;
 
 extern int cap_netlink_send(struct sock *sk, struct sk_buff *skb);
-extern int cap_netlink_recv(struct sk_buff *skb, int cap);
+extern int cap_netlink_recv(struct sk_buff *skb, int cap,
+			    struct user_namespace *ns);
 
 void reset_security_ops(void);
 
@@ -797,6 +798,7 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts)
  *	@skb.
  *	@skb contains the sk_buff structure for the netlink message.
  *	@cap indicates the capability required
+ *	@ns is the user namespace which owns skb
  *	Return 0 if permission is granted.
  *
  * Security hooks for Unix domain networking.
@@ -1557,7 +1559,8 @@ struct security_operations {
 			  struct sembuf *sops, unsigned nsops, int alter);
 
 	int (*netlink_send) (struct sock *sk, struct sk_buff *skb);
-	int (*netlink_recv) (struct sk_buff *skb, int cap);
+	int (*netlink_recv) (struct sk_buff *skb, int cap,
+			     struct user_namespace *ns);
 
 	void (*d_instantiate) (struct dentry *dentry, struct inode *inode);
 
@@ -1806,7 +1809,7 @@ void security_d_instantiate(struct dentry *dentry, struct inode *inode);
 int security_getprocattr(struct task_struct *p, char *name, char **value);
 int security_setprocattr(struct task_struct *p, char *name, void *value, size_t size);
 int security_netlink_send(struct sock *sk, struct sk_buff *skb);
-int security_netlink_recv(struct sk_buff *skb, int cap);
+int security_netlink_recv(struct sk_buff *skb, int cap, struct user_namespace *ns);
 int security_secid_to_secctx(u32 secid, char **secdata, u32 *seclen);
 int security_secctx_to_secid(const char *secdata, u32 seclen, u32 *secid);
 void security_release_secctx(char *secdata, u32 seclen);
@@ -2498,9 +2501,10 @@ static inline int security_netlink_send(struct sock *sk, struct sk_buff *skb)
 	return cap_netlink_send(sk, skb);
 }
 
-static inline int security_netlink_recv(struct sk_buff *skb, int cap)
+static inline int security_netlink_recv(struct sk_buff *skb, int cap,
+					struct user_namespace *ns)
 {
-	return cap_netlink_recv(skb, cap);
+	return cap_netlink_recv(skb, cap, ns);
 }
 
 static inline int security_secid_to_secctx(u32 secid, char **secdata, u32 *seclen)
diff --git a/kernel/audit.c b/kernel/audit.c
index 0a1355c..48144c4 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -601,13 +601,15 @@ static int audit_netlink_ok(struct sk_buff *skb, u16 msg_type)
 	case AUDIT_TTY_SET:
 	case AUDIT_TRIM:
 	case AUDIT_MAKE_EQUIV:
-		if (security_netlink_recv(skb, CAP_AUDIT_CONTROL))
+		if (security_netlink_recv(skb, CAP_AUDIT_CONTROL,
+					  sock_net(skb->sk)->user_ns))
 			err = -EPERM;
 		break;
 	case AUDIT_USER:
 	case AUDIT_FIRST_USER_MSG ... AUDIT_LAST_USER_MSG:
 	case AUDIT_FIRST_USER_MSG2 ... AUDIT_LAST_USER_MSG2:
-		if (security_netlink_recv(skb, CAP_AUDIT_WRITE))
+		if (security_netlink_recv(skb, CAP_AUDIT_WRITE,
+					  sock_net(skb->sk)->user_ns))
 			err = -EPERM;
 		break;
 	default:  /* bad msg */
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 99d9e95..4a444de 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1931,7 +1931,8 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 	sz_idx = type>>2;
 	kind = type&3;
 
-	if (kind != 2 && security_netlink_recv(skb, CAP_NET_ADMIN))
+	if (kind != 2 && security_netlink_recv(skb, CAP_NET_ADMIN,
+					       net->user_ns))
 		return -EPERM;
 
 	if (kind == 2 && nlh->nlmsg_flags&NLM_F_DUMP) {
diff --git a/net/decnet/netfilter/dn_rtmsg.c b/net/decnet/netfilter/dn_rtmsg.c
index 69975e0..2d052ab 100644
--- a/net/decnet/netfilter/dn_rtmsg.c
+++ b/net/decnet/netfilter/dn_rtmsg.c
@@ -108,7 +108,8 @@ static inline void dnrmg_receive_user_skb(struct sk_buff *skb)
 	if (nlh->nlmsg_len < sizeof(*nlh) || skb->len < nlh->nlmsg_len)
 		return;
 
-	if (security_netlink_recv(skb, CAP_NET_ADMIN))
+	if (security_netlink_recv(skb, CAP_NET_ADMIN,
+	    sock_net(skb->sk)->user_ns))
 		RCV_SKB_FAIL(-EPERM);
 
 	/* Eventually we might send routing messages too */
diff --git a/net/ipv4/netfilter/ip_queue.c b/net/ipv4/netfilter/ip_queue.c
index 5c9b9d9..51d7c52 100644
--- a/net/ipv4/netfilter/ip_queue.c
+++ b/net/ipv4/netfilter/ip_queue.c
@@ -432,7 +432,8 @@ __ipq_rcv_skb(struct sk_buff *skb)
 	if (type <= IPQM_BASE)
 		return;
 
-	if (security_netlink_recv(skb, CAP_NET_ADMIN))
+	if (security_netlink_recv(skb, CAP_NET_ADMIN,
+				  sock_net(skb->sk)->user_ns))
 		RCV_SKB_FAIL(-EPERM);
 
 	spin_lock_bh(&queue_lock);
diff --git a/net/ipv6/netfilter/ip6_queue.c b/net/ipv6/netfilter/ip6_queue.c
index 2493948..8206bf3 100644
--- a/net/ipv6/netfilter/ip6_queue.c
+++ b/net/ipv6/netfilter/ip6_queue.c
@@ -433,7 +433,8 @@ __ipq_rcv_skb(struct sk_buff *skb)
 	if (type <= IPQM_BASE)
 		return;
 
-	if (security_netlink_recv(skb, CAP_NET_ADMIN))
+	if (security_netlink_recv(skb, CAP_NET_ADMIN,
+				  sock_net(skb->sk)->user_ns))
 		RCV_SKB_FAIL(-EPERM);
 
 	spin_lock_bh(&queue_lock);
diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
index 1905976..bcaff9d 100644
--- a/net/netfilter/nfnetlink.c
+++ b/net/netfilter/nfnetlink.c
@@ -130,7 +130,7 @@ static int nfnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 	const struct nfnetlink_subsystem *ss;
 	int type, err;
 
-	if (security_netlink_recv(skb, CAP_NET_ADMIN))
+	if (security_netlink_recv(skb, CAP_NET_ADMIN, net->user_ns))
 		return -EPERM;
 
 	/* All the messages must at least contain nfgenmsg */
diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index 482fa57..00a101c 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -516,7 +516,7 @@ static int genl_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 		return -EOPNOTSUPP;
 
 	if ((ops->flags & GENL_ADMIN_PERM) &&
-	    security_netlink_recv(skb, CAP_NET_ADMIN))
+	    security_netlink_recv(skb, CAP_NET_ADMIN, net->user_ns))
 		return -EPERM;
 
 	if (nlh->nlmsg_flags & NLM_F_DUMP) {
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 0256b8a..1808e1e 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -2290,7 +2290,7 @@ static int xfrm_user_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 	link = &xfrm_dispatch[type];
 
 	/* All operations require privileges, even GET */
-	if (security_netlink_recv(skb, CAP_NET_ADMIN))
+	if (security_netlink_recv(skb, CAP_NET_ADMIN, net->user_ns))
 		return -EPERM;
 
 	if ((type == (XFRM_MSG_GETSA - XFRM_MSG_BASE) ||
diff --git a/security/commoncap.c b/security/commoncap.c
index a93b3b7..1e48e6a 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -56,11 +56,9 @@ int cap_netlink_send(struct sock *sk, struct sk_buff *skb)
 	return 0;
 }
 
-int cap_netlink_recv(struct sk_buff *skb, int cap)
+int cap_netlink_recv(struct sk_buff *skb, int cap, struct user_namespace *ns)
 {
-	if (!cap_raised(current_cap(), cap))
-		return -EPERM;
-	return 0;
+	return security_capable(ns, current_cred(), cap);
 }
 EXPORT_SYMBOL(cap_netlink_recv);
 
diff --git a/security/security.c b/security/security.c
index 0e4fccf..0a1453e 100644
--- a/security/security.c
+++ b/security/security.c
@@ -941,9 +941,9 @@ int security_netlink_send(struct sock *sk, struct sk_buff *skb)
 	return security_ops->netlink_send(sk, skb);
 }
 
-int security_netlink_recv(struct sk_buff *skb, int cap)
+int security_netlink_recv(struct sk_buff *skb, int cap, struct user_namespace *ns)
 {
-	return security_ops->netlink_recv(skb, cap);
+	return security_ops->netlink_recv(skb, cap, ns);
 }
 EXPORT_SYMBOL(security_netlink_recv);
 
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 266a229..fe290bb 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -4723,13 +4723,14 @@ static int selinux_netlink_send(struct sock *sk, struct sk_buff *skb)
 	return selinux_nlmsg_perm(sk, skb);
 }
 
-static int selinux_netlink_recv(struct sk_buff *skb, int capability)
+static int selinux_netlink_recv(struct sk_buff *skb, int capability,
+				struct user_namespace *ns)
 {
 	int err;
 	struct common_audit_data ad;
 	u32 sid;
 
-	err = cap_netlink_recv(skb, capability);
+	err = cap_netlink_recv(skb, capability, ns);
 	if (err)
 		return err;
 
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 14/15] net: pass user_ns to cap_netlink_recv()
  2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn
                   ` (8 preceding siblings ...)
  2011-09-02 19:56 ` [PATCH 13/15] userns: net: make many network capable calls targeted Serge Hallyn
@ 2011-09-02 19:56 ` Serge Hallyn
  2011-09-02 19:56 ` [PATCH 15/15] make kernel/signal.c user ns safe (v2) Serge Hallyn
  2011-09-13 14:43 ` user namespaces v3: continue targetting capabilities Serge E. Hallyn
  11 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
  Cc: Serge E. Hallyn

From: "Serge E. Hallyn" <serge.hallyn@canonical.com>

and make cap_netlink_recv() userns-aware

cap_netlink_recv() was granting privilege if a capability is in
current_cap(), regardless of the user namespace.  Fix that by
targeting the capability check against the user namespace which
owns the skb.

Because sock_net is static inline defined in net/sock.h, which we
don't want to #include at the cap_netlink_recv function (commoncap.h).

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---
 drivers/scsi/scsi_netlink.c     |    3 ++-
 include/linux/security.h        |   14 +++++++++-----
 kernel/audit.c                  |    6 ++++--
 net/core/rtnetlink.c            |    3 ++-
 net/decnet/netfilter/dn_rtmsg.c |    3 ++-
 net/ipv4/netfilter/ip_queue.c   |    3 ++-
 net/ipv6/netfilter/ip6_queue.c  |    3 ++-
 net/netfilter/nfnetlink.c       |    2 +-
 net/netlink/genetlink.c         |    2 +-
 net/xfrm/xfrm_user.c            |    2 +-
 security/commoncap.c            |    6 ++----
 security/security.c             |    4 ++--
 security/selinux/hooks.c        |    5 +++--
 13 files changed, 33 insertions(+), 23 deletions(-)

diff --git a/drivers/scsi/scsi_netlink.c b/drivers/scsi/scsi_netlink.c
index 26a8a45..0aa2e57 100644
--- a/drivers/scsi/scsi_netlink.c
+++ b/drivers/scsi/scsi_netlink.c
@@ -111,7 +111,8 @@ scsi_nl_rcv_msg(struct sk_buff *skb)
 			goto next_msg;
 		}
 
-		if (security_netlink_recv(skb, CAP_SYS_ADMIN)) {
+		if (security_netlink_recv(skb, CAP_SYS_ADMIN,
+					  sock_net(skb->sk)->user_ns)) {
 			err = -EPERM;
 			goto next_msg;
 		}
diff --git a/include/linux/security.h b/include/linux/security.h
index ebd2a53..cfa1f47 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -95,7 +95,8 @@ struct xfrm_user_sec_ctx;
 struct seq_file;
 
 extern int cap_netlink_send(struct sock *sk, struct sk_buff *skb);
-extern int cap_netlink_recv(struct sk_buff *skb, int cap);
+extern int cap_netlink_recv(struct sk_buff *skb, int cap,
+			    struct user_namespace *ns);
 
 void reset_security_ops(void);
 
@@ -797,6 +798,7 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts)
  *	@skb.
  *	@skb contains the sk_buff structure for the netlink message.
  *	@cap indicates the capability required
+ *	@ns is the user namespace which owns skb
  *	Return 0 if permission is granted.
  *
  * Security hooks for Unix domain networking.
@@ -1557,7 +1559,8 @@ struct security_operations {
 			  struct sembuf *sops, unsigned nsops, int alter);
 
 	int (*netlink_send) (struct sock *sk, struct sk_buff *skb);
-	int (*netlink_recv) (struct sk_buff *skb, int cap);
+	int (*netlink_recv) (struct sk_buff *skb, int cap,
+			     struct user_namespace *ns);
 
 	void (*d_instantiate) (struct dentry *dentry, struct inode *inode);
 
@@ -1806,7 +1809,7 @@ void security_d_instantiate(struct dentry *dentry, struct inode *inode);
 int security_getprocattr(struct task_struct *p, char *name, char **value);
 int security_setprocattr(struct task_struct *p, char *name, void *value, size_t size);
 int security_netlink_send(struct sock *sk, struct sk_buff *skb);
-int security_netlink_recv(struct sk_buff *skb, int cap);
+int security_netlink_recv(struct sk_buff *skb, int cap, struct user_namespace *ns);
 int security_secid_to_secctx(u32 secid, char **secdata, u32 *seclen);
 int security_secctx_to_secid(const char *secdata, u32 seclen, u32 *secid);
 void security_release_secctx(char *secdata, u32 seclen);
@@ -2498,9 +2501,10 @@ static inline int security_netlink_send(struct sock *sk, struct sk_buff *skb)
 	return cap_netlink_send(sk, skb);
 }
 
-static inline int security_netlink_recv(struct sk_buff *skb, int cap)
+static inline int security_netlink_recv(struct sk_buff *skb, int cap,
+					struct user_namespace *ns)
 {
-	return cap_netlink_recv(skb, cap);
+	return cap_netlink_recv(skb, cap, ns);
 }
 
 static inline int security_secid_to_secctx(u32 secid, char **secdata, u32 *seclen)
diff --git a/kernel/audit.c b/kernel/audit.c
index 0a1355c..48144c4 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -601,13 +601,15 @@ static int audit_netlink_ok(struct sk_buff *skb, u16 msg_type)
 	case AUDIT_TTY_SET:
 	case AUDIT_TRIM:
 	case AUDIT_MAKE_EQUIV:
-		if (security_netlink_recv(skb, CAP_AUDIT_CONTROL))
+		if (security_netlink_recv(skb, CAP_AUDIT_CONTROL,
+					  sock_net(skb->sk)->user_ns))
 			err = -EPERM;
 		break;
 	case AUDIT_USER:
 	case AUDIT_FIRST_USER_MSG ... AUDIT_LAST_USER_MSG:
 	case AUDIT_FIRST_USER_MSG2 ... AUDIT_LAST_USER_MSG2:
-		if (security_netlink_recv(skb, CAP_AUDIT_WRITE))
+		if (security_netlink_recv(skb, CAP_AUDIT_WRITE,
+					  sock_net(skb->sk)->user_ns))
 			err = -EPERM;
 		break;
 	default:  /* bad msg */
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 99d9e95..4a444de 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1931,7 +1931,8 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 	sz_idx = type>>2;
 	kind = type&3;
 
-	if (kind != 2 && security_netlink_recv(skb, CAP_NET_ADMIN))
+	if (kind != 2 && security_netlink_recv(skb, CAP_NET_ADMIN,
+					       net->user_ns))
 		return -EPERM;
 
 	if (kind == 2 && nlh->nlmsg_flags&NLM_F_DUMP) {
diff --git a/net/decnet/netfilter/dn_rtmsg.c b/net/decnet/netfilter/dn_rtmsg.c
index 69975e0..2d052ab 100644
--- a/net/decnet/netfilter/dn_rtmsg.c
+++ b/net/decnet/netfilter/dn_rtmsg.c
@@ -108,7 +108,8 @@ static inline void dnrmg_receive_user_skb(struct sk_buff *skb)
 	if (nlh->nlmsg_len < sizeof(*nlh) || skb->len < nlh->nlmsg_len)
 		return;
 
-	if (security_netlink_recv(skb, CAP_NET_ADMIN))
+	if (security_netlink_recv(skb, CAP_NET_ADMIN,
+	    sock_net(skb->sk)->user_ns))
 		RCV_SKB_FAIL(-EPERM);
 
 	/* Eventually we might send routing messages too */
diff --git a/net/ipv4/netfilter/ip_queue.c b/net/ipv4/netfilter/ip_queue.c
index 5c9b9d9..51d7c52 100644
--- a/net/ipv4/netfilter/ip_queue.c
+++ b/net/ipv4/netfilter/ip_queue.c
@@ -432,7 +432,8 @@ __ipq_rcv_skb(struct sk_buff *skb)
 	if (type <= IPQM_BASE)
 		return;
 
-	if (security_netlink_recv(skb, CAP_NET_ADMIN))
+	if (security_netlink_recv(skb, CAP_NET_ADMIN,
+				  sock_net(skb->sk)->user_ns))
 		RCV_SKB_FAIL(-EPERM);
 
 	spin_lock_bh(&queue_lock);
diff --git a/net/ipv6/netfilter/ip6_queue.c b/net/ipv6/netfilter/ip6_queue.c
index 2493948..8206bf3 100644
--- a/net/ipv6/netfilter/ip6_queue.c
+++ b/net/ipv6/netfilter/ip6_queue.c
@@ -433,7 +433,8 @@ __ipq_rcv_skb(struct sk_buff *skb)
 	if (type <= IPQM_BASE)
 		return;
 
-	if (security_netlink_recv(skb, CAP_NET_ADMIN))
+	if (security_netlink_recv(skb, CAP_NET_ADMIN,
+				  sock_net(skb->sk)->user_ns))
 		RCV_SKB_FAIL(-EPERM);
 
 	spin_lock_bh(&queue_lock);
diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
index 1905976..bcaff9d 100644
--- a/net/netfilter/nfnetlink.c
+++ b/net/netfilter/nfnetlink.c
@@ -130,7 +130,7 @@ static int nfnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 	const struct nfnetlink_subsystem *ss;
 	int type, err;
 
-	if (security_netlink_recv(skb, CAP_NET_ADMIN))
+	if (security_netlink_recv(skb, CAP_NET_ADMIN, net->user_ns))
 		return -EPERM;
 
 	/* All the messages must at least contain nfgenmsg */
diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index 482fa57..00a101c 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -516,7 +516,7 @@ static int genl_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 		return -EOPNOTSUPP;
 
 	if ((ops->flags & GENL_ADMIN_PERM) &&
-	    security_netlink_recv(skb, CAP_NET_ADMIN))
+	    security_netlink_recv(skb, CAP_NET_ADMIN, net->user_ns))
 		return -EPERM;
 
 	if (nlh->nlmsg_flags & NLM_F_DUMP) {
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 0256b8a..1808e1e 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -2290,7 +2290,7 @@ static int xfrm_user_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 	link = &xfrm_dispatch[type];
 
 	/* All operations require privileges, even GET */
-	if (security_netlink_recv(skb, CAP_NET_ADMIN))
+	if (security_netlink_recv(skb, CAP_NET_ADMIN, net->user_ns))
 		return -EPERM;
 
 	if ((type == (XFRM_MSG_GETSA - XFRM_MSG_BASE) ||
diff --git a/security/commoncap.c b/security/commoncap.c
index a93b3b7..1e48e6a 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -56,11 +56,9 @@ int cap_netlink_send(struct sock *sk, struct sk_buff *skb)
 	return 0;
 }
 
-int cap_netlink_recv(struct sk_buff *skb, int cap)
+int cap_netlink_recv(struct sk_buff *skb, int cap, struct user_namespace *ns)
 {
-	if (!cap_raised(current_cap(), cap))
-		return -EPERM;
-	return 0;
+	return security_capable(ns, current_cred(), cap);
 }
 EXPORT_SYMBOL(cap_netlink_recv);
 
diff --git a/security/security.c b/security/security.c
index 0e4fccf..0a1453e 100644
--- a/security/security.c
+++ b/security/security.c
@@ -941,9 +941,9 @@ int security_netlink_send(struct sock *sk, struct sk_buff *skb)
 	return security_ops->netlink_send(sk, skb);
 }
 
-int security_netlink_recv(struct sk_buff *skb, int cap)
+int security_netlink_recv(struct sk_buff *skb, int cap, struct user_namespace *ns)
 {
-	return security_ops->netlink_recv(skb, cap);
+	return security_ops->netlink_recv(skb, cap, ns);
 }
 EXPORT_SYMBOL(security_netlink_recv);
 
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 266a229..fe290bb 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -4723,13 +4723,14 @@ static int selinux_netlink_send(struct sock *sk, struct sk_buff *skb)
 	return selinux_nlmsg_perm(sk, skb);
 }
 
-static int selinux_netlink_recv(struct sk_buff *skb, int capability)
+static int selinux_netlink_recv(struct sk_buff *skb, int capability,
+				struct user_namespace *ns)
 {
 	int err;
 	struct common_audit_data ad;
 	u32 sid;
 
-	err = cap_netlink_recv(skb, capability);
+	err = cap_netlink_recv(skb, capability, ns);
 	if (err)
 		return err;
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 15/15] make kernel/signal.c user ns safe (v2)
       [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
                     ` (14 preceding siblings ...)
  2011-09-02 19:56   ` [PATCH 14/15] net: pass user_ns to cap_netlink_recv() Serge Hallyn
@ 2011-09-02 19:56   ` Serge Hallyn
  15 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

From: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Signed-off-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
---
 kernel/signal.c |   26 ++++++++++++++++++++++----
 1 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 291c970..c07b970 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -27,6 +27,7 @@
 #include <linux/capability.h>
 #include <linux/freezer.h>
 #include <linux/pid_namespace.h>
+#include <linux/user_namespace.h>
 #include <linux/nsproxy.h>
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -1073,7 +1074,8 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
 			q->info.si_code = SI_USER;
 			q->info.si_pid = task_tgid_nr_ns(current,
 							task_active_pid_ns(t));
-			q->info.si_uid = current_uid();
+			q->info.si_uid = user_ns_map_uid(task_cred_xxx(t, user_ns),
+					current_cred(), current_uid());
 			break;
 		case (unsigned long) SEND_SIG_PRIV:
 			q->info.si_signo = sig;
@@ -1363,6 +1365,11 @@ int kill_pid_info_as_uid(int sig, struct siginfo *info, struct pid *pid,
 		goto out_unlock;
 	}
 	pcred = __task_cred(p);
+	/*
+	 * this is called (only) from drivers/usb/core/devio.c.
+	 * Do we need to add user_ns to urb and usb_device, so
+	 * we can pass them in here?
+	 */
 	if (si_fromuser(info) &&
 	    euid != pcred->suid && euid != pcred->uid &&
 	    uid  != pcred->suid && uid  != pcred->uid) {
@@ -1618,7 +1625,8 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
 	 */
 	rcu_read_lock();
 	info.si_pid = task_pid_nr_ns(tsk, tsk->parent->nsproxy->pid_ns);
-	info.si_uid = __task_cred(tsk)->uid;
+	info.si_uid = user_ns_map_uid(task_cred_xxx(tsk->parent, user_ns),
+				      __task_cred(tsk), __task_cred(tsk)->uid);
 	rcu_read_unlock();
 
 	info.si_utime = cputime_to_clock_t(cputime_add(tsk->utime,
@@ -1688,6 +1696,7 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
 	unsigned long flags;
 	struct task_struct *parent;
 	struct sighand_struct *sighand;
+	const struct cred *cred;
 
 	if (for_ptracer) {
 		parent = tsk->parent;
@@ -1703,7 +1712,9 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
 	 */
 	rcu_read_lock();
 	info.si_pid = task_pid_nr_ns(tsk, parent->nsproxy->pid_ns);
-	info.si_uid = __task_cred(tsk)->uid;
+	cred = __task_cred(tsk);
+	info.si_uid = user_ns_map_uid(task_cred_xxx(parent, user_ns),
+				cred, cred->uid);
 	rcu_read_unlock();
 
 	info.si_utime = cputime_to_clock_t(tsk->utime);
@@ -2122,7 +2133,10 @@ static int ptrace_signal(int signr, siginfo_t *info,
 		info->si_errno = 0;
 		info->si_code = SI_USER;
 		info->si_pid = task_pid_vnr(current->parent);
-		info->si_uid = task_uid(current->parent);
+		/* we can cache cred if this performs poorly */
+		info->si_uid = user_ns_map_uid(current_user_ns(),
+			__task_cred(current->parent),
+			task_uid(current->parent));
 	}
 
 	/* If the (new) signal is now blocked, requeue it.  */
@@ -2552,6 +2566,10 @@ SYSCALL_DEFINE2(rt_sigpending, sigset_t __user *, set, size_t, sigsetsize)
 
 #ifndef HAVE_ARCH_COPY_SIGINFO_TO_USER
 
+/*
+ * send_signal has converted the sender's uid to the receiver
+ * task user namespace, so no need to convert here
+ */
 int copy_siginfo_to_user(siginfo_t __user *to, siginfo_t *from)
 {
 	int err;
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 15/15] make kernel/signal.c user ns safe (v2)
  2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn
                   ` (9 preceding siblings ...)
  2011-09-02 19:56 ` [PATCH 14/15] net: pass user_ns to cap_netlink_recv() Serge Hallyn
@ 2011-09-02 19:56 ` Serge Hallyn
  2011-09-13 14:43 ` user namespaces v3: continue targetting capabilities Serge E. Hallyn
  11 siblings, 0 replies; 69+ messages in thread
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
  Cc: Serge Hallyn

From: Serge Hallyn <serge.hallyn@canonical.com>

Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
---
 kernel/signal.c |   26 ++++++++++++++++++++++----
 1 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 291c970..c07b970 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -27,6 +27,7 @@
 #include <linux/capability.h>
 #include <linux/freezer.h>
 #include <linux/pid_namespace.h>
+#include <linux/user_namespace.h>
 #include <linux/nsproxy.h>
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -1073,7 +1074,8 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
 			q->info.si_code = SI_USER;
 			q->info.si_pid = task_tgid_nr_ns(current,
 							task_active_pid_ns(t));
-			q->info.si_uid = current_uid();
+			q->info.si_uid = user_ns_map_uid(task_cred_xxx(t, user_ns),
+					current_cred(), current_uid());
 			break;
 		case (unsigned long) SEND_SIG_PRIV:
 			q->info.si_signo = sig;
@@ -1363,6 +1365,11 @@ int kill_pid_info_as_uid(int sig, struct siginfo *info, struct pid *pid,
 		goto out_unlock;
 	}
 	pcred = __task_cred(p);
+	/*
+	 * this is called (only) from drivers/usb/core/devio.c.
+	 * Do we need to add user_ns to urb and usb_device, so
+	 * we can pass them in here?
+	 */
 	if (si_fromuser(info) &&
 	    euid != pcred->suid && euid != pcred->uid &&
 	    uid  != pcred->suid && uid  != pcred->uid) {
@@ -1618,7 +1625,8 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
 	 */
 	rcu_read_lock();
 	info.si_pid = task_pid_nr_ns(tsk, tsk->parent->nsproxy->pid_ns);
-	info.si_uid = __task_cred(tsk)->uid;
+	info.si_uid = user_ns_map_uid(task_cred_xxx(tsk->parent, user_ns),
+				      __task_cred(tsk), __task_cred(tsk)->uid);
 	rcu_read_unlock();
 
 	info.si_utime = cputime_to_clock_t(cputime_add(tsk->utime,
@@ -1688,6 +1696,7 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
 	unsigned long flags;
 	struct task_struct *parent;
 	struct sighand_struct *sighand;
+	const struct cred *cred;
 
 	if (for_ptracer) {
 		parent = tsk->parent;
@@ -1703,7 +1712,9 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
 	 */
 	rcu_read_lock();
 	info.si_pid = task_pid_nr_ns(tsk, parent->nsproxy->pid_ns);
-	info.si_uid = __task_cred(tsk)->uid;
+	cred = __task_cred(tsk);
+	info.si_uid = user_ns_map_uid(task_cred_xxx(parent, user_ns),
+				cred, cred->uid);
 	rcu_read_unlock();
 
 	info.si_utime = cputime_to_clock_t(tsk->utime);
@@ -2122,7 +2133,10 @@ static int ptrace_signal(int signr, siginfo_t *info,
 		info->si_errno = 0;
 		info->si_code = SI_USER;
 		info->si_pid = task_pid_vnr(current->parent);
-		info->si_uid = task_uid(current->parent);
+		/* we can cache cred if this performs poorly */
+		info->si_uid = user_ns_map_uid(current_user_ns(),
+			__task_cred(current->parent),
+			task_uid(current->parent));
 	}
 
 	/* If the (new) signal is now blocked, requeue it.  */
@@ -2552,6 +2566,10 @@ SYSCALL_DEFINE2(rt_sigpending, sigset_t __user *, set, size_t, sigsetsize)
 
 #ifndef HAVE_ARCH_COPY_SIGINFO_TO_USER
 
+/*
+ * send_signal has converted the sender's uid to the receiver
+ * task user namespace, so no need to convert here
+ */
 int copy_siginfo_to_user(siginfo_t __user *to, siginfo_t *from)
 {
 	int err;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* missing [PATCH 01/15]
       [not found]   ` <1314993400-6910-3-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
@ 2011-09-02 23:49     ` Eric W. Biederman
  0 siblings, 0 replies; 69+ messages in thread
From: Eric W. Biederman @ 2011-09-02 23:49 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, rdunlap-/UHa2rfvQTnk1uMJSBkQmQ



Was this blank email supposed to be patch 01/15?

Eric

^ permalink raw reply	[flat|nested] 69+ messages in thread

* missing [PATCH 01/15]
  2011-09-02 19:56 ` Serge Hallyn
       [not found]   ` <1314993400-6910-3-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
@ 2011-09-02 23:49   ` Eric W. Biederman
       [not found]     ` <m11uvyld2d.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2011-09-03  1:09     ` Serge E. Hallyn
  1 sibling, 2 replies; 69+ messages in thread
From: Eric W. Biederman @ 2011-09-02 23:49 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: akpm, segooon, linux-kernel, netdev, containers, dhowells, rdunlap



Was this blank email supposed to be patch 01/15?

Eric





^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: missing [PATCH 01/15]
       [not found]     ` <m11uvyld2d.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2011-09-03  1:09       ` Serge E. Hallyn
  0 siblings, 0 replies; 69+ messages in thread
From: Serge E. Hallyn @ 2011-09-03  1:09 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, rdunlap-/UHa2rfvQTnk1uMJSBkQmQ

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> 
> 
> Was this blank email supposed to be patch 01/15?

Nope, that was <grr> a git-send-email misfire.  Sorry about that.  The
patch #1 did go through, here: https://lkml.org/lkml/2011/9/2/314

I'm appending it here again too for easier feedback.

thanks,
-serge

 Subject: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)

Quoting David Howells (dhowells-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org):
> Randy Dunlap <rdunlap-/UHa2rfvQTnk1uMJSBkQmQ@public.gmane.org> wrote:
>
> > > +Any task in or resource belonging to the initial user namespace will, to this
> > > +new task, appear to belong to UID and GID -1 - which is usually known as
> >
> > that extra hyphen is confusing.  how about:
> >
> >                               to UID and GID -1, which is
>
> 'which are'.
>
> David

This will hold some info about the design.  Currently it contains
future todos, issues and questions.

Changelog:
   jul 26: incorporate feedback from David Howells.
   jul 29: incorporate feedback from Randy Dunlap.

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Cc: David Howells <dhowells-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Randy Dunlap <rdunlap-/UHa2rfvQTnk1uMJSBkQmQ@public.gmane.org>
---
 Documentation/namespaces/user_namespace.txt |  107 +++++++++++++++++++++++++++
 1 files changed, 107 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/namespaces/user_namespace.txt

diff --git a/Documentation/namespaces/user_namespace.txt b/Documentation/namespaces/user_namespace.txt
new file mode 100644
index 0000000..b0bc480
--- /dev/null
+++ b/Documentation/namespaces/user_namespace.txt
@@ -0,0 +1,107 @@
+Description
+===========
+
+Traditionally, each task is owned by a user ID (UID) and belongs to one or more
+groups (GID).  Both are simple numeric IDs, though userspace usually translates
+them to names.  The user namespace allows tasks to have different views of the
+UIDs and GIDs associated with tasks and other resources.  (See 'UID mapping'
+below for more.)
+
+The user namespace is a simple hierarchical one.  The system starts with all
+tasks belonging to the initial user namespace.  A task creates a new user
+namespace by passing the CLONE_NEWUSER flag to clone(2).  This requires the
+creating task to have the CAP_SETUID, CAP_SETGID, and CAP_CHOWN capabilities,
+but it does not need to be running as root.  The clone(2) call will result in a
+new task which to itself appears to be running as UID and GID 0, but to its
+creator seems to have the creator's credentials.
+
+To this new task, any resource belonging to the initial user namespace will
+appear to belong to user and group 'nobody', which are UID and GID -1.
+Permission to open such files will be granted according to world access
+permissions.  UID comparisons and group membership checks will return false,
+and privilege will be denied.
+
+When a task belonging to (for example) userid 500 in the initial user namespace
+creates a new user namespace, even though the new task will see itself as
+belonging to UID 0, any task in the initial user namespace will see it as
+belonging to UID 500.  Therefore, UID 500 in the initial user namespace will be
+able to kill the new task.  Files created by the new user will (eventually) be
+seen by tasks in its own user namespace as belonging to UID 0, but to tasks in
+the initial user namespace as belonging to UID 500.
+
+Note that this userid mapping for the VFS is not yet implemented, though the
+lkml and containers mailing list archives will show several previous
+prototypes.  In the end, those got hung up waiting on the concept of targeted
+capabilities to be developed, which, thanks to the insight of Eric Biederman,
+they finally did.
+
+Relationship between the User namespace and other namespaces
+============================================================
+
+Other namespaces, such as UTS and network, are owned by a user namespace.  When
+such a namespace is created, it is assigned to the user namespace of the task
+by which it was created.  Therefore, attempts to exercise privilege to
+resources in, for instance, a particular network namespace, can be properly
+validated by checking whether the caller has the needed privilege (i.e.
+CAP_NET_ADMIN) targeted to the user namespace which owns the network namespace.
+This is done using the ns_capable() function.
+
+As an example, if a new task is cloned with a private user namespace but
+no private network namespace, then the task's network namespace is owned
+by the parent user namespace.  The new task has no privilege to the
+parent user namespace, so it will not be able to create or configure
+network devices.  If, instead, the task were cloned with both private
+user and network namespaces, then the private network namespace is owned
+by the private user namespace, and so root in the new user namespace
+will have privilege targeted to the network namespace.  It will be able
+to create and configure network devices.
+
+UID Mapping
+===========
+The current plan (see 'flexible UID mapping' at
+https://wiki.ubuntu.com/UserNamespace) is:
+
+The UID/GID stored on disk will be that in the init_user_ns.  Most likely
+UID/GID in other namespaces will be stored in xattrs.  But Eric was advocating
+(a few years ago) leaving the details up to filesystems while providing a lib/
+stock implementation.  See the thread around here:
+http://www.mail-archive.com/devel-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org/msg09331.html
+
+
+Working notes
+=============
+Capability checks for actions related to syslog must be against the
+init_user_ns until syslog is containerized.
+
+Same is true for reboot and power, control groups, devices, and time.
+
+Perf actions (kernel/event/core.c for instance) will always be constrained to
+init_user_ns.
+
+Q:
+Is accounting considered properly containerized with respect to pidns?  (it
+appears to be).  If so, then we can change the capable() check in
+kernel/acct.c to 'ns_capable(current_pid_ns()->user_ns, CAP_PACCT)'
+
+Q:
+For things like nice and schedaffinity, we could allow root in a container to
+control those, and leave only cgroups to constrain the container.  I'm not sure
+whether that is right, or whether it violates admin expectations.
+
+I deferred some of commoncap.c.  I'm punting on xattr stuff as they take
+dentries, not inodes.
+
+For drivers/tty/tty_io.c and drivers/tty/vt/vt.c, we'll want to (for some of
+them) target the capability checks at the user_ns owning the tty.  That will
+have to wait until we get userns owning files straightened out.
+
+We need to figure out how to label devices.  Should we just toss a user_ns
+right into struct device?
+
+capable(CAP_MAC_ADMIN) checks are always to be against init_user_ns, unless
+some day LSMs were to be containerized, near zero chance.
+
+inode_owner_or_capable() should probably take an optional ns and cap parameter.
+If cap is 0, then CAP_FOWNER is checked.  If ns is NULL, we derive the ns from
+inode.  But if ns is provided, then callers who need to derive
+inode_userns(inode) anyway can save a few cycles.
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: missing [PATCH 01/15]
  2011-09-02 23:49   ` Eric W. Biederman
       [not found]     ` <m11uvyld2d.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2011-09-03  1:09     ` Serge E. Hallyn
  1 sibling, 0 replies; 69+ messages in thread
From: Serge E. Hallyn @ 2011-09-03  1:09 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: akpm, segooon, linux-kernel, netdev, containers, dhowells, rdunlap

Quoting Eric W. Biederman (ebiederm@xmission.com):
> 
> 
> Was this blank email supposed to be patch 01/15?

Nope, that was <grr> a git-send-email misfire.  Sorry about that.  The
patch #1 did go through, here: https://lkml.org/lkml/2011/9/2/314

I'm appending it here again too for easier feedback.

thanks,
-serge

 Subject: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)

Quoting David Howells (dhowells@redhat.com):
> Randy Dunlap <rdunlap@xenotime.net> wrote:
>
> > > +Any task in or resource belonging to the initial user namespace will, to this
> > > +new task, appear to belong to UID and GID -1 - which is usually known as
> >
> > that extra hyphen is confusing.  how about:
> >
> >                               to UID and GID -1, which is
>
> 'which are'.
>
> David

This will hold some info about the design.  Currently it contains
future todos, issues and questions.

Changelog:
   jul 26: incorporate feedback from David Howells.
   jul 29: incorporate feedback from Randy Dunlap.

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
---
 Documentation/namespaces/user_namespace.txt |  107 +++++++++++++++++++++++++++
 1 files changed, 107 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/namespaces/user_namespace.txt

diff --git a/Documentation/namespaces/user_namespace.txt b/Documentation/namespaces/user_namespace.txt
new file mode 100644
index 0000000..b0bc480
--- /dev/null
+++ b/Documentation/namespaces/user_namespace.txt
@@ -0,0 +1,107 @@
+Description
+===========
+
+Traditionally, each task is owned by a user ID (UID) and belongs to one or more
+groups (GID).  Both are simple numeric IDs, though userspace usually translates
+them to names.  The user namespace allows tasks to have different views of the
+UIDs and GIDs associated with tasks and other resources.  (See 'UID mapping'
+below for more.)
+
+The user namespace is a simple hierarchical one.  The system starts with all
+tasks belonging to the initial user namespace.  A task creates a new user
+namespace by passing the CLONE_NEWUSER flag to clone(2).  This requires the
+creating task to have the CAP_SETUID, CAP_SETGID, and CAP_CHOWN capabilities,
+but it does not need to be running as root.  The clone(2) call will result in a
+new task which to itself appears to be running as UID and GID 0, but to its
+creator seems to have the creator's credentials.
+
+To this new task, any resource belonging to the initial user namespace will
+appear to belong to user and group 'nobody', which are UID and GID -1.
+Permission to open such files will be granted according to world access
+permissions.  UID comparisons and group membership checks will return false,
+and privilege will be denied.
+
+When a task belonging to (for example) userid 500 in the initial user namespace
+creates a new user namespace, even though the new task will see itself as
+belonging to UID 0, any task in the initial user namespace will see it as
+belonging to UID 500.  Therefore, UID 500 in the initial user namespace will be
+able to kill the new task.  Files created by the new user will (eventually) be
+seen by tasks in its own user namespace as belonging to UID 0, but to tasks in
+the initial user namespace as belonging to UID 500.
+
+Note that this userid mapping for the VFS is not yet implemented, though the
+lkml and containers mailing list archives will show several previous
+prototypes.  In the end, those got hung up waiting on the concept of targeted
+capabilities to be developed, which, thanks to the insight of Eric Biederman,
+they finally did.
+
+Relationship between the User namespace and other namespaces
+============================================================
+
+Other namespaces, such as UTS and network, are owned by a user namespace.  When
+such a namespace is created, it is assigned to the user namespace of the task
+by which it was created.  Therefore, attempts to exercise privilege to
+resources in, for instance, a particular network namespace, can be properly
+validated by checking whether the caller has the needed privilege (i.e.
+CAP_NET_ADMIN) targeted to the user namespace which owns the network namespace.
+This is done using the ns_capable() function.
+
+As an example, if a new task is cloned with a private user namespace but
+no private network namespace, then the task's network namespace is owned
+by the parent user namespace.  The new task has no privilege to the
+parent user namespace, so it will not be able to create or configure
+network devices.  If, instead, the task were cloned with both private
+user and network namespaces, then the private network namespace is owned
+by the private user namespace, and so root in the new user namespace
+will have privilege targeted to the network namespace.  It will be able
+to create and configure network devices.
+
+UID Mapping
+===========
+The current plan (see 'flexible UID mapping' at
+https://wiki.ubuntu.com/UserNamespace) is:
+
+The UID/GID stored on disk will be that in the init_user_ns.  Most likely
+UID/GID in other namespaces will be stored in xattrs.  But Eric was advocating
+(a few years ago) leaving the details up to filesystems while providing a lib/
+stock implementation.  See the thread around here:
+http://www.mail-archive.com/devel@openvz.org/msg09331.html
+
+
+Working notes
+=============
+Capability checks for actions related to syslog must be against the
+init_user_ns until syslog is containerized.
+
+Same is true for reboot and power, control groups, devices, and time.
+
+Perf actions (kernel/event/core.c for instance) will always be constrained to
+init_user_ns.
+
+Q:
+Is accounting considered properly containerized with respect to pidns?  (it
+appears to be).  If so, then we can change the capable() check in
+kernel/acct.c to 'ns_capable(current_pid_ns()->user_ns, CAP_PACCT)'
+
+Q:
+For things like nice and schedaffinity, we could allow root in a container to
+control those, and leave only cgroups to constrain the container.  I'm not sure
+whether that is right, or whether it violates admin expectations.
+
+I deferred some of commoncap.c.  I'm punting on xattr stuff as they take
+dentries, not inodes.
+
+For drivers/tty/tty_io.c and drivers/tty/vt/vt.c, we'll want to (for some of
+them) target the capability checks at the user_ns owning the tty.  That will
+have to wait until we get userns owning files straightened out.
+
+We need to figure out how to label devices.  Should we just toss a user_ns
+right into struct device?
+
+capable(CAP_MAC_ADMIN) checks are always to be against init_user_ns, unless
+some day LSMs were to be containerized, near zero chance.
+
+inode_owner_or_capable() should probably take an optional ns and cap parameter.
+If cap is 0, then CAP_FOWNER is checked.  If ns is NULL, we derive the ns from
+inode.  But if ns is provided, then callers who need to derive
+inode_userns(inode) anyway can save a few cycles.
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH 02/15] user ns: setns: move capable checks into per-ns attach helper
       [not found]     ` <1314993400-6910-5-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
@ 2011-09-04  1:51       ` Matt Helsley
  0 siblings, 0 replies; 69+ messages in thread
From: Matt Helsley @ 2011-09-04  1:51 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, rdunlap-/UHa2rfvQTnk1uMJSBkQmQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Fri, Sep 02, 2011 at 07:56:27PM +0000, Serge Hallyn wrote:
> From: "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>

I was confused about this patch until I realized that you're not
simply "moving" the capability checks but "distributing" them. Then
you're showing that you'll soon change some to nsown_capable() or
ns_capable() using the strange cpp pattern in the snippet below.

At least I think that's what you intended. A commit message would
help :).

Cheers,
	-Matt Helsley

> 
> Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> ---
>  ipc/namespace.c          |    7 +++++++
>  kernel/fork.c            |    5 +++++
>  kernel/nsproxy.c         |   11 ++++++++---
>  kernel/utsname.c         |    7 +++++++
>  net/core/net_namespace.c |    7 +++++++
>  5 files changed, 34 insertions(+), 3 deletions(-)
> 
> diff --git a/ipc/namespace.c b/ipc/namespace.c
> index ce0a647..a0a7609 100644
> --- a/ipc/namespace.c
> +++ b/ipc/namespace.c
> @@ -163,6 +163,13 @@ static void ipcns_put(void *ns)
> 
>  static int ipcns_install(struct nsproxy *nsproxy, void *ns)
>  {
> +#if 0
> +	struct ipc_namespace *newns = ns;
> +	if (!ns_capable(newns->user_ns, CAP_SYS_ADMIN))
> +#else
> +	if (!capable(CAP_SYS_ADMIN))
> +#endif
> +		return -1;
>  	/* Ditch state from the old ipc namespace */
>  	exit_sem(current);
>  	put_ipc_ns(nsproxy->ipc_ns);

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 02/15] user ns: setns: move capable checks into per-ns attach helper
  2011-09-02 19:56     ` Serge Hallyn
  (?)
  (?)
@ 2011-09-04  1:51     ` Matt Helsley
       [not found]       ` <20110904015140.GB32295-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2011-09-09 14:56       ` Serge E. Hallyn
  -1 siblings, 2 replies; 69+ messages in thread
From: Matt Helsley @ 2011-09-04  1:51 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap

On Fri, Sep 02, 2011 at 07:56:27PM +0000, Serge Hallyn wrote:
> From: "Serge E. Hallyn" <serge@hallyn.com>

I was confused about this patch until I realized that you're not
simply "moving" the capability checks but "distributing" them. Then
you're showing that you'll soon change some to nsown_capable() or
ns_capable() using the strange cpp pattern in the snippet below.

At least I think that's what you intended. A commit message would
help :).

Cheers,
	-Matt Helsley

> 
> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
> Cc: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  ipc/namespace.c          |    7 +++++++
>  kernel/fork.c            |    5 +++++
>  kernel/nsproxy.c         |   11 ++++++++---
>  kernel/utsname.c         |    7 +++++++
>  net/core/net_namespace.c |    7 +++++++
>  5 files changed, 34 insertions(+), 3 deletions(-)
> 
> diff --git a/ipc/namespace.c b/ipc/namespace.c
> index ce0a647..a0a7609 100644
> --- a/ipc/namespace.c
> +++ b/ipc/namespace.c
> @@ -163,6 +163,13 @@ static void ipcns_put(void *ns)
> 
>  static int ipcns_install(struct nsproxy *nsproxy, void *ns)
>  {
> +#if 0
> +	struct ipc_namespace *newns = ns;
> +	if (!ns_capable(newns->user_ns, CAP_SYS_ADMIN))
> +#else
> +	if (!capable(CAP_SYS_ADMIN))
> +#endif
> +		return -1;
>  	/* Ditch state from the old ipc namespace */
>  	exit_sem(current);
>  	put_ipc_ns(nsproxy->ipc_ns);

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
       [not found]   ` <1314993400-6910-4-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
@ 2011-09-07 22:50     ` Andrew Morton
  0 siblings, 0 replies; 69+ messages in thread
From: Andrew Morton @ 2011-09-07 22:50 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, rdunlap-/UHa2rfvQTnk1uMJSBkQmQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Fri,  2 Sep 2011 19:56:26 +0000
Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:

> +Note that this userid mapping for the VFS is not yet implemented, though the
> +lkml and containers mailing list archives will show several previous
> +prototypes.  In the end, those got hung up waiting on the concept of targeted
> +capabilities to be developed, which, thanks to the insight of Eric Biederman,
> +they finally did.

not-yet-implemented things worry me.  When can we expect this to
happen, and how big and ugly will it be?

I'm not seeing many (any) reviewed-by's on these patches.  I could get
down and stare at them myself, but that wouldn't be very useful.  This
work goes pretty deep and is quite security-affecting.  And network-afecting.
Can you round up some suitable people and get the reviewing and testing happening
please?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
  2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn
@ 2011-09-07 22:50   ` Andrew Morton
       [not found]     ` <20110907155024.42e3fe27.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2011-09-09 13:10     ` Serge E. Hallyn
       [not found]   ` <1314993400-6910-4-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
  2011-09-26 19:17     ` [kernel-hardening] " Vasiliy Kulikov
  2 siblings, 2 replies; 69+ messages in thread
From: Andrew Morton @ 2011-09-07 22:50 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: segooon, linux-kernel, netdev, containers, dhowells, ebiederm,
	rdunlap, Serge E. Hallyn

On Fri,  2 Sep 2011 19:56:26 +0000
Serge Hallyn <serge@hallyn.com> wrote:

> +Note that this userid mapping for the VFS is not yet implemented, though the
> +lkml and containers mailing list archives will show several previous
> +prototypes.  In the end, those got hung up waiting on the concept of targeted
> +capabilities to be developed, which, thanks to the insight of Eric Biederman,
> +they finally did.

not-yet-implemented things worry me.  When can we expect this to
happen, and how big and ugly will it be?

I'm not seeing many (any) reviewed-by's on these patches.  I could get
down and stare at them myself, but that wouldn't be very useful.  This
work goes pretty deep and is quite security-affecting.  And network-afecting.
Can you round up some suitable people and get the reviewing and testing happening
please?


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
       [not found]     ` <20110907155024.42e3fe27.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2011-09-09 13:10       ` Serge E. Hallyn
  0 siblings, 0 replies; 69+ messages in thread
From: Serge E. Hallyn @ 2011-09-09 13:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, rdunlap-/UHa2rfvQTnk1uMJSBkQmQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

Quoting Andrew Morton (akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org):
> On Fri,  2 Sep 2011 19:56:26 +0000
> Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > +Note that this userid mapping for the VFS is not yet implemented, though the
> > +lkml and containers mailing list archives will show several previous
> > +prototypes.  In the end, those got hung up waiting on the concept of targeted
> > +capabilities to be developed, which, thanks to the insight of Eric Biederman,
> > +they finally did.
> 
> not-yet-implemented things worry me.  When can we expect this to
> happen, and how big and ugly will it be?

Hi Andrew,

We did a proof of concept of the simplest version of this in early August
(see git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-userns-devel.git)
which actually was very un-scary.  So technically we could push it at the
same time as this set, but I thought that might just be too much for
review in one cycle.  That set (Eric's) is the very simplest approach
which tags an entire filesystem with a user namespace.

We would also want to pursue the more baroque approach, where filesystems
themselves are user-namespace aware.  I did an approach like that in
2008, see
https://lists.linux-foundation.org/pipermail/containers/2008-August/012679.html
It again is very do-able without being ugly, but, importantly, user
namespaces are usable for containers without that.  For starters, we only
need /proc and /sys to be user namespace aware (since they must allow
access from multiple namespaces), and that is simple as they are not
persistent.

So I believe that this is the last scary patchset, and that user
namespaces could actually be usable by the end of the year!

> I'm not seeing many (any) reviewed-by's on these patches.  I could get
> down and stare at them myself, but that wouldn't be very useful.  This
> work goes pretty deep and is quite security-affecting.  And network-afecting.
> Can you round up some suitable people and get the reviewing and testing happening
> please?

Will try.  Unfortunately I missed my chance to beg and bribe people in
person at plumbers :(

thanks,
-serge

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
  2011-09-07 22:50   ` Andrew Morton
       [not found]     ` <20110907155024.42e3fe27.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2011-09-09 13:10     ` Serge E. Hallyn
  1 sibling, 0 replies; 69+ messages in thread
From: Serge E. Hallyn @ 2011-09-09 13:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Serge Hallyn, segooon, linux-kernel, netdev, containers,
	dhowells, ebiederm, rdunlap

Quoting Andrew Morton (akpm@linux-foundation.org):
> On Fri,  2 Sep 2011 19:56:26 +0000
> Serge Hallyn <serge@hallyn.com> wrote:
> 
> > +Note that this userid mapping for the VFS is not yet implemented, though the
> > +lkml and containers mailing list archives will show several previous
> > +prototypes.  In the end, those got hung up waiting on the concept of targeted
> > +capabilities to be developed, which, thanks to the insight of Eric Biederman,
> > +they finally did.
> 
> not-yet-implemented things worry me.  When can we expect this to
> happen, and how big and ugly will it be?

Hi Andrew,

We did a proof of concept of the simplest version of this in early August
(see git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-userns-devel.git)
which actually was very un-scary.  So technically we could push it at the
same time as this set, but I thought that might just be too much for
review in one cycle.  That set (Eric's) is the very simplest approach
which tags an entire filesystem with a user namespace.

We would also want to pursue the more baroque approach, where filesystems
themselves are user-namespace aware.  I did an approach like that in
2008, see
https://lists.linux-foundation.org/pipermail/containers/2008-August/012679.html
It again is very do-able without being ugly, but, importantly, user
namespaces are usable for containers without that.  For starters, we only
need /proc and /sys to be user namespace aware (since they must allow
access from multiple namespaces), and that is simple as they are not
persistent.

So I believe that this is the last scary patchset, and that user
namespaces could actually be usable by the end of the year!

> I'm not seeing many (any) reviewed-by's on these patches.  I could get
> down and stare at them myself, but that wouldn't be very useful.  This
> work goes pretty deep and is quite security-affecting.  And network-afecting.
> Can you round up some suitable people and get the reviewing and testing happening
> please?

Will try.  Unfortunately I missed my chance to beg and bribe people in
person at plumbers :(

thanks,
-serge

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 02/15] user ns: setns: move capable checks into per-ns attach helper
       [not found]       ` <20110904015140.GB32295-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2011-09-09 14:56         ` Serge E. Hallyn
  0 siblings, 0 replies; 69+ messages in thread
From: Serge E. Hallyn @ 2011-09-09 14:56 UTC (permalink / raw)
  To: Matt Helsley
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, rdunlap-/UHa2rfvQTnk1uMJSBkQmQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> On Fri, Sep 02, 2011 at 07:56:27PM +0000, Serge Hallyn wrote:
> > From: "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
> 
> I was confused about this patch until I realized that you're not
> simply "moving" the capability checks but "distributing" them. Then
> you're showing that you'll soon change some to nsown_capable() or
> ns_capable() using the strange cpp pattern in the snippet below.
> 
> At least I think that's what you intended. A commit message would
> help :).

Yes, sorry - Eric convinced me several times to be more conservative in
the patch, and I failed to fix the commit msg when squashing the
resulting patches.  How about the following:

======

user ns: update capable calls when cloning and attaching namespaces

Distribute the capable() checks at ns attach into the namespace-specific
attach handler.

Note the fact that the capable() checks will be changed to targeted
checks at both namespace clone and attach methods, but don't actually
make that change yet.  Until that trigger is pulled, you must have
the capabilities targeted toward the initial user namespace in order to
do any of these actions, meaning that a task in a child user namespace
cannot do them.  Once we pull the trigger, a task in a child user
namespace will be able to clone new namespaces if it is privileged in
its own user namespace, and attach to existing namespaces to which it
has privilege.

======

Thanks for taking a look, Matt!

-serge

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 02/15] user ns: setns: move capable checks into per-ns attach helper
  2011-09-04  1:51     ` Matt Helsley
       [not found]       ` <20110904015140.GB32295-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2011-09-09 14:56       ` Serge E. Hallyn
  1 sibling, 0 replies; 69+ messages in thread
From: Serge E. Hallyn @ 2011-09-09 14:56 UTC (permalink / raw)
  To: Matt Helsley
  Cc: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap

Quoting Matt Helsley (matthltc@us.ibm.com):
> On Fri, Sep 02, 2011 at 07:56:27PM +0000, Serge Hallyn wrote:
> > From: "Serge E. Hallyn" <serge@hallyn.com>
> 
> I was confused about this patch until I realized that you're not
> simply "moving" the capability checks but "distributing" them. Then
> you're showing that you'll soon change some to nsown_capable() or
> ns_capable() using the strange cpp pattern in the snippet below.
> 
> At least I think that's what you intended. A commit message would
> help :).

Yes, sorry - Eric convinced me several times to be more conservative in
the patch, and I failed to fix the commit msg when squashing the
resulting patches.  How about the following:

======

user ns: update capable calls when cloning and attaching namespaces

Distribute the capable() checks at ns attach into the namespace-specific
attach handler.

Note the fact that the capable() checks will be changed to targeted
checks at both namespace clone and attach methods, but don't actually
make that change yet.  Until that trigger is pulled, you must have
the capabilities targeted toward the initial user namespace in order to
do any of these actions, meaning that a task in a child user namespace
cannot do them.  Once we pull the trigger, a task in a child user
namespace will be able to clone new namespaces if it is privileged in
its own user namespace, and attach to existing namespaces to which it
has privilege.

======

Thanks for taking a look, Matt!

-serge

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: user namespaces v3: continue targetting capabilities
  2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn
                   ` (10 preceding siblings ...)
  2011-09-02 19:56 ` [PATCH 15/15] make kernel/signal.c user ns safe (v2) Serge Hallyn
@ 2011-09-13 14:43 ` Serge E. Hallyn
  11 siblings, 0 replies; 69+ messages in thread
From: Serge E. Hallyn @ 2011-09-13 14:43 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap

I did a bit of basic performance testing - just running unixbench
and doing a kernel compile (without profiling) with and without
this patchset, with USER_NS enabled for both.  I could find no
meaningful impact.

473.01user 32.48system 9:05.44elapsed 92%CPU (0avgtext+0avgdata 430752maxresident)k
112736inputs+576936outputs (8major+22057422minor)pagefaults 0swaps
473.78user 33.12system 9:06.14elapsed 92%CPU (0avgtext+0avgdata 430752maxresident)k
116656inputs+576936outputs (12major+22056621minor)pagefaults 0swaps

and with:
474.09user 31.62system 9:05.70elapsed 92%CPU (0avgtext+0avgdata 430752maxresident)k
112648inputs+576936outputs (7major+22056909minor)pagefaults 0swaps
472.54user 33.26system 9:05.43elapsed 92%CPU (0avgtext+0avgdata 430608maxresident)k
116656inputs+576936outputs (12major+22058358minor)pagefaults 0swaps

I'll append the full unixbench outputs below, but index score without
the patchset was 1594.3, and with the patchset was 1597.4.

thanks,
-serge

=====================================================================
unixbench without patchset:
=====================================================================
   BYTE UNIX Benchmarks (Version 5.1.3)

   System: marula: GNU/Linux
   OS: GNU/Linux -- 3.0.0-11-server -- #17-Ubuntu SMP Fri Sep 9 19:31:36 UTC 2011
   Machine: x86_64 (x86_64)
   Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
   CPU 0: Intel(R) Xeon(R) CPU E5530 @ 2.40GHz (4800.3 bogomips)
          Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
   02:43:44 up  3:00,  1 user,  load average: 0.05, 0.03, 0.03; runlevel 2

------------------------------------------------------------------------
Benchmark Run: Mon Sep 12 2011 02:43:44 - 03:11:55
1 CPU in system; running 1 parallel copy of tests

Dhrystone 2 using register variables       28147322.1 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     3289.7 MWIPS (10.0 s, 7 samples)
Execl Throughput                               4557.5 lps   (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks       1145450.6 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks          312941.7 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks       1969030.8 KBps  (30.0 s, 2 samples)
Pipe Throughput                             2080076.5 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                 331910.6 lps   (10.0 s, 7 samples)
Process Creation                              14921.7 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   6989.7 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                    913.9 lpm   (60.0 s, 2 samples)
System Call Overhead                        3453367.4 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   28147322.1   2411.9
Double-Precision Whetstone                       55.0       3289.7    598.1
Execl Throughput                                 43.0       4557.5   1059.9
File Copy 1024 bufsize 2000 maxblocks          3960.0    1145450.6   2892.6
File Copy 256 bufsize 500 maxblocks            1655.0     312941.7   1890.9
File Copy 4096 bufsize 8000 maxblocks          5800.0    1969030.8   3394.9
Pipe Throughput                               12440.0    2080076.5   1672.1
Pipe-based Context Switching                   4000.0     331910.6    829.8
Process Creation                                126.0      14921.7   1184.3
Shell Scripts (1 concurrent)                     42.4       6989.7   1648.5
Shell Scripts (8 concurrent)                      6.0        913.9   1523.2
System Call Overhead                          15000.0    3453367.4   2302.2
                                                                   ========
System Benchmarks Index Score                                        1594.3

=====================================================================
unixbench with patchset:
=====================================================================
   BYTE UNIX Benchmarks (Version 5.1.3)

   System: marula: GNU/Linux
   OS: GNU/Linux -- 3.0.0-11-server -- #17userns1 SMP Mon Sep 12 13:42:40 UTC 2011
   Machine: x86_64 (x86_64)
   Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
   CPU 0: Intel(R) Xeon(R) CPU E5530 @ 2.40GHz (4799.6 bogomips)
          Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
   12:42:07 up 8 min,  1 user,  load average: 0.00, 0.01, 0.02; runlevel 2

------------------------------------------------------------------------
Benchmark Run: Mon Sep 12 2011 12:42:07 - 13:10:19
1 CPU in system; running 1 parallel copy of tests

Dhrystone 2 using register variables       28232156.4 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     3290.0 MWIPS (10.0 s, 7 samples)
Execl Throughput                               4553.7 lps   (29.9 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks       1142317.5 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks          317068.8 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks       1956611.4 KBps  (30.0 s, 2 samples)
Pipe Throughput                             2086728.8 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                 343275.1 lps   (10.0 s, 7 samples)
Process Creation                              14718.6 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   6989.0 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                    913.6 lpm   (60.0 s, 2 samples)
System Call Overhead                        3434956.0 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   28232156.4   2419.2
Double-Precision Whetstone                       55.0       3290.0    598.2
Execl Throughput                                 43.0       4553.7   1059.0
File Copy 1024 bufsize 2000 maxblocks          3960.0    1142317.5   2884.6
File Copy 256 bufsize 500 maxblocks            1655.0     317068.8   1915.8
File Copy 4096 bufsize 8000 maxblocks          5800.0    1956611.4   3373.5
Pipe Throughput                               12440.0    2086728.8   1677.4
Pipe-based Context Switching                   4000.0     343275.1    858.2
Process Creation                                126.0      14718.6   1168.1
Shell Scripts (1 concurrent)                     42.4       6989.0   1648.3
Shell Scripts (8 concurrent)                      6.0        913.6   1522.6
System Call Overhead                          15000.0    3434956.0   2290.0
                                                                   ========
System Benchmarks Index Score                                        1597.4


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
  2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn
@ 2011-09-26 19:17     ` Vasiliy Kulikov
       [not found]   ` <1314993400-6910-4-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
  2011-09-26 19:17     ` [kernel-hardening] " Vasiliy Kulikov
  2 siblings, 0 replies; 69+ messages in thread
From: Vasiliy Kulikov @ 2011-09-26 19:17 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: akpm, linux-kernel, netdev, containers, dhowells, ebiederm,
	rdunlap, Serge E. Hallyn, kernel-hardening

(cc'ed kernel-hardening)


Hi Serge,

I didn't deeply studied the patches yet (sorry!), but I have some
long-term question about the technique in general.  I couldn't find
answers to the questions in the documentation.

First, the patches by design expose much kernel code to unprivileged
userspace processes.  This code doesn't expect malformed data (e.g. VFS,
specific filesystems, block layer, char drivers, sysadmin part of LSMs,
etc. etc.).  By relaxing permission rules you greatly increase attack
surface of the kernel from unprivileged users.  Are you (or somebody
else) planning to audit this code?

Also, will it be possible to somehow restrict what specific kernel
facilities are accessible from users (IOW, what root emulation
limitations are in action)?  It is userful from both points of sysadmin,
who might not want to allow users to do such things, and from the
security POV in sense of attack surface reduction.

The patches explicitly enable some features for users on white list
basis.  It's possible to do it for simple cases, but what are you going
to do with multiplexing functions where there is a permission check
before the actual multiplexing?  FS, networking drivers, etc.  Are you
going to do the same thing as net_namespace does? - For each multiplexed
entity create bool ->ns_aware which is false by default for all
"untrusted"/not prepared protocols and is true for audited/prepared
protocols.  Or probably you have something else in mind?

Thanks,

On Fri, Sep 02, 2011 at 19:56 +0000, Serge Hallyn wrote:
> From: "Serge E. Hallyn" <serge@hallyn.com>
> 
> Quoting David Howells (dhowells@redhat.com):
> > Randy Dunlap <rdunlap@xenotime.net> wrote:
> >
> > > > +Any task in or resource belonging to the initial user namespace will, to this
> > > > +new task, appear to belong to UID and GID -1 - which is usually known as
> > >
> > > that extra hyphen is confusing.  how about:
> > >
> > >                               to UID and GID -1, which is
> >
> > 'which are'.
> >
> > David
> 
> This will hold some info about the design.  Currently it contains
> future todos, issues and questions.
> 
> Changelog:
>    jul 26: incorporate feedback from David Howells.
>    jul 29: incorporate feedback from Randy Dunlap.
> 
> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
> Cc: Eric W. Biederman <ebiederm@xmission.com>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Randy Dunlap <rdunlap@xenotime.net>
> ---
>  Documentation/namespaces/user_namespace.txt |  107 +++++++++++++++++++++++++++
>  1 files changed, 107 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/namespaces/user_namespace.txt
> 
> diff --git a/Documentation/namespaces/user_namespace.txt b/Documentation/namespaces/user_namespace.txt
> new file mode 100644
> index 0000000..b0bc480
> --- /dev/null
> +++ b/Documentation/namespaces/user_namespace.txt
> @@ -0,0 +1,107 @@
> +Description
> +===========
> +
> +Traditionally, each task is owned by a user ID (UID) and belongs to one or more
> +groups (GID).  Both are simple numeric IDs, though userspace usually translates
> +them to names.  The user namespace allows tasks to have different views of the
> +UIDs and GIDs associated with tasks and other resources.  (See 'UID mapping'
> +below for more.)
> +
> +The user namespace is a simple hierarchical one.  The system starts with all
> +tasks belonging to the initial user namespace.  A task creates a new user
> +namespace by passing the CLONE_NEWUSER flag to clone(2).  This requires the
> +creating task to have the CAP_SETUID, CAP_SETGID, and CAP_CHOWN capabilities,
> +but it does not need to be running as root.  The clone(2) call will result in a
> +new task which to itself appears to be running as UID and GID 0, but to its
> +creator seems to have the creator's credentials.
> +
> +To this new task, any resource belonging to the initial user namespace will
> +appear to belong to user and group 'nobody', which are UID and GID -1.
> +Permission to open such files will be granted according to world access
> +permissions.  UID comparisons and group membership checks will return false,
> +and privilege will be denied.
> +
> +When a task belonging to (for example) userid 500 in the initial user namespace
> +creates a new user namespace, even though the new task will see itself as
> +belonging to UID 0, any task in the initial user namespace will see it as
> +belonging to UID 500.  Therefore, UID 500 in the initial user namespace will be
> +able to kill the new task.  Files created by the new user will (eventually) be
> +seen by tasks in its own user namespace as belonging to UID 0, but to tasks in
> +the initial user namespace as belonging to UID 500.
> +
> +Note that this userid mapping for the VFS is not yet implemented, though the
> +lkml and containers mailing list archives will show several previous
> +prototypes.  In the end, those got hung up waiting on the concept of targeted
> +capabilities to be developed, which, thanks to the insight of Eric Biederman,
> +they finally did.
> +
> +Relationship between the User namespace and other namespaces
> +============================================================
> +
> +Other namespaces, such as UTS and network, are owned by a user namespace.  When
> +such a namespace is created, it is assigned to the user namespace of the task
> +by which it was created.  Therefore, attempts to exercise privilege to
> +resources in, for instance, a particular network namespace, can be properly
> +validated by checking whether the caller has the needed privilege (i.e.
> +CAP_NET_ADMIN) targeted to the user namespace which owns the network namespace.
> +This is done using the ns_capable() function.
> +
> +As an example, if a new task is cloned with a private user namespace but
> +no private network namespace, then the task's network namespace is owned
> +by the parent user namespace.  The new task has no privilege to the
> +parent user namespace, so it will not be able to create or configure
> +network devices.  If, instead, the task were cloned with both private
> +user and network namespaces, then the private network namespace is owned
> +by the private user namespace, and so root in the new user namespace
> +will have privilege targeted to the network namespace.  It will be able
> +to create and configure network devices.
> +
> +UID Mapping
> +===========
> +The current plan (see 'flexible UID mapping' at
> +https://wiki.ubuntu.com/UserNamespace) is:
> +
> +The UID/GID stored on disk will be that in the init_user_ns.  Most likely
> +UID/GID in other namespaces will be stored in xattrs.  But Eric was advocating
> +(a few years ago) leaving the details up to filesystems while providing a lib/
> +stock implementation.  See the thread around here:
> +http://www.mail-archive.com/devel@openvz.org/msg09331.html
> +
> +
> +Working notes
> +=============
> +Capability checks for actions related to syslog must be against the
> +init_user_ns until syslog is containerized.
> +
> +Same is true for reboot and power, control groups, devices, and time.
> +
> +Perf actions (kernel/event/core.c for instance) will always be constrained to
> +init_user_ns.
> +
> +Q:
> +Is accounting considered properly containerized with respect to pidns?  (it
> +appears to be).  If so, then we can change the capable() check in
> +kernel/acct.c to 'ns_capable(current_pid_ns()->user_ns, CAP_PACCT)'
> +
> +Q:
> +For things like nice and schedaffinity, we could allow root in a container to
> +control those, and leave only cgroups to constrain the container.  I'm not sure
> +whether that is right, or whether it violates admin expectations.
> +
> +I deferred some of commoncap.c.  I'm punting on xattr stuff as they take
> +dentries, not inodes.
> +
> +For drivers/tty/tty_io.c and drivers/tty/vt/vt.c, we'll want to (for some of
> +them) target the capability checks at the user_ns owning the tty.  That will
> +have to wait until we get userns owning files straightened out.
> +
> +We need to figure out how to label devices.  Should we just toss a user_ns
> +right into struct device?
> +
> +capable(CAP_MAC_ADMIN) checks are always to be against init_user_ns, unless
> +some day LSMs were to be containerized, near zero chance.
> +
> +inode_owner_or_capable() should probably take an optional ns and cap parameter.
> +If cap is 0, then CAP_FOWNER is checked.  If ns is NULL, we derive the ns from
> +inode.  But if ns is provided, then callers who need to derive
> +inode_userns(inode) anyway can save a few cycles.
> -- 
> 1.7.5.4

-- 
Vasiliy

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [kernel-hardening] Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
@ 2011-09-26 19:17     ` Vasiliy Kulikov
  0 siblings, 0 replies; 69+ messages in thread
From: Vasiliy Kulikov @ 2011-09-26 19:17 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: akpm, linux-kernel, netdev, containers, dhowells, ebiederm,
	rdunlap, Serge E. Hallyn, kernel-hardening

(cc'ed kernel-hardening)


Hi Serge,

I didn't deeply studied the patches yet (sorry!), but I have some
long-term question about the technique in general.  I couldn't find
answers to the questions in the documentation.

First, the patches by design expose much kernel code to unprivileged
userspace processes.  This code doesn't expect malformed data (e.g. VFS,
specific filesystems, block layer, char drivers, sysadmin part of LSMs,
etc. etc.).  By relaxing permission rules you greatly increase attack
surface of the kernel from unprivileged users.  Are you (or somebody
else) planning to audit this code?

Also, will it be possible to somehow restrict what specific kernel
facilities are accessible from users (IOW, what root emulation
limitations are in action)?  It is userful from both points of sysadmin,
who might not want to allow users to do such things, and from the
security POV in sense of attack surface reduction.

The patches explicitly enable some features for users on white list
basis.  It's possible to do it for simple cases, but what are you going
to do with multiplexing functions where there is a permission check
before the actual multiplexing?  FS, networking drivers, etc.  Are you
going to do the same thing as net_namespace does? - For each multiplexed
entity create bool ->ns_aware which is false by default for all
"untrusted"/not prepared protocols and is true for audited/prepared
protocols.  Or probably you have something else in mind?

Thanks,

On Fri, Sep 02, 2011 at 19:56 +0000, Serge Hallyn wrote:
> From: "Serge E. Hallyn" <serge@hallyn.com>
> 
> Quoting David Howells (dhowells@redhat.com):
> > Randy Dunlap <rdunlap@xenotime.net> wrote:
> >
> > > > +Any task in or resource belonging to the initial user namespace will, to this
> > > > +new task, appear to belong to UID and GID -1 - which is usually known as
> > >
> > > that extra hyphen is confusing.  how about:
> > >
> > >                               to UID and GID -1, which is
> >
> > 'which are'.
> >
> > David
> 
> This will hold some info about the design.  Currently it contains
> future todos, issues and questions.
> 
> Changelog:
>    jul 26: incorporate feedback from David Howells.
>    jul 29: incorporate feedback from Randy Dunlap.
> 
> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
> Cc: Eric W. Biederman <ebiederm@xmission.com>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Randy Dunlap <rdunlap@xenotime.net>
> ---
>  Documentation/namespaces/user_namespace.txt |  107 +++++++++++++++++++++++++++
>  1 files changed, 107 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/namespaces/user_namespace.txt
> 
> diff --git a/Documentation/namespaces/user_namespace.txt b/Documentation/namespaces/user_namespace.txt
> new file mode 100644
> index 0000000..b0bc480
> --- /dev/null
> +++ b/Documentation/namespaces/user_namespace.txt
> @@ -0,0 +1,107 @@
> +Description
> +===========
> +
> +Traditionally, each task is owned by a user ID (UID) and belongs to one or more
> +groups (GID).  Both are simple numeric IDs, though userspace usually translates
> +them to names.  The user namespace allows tasks to have different views of the
> +UIDs and GIDs associated with tasks and other resources.  (See 'UID mapping'
> +below for more.)
> +
> +The user namespace is a simple hierarchical one.  The system starts with all
> +tasks belonging to the initial user namespace.  A task creates a new user
> +namespace by passing the CLONE_NEWUSER flag to clone(2).  This requires the
> +creating task to have the CAP_SETUID, CAP_SETGID, and CAP_CHOWN capabilities,
> +but it does not need to be running as root.  The clone(2) call will result in a
> +new task which to itself appears to be running as UID and GID 0, but to its
> +creator seems to have the creator's credentials.
> +
> +To this new task, any resource belonging to the initial user namespace will
> +appear to belong to user and group 'nobody', which are UID and GID -1.
> +Permission to open such files will be granted according to world access
> +permissions.  UID comparisons and group membership checks will return false,
> +and privilege will be denied.
> +
> +When a task belonging to (for example) userid 500 in the initial user namespace
> +creates a new user namespace, even though the new task will see itself as
> +belonging to UID 0, any task in the initial user namespace will see it as
> +belonging to UID 500.  Therefore, UID 500 in the initial user namespace will be
> +able to kill the new task.  Files created by the new user will (eventually) be
> +seen by tasks in its own user namespace as belonging to UID 0, but to tasks in
> +the initial user namespace as belonging to UID 500.
> +
> +Note that this userid mapping for the VFS is not yet implemented, though the
> +lkml and containers mailing list archives will show several previous
> +prototypes.  In the end, those got hung up waiting on the concept of targeted
> +capabilities to be developed, which, thanks to the insight of Eric Biederman,
> +they finally did.
> +
> +Relationship between the User namespace and other namespaces
> +============================================================
> +
> +Other namespaces, such as UTS and network, are owned by a user namespace.  When
> +such a namespace is created, it is assigned to the user namespace of the task
> +by which it was created.  Therefore, attempts to exercise privilege to
> +resources in, for instance, a particular network namespace, can be properly
> +validated by checking whether the caller has the needed privilege (i.e.
> +CAP_NET_ADMIN) targeted to the user namespace which owns the network namespace.
> +This is done using the ns_capable() function.
> +
> +As an example, if a new task is cloned with a private user namespace but
> +no private network namespace, then the task's network namespace is owned
> +by the parent user namespace.  The new task has no privilege to the
> +parent user namespace, so it will not be able to create or configure
> +network devices.  If, instead, the task were cloned with both private
> +user and network namespaces, then the private network namespace is owned
> +by the private user namespace, and so root in the new user namespace
> +will have privilege targeted to the network namespace.  It will be able
> +to create and configure network devices.
> +
> +UID Mapping
> +===========
> +The current plan (see 'flexible UID mapping' at
> +https://wiki.ubuntu.com/UserNamespace) is:
> +
> +The UID/GID stored on disk will be that in the init_user_ns.  Most likely
> +UID/GID in other namespaces will be stored in xattrs.  But Eric was advocating
> +(a few years ago) leaving the details up to filesystems while providing a lib/
> +stock implementation.  See the thread around here:
> +http://www.mail-archive.com/devel@openvz.org/msg09331.html
> +
> +
> +Working notes
> +=============
> +Capability checks for actions related to syslog must be against the
> +init_user_ns until syslog is containerized.
> +
> +Same is true for reboot and power, control groups, devices, and time.
> +
> +Perf actions (kernel/event/core.c for instance) will always be constrained to
> +init_user_ns.
> +
> +Q:
> +Is accounting considered properly containerized with respect to pidns?  (it
> +appears to be).  If so, then we can change the capable() check in
> +kernel/acct.c to 'ns_capable(current_pid_ns()->user_ns, CAP_PACCT)'
> +
> +Q:
> +For things like nice and schedaffinity, we could allow root in a container to
> +control those, and leave only cgroups to constrain the container.  I'm not sure
> +whether that is right, or whether it violates admin expectations.
> +
> +I deferred some of commoncap.c.  I'm punting on xattr stuff as they take
> +dentries, not inodes.
> +
> +For drivers/tty/tty_io.c and drivers/tty/vt/vt.c, we'll want to (for some of
> +them) target the capability checks at the user_ns owning the tty.  That will
> +have to wait until we get userns owning files straightened out.
> +
> +We need to figure out how to label devices.  Should we just toss a user_ns
> +right into struct device?
> +
> +capable(CAP_MAC_ADMIN) checks are always to be against init_user_ns, unless
> +some day LSMs were to be containerized, near zero chance.
> +
> +inode_owner_or_capable() should probably take an optional ns and cap parameter.
> +If cap is 0, then CAP_FOWNER is checked.  If ns is NULL, we derive the ns from
> +inode.  But if ns is provided, then callers who need to derive
> +inode_userns(inode) anyway can save a few cycles.
> -- 
> 1.7.5.4

-- 
Vasiliy

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
  2011-09-26 19:17     ` [kernel-hardening] " Vasiliy Kulikov
@ 2011-09-27 13:21       ` Serge E. Hallyn
  -1 siblings, 0 replies; 69+ messages in thread
From: Serge E. Hallyn @ 2011-09-27 13:21 UTC (permalink / raw)
  To: Vasiliy Kulikov
  Cc: Serge Hallyn, akpm, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap, kernel-hardening

Quoting Vasiliy Kulikov (segoon@openwall.com):
> (cc'ed kernel-hardening)
> 
> 
> Hi Serge,
> 
> I didn't deeply studied the patches yet (sorry!), but I have some
> long-term question about the technique in general.  I couldn't find
> answers to the questions in the documentation.

Great - thanks for your time, Vasiliy.

There is documentation at https://wiki.ubuntu.com/UserNamespace,
and I was adding a Documentation/namespaces/user_namespace.txt file
(which hasn't gone in yet) which you can see here:
https://lkml.org/lkml/2011/7/26/351

But those don't answer your questions sufficiently.

> First, the patches by design expose much kernel code to unprivileged
> userspace processes.  This code doesn't expect malformed data (e.g. VFS,
> specific filesystems, block layer, char drivers, sysadmin part of LSMs,
> etc. etc.).  By relaxing permission rules you greatly increase attack
> surface of the kernel from unprivileged users.  Are you (or somebody
> else) planning to audit this code?

I had wanted to (but didn't) propose a discussion at ksummit about how
best to approach the filesystem code.  That's not even just for user
namespaces - patches have been floated in the past to make mount an
unprivileged operation depending on the FS and the user's permission
over the device and target.  So I don't know if a combination of auditing
and fuzzing is the way to go, or what, and wanted to get input from
some people who are more knowledgeable on that topic than me.

You're right about other kernel code as well.

I'll certainly join in this effort, but don't want to go blindly
charging in without some advice/guidance about the best way to do
this and, if others are interested, coordinate it.

We can start by looking through all code which is currently under
ns_capable(), and analyzing that.  But what tools do we have
available to perform the analysis?

Do you think a kernel summit discussion (i suppose given the late
timing, a beer bof) would be beneficial?  (I wouldn't be there)

> Also, will it be possible to somehow restrict what specific kernel
> facilities are accessible from users (IOW, what root emulation
> limitations are in action)?  It is userful from both points of sysadmin,
> who might not want to allow users to do such things, and from the
> security POV in sense of attack surface reduction.

You're probably thinking along different lines, but this is why I've
been wanting seccomp2 to get pushed through.  So that we can deny a
container the syscalls we know it won't need, especially newer ones,
to reduce the attack surface available to it.

> The patches explicitly enable some features for users on white list
> basis.  It's possible to do it for simple cases, but what are you going
> to do with multiplexing functions where there is a permission check
> before the actual multiplexing?  FS, networking drivers, etc.  Are you
> going to do the same thing as net_namespace does? - For each multiplexed
> entity create bool ->ns_aware which is false by default for all
> "untrusted"/not prepared protocols and is true for audited/prepared
> protocols.  Or probably you have something else in mind?

Ah, I typed the bottom paragraph before realizing what you were actually
asking.  The filesystems are a good example.  In the unprivileged mounts
patchsets, for instance, a flag was added to each filesystem indicating
if it was safe for unprivileged mounting (turned off for all real block
filesystems :).  For targeted capabilities, my goal would be simply to
make sure that each non-netns-aware entity do a (untargeted) capable()
check.  Without pointing to a specific example it's hard to say what I
will do.  It depends on how the code was previously laid out, and what
the maintainer of that subsystem prefers.

The way we're approaching it right now is that by default everything
stays 'capable(X)', so that a non-init user namespace doesn't get the
privileges.  While some of my patchsets this summer didn't follow this,
Eric reminded me that we should first clamp down on the user namespaces
as much as possible, and relax permissions in child namespaces later.
So the small (1-2 patch sized) sets I've been sending the last few
weeks are just trying to fix existing inadequate userid or capability
checks.

-serge

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [kernel-hardening] Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
@ 2011-09-27 13:21       ` Serge E. Hallyn
  0 siblings, 0 replies; 69+ messages in thread
From: Serge E. Hallyn @ 2011-09-27 13:21 UTC (permalink / raw)
  To: Vasiliy Kulikov
  Cc: Serge Hallyn, akpm, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap, kernel-hardening

Quoting Vasiliy Kulikov (segoon@openwall.com):
> (cc'ed kernel-hardening)
> 
> 
> Hi Serge,
> 
> I didn't deeply studied the patches yet (sorry!), but I have some
> long-term question about the technique in general.  I couldn't find
> answers to the questions in the documentation.

Great - thanks for your time, Vasiliy.

There is documentation at https://wiki.ubuntu.com/UserNamespace,
and I was adding a Documentation/namespaces/user_namespace.txt file
(which hasn't gone in yet) which you can see here:
https://lkml.org/lkml/2011/7/26/351

But those don't answer your questions sufficiently.

> First, the patches by design expose much kernel code to unprivileged
> userspace processes.  This code doesn't expect malformed data (e.g. VFS,
> specific filesystems, block layer, char drivers, sysadmin part of LSMs,
> etc. etc.).  By relaxing permission rules you greatly increase attack
> surface of the kernel from unprivileged users.  Are you (or somebody
> else) planning to audit this code?

I had wanted to (but didn't) propose a discussion at ksummit about how
best to approach the filesystem code.  That's not even just for user
namespaces - patches have been floated in the past to make mount an
unprivileged operation depending on the FS and the user's permission
over the device and target.  So I don't know if a combination of auditing
and fuzzing is the way to go, or what, and wanted to get input from
some people who are more knowledgeable on that topic than me.

You're right about other kernel code as well.

I'll certainly join in this effort, but don't want to go blindly
charging in without some advice/guidance about the best way to do
this and, if others are interested, coordinate it.

We can start by looking through all code which is currently under
ns_capable(), and analyzing that.  But what tools do we have
available to perform the analysis?

Do you think a kernel summit discussion (i suppose given the late
timing, a beer bof) would be beneficial?  (I wouldn't be there)

> Also, will it be possible to somehow restrict what specific kernel
> facilities are accessible from users (IOW, what root emulation
> limitations are in action)?  It is userful from both points of sysadmin,
> who might not want to allow users to do such things, and from the
> security POV in sense of attack surface reduction.

You're probably thinking along different lines, but this is why I've
been wanting seccomp2 to get pushed through.  So that we can deny a
container the syscalls we know it won't need, especially newer ones,
to reduce the attack surface available to it.

> The patches explicitly enable some features for users on white list
> basis.  It's possible to do it for simple cases, but what are you going
> to do with multiplexing functions where there is a permission check
> before the actual multiplexing?  FS, networking drivers, etc.  Are you
> going to do the same thing as net_namespace does? - For each multiplexed
> entity create bool ->ns_aware which is false by default for all
> "untrusted"/not prepared protocols and is true for audited/prepared
> protocols.  Or probably you have something else in mind?

Ah, I typed the bottom paragraph before realizing what you were actually
asking.  The filesystems are a good example.  In the unprivileged mounts
patchsets, for instance, a flag was added to each filesystem indicating
if it was safe for unprivileged mounting (turned off for all real block
filesystems :).  For targeted capabilities, my goal would be simply to
make sure that each non-netns-aware entity do a (untargeted) capable()
check.  Without pointing to a specific example it's hard to say what I
will do.  It depends on how the code was previously laid out, and what
the maintainer of that subsystem prefers.

The way we're approaching it right now is that by default everything
stays 'capable(X)', so that a non-init user namespace doesn't get the
privileges.  While some of my patchsets this summer didn't follow this,
Eric reminded me that we should first clamp down on the user namespaces
as much as possible, and relax permissions in child namespaces later.
So the small (1-2 patch sized) sets I've been sending the last few
weeks are just trying to fix existing inadequate userid or capability
checks.

-serge

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
  2011-09-27 13:21       ` [kernel-hardening] " Serge E. Hallyn
@ 2011-09-27 15:56         ` Vasiliy Kulikov
  -1 siblings, 0 replies; 69+ messages in thread
From: Vasiliy Kulikov @ 2011-09-27 15:56 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Serge Hallyn, akpm, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap, kernel-hardening

On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote:
> > First, the patches by design expose much kernel code to unprivileged
> > userspace processes.  This code doesn't expect malformed data (e.g. VFS,
> > specific filesystems, block layer, char drivers, sysadmin part of LSMs,
> > etc. etc.).  By relaxing permission rules you greatly increase attack
> > surface of the kernel from unprivileged users.  Are you (or somebody
> > else) planning to audit this code?
> 
> I had wanted to (but didn't) propose a discussion at ksummit about how
> best to approach the filesystem code.  That's not even just for user
> namespaces - patches have been floated in the past to make mount an
> unprivileged operation depending on the FS and the user's permission
> over the device and target.

This is a dangerous operation by itself.  AFAICS, this is the reason why
e.g. FUSE doesn't pass user mount points to other users and even root.
Beginning from violating some rules like existance of single "." and
".." in each directory and ending with filename charsets with /, \000
and things like `, ", ', \ inside.


>  So I don't know if a combination of auditing
> and fuzzing is the way to go,

Maybe the combination of both.  There are no generic recommendations,
it's always limited to the subsystem, checked property, and the
auditor.


> > Also, will it be possible to somehow restrict what specific kernel
> > facilities are accessible from users (IOW, what root emulation
> > limitations are in action)?  It is userful from both points of sysadmin,
> > who might not want to allow users to do such things, and from the
> > security POV in sense of attack surface reduction.
> 
> You're probably thinking along different lines, but this is why I've
> been wanting seccomp2 to get pushed through.  So that we can deny a
> container the syscalls we know it won't need, especially newer ones,
> to reduce the attack surface available to it.

This dependency greatly complicates the things.

First, there is a big misunderstanding between Will and Ingo in what
needs seccompv2 should serve.  Will wants to reduce kernel attack
surface by limiting syscalls and syscall arguments available to a user
(a single task, btw).  Ingo wants to see a full featured filtering
engine, which needs code changes all over the kernel.  Given the needed
changes amounts, it will unlikely reduce attack surface.

You probably don't want Will's version as syscalls filtering is a very
bad abstraction in your case.  user_namespaces likely need Ingo's
version of seccomp as it will be possible to filter e.g. fs-specific
events, but even if it is implemented, it will take a looong time for
your needs IMHO.


Also, I'm afraid for _good_ user_namespace filtering the policy
definition will be too complicated (like SELinux policy definition for
non-trivial applications) if it is implemented in events filtering
terms.


> The way we're approaching it right now is that by default everything
> stays 'capable(X)', so that a non-init user namespace doesn't get the
> privileges.

Great.  I was not sure about it.


>  While some of my patchsets this summer didn't follow this,
> Eric reminded me that we should first clamp down on the user namespaces
> as much as possible, and relax permissions in child namespaces later.

I think it is the only sane way.


> So the small (1-2 patch sized) sets I've been sending the last few
> weeks are just trying to fix existing inadequate userid or capability
> checks.
> 
> -serge

Thanks,

-- 
Vasiliy Kulikov
http://www.openwall.com - bringing security into open computing environments

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [kernel-hardening] Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
@ 2011-09-27 15:56         ` Vasiliy Kulikov
  0 siblings, 0 replies; 69+ messages in thread
From: Vasiliy Kulikov @ 2011-09-27 15:56 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Serge Hallyn, akpm, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap, kernel-hardening

On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote:
> > First, the patches by design expose much kernel code to unprivileged
> > userspace processes.  This code doesn't expect malformed data (e.g. VFS,
> > specific filesystems, block layer, char drivers, sysadmin part of LSMs,
> > etc. etc.).  By relaxing permission rules you greatly increase attack
> > surface of the kernel from unprivileged users.  Are you (or somebody
> > else) planning to audit this code?
> 
> I had wanted to (but didn't) propose a discussion at ksummit about how
> best to approach the filesystem code.  That's not even just for user
> namespaces - patches have been floated in the past to make mount an
> unprivileged operation depending on the FS and the user's permission
> over the device and target.

This is a dangerous operation by itself.  AFAICS, this is the reason why
e.g. FUSE doesn't pass user mount points to other users and even root.
Beginning from violating some rules like existance of single "." and
".." in each directory and ending with filename charsets with /, \000
and things like `, ", ', \ inside.


>  So I don't know if a combination of auditing
> and fuzzing is the way to go,

Maybe the combination of both.  There are no generic recommendations,
it's always limited to the subsystem, checked property, and the
auditor.


> > Also, will it be possible to somehow restrict what specific kernel
> > facilities are accessible from users (IOW, what root emulation
> > limitations are in action)?  It is userful from both points of sysadmin,
> > who might not want to allow users to do such things, and from the
> > security POV in sense of attack surface reduction.
> 
> You're probably thinking along different lines, but this is why I've
> been wanting seccomp2 to get pushed through.  So that we can deny a
> container the syscalls we know it won't need, especially newer ones,
> to reduce the attack surface available to it.

This dependency greatly complicates the things.

First, there is a big misunderstanding between Will and Ingo in what
needs seccompv2 should serve.  Will wants to reduce kernel attack
surface by limiting syscalls and syscall arguments available to a user
(a single task, btw).  Ingo wants to see a full featured filtering
engine, which needs code changes all over the kernel.  Given the needed
changes amounts, it will unlikely reduce attack surface.

You probably don't want Will's version as syscalls filtering is a very
bad abstraction in your case.  user_namespaces likely need Ingo's
version of seccomp as it will be possible to filter e.g. fs-specific
events, but even if it is implemented, it will take a looong time for
your needs IMHO.


Also, I'm afraid for _good_ user_namespace filtering the policy
definition will be too complicated (like SELinux policy definition for
non-trivial applications) if it is implemented in events filtering
terms.


> The way we're approaching it right now is that by default everything
> stays 'capable(X)', so that a non-init user namespace doesn't get the
> privileges.

Great.  I was not sure about it.


>  While some of my patchsets this summer didn't follow this,
> Eric reminded me that we should first clamp down on the user namespaces
> as much as possible, and relax permissions in child namespaces later.

I think it is the only sane way.


> So the small (1-2 patch sized) sets I've been sending the last few
> weeks are just trying to fix existing inadequate userid or capability
> checks.
> 
> -serge

Thanks,

-- 
Vasiliy Kulikov
http://www.openwall.com - bringing security into open computing environments

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
  2011-09-27 15:56         ` [kernel-hardening] " Vasiliy Kulikov
@ 2011-10-01 17:00           ` Serge E. Hallyn
  -1 siblings, 0 replies; 69+ messages in thread
From: Serge E. Hallyn @ 2011-10-01 17:00 UTC (permalink / raw)
  To: Vasiliy Kulikov
  Cc: Serge Hallyn, akpm, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap, kernel-hardening

Quoting Vasiliy Kulikov (segoon@openwall.com):
> On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote:
> > > First, the patches by design expose much kernel code to unprivileged
> > > userspace processes.  This code doesn't expect malformed data (e.g. VFS,
> > > specific filesystems, block layer, char drivers, sysadmin part of LSMs,
> > > etc. etc.).  By relaxing permission rules you greatly increase attack
> > > surface of the kernel from unprivileged users.  Are you (or somebody
> > > else) planning to audit this code?
> > 
> > I had wanted to (but didn't) propose a discussion at ksummit about how
> > best to approach the filesystem code.  That's not even just for user
> > namespaces - patches have been floated in the past to make mount an
> > unprivileged operation depending on the FS and the user's permission
> > over the device and target.
> 
> This is a dangerous operation by itself.

Of course it is :)  And it's been a while since it has been brought up,
but it *was* quite well thought through and throrougly discussed - see
i.e. https://lkml.org/lkml/2008/1/8/131

Oh, that's right.  In the end the reason it didn't go in had to do with
the ability for an unprivileged user to prevent a privileged user from
unmounting trees by leaving a busy mount in a hidden namespace.

Eric, in the past we didn't know what to do about that, but I wonder
if setns could be used in some clever way to solve it from userspace.

> AFAICS, this is the reason why
> e.g. FUSE doesn't pass user mount points to other users and even root.
> Beginning from violating some rules like existance of single "." and
> ".." in each directory and ending with filename charsets with /, \000
> and things like `, ", ', \ inside.
> 
> 
> >  So I don't know if a combination of auditing
> > and fuzzing is the way to go,
> 
> Maybe the combination of both.  There are no generic recommendations,
> it's always limited to the subsystem, checked property, and the
> auditor.

Ok, let me keep focusing on the tightening down right now, and then
before proceeding with relaxing, I'll start some analysis and discussion
of the code which is already under targeted (ns_capable) capability checks.

> > > Also, will it be possible to somehow restrict what specific kernel
> > > facilities are accessible from users (IOW, what root emulation
> > > limitations are in action)?  It is userful from both points of sysadmin,
> > > who might not want to allow users to do such things, and from the
> > > security POV in sense of attack surface reduction.
> > 
> > You're probably thinking along different lines, but this is why I've
> > been wanting seccomp2 to get pushed through.  So that we can deny a
> > container the syscalls we know it won't need, especially newer ones,
> > to reduce the attack surface available to it.
> 
> This dependency greatly complicates the things.

IMO this is not a dependency for user namespaces though - it's only a
dependency for unprivileged user namespaces.  And we haven't seriously
discussed doing that yet precisely because we're nowhere near ready
(and frankly I don't know that it'll ever be sane).

> First, there is a big misunderstanding between Will and Ingo in what
> needs seccompv2 should serve.  Will wants to reduce kernel attack

I know I know :)

> surface by limiting syscalls and syscall arguments available to a user
> (a single task, btw).  Ingo wants to see a full featured filtering
> engine, which needs code changes all over the kernel.  Given the needed
> changes amounts, it will unlikely reduce attack surface.
> 
> You probably don't want Will's version as syscalls filtering is a very

It seems to me per-syscall filtering is a great start.  I'm not looking
to seccomp2 as an assurance against formerly privileged (and now only
privileged per-namespace) code which may have had previously overlooked
bugs.  I'm looking to seccomp2 as an assurance against bugs in newly
written syscalls or the compatibility layer.

> bad abstraction in your case.  user_namespaces likely need Ingo's
> version of seccomp as it will be possible to filter e.g. fs-specific
> events, but even if it is implemented, it will take a looong time for
> your needs IMHO.

Yes, I think that would just lead to exploits through bad policy.

> Also, I'm afraid for _good_ user_namespace filtering the policy
> definition will be too complicated (like SELinux policy definition for
> non-trivial applications) if it is implemented in events filtering
> terms.
> 
> 
> > The way we're approaching it right now is that by default everything
> > stays 'capable(X)', so that a non-init user namespace doesn't get the
> > privileges.
> 
> Great.  I was not sure about it.
> 
> 
> >  While some of my patchsets this summer didn't follow this,
> > Eric reminded me that we should first clamp down on the user namespaces
> > as much as possible, and relax permissions in child namespaces later.
> 
> I think it is the only sane way.

Yup.  I trust you and Eric will keep me in check if I get over-zealous :)

-serge

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [kernel-hardening] Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
@ 2011-10-01 17:00           ` Serge E. Hallyn
  0 siblings, 0 replies; 69+ messages in thread
From: Serge E. Hallyn @ 2011-10-01 17:00 UTC (permalink / raw)
  To: Vasiliy Kulikov
  Cc: Serge Hallyn, akpm, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap, kernel-hardening

Quoting Vasiliy Kulikov (segoon@openwall.com):
> On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote:
> > > First, the patches by design expose much kernel code to unprivileged
> > > userspace processes.  This code doesn't expect malformed data (e.g. VFS,
> > > specific filesystems, block layer, char drivers, sysadmin part of LSMs,
> > > etc. etc.).  By relaxing permission rules you greatly increase attack
> > > surface of the kernel from unprivileged users.  Are you (or somebody
> > > else) planning to audit this code?
> > 
> > I had wanted to (but didn't) propose a discussion at ksummit about how
> > best to approach the filesystem code.  That's not even just for user
> > namespaces - patches have been floated in the past to make mount an
> > unprivileged operation depending on the FS and the user's permission
> > over the device and target.
> 
> This is a dangerous operation by itself.

Of course it is :)  And it's been a while since it has been brought up,
but it *was* quite well thought through and throrougly discussed - see
i.e. https://lkml.org/lkml/2008/1/8/131

Oh, that's right.  In the end the reason it didn't go in had to do with
the ability for an unprivileged user to prevent a privileged user from
unmounting trees by leaving a busy mount in a hidden namespace.

Eric, in the past we didn't know what to do about that, but I wonder
if setns could be used in some clever way to solve it from userspace.

> AFAICS, this is the reason why
> e.g. FUSE doesn't pass user mount points to other users and even root.
> Beginning from violating some rules like existance of single "." and
> ".." in each directory and ending with filename charsets with /, \000
> and things like `, ", ', \ inside.
> 
> 
> >  So I don't know if a combination of auditing
> > and fuzzing is the way to go,
> 
> Maybe the combination of both.  There are no generic recommendations,
> it's always limited to the subsystem, checked property, and the
> auditor.

Ok, let me keep focusing on the tightening down right now, and then
before proceeding with relaxing, I'll start some analysis and discussion
of the code which is already under targeted (ns_capable) capability checks.

> > > Also, will it be possible to somehow restrict what specific kernel
> > > facilities are accessible from users (IOW, what root emulation
> > > limitations are in action)?  It is userful from both points of sysadmin,
> > > who might not want to allow users to do such things, and from the
> > > security POV in sense of attack surface reduction.
> > 
> > You're probably thinking along different lines, but this is why I've
> > been wanting seccomp2 to get pushed through.  So that we can deny a
> > container the syscalls we know it won't need, especially newer ones,
> > to reduce the attack surface available to it.
> 
> This dependency greatly complicates the things.

IMO this is not a dependency for user namespaces though - it's only a
dependency for unprivileged user namespaces.  And we haven't seriously
discussed doing that yet precisely because we're nowhere near ready
(and frankly I don't know that it'll ever be sane).

> First, there is a big misunderstanding between Will and Ingo in what
> needs seccompv2 should serve.  Will wants to reduce kernel attack

I know I know :)

> surface by limiting syscalls and syscall arguments available to a user
> (a single task, btw).  Ingo wants to see a full featured filtering
> engine, which needs code changes all over the kernel.  Given the needed
> changes amounts, it will unlikely reduce attack surface.
> 
> You probably don't want Will's version as syscalls filtering is a very

It seems to me per-syscall filtering is a great start.  I'm not looking
to seccomp2 as an assurance against formerly privileged (and now only
privileged per-namespace) code which may have had previously overlooked
bugs.  I'm looking to seccomp2 as an assurance against bugs in newly
written syscalls or the compatibility layer.

> bad abstraction in your case.  user_namespaces likely need Ingo's
> version of seccomp as it will be possible to filter e.g. fs-specific
> events, but even if it is implemented, it will take a looong time for
> your needs IMHO.

Yes, I think that would just lead to exploits through bad policy.

> Also, I'm afraid for _good_ user_namespace filtering the policy
> definition will be too complicated (like SELinux policy definition for
> non-trivial applications) if it is implemented in events filtering
> terms.
> 
> 
> > The way we're approaching it right now is that by default everything
> > stays 'capable(X)', so that a non-init user namespace doesn't get the
> > privileges.
> 
> Great.  I was not sure about it.
> 
> 
> >  While some of my patchsets this summer didn't follow this,
> > Eric reminded me that we should first clamp down on the user namespaces
> > as much as possible, and relax permissions in child namespaces later.
> 
> I think it is the only sane way.

Yup.  I trust you and Eric will keep me in check if I get over-zealous :)

-serge

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
  2011-10-01 17:00           ` [kernel-hardening] " Serge E. Hallyn
@ 2011-10-03  1:46             ` Eric W. Biederman
  -1 siblings, 0 replies; 69+ messages in thread
From: Eric W. Biederman @ 2011-10-03  1:46 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Vasiliy Kulikov, Serge Hallyn, akpm, linux-kernel, netdev,
	containers, dhowells, rdunlap, kernel-hardening

"Serge E. Hallyn" <serge.hallyn@canonical.com> writes:

> Quoting Vasiliy Kulikov (segoon@openwall.com):
>> On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote:
>> > > First, the patches by design expose much kernel code to unprivileged
>> > > userspace processes.  This code doesn't expect malformed data (e.g. VFS,
>> > > specific filesystems, block layer, char drivers, sysadmin part of LSMs,
>> > > etc. etc.).  By relaxing permission rules you greatly increase attack
>> > > surface of the kernel from unprivileged users.  Are you (or somebody
>> > > else) planning to audit this code?

Well in theory this codes does expose this code to unprivileged user
space in a way that increases the attack surface.    However right now
there are a lot of cases where because the kernel lacks a sufficient
mechanism people are just given root provileges so that can get things
done.  Network manager controlling the network stack as an unprivileged
user.  Random filesystems on usb sticks being mounted and unmounted
automatically when the usb sticks are inserted and removed.

I completely agree that auditing and looking at the code is necessary I
think most of what will happen is that we will start directly supporting
how the kernel is actually used in the real world.  Which should
actually reduce our level of vulnerability, because we give up the
delusion that large classes of operations don't need careful
attention because only root can perform them.   Operations which the
user space authors turn around and write a suid binary for and
unprivileged user space performs those operations all day long.

>> > I had wanted to (but didn't) propose a discussion at ksummit about how
>> > best to approach the filesystem code.  That's not even just for user
>> > namespaces - patches have been floated in the past to make mount an
>> > unprivileged operation depending on the FS and the user's permission
>> > over the device and target.
>> 
>> This is a dangerous operation by itself.
>
> Of course it is :)  And it's been a while since it has been brought up,
> but it *was* quite well thought through and throrougly discussed - see
> i.e. https://lkml.org/lkml/2008/1/8/131
>
> Oh, that's right.  In the end the reason it didn't go in had to do with
> the ability for an unprivileged user to prevent a privileged user from
> unmounting trees by leaving a busy mount in a hidden namespace.
>
> Eric, in the past we didn't know what to do about that, but I wonder
> if setns could be used in some clever way to solve it from userspace.

Oh.  That is a good objection.  I had not realized that unprivileged
mounts had that problem.

Still the solution is straight forward.  If the concern is that an
unprivileged user can prevent a privileged user from unmounting trees,
we need to require that a forced unmount of the filesystem triggers a
revoke on all open files.  sysfs and proc already support revoke at the
per file level so we can safely remove modules, we just need to extend
that support to the forced unmount case.

This is problem that actually needs to be solved for ordinary file
systems as well because of hot pluggable usb drives.  For filesystems
like ext4 it is more difficult because we need a solution that does
not sacrafice performance in the common case.  I was talking to 
Ted Tso a bit about this at plumbers conf.  It happens that hot unplug
of usb devices with mount filesystems are currently a non-ending source
of subtle bugs in the extN code.

The one implementation detail that sounds a bit trick is what to do
about mount structures in mount namespaces when we forcibly unmount
a filesystem.  That could get a bit complicated but if that is the only
hang up I'm certain we can figure something out.

Eric

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [kernel-hardening] Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
@ 2011-10-03  1:46             ` Eric W. Biederman
  0 siblings, 0 replies; 69+ messages in thread
From: Eric W. Biederman @ 2011-10-03  1:46 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Vasiliy Kulikov, Serge Hallyn, akpm, linux-kernel, netdev,
	containers, dhowells, rdunlap, kernel-hardening

"Serge E. Hallyn" <serge.hallyn@canonical.com> writes:

> Quoting Vasiliy Kulikov (segoon@openwall.com):
>> On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote:
>> > > First, the patches by design expose much kernel code to unprivileged
>> > > userspace processes.  This code doesn't expect malformed data (e.g. VFS,
>> > > specific filesystems, block layer, char drivers, sysadmin part of LSMs,
>> > > etc. etc.).  By relaxing permission rules you greatly increase attack
>> > > surface of the kernel from unprivileged users.  Are you (or somebody
>> > > else) planning to audit this code?

Well in theory this codes does expose this code to unprivileged user
space in a way that increases the attack surface.    However right now
there are a lot of cases where because the kernel lacks a sufficient
mechanism people are just given root provileges so that can get things
done.  Network manager controlling the network stack as an unprivileged
user.  Random filesystems on usb sticks being mounted and unmounted
automatically when the usb sticks are inserted and removed.

I completely agree that auditing and looking at the code is necessary I
think most of what will happen is that we will start directly supporting
how the kernel is actually used in the real world.  Which should
actually reduce our level of vulnerability, because we give up the
delusion that large classes of operations don't need careful
attention because only root can perform them.   Operations which the
user space authors turn around and write a suid binary for and
unprivileged user space performs those operations all day long.

>> > I had wanted to (but didn't) propose a discussion at ksummit about how
>> > best to approach the filesystem code.  That's not even just for user
>> > namespaces - patches have been floated in the past to make mount an
>> > unprivileged operation depending on the FS and the user's permission
>> > over the device and target.
>> 
>> This is a dangerous operation by itself.
>
> Of course it is :)  And it's been a while since it has been brought up,
> but it *was* quite well thought through and throrougly discussed - see
> i.e. https://lkml.org/lkml/2008/1/8/131
>
> Oh, that's right.  In the end the reason it didn't go in had to do with
> the ability for an unprivileged user to prevent a privileged user from
> unmounting trees by leaving a busy mount in a hidden namespace.
>
> Eric, in the past we didn't know what to do about that, but I wonder
> if setns could be used in some clever way to solve it from userspace.

Oh.  That is a good objection.  I had not realized that unprivileged
mounts had that problem.

Still the solution is straight forward.  If the concern is that an
unprivileged user can prevent a privileged user from unmounting trees,
we need to require that a forced unmount of the filesystem triggers a
revoke on all open files.  sysfs and proc already support revoke at the
per file level so we can safely remove modules, we just need to extend
that support to the forced unmount case.

This is problem that actually needs to be solved for ordinary file
systems as well because of hot pluggable usb drives.  For filesystems
like ext4 it is more difficult because we need a solution that does
not sacrafice performance in the common case.  I was talking to 
Ted Tso a bit about this at plumbers conf.  It happens that hot unplug
of usb devices with mount filesystems are currently a non-ending source
of subtle bugs in the extN code.

The one implementation detail that sounds a bit trick is what to do
about mount structures in mount namespaces when we forcibly unmount
a filesystem.  That could get a bit complicated but if that is the only
hang up I'm certain we can figure something out.

Eric

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
  2011-10-03  1:46             ` [kernel-hardening] " Eric W. Biederman
@ 2011-10-03 19:53               ` Eric W. Biederman
  -1 siblings, 0 replies; 69+ messages in thread
From: Eric W. Biederman @ 2011-10-03 19:53 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Vasiliy Kulikov, Serge Hallyn, akpm, linux-kernel, netdev,
	containers, dhowells, rdunlap, kernel-hardening

ebiederm@xmission.com (Eric W. Biederman) writes:

> "Serge E. Hallyn" <serge.hallyn@canonical.com> writes:
>
>> Quoting Vasiliy Kulikov (segoon@openwall.com):
>>> On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote:
>>> > > First, the patches by design expose much kernel code to unprivileged
>>> > > userspace processes.  This code doesn't expect malformed data (e.g. VFS,
>>> > > specific filesystems, block layer, char drivers, sysadmin part of LSMs,
>>> > > etc. etc.).  By relaxing permission rules you greatly increase attack
>>> > > surface of the kernel from unprivileged users.  Are you (or somebody
>>> > > else) planning to audit this code?
>
> Well in theory this codes does expose this code to unprivileged user
> space in a way that increases the attack surface.    However right now
> there are a lot of cases where because the kernel lacks a sufficient
> mechanism people are just given root provileges so that can get things
> done.  Network manager controlling the network stack as an unprivileged
> user.  Random filesystems on usb sticks being mounted and unmounted
> automatically when the usb sticks are inserted and removed.
>
> I completely agree that auditing and looking at the code is necessary I
> think most of what will happen is that we will start directly supporting
> how the kernel is actually used in the real world.  Which should
> actually reduce our level of vulnerability, because we give up the
> delusion that large classes of operations don't need careful
> attention because only root can perform them.   Operations which the
> user space authors turn around and write a suid binary for and
> unprivileged user space performs those operations all day long.
>
>>> > I had wanted to (but didn't) propose a discussion at ksummit about how
>>> > best to approach the filesystem code.  That's not even just for user
>>> > namespaces - patches have been floated in the past to make mount an
>>> > unprivileged operation depending on the FS and the user's permission
>>> > over the device and target.
>>> 
>>> This is a dangerous operation by itself.
>>
>> Of course it is :)  And it's been a while since it has been brought up,
>> but it *was* quite well thought through and throrougly discussed - see
>> i.e. https://lkml.org/lkml/2008/1/8/131
>>
>> Oh, that's right.  In the end the reason it didn't go in had to do with
>> the ability for an unprivileged user to prevent a privileged user from
>> unmounting trees by leaving a busy mount in a hidden namespace.
>>
>> Eric, in the past we didn't know what to do about that, but I wonder
>> if setns could be used in some clever way to solve it from userspace.
>
> Oh.  That is a good objection.  I had not realized that unprivileged
> mounts had that problem.

I just re-read the discussion you are referring to and that wasn't
it.  Fuse already has something like a revoke in it's umount -f
implementation.

Eric

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [kernel-hardening] Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
@ 2011-10-03 19:53               ` Eric W. Biederman
  0 siblings, 0 replies; 69+ messages in thread
From: Eric W. Biederman @ 2011-10-03 19:53 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Vasiliy Kulikov, Serge Hallyn, akpm, linux-kernel, netdev,
	containers, dhowells, rdunlap, kernel-hardening

ebiederm@xmission.com (Eric W. Biederman) writes:

> "Serge E. Hallyn" <serge.hallyn@canonical.com> writes:
>
>> Quoting Vasiliy Kulikov (segoon@openwall.com):
>>> On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote:
>>> > > First, the patches by design expose much kernel code to unprivileged
>>> > > userspace processes.  This code doesn't expect malformed data (e.g. VFS,
>>> > > specific filesystems, block layer, char drivers, sysadmin part of LSMs,
>>> > > etc. etc.).  By relaxing permission rules you greatly increase attack
>>> > > surface of the kernel from unprivileged users.  Are you (or somebody
>>> > > else) planning to audit this code?
>
> Well in theory this codes does expose this code to unprivileged user
> space in a way that increases the attack surface.    However right now
> there are a lot of cases where because the kernel lacks a sufficient
> mechanism people are just given root provileges so that can get things
> done.  Network manager controlling the network stack as an unprivileged
> user.  Random filesystems on usb sticks being mounted and unmounted
> automatically when the usb sticks are inserted and removed.
>
> I completely agree that auditing and looking at the code is necessary I
> think most of what will happen is that we will start directly supporting
> how the kernel is actually used in the real world.  Which should
> actually reduce our level of vulnerability, because we give up the
> delusion that large classes of operations don't need careful
> attention because only root can perform them.   Operations which the
> user space authors turn around and write a suid binary for and
> unprivileged user space performs those operations all day long.
>
>>> > I had wanted to (but didn't) propose a discussion at ksummit about how
>>> > best to approach the filesystem code.  That's not even just for user
>>> > namespaces - patches have been floated in the past to make mount an
>>> > unprivileged operation depending on the FS and the user's permission
>>> > over the device and target.
>>> 
>>> This is a dangerous operation by itself.
>>
>> Of course it is :)  And it's been a while since it has been brought up,
>> but it *was* quite well thought through and throrougly discussed - see
>> i.e. https://lkml.org/lkml/2008/1/8/131
>>
>> Oh, that's right.  In the end the reason it didn't go in had to do with
>> the ability for an unprivileged user to prevent a privileged user from
>> unmounting trees by leaving a busy mount in a hidden namespace.
>>
>> Eric, in the past we didn't know what to do about that, but I wonder
>> if setns could be used in some clever way to solve it from userspace.
>
> Oh.  That is a good objection.  I had not realized that unprivileged
> mounts had that problem.

I just re-read the discussion you are referring to and that wasn't
it.  Fuse already has something like a revoke in it's umount -f
implementation.

Eric

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
  2011-10-03 19:53               ` [kernel-hardening] " Eric W. Biederman
@ 2011-10-03 20:04                 ` Serge E. Hallyn
  -1 siblings, 0 replies; 69+ messages in thread
From: Serge E. Hallyn @ 2011-10-03 20:04 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Vasiliy Kulikov, Serge Hallyn, akpm, linux-kernel, netdev,
	containers, dhowells, rdunlap, kernel-hardening

Quoting Eric W. Biederman (ebiederm@xmission.com):
> ebiederm@xmission.com (Eric W. Biederman) writes:
> 
> > "Serge E. Hallyn" <serge.hallyn@canonical.com> writes:
> >
> >> Quoting Vasiliy Kulikov (segoon@openwall.com):
> >>> On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote:
> >>> > > First, the patches by design expose much kernel code to unprivileged
> >>> > > userspace processes.  This code doesn't expect malformed data (e.g. VFS,
> >>> > > specific filesystems, block layer, char drivers, sysadmin part of LSMs,
> >>> > > etc. etc.).  By relaxing permission rules you greatly increase attack
> >>> > > surface of the kernel from unprivileged users.  Are you (or somebody
> >>> > > else) planning to audit this code?
> >
> > Well in theory this codes does expose this code to unprivileged user
> > space in a way that increases the attack surface.    However right now
> > there are a lot of cases where because the kernel lacks a sufficient
> > mechanism people are just given root provileges so that can get things
> > done.  Network manager controlling the network stack as an unprivileged
> > user.  Random filesystems on usb sticks being mounted and unmounted
> > automatically when the usb sticks are inserted and removed.
> >
> > I completely agree that auditing and looking at the code is necessary I
> > think most of what will happen is that we will start directly supporting
> > how the kernel is actually used in the real world.  Which should
> > actually reduce our level of vulnerability, because we give up the
> > delusion that large classes of operations don't need careful
> > attention because only root can perform them.   Operations which the
> > user space authors turn around and write a suid binary for and
> > unprivileged user space performs those operations all day long.
> >
> >>> > I had wanted to (but didn't) propose a discussion at ksummit about how
> >>> > best to approach the filesystem code.  That's not even just for user
> >>> > namespaces - patches have been floated in the past to make mount an
> >>> > unprivileged operation depending on the FS and the user's permission
> >>> > over the device and target.
> >>> 
> >>> This is a dangerous operation by itself.
> >>
> >> Of course it is :)  And it's been a while since it has been brought up,
> >> but it *was* quite well thought through and throrougly discussed - see
> >> i.e. https://lkml.org/lkml/2008/1/8/131
> >>
> >> Oh, that's right.  In the end the reason it didn't go in had to do with
> >> the ability for an unprivileged user to prevent a privileged user from
> >> unmounting trees by leaving a busy mount in a hidden namespace.
> >>
> >> Eric, in the past we didn't know what to do about that, but I wonder
> >> if setns could be used in some clever way to solve it from userspace.
> >
> > Oh.  That is a good objection.  I had not realized that unprivileged
> > mounts had that problem.
> 
> I just re-read the discussion you are referring to and that wasn't

The one I linked was one discussion, but not the final one.

https://lkml.org/lkml/2008/10/6/72 is the one where the need for
revoke is brought up.

> it.  Fuse already has something like a revoke in it's umount -f
> implementation.

I'll have to (haven't yet) take a look at it.

-serge

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [kernel-hardening] Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
@ 2011-10-03 20:04                 ` Serge E. Hallyn
  0 siblings, 0 replies; 69+ messages in thread
From: Serge E. Hallyn @ 2011-10-03 20:04 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Vasiliy Kulikov, Serge Hallyn, akpm, linux-kernel, netdev,
	containers, dhowells, rdunlap, kernel-hardening

Quoting Eric W. Biederman (ebiederm@xmission.com):
> ebiederm@xmission.com (Eric W. Biederman) writes:
> 
> > "Serge E. Hallyn" <serge.hallyn@canonical.com> writes:
> >
> >> Quoting Vasiliy Kulikov (segoon@openwall.com):
> >>> On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote:
> >>> > > First, the patches by design expose much kernel code to unprivileged
> >>> > > userspace processes.  This code doesn't expect malformed data (e.g. VFS,
> >>> > > specific filesystems, block layer, char drivers, sysadmin part of LSMs,
> >>> > > etc. etc.).  By relaxing permission rules you greatly increase attack
> >>> > > surface of the kernel from unprivileged users.  Are you (or somebody
> >>> > > else) planning to audit this code?
> >
> > Well in theory this codes does expose this code to unprivileged user
> > space in a way that increases the attack surface.    However right now
> > there are a lot of cases where because the kernel lacks a sufficient
> > mechanism people are just given root provileges so that can get things
> > done.  Network manager controlling the network stack as an unprivileged
> > user.  Random filesystems on usb sticks being mounted and unmounted
> > automatically when the usb sticks are inserted and removed.
> >
> > I completely agree that auditing and looking at the code is necessary I
> > think most of what will happen is that we will start directly supporting
> > how the kernel is actually used in the real world.  Which should
> > actually reduce our level of vulnerability, because we give up the
> > delusion that large classes of operations don't need careful
> > attention because only root can perform them.   Operations which the
> > user space authors turn around and write a suid binary for and
> > unprivileged user space performs those operations all day long.
> >
> >>> > I had wanted to (but didn't) propose a discussion at ksummit about how
> >>> > best to approach the filesystem code.  That's not even just for user
> >>> > namespaces - patches have been floated in the past to make mount an
> >>> > unprivileged operation depending on the FS and the user's permission
> >>> > over the device and target.
> >>> 
> >>> This is a dangerous operation by itself.
> >>
> >> Of course it is :)  And it's been a while since it has been brought up,
> >> but it *was* quite well thought through and throrougly discussed - see
> >> i.e. https://lkml.org/lkml/2008/1/8/131
> >>
> >> Oh, that's right.  In the end the reason it didn't go in had to do with
> >> the ability for an unprivileged user to prevent a privileged user from
> >> unmounting trees by leaving a busy mount in a hidden namespace.
> >>
> >> Eric, in the past we didn't know what to do about that, but I wonder
> >> if setns could be used in some clever way to solve it from userspace.
> >
> > Oh.  That is a good objection.  I had not realized that unprivileged
> > mounts had that problem.
> 
> I just re-read the discussion you are referring to and that wasn't

The one I linked was one discussion, but not the final one.

https://lkml.org/lkml/2008/10/6/72 is the one where the need for
revoke is brought up.

> it.  Fuse already has something like a revoke in it's umount -f
> implementation.

I'll have to (haven't yet) take a look at it.

-serge

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2011-10-03 20:04 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn
2011-09-02 19:56 ` Serge Hallyn
2011-09-02 19:56   ` (unknown), Serge Hallyn
2011-09-02 19:56 ` Serge Hallyn
     [not found]   ` <1314993400-6910-3-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
2011-09-02 23:49     ` missing [PATCH 01/15] Eric W. Biederman
2011-09-02 23:49   ` Eric W. Biederman
     [not found]     ` <m11uvyld2d.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2011-09-03  1:09       ` Serge E. Hallyn
2011-09-03  1:09     ` Serge E. Hallyn
2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn
2011-09-07 22:50   ` Andrew Morton
     [not found]     ` <20110907155024.42e3fe27.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2011-09-09 13:10       ` Serge E. Hallyn
2011-09-09 13:10     ` Serge E. Hallyn
     [not found]   ` <1314993400-6910-4-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
2011-09-07 22:50     ` Andrew Morton
2011-09-26 19:17   ` Vasiliy Kulikov
2011-09-26 19:17     ` [kernel-hardening] " Vasiliy Kulikov
2011-09-27 13:21     ` Serge E. Hallyn
2011-09-27 13:21       ` [kernel-hardening] " Serge E. Hallyn
2011-09-27 15:56       ` Vasiliy Kulikov
2011-09-27 15:56         ` [kernel-hardening] " Vasiliy Kulikov
2011-10-01 17:00         ` Serge E. Hallyn
2011-10-01 17:00           ` [kernel-hardening] " Serge E. Hallyn
2011-10-03  1:46           ` Eric W. Biederman
2011-10-03  1:46             ` [kernel-hardening] " Eric W. Biederman
2011-10-03 19:53             ` Eric W. Biederman
2011-10-03 19:53               ` [kernel-hardening] " Eric W. Biederman
2011-10-03 20:04               ` Serge E. Hallyn
2011-10-03 20:04                 ` [kernel-hardening] " Serge E. Hallyn
2011-09-02 19:56 ` [PATCH 07/15] user namespace: use net->user_ns for some capable calls under net/ Serge Hallyn
2011-09-02 19:56 ` [PATCH 08/15] af_netlink.c: make netlink_capable userns-aware Serge Hallyn
     [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
2011-09-02 19:56   ` (unknown), Serge Hallyn
2011-09-02 19:56   ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn
2011-09-02 19:56   ` [PATCH 02/15] user ns: setns: move capable checks into per-ns attach helper Serge Hallyn
2011-09-02 19:56     ` Serge Hallyn
2011-09-02 19:56     ` Serge Hallyn
2011-09-04  1:51     ` Matt Helsley
     [not found]       ` <20110904015140.GB32295-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-09-09 14:56         ` Serge E. Hallyn
2011-09-09 14:56       ` Serge E. Hallyn
     [not found]     ` <1314993400-6910-5-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
2011-09-04  1:51       ` Matt Helsley
2011-09-02 19:56   ` [PATCH 03/15] keyctl: check capabilities against key's user_ns Serge Hallyn
2011-09-02 19:56     ` Serge Hallyn
2011-09-02 19:56     ` Serge Hallyn
2011-09-02 19:56   ` [PATCH 04/15] user_ns: convert fs/attr.c to targeted capabilities Serge Hallyn
2011-09-02 19:56     ` Serge Hallyn
2011-09-02 19:56     ` Serge Hallyn
2011-09-02 19:56   ` [PATCH 05/15] userns: clamp down users of cap_raised Serge Hallyn
2011-09-02 19:56     ` Serge Hallyn
2011-09-02 19:56     ` Serge Hallyn
2011-09-02 19:56   ` [PATCH 06/15] user namespace: make each net (net_ns) belong to a user_ns Serge Hallyn
2011-09-02 19:56     ` Serge Hallyn
2011-09-02 19:56     ` Serge Hallyn
2011-09-02 19:56   ` [PATCH 07/15] user namespace: use net->user_ns for some capable calls under net/ Serge Hallyn
2011-09-02 19:56   ` [PATCH 08/15] af_netlink.c: make netlink_capable userns-aware Serge Hallyn
2011-09-02 19:56   ` [PATCH 09/15] user ns: convert ipv6 to targeted capabilities Serge Hallyn
2011-09-02 19:56   ` [PATCH 10/15] net/core/scm.c: target capable() calls to user_ns owning the net_ns Serge Hallyn
2011-09-02 19:56   ` [PATCH 11/15] userns: make some net-sysfs capable calls targeted Serge Hallyn
2011-09-02 19:56     ` Serge Hallyn
2011-09-02 19:56     ` Serge Hallyn
2011-09-02 19:56   ` [PATCH 12/15] user_ns: target af_key capability check Serge Hallyn
2011-09-02 19:56     ` Serge Hallyn
2011-09-02 19:56     ` Serge Hallyn
2011-09-02 19:56   ` [PATCH 13/15] userns: net: make many network capable calls targeted Serge Hallyn
2011-09-02 19:56   ` [PATCH 14/15] net: pass user_ns to cap_netlink_recv() Serge Hallyn
2011-09-02 19:56   ` [PATCH 15/15] make kernel/signal.c user ns safe (v2) Serge Hallyn
2011-09-02 19:56 ` [PATCH 09/15] user ns: convert ipv6 to targeted capabilities Serge Hallyn
2011-09-02 19:56 ` [PATCH 10/15] net/core/scm.c: target capable() calls to user_ns owning the net_ns Serge Hallyn
2011-09-02 19:56 ` [PATCH 13/15] userns: net: make many network capable calls targeted Serge Hallyn
2011-09-02 19:56 ` [PATCH 14/15] net: pass user_ns to cap_netlink_recv() Serge Hallyn
2011-09-02 19:56 ` [PATCH 15/15] make kernel/signal.c user ns safe (v2) Serge Hallyn
2011-09-13 14:43 ` user namespaces v3: continue targetting capabilities Serge E. Hallyn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.