* user namespaces v3: continue targetting capabilities @ 2011-09-02 19:56 Serge Hallyn 2011-09-02 19:56 ` (unknown), Serge Hallyn ` (11 more replies) 0 siblings, 12 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm, segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap This was last sent Jul 26, and incorporates feedback from that thread. The last patch, 0015-make-kernel-signal.c-user-ns-safe-v2.patch, is new, so could stand extra scrutiny. This patchset is a basis for Eric's set which allows assigning a filesystem to a user namespace (http://git.kernel.org/?p=linux/kernel/git/ebiederm/linux-userns-devel.git), which is the last hurdle to starting to employ user namespaces to help constrain root in a container. So if there is no more major feedback, I'd love to see this get a spin in -mm so we can proceed with that. [ v2 intro message: ] here is a set of patches to continue targetting capabilities where appropriate. This set goes about as far as is possible without making the VFS user namespace aware, meaning that the VFS can provide a namespaced view of userids, i.e init_user_ns sees file owner 500, while child user ns sees file owner 0 or 1000. (There are a few other things, like siginfos, which can be addressed before we address the VFS). With this set applied, you can create and configure veth netdevs if your user namespace owns your network namespace (and you are privileged), but not otherwise. Some simple testcases can be found at https://code.launchpad.net/~serge-hallyn/+junk/usernstests with packages at https://launchpad.net/~serge-hallyn/+archive/userns-natty Feedback very much appreciated. ^ permalink raw reply [flat|nested] 69+ messages in thread
* (no subject) 2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` Serge Hallyn ` (10 subsequent siblings) 11 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm, segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap Cc: Serge Hallyn GIT: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) GIT: [PATCH 02/15] user ns: setns: move capable checks into per-ns attach GIT: [PATCH 03/15] keyctl: check capabilities against key's user_ns GIT: [PATCH 04/15] user_ns: convert fs/attr.c to targeted capabilities GIT: [PATCH 05/15] userns: clamp down users of cap_raised GIT: [PATCH 06/15] user namespace: make each net (net_ns) belong to a GIT: [PATCH 07/15] user namespace: use net->user_ns for some capable GIT: [PATCH 08/15] af_netlink.c: make netlink_capable userns-aware GIT: [PATCH 09/15] user ns: convert ipv6 to targeted capabilities GIT: [PATCH 10/15] net/core/scm.c: target capable() calls to user_ns GIT: [PATCH 11/15] userns: make some net-sysfs capable calls targeted GIT: [PATCH 12/15] user_ns: target af_key capability check GIT: [PATCH 13/15] userns: net: make many network capable calls targeted GIT: [PATCH 14/15] net: pass user_ns to cap_netlink_recv() GIT: [PATCH 15/15] make kernel/signal.c user ns safe (v2) ^ permalink raw reply [flat|nested] 69+ messages in thread
* (unknown), @ 2011-09-02 19:56 ` Serge Hallyn 0 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm, segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap Cc: Serge Hallyn GIT: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) GIT: [PATCH 02/15] user ns: setns: move capable checks into per-ns attach GIT: [PATCH 03/15] keyctl: check capabilities against key's user_ns GIT: [PATCH 04/15] user_ns: convert fs/attr.c to targeted capabilities GIT: [PATCH 05/15] userns: clamp down users of cap_raised GIT: [PATCH 06/15] user namespace: make each net (net_ns) belong to a GIT: [PATCH 07/15] user namespace: use net->user_ns for some capable GIT: [PATCH 08/15] af_netlink.c: make netlink_capable userns-aware GIT: [PATCH 09/15] user ns: convert ipv6 to targeted capabilities GIT: [PATCH 10/15] net/core/scm.c: target capable() calls to user_ns GIT: [PATCH 11/15] userns: make some net-sysfs capable calls targeted GIT: [PATCH 12/15] user_ns: target af_key capability check GIT: [PATCH 13/15] userns: net: make many network capable calls targeted GIT: [PATCH 14/15] net: pass user_ns to cap_netlink_recv() GIT: [PATCH 15/15] make kernel/signal.c user ns safe (v2) ^ permalink raw reply [flat|nested] 69+ messages in thread
* (no subject) 2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn 2011-09-02 19:56 ` (unknown), Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn [not found] ` <1314993400-6910-3-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> 2011-09-02 23:49 ` Eric W. Biederman 2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn ` (9 subsequent siblings) 11 siblings, 2 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm, segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap ^ permalink raw reply [flat|nested] 69+ messages in thread
[parent not found: <1314993400-6910-3-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>]
* missing [PATCH 01/15] [not found] ` <1314993400-6910-3-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> @ 2011-09-02 23:49 ` Eric W. Biederman 0 siblings, 0 replies; 69+ messages in thread From: Eric W. Biederman @ 2011-09-02 23:49 UTC (permalink / raw) To: Serge Hallyn Cc: netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, rdunlap-/UHa2rfvQTnk1uMJSBkQmQ Was this blank email supposed to be patch 01/15? Eric ^ permalink raw reply [flat|nested] 69+ messages in thread
* missing [PATCH 01/15] 2011-09-02 19:56 ` Serge Hallyn [not found] ` <1314993400-6910-3-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> @ 2011-09-02 23:49 ` Eric W. Biederman [not found] ` <m11uvyld2d.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org> 2011-09-03 1:09 ` Serge E. Hallyn 1 sibling, 2 replies; 69+ messages in thread From: Eric W. Biederman @ 2011-09-02 23:49 UTC (permalink / raw) To: Serge Hallyn Cc: akpm, segooon, linux-kernel, netdev, containers, dhowells, rdunlap Was this blank email supposed to be patch 01/15? Eric ^ permalink raw reply [flat|nested] 69+ messages in thread
[parent not found: <m11uvyld2d.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>]
* Re: missing [PATCH 01/15] [not found] ` <m11uvyld2d.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org> @ 2011-09-03 1:09 ` Serge E. Hallyn 0 siblings, 0 replies; 69+ messages in thread From: Serge E. Hallyn @ 2011-09-03 1:09 UTC (permalink / raw) To: Eric W. Biederman Cc: netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, rdunlap-/UHa2rfvQTnk1uMJSBkQmQ Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org): > > > Was this blank email supposed to be patch 01/15? Nope, that was <grr> a git-send-email misfire. Sorry about that. The patch #1 did go through, here: https://lkml.org/lkml/2011/9/2/314 I'm appending it here again too for easier feedback. thanks, -serge Subject: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Quoting David Howells (dhowells-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org): > Randy Dunlap <rdunlap-/UHa2rfvQTnk1uMJSBkQmQ@public.gmane.org> wrote: > > > > +Any task in or resource belonging to the initial user namespace will, to this > > > +new task, appear to belong to UID and GID -1 - which is usually known as > > > > that extra hyphen is confusing. how about: > > > > to UID and GID -1, which is > > 'which are'. > > David This will hold some info about the design. Currently it contains future todos, issues and questions. Changelog: jul 26: incorporate feedback from David Howells. jul 29: incorporate feedback from Randy Dunlap. Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> Cc: David Howells <dhowells-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Randy Dunlap <rdunlap-/UHa2rfvQTnk1uMJSBkQmQ@public.gmane.org> --- Documentation/namespaces/user_namespace.txt | 107 +++++++++++++++++++++++++++ 1 files changed, 107 insertions(+), 0 deletions(-) create mode 100644 Documentation/namespaces/user_namespace.txt diff --git a/Documentation/namespaces/user_namespace.txt b/Documentation/namespaces/user_namespace.txt new file mode 100644 index 0000000..b0bc480 --- /dev/null +++ b/Documentation/namespaces/user_namespace.txt @@ -0,0 +1,107 @@ +Description +=========== + +Traditionally, each task is owned by a user ID (UID) and belongs to one or more +groups (GID). Both are simple numeric IDs, though userspace usually translates +them to names. The user namespace allows tasks to have different views of the +UIDs and GIDs associated with tasks and other resources. (See 'UID mapping' +below for more.) + +The user namespace is a simple hierarchical one. The system starts with all +tasks belonging to the initial user namespace. A task creates a new user +namespace by passing the CLONE_NEWUSER flag to clone(2). This requires the +creating task to have the CAP_SETUID, CAP_SETGID, and CAP_CHOWN capabilities, +but it does not need to be running as root. The clone(2) call will result in a +new task which to itself appears to be running as UID and GID 0, but to its +creator seems to have the creator's credentials. + +To this new task, any resource belonging to the initial user namespace will +appear to belong to user and group 'nobody', which are UID and GID -1. +Permission to open such files will be granted according to world access +permissions. UID comparisons and group membership checks will return false, +and privilege will be denied. + +When a task belonging to (for example) userid 500 in the initial user namespace +creates a new user namespace, even though the new task will see itself as +belonging to UID 0, any task in the initial user namespace will see it as +belonging to UID 500. Therefore, UID 500 in the initial user namespace will be +able to kill the new task. Files created by the new user will (eventually) be +seen by tasks in its own user namespace as belonging to UID 0, but to tasks in +the initial user namespace as belonging to UID 500. + +Note that this userid mapping for the VFS is not yet implemented, though the +lkml and containers mailing list archives will show several previous +prototypes. In the end, those got hung up waiting on the concept of targeted +capabilities to be developed, which, thanks to the insight of Eric Biederman, +they finally did. + +Relationship between the User namespace and other namespaces +============================================================ + +Other namespaces, such as UTS and network, are owned by a user namespace. When +such a namespace is created, it is assigned to the user namespace of the task +by which it was created. Therefore, attempts to exercise privilege to +resources in, for instance, a particular network namespace, can be properly +validated by checking whether the caller has the needed privilege (i.e. +CAP_NET_ADMIN) targeted to the user namespace which owns the network namespace. +This is done using the ns_capable() function. + +As an example, if a new task is cloned with a private user namespace but +no private network namespace, then the task's network namespace is owned +by the parent user namespace. The new task has no privilege to the +parent user namespace, so it will not be able to create or configure +network devices. If, instead, the task were cloned with both private +user and network namespaces, then the private network namespace is owned +by the private user namespace, and so root in the new user namespace +will have privilege targeted to the network namespace. It will be able +to create and configure network devices. + +UID Mapping +=========== +The current plan (see 'flexible UID mapping' at +https://wiki.ubuntu.com/UserNamespace) is: + +The UID/GID stored on disk will be that in the init_user_ns. Most likely +UID/GID in other namespaces will be stored in xattrs. But Eric was advocating +(a few years ago) leaving the details up to filesystems while providing a lib/ +stock implementation. See the thread around here: +http://www.mail-archive.com/devel-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org/msg09331.html + + +Working notes +============= +Capability checks for actions related to syslog must be against the +init_user_ns until syslog is containerized. + +Same is true for reboot and power, control groups, devices, and time. + +Perf actions (kernel/event/core.c for instance) will always be constrained to +init_user_ns. + +Q: +Is accounting considered properly containerized with respect to pidns? (it +appears to be). If so, then we can change the capable() check in +kernel/acct.c to 'ns_capable(current_pid_ns()->user_ns, CAP_PACCT)' + +Q: +For things like nice and schedaffinity, we could allow root in a container to +control those, and leave only cgroups to constrain the container. I'm not sure +whether that is right, or whether it violates admin expectations. + +I deferred some of commoncap.c. I'm punting on xattr stuff as they take +dentries, not inodes. + +For drivers/tty/tty_io.c and drivers/tty/vt/vt.c, we'll want to (for some of +them) target the capability checks at the user_ns owning the tty. That will +have to wait until we get userns owning files straightened out. + +We need to figure out how to label devices. Should we just toss a user_ns +right into struct device? + +capable(CAP_MAC_ADMIN) checks are always to be against init_user_ns, unless +some day LSMs were to be containerized, near zero chance. + +inode_owner_or_capable() should probably take an optional ns and cap parameter. +If cap is 0, then CAP_FOWNER is checked. If ns is NULL, we derive the ns from +inode. But if ns is provided, then callers who need to derive +inode_userns(inode) anyway can save a few cycles. -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: missing [PATCH 01/15] 2011-09-02 23:49 ` Eric W. Biederman [not found] ` <m11uvyld2d.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org> @ 2011-09-03 1:09 ` Serge E. Hallyn 1 sibling, 0 replies; 69+ messages in thread From: Serge E. Hallyn @ 2011-09-03 1:09 UTC (permalink / raw) To: Eric W. Biederman Cc: akpm, segooon, linux-kernel, netdev, containers, dhowells, rdunlap Quoting Eric W. Biederman (ebiederm@xmission.com): > > > Was this blank email supposed to be patch 01/15? Nope, that was <grr> a git-send-email misfire. Sorry about that. The patch #1 did go through, here: https://lkml.org/lkml/2011/9/2/314 I'm appending it here again too for easier feedback. thanks, -serge Subject: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Quoting David Howells (dhowells@redhat.com): > Randy Dunlap <rdunlap@xenotime.net> wrote: > > > > +Any task in or resource belonging to the initial user namespace will, to this > > > +new task, appear to belong to UID and GID -1 - which is usually known as > > > > that extra hyphen is confusing. how about: > > > > to UID and GID -1, which is > > 'which are'. > > David This will hold some info about the design. Currently it contains future todos, issues and questions. Changelog: jul 26: incorporate feedback from David Howells. jul 29: incorporate feedback from Randy Dunlap. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: David Howells <dhowells@redhat.com> Cc: Randy Dunlap <rdunlap@xenotime.net> --- Documentation/namespaces/user_namespace.txt | 107 +++++++++++++++++++++++++++ 1 files changed, 107 insertions(+), 0 deletions(-) create mode 100644 Documentation/namespaces/user_namespace.txt diff --git a/Documentation/namespaces/user_namespace.txt b/Documentation/namespaces/user_namespace.txt new file mode 100644 index 0000000..b0bc480 --- /dev/null +++ b/Documentation/namespaces/user_namespace.txt @@ -0,0 +1,107 @@ +Description +=========== + +Traditionally, each task is owned by a user ID (UID) and belongs to one or more +groups (GID). Both are simple numeric IDs, though userspace usually translates +them to names. The user namespace allows tasks to have different views of the +UIDs and GIDs associated with tasks and other resources. (See 'UID mapping' +below for more.) + +The user namespace is a simple hierarchical one. The system starts with all +tasks belonging to the initial user namespace. A task creates a new user +namespace by passing the CLONE_NEWUSER flag to clone(2). This requires the +creating task to have the CAP_SETUID, CAP_SETGID, and CAP_CHOWN capabilities, +but it does not need to be running as root. The clone(2) call will result in a +new task which to itself appears to be running as UID and GID 0, but to its +creator seems to have the creator's credentials. + +To this new task, any resource belonging to the initial user namespace will +appear to belong to user and group 'nobody', which are UID and GID -1. +Permission to open such files will be granted according to world access +permissions. UID comparisons and group membership checks will return false, +and privilege will be denied. + +When a task belonging to (for example) userid 500 in the initial user namespace +creates a new user namespace, even though the new task will see itself as +belonging to UID 0, any task in the initial user namespace will see it as +belonging to UID 500. Therefore, UID 500 in the initial user namespace will be +able to kill the new task. Files created by the new user will (eventually) be +seen by tasks in its own user namespace as belonging to UID 0, but to tasks in +the initial user namespace as belonging to UID 500. + +Note that this userid mapping for the VFS is not yet implemented, though the +lkml and containers mailing list archives will show several previous +prototypes. In the end, those got hung up waiting on the concept of targeted +capabilities to be developed, which, thanks to the insight of Eric Biederman, +they finally did. + +Relationship between the User namespace and other namespaces +============================================================ + +Other namespaces, such as UTS and network, are owned by a user namespace. When +such a namespace is created, it is assigned to the user namespace of the task +by which it was created. Therefore, attempts to exercise privilege to +resources in, for instance, a particular network namespace, can be properly +validated by checking whether the caller has the needed privilege (i.e. +CAP_NET_ADMIN) targeted to the user namespace which owns the network namespace. +This is done using the ns_capable() function. + +As an example, if a new task is cloned with a private user namespace but +no private network namespace, then the task's network namespace is owned +by the parent user namespace. The new task has no privilege to the +parent user namespace, so it will not be able to create or configure +network devices. If, instead, the task were cloned with both private +user and network namespaces, then the private network namespace is owned +by the private user namespace, and so root in the new user namespace +will have privilege targeted to the network namespace. It will be able +to create and configure network devices. + +UID Mapping +=========== +The current plan (see 'flexible UID mapping' at +https://wiki.ubuntu.com/UserNamespace) is: + +The UID/GID stored on disk will be that in the init_user_ns. Most likely +UID/GID in other namespaces will be stored in xattrs. But Eric was advocating +(a few years ago) leaving the details up to filesystems while providing a lib/ +stock implementation. See the thread around here: +http://www.mail-archive.com/devel@openvz.org/msg09331.html + + +Working notes +============= +Capability checks for actions related to syslog must be against the +init_user_ns until syslog is containerized. + +Same is true for reboot and power, control groups, devices, and time. + +Perf actions (kernel/event/core.c for instance) will always be constrained to +init_user_ns. + +Q: +Is accounting considered properly containerized with respect to pidns? (it +appears to be). If so, then we can change the capable() check in +kernel/acct.c to 'ns_capable(current_pid_ns()->user_ns, CAP_PACCT)' + +Q: +For things like nice and schedaffinity, we could allow root in a container to +control those, and leave only cgroups to constrain the container. I'm not sure +whether that is right, or whether it violates admin expectations. + +I deferred some of commoncap.c. I'm punting on xattr stuff as they take +dentries, not inodes. + +For drivers/tty/tty_io.c and drivers/tty/vt/vt.c, we'll want to (for some of +them) target the capability checks at the user_ns owning the tty. That will +have to wait until we get userns owning files straightened out. + +We need to figure out how to label devices. Should we just toss a user_ns +right into struct device? + +capable(CAP_MAC_ADMIN) checks are always to be against init_user_ns, unless +some day LSMs were to be containerized, near zero chance. + +inode_owner_or_capable() should probably take an optional ns and cap parameter. +If cap is 0, then CAP_FOWNER is checked. If ns is NULL, we derive the ns from +inode. But if ns is provided, then callers who need to derive +inode_userns(inode) anyway can save a few cycles. -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) 2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn 2011-09-02 19:56 ` (unknown), Serge Hallyn 2011-09-02 19:56 ` Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn 2011-09-07 22:50 ` Andrew Morton ` (2 more replies) 2011-09-02 19:56 ` [PATCH 07/15] user namespace: use net->user_ns for some capable calls under net/ Serge Hallyn ` (8 subsequent siblings) 11 siblings, 3 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm, segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap Cc: Serge E. Hallyn, Serge E. Hallyn From: "Serge E. Hallyn" <serge@hallyn.com> Quoting David Howells (dhowells@redhat.com): > Randy Dunlap <rdunlap@xenotime.net> wrote: > > > > +Any task in or resource belonging to the initial user namespace will, to this > > > +new task, appear to belong to UID and GID -1 - which is usually known as > > > > that extra hyphen is confusing. how about: > > > > to UID and GID -1, which is > > 'which are'. > > David This will hold some info about the design. Currently it contains future todos, issues and questions. Changelog: jul 26: incorporate feedback from David Howells. jul 29: incorporate feedback from Randy Dunlap. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: David Howells <dhowells@redhat.com> Cc: Randy Dunlap <rdunlap@xenotime.net> --- Documentation/namespaces/user_namespace.txt | 107 +++++++++++++++++++++++++++ 1 files changed, 107 insertions(+), 0 deletions(-) create mode 100644 Documentation/namespaces/user_namespace.txt diff --git a/Documentation/namespaces/user_namespace.txt b/Documentation/namespaces/user_namespace.txt new file mode 100644 index 0000000..b0bc480 --- /dev/null +++ b/Documentation/namespaces/user_namespace.txt @@ -0,0 +1,107 @@ +Description +=========== + +Traditionally, each task is owned by a user ID (UID) and belongs to one or more +groups (GID). Both are simple numeric IDs, though userspace usually translates +them to names. The user namespace allows tasks to have different views of the +UIDs and GIDs associated with tasks and other resources. (See 'UID mapping' +below for more.) + +The user namespace is a simple hierarchical one. The system starts with all +tasks belonging to the initial user namespace. A task creates a new user +namespace by passing the CLONE_NEWUSER flag to clone(2). This requires the +creating task to have the CAP_SETUID, CAP_SETGID, and CAP_CHOWN capabilities, +but it does not need to be running as root. The clone(2) call will result in a +new task which to itself appears to be running as UID and GID 0, but to its +creator seems to have the creator's credentials. + +To this new task, any resource belonging to the initial user namespace will +appear to belong to user and group 'nobody', which are UID and GID -1. +Permission to open such files will be granted according to world access +permissions. UID comparisons and group membership checks will return false, +and privilege will be denied. + +When a task belonging to (for example) userid 500 in the initial user namespace +creates a new user namespace, even though the new task will see itself as +belonging to UID 0, any task in the initial user namespace will see it as +belonging to UID 500. Therefore, UID 500 in the initial user namespace will be +able to kill the new task. Files created by the new user will (eventually) be +seen by tasks in its own user namespace as belonging to UID 0, but to tasks in +the initial user namespace as belonging to UID 500. + +Note that this userid mapping for the VFS is not yet implemented, though the +lkml and containers mailing list archives will show several previous +prototypes. In the end, those got hung up waiting on the concept of targeted +capabilities to be developed, which, thanks to the insight of Eric Biederman, +they finally did. + +Relationship between the User namespace and other namespaces +============================================================ + +Other namespaces, such as UTS and network, are owned by a user namespace. When +such a namespace is created, it is assigned to the user namespace of the task +by which it was created. Therefore, attempts to exercise privilege to +resources in, for instance, a particular network namespace, can be properly +validated by checking whether the caller has the needed privilege (i.e. +CAP_NET_ADMIN) targeted to the user namespace which owns the network namespace. +This is done using the ns_capable() function. + +As an example, if a new task is cloned with a private user namespace but +no private network namespace, then the task's network namespace is owned +by the parent user namespace. The new task has no privilege to the +parent user namespace, so it will not be able to create or configure +network devices. If, instead, the task were cloned with both private +user and network namespaces, then the private network namespace is owned +by the private user namespace, and so root in the new user namespace +will have privilege targeted to the network namespace. It will be able +to create and configure network devices. + +UID Mapping +=========== +The current plan (see 'flexible UID mapping' at +https://wiki.ubuntu.com/UserNamespace) is: + +The UID/GID stored on disk will be that in the init_user_ns. Most likely +UID/GID in other namespaces will be stored in xattrs. But Eric was advocating +(a few years ago) leaving the details up to filesystems while providing a lib/ +stock implementation. See the thread around here: +http://www.mail-archive.com/devel@openvz.org/msg09331.html + + +Working notes +============= +Capability checks for actions related to syslog must be against the +init_user_ns until syslog is containerized. + +Same is true for reboot and power, control groups, devices, and time. + +Perf actions (kernel/event/core.c for instance) will always be constrained to +init_user_ns. + +Q: +Is accounting considered properly containerized with respect to pidns? (it +appears to be). If so, then we can change the capable() check in +kernel/acct.c to 'ns_capable(current_pid_ns()->user_ns, CAP_PACCT)' + +Q: +For things like nice and schedaffinity, we could allow root in a container to +control those, and leave only cgroups to constrain the container. I'm not sure +whether that is right, or whether it violates admin expectations. + +I deferred some of commoncap.c. I'm punting on xattr stuff as they take +dentries, not inodes. + +For drivers/tty/tty_io.c and drivers/tty/vt/vt.c, we'll want to (for some of +them) target the capability checks at the user_ns owning the tty. That will +have to wait until we get userns owning files straightened out. + +We need to figure out how to label devices. Should we just toss a user_ns +right into struct device? + +capable(CAP_MAC_ADMIN) checks are always to be against init_user_ns, unless +some day LSMs were to be containerized, near zero chance. + +inode_owner_or_capable() should probably take an optional ns and cap parameter. +If cap is 0, then CAP_FOWNER is checked. If ns is NULL, we derive the ns from +inode. But if ns is provided, then callers who need to derive +inode_userns(inode) anyway can save a few cycles. -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) 2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn @ 2011-09-07 22:50 ` Andrew Morton [not found] ` <20110907155024.42e3fe27.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> 2011-09-09 13:10 ` Serge E. Hallyn [not found] ` <1314993400-6910-4-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> 2011-09-26 19:17 ` [kernel-hardening] " Vasiliy Kulikov 2 siblings, 2 replies; 69+ messages in thread From: Andrew Morton @ 2011-09-07 22:50 UTC (permalink / raw) To: Serge Hallyn Cc: segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap, Serge E. Hallyn On Fri, 2 Sep 2011 19:56:26 +0000 Serge Hallyn <serge@hallyn.com> wrote: > +Note that this userid mapping for the VFS is not yet implemented, though the > +lkml and containers mailing list archives will show several previous > +prototypes. In the end, those got hung up waiting on the concept of targeted > +capabilities to be developed, which, thanks to the insight of Eric Biederman, > +they finally did. not-yet-implemented things worry me. When can we expect this to happen, and how big and ugly will it be? I'm not seeing many (any) reviewed-by's on these patches. I could get down and stare at them myself, but that wouldn't be very useful. This work goes pretty deep and is quite security-affecting. And network-afecting. Can you round up some suitable people and get the reviewing and testing happening please? ^ permalink raw reply [flat|nested] 69+ messages in thread
[parent not found: <20110907155024.42e3fe27.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]
* Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) [not found] ` <20110907155024.42e3fe27.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> @ 2011-09-09 13:10 ` Serge E. Hallyn 0 siblings, 0 replies; 69+ messages in thread From: Serge E. Hallyn @ 2011-09-09 13:10 UTC (permalink / raw) To: Andrew Morton Cc: netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, rdunlap-/UHa2rfvQTnk1uMJSBkQmQ, ebiederm-aS9lmoZGLiVWk0Htik3J/w Quoting Andrew Morton (akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org): > On Fri, 2 Sep 2011 19:56:26 +0000 > Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote: > > > +Note that this userid mapping for the VFS is not yet implemented, though the > > +lkml and containers mailing list archives will show several previous > > +prototypes. In the end, those got hung up waiting on the concept of targeted > > +capabilities to be developed, which, thanks to the insight of Eric Biederman, > > +they finally did. > > not-yet-implemented things worry me. When can we expect this to > happen, and how big and ugly will it be? Hi Andrew, We did a proof of concept of the simplest version of this in early August (see git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-userns-devel.git) which actually was very un-scary. So technically we could push it at the same time as this set, but I thought that might just be too much for review in one cycle. That set (Eric's) is the very simplest approach which tags an entire filesystem with a user namespace. We would also want to pursue the more baroque approach, where filesystems themselves are user-namespace aware. I did an approach like that in 2008, see https://lists.linux-foundation.org/pipermail/containers/2008-August/012679.html It again is very do-able without being ugly, but, importantly, user namespaces are usable for containers without that. For starters, we only need /proc and /sys to be user namespace aware (since they must allow access from multiple namespaces), and that is simple as they are not persistent. So I believe that this is the last scary patchset, and that user namespaces could actually be usable by the end of the year! > I'm not seeing many (any) reviewed-by's on these patches. I could get > down and stare at them myself, but that wouldn't be very useful. This > work goes pretty deep and is quite security-affecting. And network-afecting. > Can you round up some suitable people and get the reviewing and testing happening > please? Will try. Unfortunately I missed my chance to beg and bribe people in person at plumbers :( thanks, -serge ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) 2011-09-07 22:50 ` Andrew Morton [not found] ` <20110907155024.42e3fe27.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> @ 2011-09-09 13:10 ` Serge E. Hallyn 1 sibling, 0 replies; 69+ messages in thread From: Serge E. Hallyn @ 2011-09-09 13:10 UTC (permalink / raw) To: Andrew Morton Cc: Serge Hallyn, segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap Quoting Andrew Morton (akpm@linux-foundation.org): > On Fri, 2 Sep 2011 19:56:26 +0000 > Serge Hallyn <serge@hallyn.com> wrote: > > > +Note that this userid mapping for the VFS is not yet implemented, though the > > +lkml and containers mailing list archives will show several previous > > +prototypes. In the end, those got hung up waiting on the concept of targeted > > +capabilities to be developed, which, thanks to the insight of Eric Biederman, > > +they finally did. > > not-yet-implemented things worry me. When can we expect this to > happen, and how big and ugly will it be? Hi Andrew, We did a proof of concept of the simplest version of this in early August (see git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-userns-devel.git) which actually was very un-scary. So technically we could push it at the same time as this set, but I thought that might just be too much for review in one cycle. That set (Eric's) is the very simplest approach which tags an entire filesystem with a user namespace. We would also want to pursue the more baroque approach, where filesystems themselves are user-namespace aware. I did an approach like that in 2008, see https://lists.linux-foundation.org/pipermail/containers/2008-August/012679.html It again is very do-able without being ugly, but, importantly, user namespaces are usable for containers without that. For starters, we only need /proc and /sys to be user namespace aware (since they must allow access from multiple namespaces), and that is simple as they are not persistent. So I believe that this is the last scary patchset, and that user namespaces could actually be usable by the end of the year! > I'm not seeing many (any) reviewed-by's on these patches. I could get > down and stare at them myself, but that wouldn't be very useful. This > work goes pretty deep and is quite security-affecting. And network-afecting. > Can you round up some suitable people and get the reviewing and testing happening > please? Will try. Unfortunately I missed my chance to beg and bribe people in person at plumbers :( thanks, -serge ^ permalink raw reply [flat|nested] 69+ messages in thread
[parent not found: <1314993400-6910-4-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) [not found] ` <1314993400-6910-4-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> @ 2011-09-07 22:50 ` Andrew Morton 0 siblings, 0 replies; 69+ messages in thread From: Andrew Morton @ 2011-09-07 22:50 UTC (permalink / raw) To: Serge Hallyn Cc: netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, rdunlap-/UHa2rfvQTnk1uMJSBkQmQ, ebiederm-aS9lmoZGLiVWk0Htik3J/w On Fri, 2 Sep 2011 19:56:26 +0000 Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote: > +Note that this userid mapping for the VFS is not yet implemented, though the > +lkml and containers mailing list archives will show several previous > +prototypes. In the end, those got hung up waiting on the concept of targeted > +capabilities to be developed, which, thanks to the insight of Eric Biederman, > +they finally did. not-yet-implemented things worry me. When can we expect this to happen, and how big and ugly will it be? I'm not seeing many (any) reviewed-by's on these patches. I could get down and stare at them myself, but that wouldn't be very useful. This work goes pretty deep and is quite security-affecting. And network-afecting. Can you round up some suitable people and get the reviewing and testing happening please? ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) 2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn @ 2011-09-26 19:17 ` Vasiliy Kulikov [not found] ` <1314993400-6910-4-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> 2011-09-26 19:17 ` [kernel-hardening] " Vasiliy Kulikov 2 siblings, 0 replies; 69+ messages in thread From: Vasiliy Kulikov @ 2011-09-26 19:17 UTC (permalink / raw) To: Serge Hallyn Cc: akpm, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap, Serge E. Hallyn, kernel-hardening (cc'ed kernel-hardening) Hi Serge, I didn't deeply studied the patches yet (sorry!), but I have some long-term question about the technique in general. I couldn't find answers to the questions in the documentation. First, the patches by design expose much kernel code to unprivileged userspace processes. This code doesn't expect malformed data (e.g. VFS, specific filesystems, block layer, char drivers, sysadmin part of LSMs, etc. etc.). By relaxing permission rules you greatly increase attack surface of the kernel from unprivileged users. Are you (or somebody else) planning to audit this code? Also, will it be possible to somehow restrict what specific kernel facilities are accessible from users (IOW, what root emulation limitations are in action)? It is userful from both points of sysadmin, who might not want to allow users to do such things, and from the security POV in sense of attack surface reduction. The patches explicitly enable some features for users on white list basis. It's possible to do it for simple cases, but what are you going to do with multiplexing functions where there is a permission check before the actual multiplexing? FS, networking drivers, etc. Are you going to do the same thing as net_namespace does? - For each multiplexed entity create bool ->ns_aware which is false by default for all "untrusted"/not prepared protocols and is true for audited/prepared protocols. Or probably you have something else in mind? Thanks, On Fri, Sep 02, 2011 at 19:56 +0000, Serge Hallyn wrote: > From: "Serge E. Hallyn" <serge@hallyn.com> > > Quoting David Howells (dhowells@redhat.com): > > Randy Dunlap <rdunlap@xenotime.net> wrote: > > > > > > +Any task in or resource belonging to the initial user namespace will, to this > > > > +new task, appear to belong to UID and GID -1 - which is usually known as > > > > > > that extra hyphen is confusing. how about: > > > > > > to UID and GID -1, which is > > > > 'which are'. > > > > David > > This will hold some info about the design. Currently it contains > future todos, issues and questions. > > Changelog: > jul 26: incorporate feedback from David Howells. > jul 29: incorporate feedback from Randy Dunlap. > > Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> > Cc: Eric W. Biederman <ebiederm@xmission.com> > Cc: David Howells <dhowells@redhat.com> > Cc: Randy Dunlap <rdunlap@xenotime.net> > --- > Documentation/namespaces/user_namespace.txt | 107 +++++++++++++++++++++++++++ > 1 files changed, 107 insertions(+), 0 deletions(-) > create mode 100644 Documentation/namespaces/user_namespace.txt > > diff --git a/Documentation/namespaces/user_namespace.txt b/Documentation/namespaces/user_namespace.txt > new file mode 100644 > index 0000000..b0bc480 > --- /dev/null > +++ b/Documentation/namespaces/user_namespace.txt > @@ -0,0 +1,107 @@ > +Description > +=========== > + > +Traditionally, each task is owned by a user ID (UID) and belongs to one or more > +groups (GID). Both are simple numeric IDs, though userspace usually translates > +them to names. The user namespace allows tasks to have different views of the > +UIDs and GIDs associated with tasks and other resources. (See 'UID mapping' > +below for more.) > + > +The user namespace is a simple hierarchical one. The system starts with all > +tasks belonging to the initial user namespace. A task creates a new user > +namespace by passing the CLONE_NEWUSER flag to clone(2). This requires the > +creating task to have the CAP_SETUID, CAP_SETGID, and CAP_CHOWN capabilities, > +but it does not need to be running as root. The clone(2) call will result in a > +new task which to itself appears to be running as UID and GID 0, but to its > +creator seems to have the creator's credentials. > + > +To this new task, any resource belonging to the initial user namespace will > +appear to belong to user and group 'nobody', which are UID and GID -1. > +Permission to open such files will be granted according to world access > +permissions. UID comparisons and group membership checks will return false, > +and privilege will be denied. > + > +When a task belonging to (for example) userid 500 in the initial user namespace > +creates a new user namespace, even though the new task will see itself as > +belonging to UID 0, any task in the initial user namespace will see it as > +belonging to UID 500. Therefore, UID 500 in the initial user namespace will be > +able to kill the new task. Files created by the new user will (eventually) be > +seen by tasks in its own user namespace as belonging to UID 0, but to tasks in > +the initial user namespace as belonging to UID 500. > + > +Note that this userid mapping for the VFS is not yet implemented, though the > +lkml and containers mailing list archives will show several previous > +prototypes. In the end, those got hung up waiting on the concept of targeted > +capabilities to be developed, which, thanks to the insight of Eric Biederman, > +they finally did. > + > +Relationship between the User namespace and other namespaces > +============================================================ > + > +Other namespaces, such as UTS and network, are owned by a user namespace. When > +such a namespace is created, it is assigned to the user namespace of the task > +by which it was created. Therefore, attempts to exercise privilege to > +resources in, for instance, a particular network namespace, can be properly > +validated by checking whether the caller has the needed privilege (i.e. > +CAP_NET_ADMIN) targeted to the user namespace which owns the network namespace. > +This is done using the ns_capable() function. > + > +As an example, if a new task is cloned with a private user namespace but > +no private network namespace, then the task's network namespace is owned > +by the parent user namespace. The new task has no privilege to the > +parent user namespace, so it will not be able to create or configure > +network devices. If, instead, the task were cloned with both private > +user and network namespaces, then the private network namespace is owned > +by the private user namespace, and so root in the new user namespace > +will have privilege targeted to the network namespace. It will be able > +to create and configure network devices. > + > +UID Mapping > +=========== > +The current plan (see 'flexible UID mapping' at > +https://wiki.ubuntu.com/UserNamespace) is: > + > +The UID/GID stored on disk will be that in the init_user_ns. Most likely > +UID/GID in other namespaces will be stored in xattrs. But Eric was advocating > +(a few years ago) leaving the details up to filesystems while providing a lib/ > +stock implementation. See the thread around here: > +http://www.mail-archive.com/devel@openvz.org/msg09331.html > + > + > +Working notes > +============= > +Capability checks for actions related to syslog must be against the > +init_user_ns until syslog is containerized. > + > +Same is true for reboot and power, control groups, devices, and time. > + > +Perf actions (kernel/event/core.c for instance) will always be constrained to > +init_user_ns. > + > +Q: > +Is accounting considered properly containerized with respect to pidns? (it > +appears to be). If so, then we can change the capable() check in > +kernel/acct.c to 'ns_capable(current_pid_ns()->user_ns, CAP_PACCT)' > + > +Q: > +For things like nice and schedaffinity, we could allow root in a container to > +control those, and leave only cgroups to constrain the container. I'm not sure > +whether that is right, or whether it violates admin expectations. > + > +I deferred some of commoncap.c. I'm punting on xattr stuff as they take > +dentries, not inodes. > + > +For drivers/tty/tty_io.c and drivers/tty/vt/vt.c, we'll want to (for some of > +them) target the capability checks at the user_ns owning the tty. That will > +have to wait until we get userns owning files straightened out. > + > +We need to figure out how to label devices. Should we just toss a user_ns > +right into struct device? > + > +capable(CAP_MAC_ADMIN) checks are always to be against init_user_ns, unless > +some day LSMs were to be containerized, near zero chance. > + > +inode_owner_or_capable() should probably take an optional ns and cap parameter. > +If cap is 0, then CAP_FOWNER is checked. If ns is NULL, we derive the ns from > +inode. But if ns is provided, then callers who need to derive > +inode_userns(inode) anyway can save a few cycles. > -- > 1.7.5.4 -- Vasiliy ^ permalink raw reply [flat|nested] 69+ messages in thread
* [kernel-hardening] Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) @ 2011-09-26 19:17 ` Vasiliy Kulikov 0 siblings, 0 replies; 69+ messages in thread From: Vasiliy Kulikov @ 2011-09-26 19:17 UTC (permalink / raw) To: Serge Hallyn Cc: akpm, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap, Serge E. Hallyn, kernel-hardening (cc'ed kernel-hardening) Hi Serge, I didn't deeply studied the patches yet (sorry!), but I have some long-term question about the technique in general. I couldn't find answers to the questions in the documentation. First, the patches by design expose much kernel code to unprivileged userspace processes. This code doesn't expect malformed data (e.g. VFS, specific filesystems, block layer, char drivers, sysadmin part of LSMs, etc. etc.). By relaxing permission rules you greatly increase attack surface of the kernel from unprivileged users. Are you (or somebody else) planning to audit this code? Also, will it be possible to somehow restrict what specific kernel facilities are accessible from users (IOW, what root emulation limitations are in action)? It is userful from both points of sysadmin, who might not want to allow users to do such things, and from the security POV in sense of attack surface reduction. The patches explicitly enable some features for users on white list basis. It's possible to do it for simple cases, but what are you going to do with multiplexing functions where there is a permission check before the actual multiplexing? FS, networking drivers, etc. Are you going to do the same thing as net_namespace does? - For each multiplexed entity create bool ->ns_aware which is false by default for all "untrusted"/not prepared protocols and is true for audited/prepared protocols. Or probably you have something else in mind? Thanks, On Fri, Sep 02, 2011 at 19:56 +0000, Serge Hallyn wrote: > From: "Serge E. Hallyn" <serge@hallyn.com> > > Quoting David Howells (dhowells@redhat.com): > > Randy Dunlap <rdunlap@xenotime.net> wrote: > > > > > > +Any task in or resource belonging to the initial user namespace will, to this > > > > +new task, appear to belong to UID and GID -1 - which is usually known as > > > > > > that extra hyphen is confusing. how about: > > > > > > to UID and GID -1, which is > > > > 'which are'. > > > > David > > This will hold some info about the design. Currently it contains > future todos, issues and questions. > > Changelog: > jul 26: incorporate feedback from David Howells. > jul 29: incorporate feedback from Randy Dunlap. > > Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> > Cc: Eric W. Biederman <ebiederm@xmission.com> > Cc: David Howells <dhowells@redhat.com> > Cc: Randy Dunlap <rdunlap@xenotime.net> > --- > Documentation/namespaces/user_namespace.txt | 107 +++++++++++++++++++++++++++ > 1 files changed, 107 insertions(+), 0 deletions(-) > create mode 100644 Documentation/namespaces/user_namespace.txt > > diff --git a/Documentation/namespaces/user_namespace.txt b/Documentation/namespaces/user_namespace.txt > new file mode 100644 > index 0000000..b0bc480 > --- /dev/null > +++ b/Documentation/namespaces/user_namespace.txt > @@ -0,0 +1,107 @@ > +Description > +=========== > + > +Traditionally, each task is owned by a user ID (UID) and belongs to one or more > +groups (GID). Both are simple numeric IDs, though userspace usually translates > +them to names. The user namespace allows tasks to have different views of the > +UIDs and GIDs associated with tasks and other resources. (See 'UID mapping' > +below for more.) > + > +The user namespace is a simple hierarchical one. The system starts with all > +tasks belonging to the initial user namespace. A task creates a new user > +namespace by passing the CLONE_NEWUSER flag to clone(2). This requires the > +creating task to have the CAP_SETUID, CAP_SETGID, and CAP_CHOWN capabilities, > +but it does not need to be running as root. The clone(2) call will result in a > +new task which to itself appears to be running as UID and GID 0, but to its > +creator seems to have the creator's credentials. > + > +To this new task, any resource belonging to the initial user namespace will > +appear to belong to user and group 'nobody', which are UID and GID -1. > +Permission to open such files will be granted according to world access > +permissions. UID comparisons and group membership checks will return false, > +and privilege will be denied. > + > +When a task belonging to (for example) userid 500 in the initial user namespace > +creates a new user namespace, even though the new task will see itself as > +belonging to UID 0, any task in the initial user namespace will see it as > +belonging to UID 500. Therefore, UID 500 in the initial user namespace will be > +able to kill the new task. Files created by the new user will (eventually) be > +seen by tasks in its own user namespace as belonging to UID 0, but to tasks in > +the initial user namespace as belonging to UID 500. > + > +Note that this userid mapping for the VFS is not yet implemented, though the > +lkml and containers mailing list archives will show several previous > +prototypes. In the end, those got hung up waiting on the concept of targeted > +capabilities to be developed, which, thanks to the insight of Eric Biederman, > +they finally did. > + > +Relationship between the User namespace and other namespaces > +============================================================ > + > +Other namespaces, such as UTS and network, are owned by a user namespace. When > +such a namespace is created, it is assigned to the user namespace of the task > +by which it was created. Therefore, attempts to exercise privilege to > +resources in, for instance, a particular network namespace, can be properly > +validated by checking whether the caller has the needed privilege (i.e. > +CAP_NET_ADMIN) targeted to the user namespace which owns the network namespace. > +This is done using the ns_capable() function. > + > +As an example, if a new task is cloned with a private user namespace but > +no private network namespace, then the task's network namespace is owned > +by the parent user namespace. The new task has no privilege to the > +parent user namespace, so it will not be able to create or configure > +network devices. If, instead, the task were cloned with both private > +user and network namespaces, then the private network namespace is owned > +by the private user namespace, and so root in the new user namespace > +will have privilege targeted to the network namespace. It will be able > +to create and configure network devices. > + > +UID Mapping > +=========== > +The current plan (see 'flexible UID mapping' at > +https://wiki.ubuntu.com/UserNamespace) is: > + > +The UID/GID stored on disk will be that in the init_user_ns. Most likely > +UID/GID in other namespaces will be stored in xattrs. But Eric was advocating > +(a few years ago) leaving the details up to filesystems while providing a lib/ > +stock implementation. See the thread around here: > +http://www.mail-archive.com/devel@openvz.org/msg09331.html > + > + > +Working notes > +============= > +Capability checks for actions related to syslog must be against the > +init_user_ns until syslog is containerized. > + > +Same is true for reboot and power, control groups, devices, and time. > + > +Perf actions (kernel/event/core.c for instance) will always be constrained to > +init_user_ns. > + > +Q: > +Is accounting considered properly containerized with respect to pidns? (it > +appears to be). If so, then we can change the capable() check in > +kernel/acct.c to 'ns_capable(current_pid_ns()->user_ns, CAP_PACCT)' > + > +Q: > +For things like nice and schedaffinity, we could allow root in a container to > +control those, and leave only cgroups to constrain the container. I'm not sure > +whether that is right, or whether it violates admin expectations. > + > +I deferred some of commoncap.c. I'm punting on xattr stuff as they take > +dentries, not inodes. > + > +For drivers/tty/tty_io.c and drivers/tty/vt/vt.c, we'll want to (for some of > +them) target the capability checks at the user_ns owning the tty. That will > +have to wait until we get userns owning files straightened out. > + > +We need to figure out how to label devices. Should we just toss a user_ns > +right into struct device? > + > +capable(CAP_MAC_ADMIN) checks are always to be against init_user_ns, unless > +some day LSMs were to be containerized, near zero chance. > + > +inode_owner_or_capable() should probably take an optional ns and cap parameter. > +If cap is 0, then CAP_FOWNER is checked. If ns is NULL, we derive the ns from > +inode. But if ns is provided, then callers who need to derive > +inode_userns(inode) anyway can save a few cycles. > -- > 1.7.5.4 -- Vasiliy ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) 2011-09-26 19:17 ` [kernel-hardening] " Vasiliy Kulikov @ 2011-09-27 13:21 ` Serge E. Hallyn -1 siblings, 0 replies; 69+ messages in thread From: Serge E. Hallyn @ 2011-09-27 13:21 UTC (permalink / raw) To: Vasiliy Kulikov Cc: Serge Hallyn, akpm, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap, kernel-hardening Quoting Vasiliy Kulikov (segoon@openwall.com): > (cc'ed kernel-hardening) > > > Hi Serge, > > I didn't deeply studied the patches yet (sorry!), but I have some > long-term question about the technique in general. I couldn't find > answers to the questions in the documentation. Great - thanks for your time, Vasiliy. There is documentation at https://wiki.ubuntu.com/UserNamespace, and I was adding a Documentation/namespaces/user_namespace.txt file (which hasn't gone in yet) which you can see here: https://lkml.org/lkml/2011/7/26/351 But those don't answer your questions sufficiently. > First, the patches by design expose much kernel code to unprivileged > userspace processes. This code doesn't expect malformed data (e.g. VFS, > specific filesystems, block layer, char drivers, sysadmin part of LSMs, > etc. etc.). By relaxing permission rules you greatly increase attack > surface of the kernel from unprivileged users. Are you (or somebody > else) planning to audit this code? I had wanted to (but didn't) propose a discussion at ksummit about how best to approach the filesystem code. That's not even just for user namespaces - patches have been floated in the past to make mount an unprivileged operation depending on the FS and the user's permission over the device and target. So I don't know if a combination of auditing and fuzzing is the way to go, or what, and wanted to get input from some people who are more knowledgeable on that topic than me. You're right about other kernel code as well. I'll certainly join in this effort, but don't want to go blindly charging in without some advice/guidance about the best way to do this and, if others are interested, coordinate it. We can start by looking through all code which is currently under ns_capable(), and analyzing that. But what tools do we have available to perform the analysis? Do you think a kernel summit discussion (i suppose given the late timing, a beer bof) would be beneficial? (I wouldn't be there) > Also, will it be possible to somehow restrict what specific kernel > facilities are accessible from users (IOW, what root emulation > limitations are in action)? It is userful from both points of sysadmin, > who might not want to allow users to do such things, and from the > security POV in sense of attack surface reduction. You're probably thinking along different lines, but this is why I've been wanting seccomp2 to get pushed through. So that we can deny a container the syscalls we know it won't need, especially newer ones, to reduce the attack surface available to it. > The patches explicitly enable some features for users on white list > basis. It's possible to do it for simple cases, but what are you going > to do with multiplexing functions where there is a permission check > before the actual multiplexing? FS, networking drivers, etc. Are you > going to do the same thing as net_namespace does? - For each multiplexed > entity create bool ->ns_aware which is false by default for all > "untrusted"/not prepared protocols and is true for audited/prepared > protocols. Or probably you have something else in mind? Ah, I typed the bottom paragraph before realizing what you were actually asking. The filesystems are a good example. In the unprivileged mounts patchsets, for instance, a flag was added to each filesystem indicating if it was safe for unprivileged mounting (turned off for all real block filesystems :). For targeted capabilities, my goal would be simply to make sure that each non-netns-aware entity do a (untargeted) capable() check. Without pointing to a specific example it's hard to say what I will do. It depends on how the code was previously laid out, and what the maintainer of that subsystem prefers. The way we're approaching it right now is that by default everything stays 'capable(X)', so that a non-init user namespace doesn't get the privileges. While some of my patchsets this summer didn't follow this, Eric reminded me that we should first clamp down on the user namespaces as much as possible, and relax permissions in child namespaces later. So the small (1-2 patch sized) sets I've been sending the last few weeks are just trying to fix existing inadequate userid or capability checks. -serge ^ permalink raw reply [flat|nested] 69+ messages in thread
* [kernel-hardening] Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) @ 2011-09-27 13:21 ` Serge E. Hallyn 0 siblings, 0 replies; 69+ messages in thread From: Serge E. Hallyn @ 2011-09-27 13:21 UTC (permalink / raw) To: Vasiliy Kulikov Cc: Serge Hallyn, akpm, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap, kernel-hardening Quoting Vasiliy Kulikov (segoon@openwall.com): > (cc'ed kernel-hardening) > > > Hi Serge, > > I didn't deeply studied the patches yet (sorry!), but I have some > long-term question about the technique in general. I couldn't find > answers to the questions in the documentation. Great - thanks for your time, Vasiliy. There is documentation at https://wiki.ubuntu.com/UserNamespace, and I was adding a Documentation/namespaces/user_namespace.txt file (which hasn't gone in yet) which you can see here: https://lkml.org/lkml/2011/7/26/351 But those don't answer your questions sufficiently. > First, the patches by design expose much kernel code to unprivileged > userspace processes. This code doesn't expect malformed data (e.g. VFS, > specific filesystems, block layer, char drivers, sysadmin part of LSMs, > etc. etc.). By relaxing permission rules you greatly increase attack > surface of the kernel from unprivileged users. Are you (or somebody > else) planning to audit this code? I had wanted to (but didn't) propose a discussion at ksummit about how best to approach the filesystem code. That's not even just for user namespaces - patches have been floated in the past to make mount an unprivileged operation depending on the FS and the user's permission over the device and target. So I don't know if a combination of auditing and fuzzing is the way to go, or what, and wanted to get input from some people who are more knowledgeable on that topic than me. You're right about other kernel code as well. I'll certainly join in this effort, but don't want to go blindly charging in without some advice/guidance about the best way to do this and, if others are interested, coordinate it. We can start by looking through all code which is currently under ns_capable(), and analyzing that. But what tools do we have available to perform the analysis? Do you think a kernel summit discussion (i suppose given the late timing, a beer bof) would be beneficial? (I wouldn't be there) > Also, will it be possible to somehow restrict what specific kernel > facilities are accessible from users (IOW, what root emulation > limitations are in action)? It is userful from both points of sysadmin, > who might not want to allow users to do such things, and from the > security POV in sense of attack surface reduction. You're probably thinking along different lines, but this is why I've been wanting seccomp2 to get pushed through. So that we can deny a container the syscalls we know it won't need, especially newer ones, to reduce the attack surface available to it. > The patches explicitly enable some features for users on white list > basis. It's possible to do it for simple cases, but what are you going > to do with multiplexing functions where there is a permission check > before the actual multiplexing? FS, networking drivers, etc. Are you > going to do the same thing as net_namespace does? - For each multiplexed > entity create bool ->ns_aware which is false by default for all > "untrusted"/not prepared protocols and is true for audited/prepared > protocols. Or probably you have something else in mind? Ah, I typed the bottom paragraph before realizing what you were actually asking. The filesystems are a good example. In the unprivileged mounts patchsets, for instance, a flag was added to each filesystem indicating if it was safe for unprivileged mounting (turned off for all real block filesystems :). For targeted capabilities, my goal would be simply to make sure that each non-netns-aware entity do a (untargeted) capable() check. Without pointing to a specific example it's hard to say what I will do. It depends on how the code was previously laid out, and what the maintainer of that subsystem prefers. The way we're approaching it right now is that by default everything stays 'capable(X)', so that a non-init user namespace doesn't get the privileges. While some of my patchsets this summer didn't follow this, Eric reminded me that we should first clamp down on the user namespaces as much as possible, and relax permissions in child namespaces later. So the small (1-2 patch sized) sets I've been sending the last few weeks are just trying to fix existing inadequate userid or capability checks. -serge ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) 2011-09-27 13:21 ` [kernel-hardening] " Serge E. Hallyn @ 2011-09-27 15:56 ` Vasiliy Kulikov -1 siblings, 0 replies; 69+ messages in thread From: Vasiliy Kulikov @ 2011-09-27 15:56 UTC (permalink / raw) To: Serge E. Hallyn Cc: Serge Hallyn, akpm, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap, kernel-hardening On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote: > > First, the patches by design expose much kernel code to unprivileged > > userspace processes. This code doesn't expect malformed data (e.g. VFS, > > specific filesystems, block layer, char drivers, sysadmin part of LSMs, > > etc. etc.). By relaxing permission rules you greatly increase attack > > surface of the kernel from unprivileged users. Are you (or somebody > > else) planning to audit this code? > > I had wanted to (but didn't) propose a discussion at ksummit about how > best to approach the filesystem code. That's not even just for user > namespaces - patches have been floated in the past to make mount an > unprivileged operation depending on the FS and the user's permission > over the device and target. This is a dangerous operation by itself. AFAICS, this is the reason why e.g. FUSE doesn't pass user mount points to other users and even root. Beginning from violating some rules like existance of single "." and ".." in each directory and ending with filename charsets with /, \000 and things like `, ", ', \ inside. > So I don't know if a combination of auditing > and fuzzing is the way to go, Maybe the combination of both. There are no generic recommendations, it's always limited to the subsystem, checked property, and the auditor. > > Also, will it be possible to somehow restrict what specific kernel > > facilities are accessible from users (IOW, what root emulation > > limitations are in action)? It is userful from both points of sysadmin, > > who might not want to allow users to do such things, and from the > > security POV in sense of attack surface reduction. > > You're probably thinking along different lines, but this is why I've > been wanting seccomp2 to get pushed through. So that we can deny a > container the syscalls we know it won't need, especially newer ones, > to reduce the attack surface available to it. This dependency greatly complicates the things. First, there is a big misunderstanding between Will and Ingo in what needs seccompv2 should serve. Will wants to reduce kernel attack surface by limiting syscalls and syscall arguments available to a user (a single task, btw). Ingo wants to see a full featured filtering engine, which needs code changes all over the kernel. Given the needed changes amounts, it will unlikely reduce attack surface. You probably don't want Will's version as syscalls filtering is a very bad abstraction in your case. user_namespaces likely need Ingo's version of seccomp as it will be possible to filter e.g. fs-specific events, but even if it is implemented, it will take a looong time for your needs IMHO. Also, I'm afraid for _good_ user_namespace filtering the policy definition will be too complicated (like SELinux policy definition for non-trivial applications) if it is implemented in events filtering terms. > The way we're approaching it right now is that by default everything > stays 'capable(X)', so that a non-init user namespace doesn't get the > privileges. Great. I was not sure about it. > While some of my patchsets this summer didn't follow this, > Eric reminded me that we should first clamp down on the user namespaces > as much as possible, and relax permissions in child namespaces later. I think it is the only sane way. > So the small (1-2 patch sized) sets I've been sending the last few > weeks are just trying to fix existing inadequate userid or capability > checks. > > -serge Thanks, -- Vasiliy Kulikov http://www.openwall.com - bringing security into open computing environments ^ permalink raw reply [flat|nested] 69+ messages in thread
* [kernel-hardening] Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) @ 2011-09-27 15:56 ` Vasiliy Kulikov 0 siblings, 0 replies; 69+ messages in thread From: Vasiliy Kulikov @ 2011-09-27 15:56 UTC (permalink / raw) To: Serge E. Hallyn Cc: Serge Hallyn, akpm, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap, kernel-hardening On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote: > > First, the patches by design expose much kernel code to unprivileged > > userspace processes. This code doesn't expect malformed data (e.g. VFS, > > specific filesystems, block layer, char drivers, sysadmin part of LSMs, > > etc. etc.). By relaxing permission rules you greatly increase attack > > surface of the kernel from unprivileged users. Are you (or somebody > > else) planning to audit this code? > > I had wanted to (but didn't) propose a discussion at ksummit about how > best to approach the filesystem code. That's not even just for user > namespaces - patches have been floated in the past to make mount an > unprivileged operation depending on the FS and the user's permission > over the device and target. This is a dangerous operation by itself. AFAICS, this is the reason why e.g. FUSE doesn't pass user mount points to other users and even root. Beginning from violating some rules like existance of single "." and ".." in each directory and ending with filename charsets with /, \000 and things like `, ", ', \ inside. > So I don't know if a combination of auditing > and fuzzing is the way to go, Maybe the combination of both. There are no generic recommendations, it's always limited to the subsystem, checked property, and the auditor. > > Also, will it be possible to somehow restrict what specific kernel > > facilities are accessible from users (IOW, what root emulation > > limitations are in action)? It is userful from both points of sysadmin, > > who might not want to allow users to do such things, and from the > > security POV in sense of attack surface reduction. > > You're probably thinking along different lines, but this is why I've > been wanting seccomp2 to get pushed through. So that we can deny a > container the syscalls we know it won't need, especially newer ones, > to reduce the attack surface available to it. This dependency greatly complicates the things. First, there is a big misunderstanding between Will and Ingo in what needs seccompv2 should serve. Will wants to reduce kernel attack surface by limiting syscalls and syscall arguments available to a user (a single task, btw). Ingo wants to see a full featured filtering engine, which needs code changes all over the kernel. Given the needed changes amounts, it will unlikely reduce attack surface. You probably don't want Will's version as syscalls filtering is a very bad abstraction in your case. user_namespaces likely need Ingo's version of seccomp as it will be possible to filter e.g. fs-specific events, but even if it is implemented, it will take a looong time for your needs IMHO. Also, I'm afraid for _good_ user_namespace filtering the policy definition will be too complicated (like SELinux policy definition for non-trivial applications) if it is implemented in events filtering terms. > The way we're approaching it right now is that by default everything > stays 'capable(X)', so that a non-init user namespace doesn't get the > privileges. Great. I was not sure about it. > While some of my patchsets this summer didn't follow this, > Eric reminded me that we should first clamp down on the user namespaces > as much as possible, and relax permissions in child namespaces later. I think it is the only sane way. > So the small (1-2 patch sized) sets I've been sending the last few > weeks are just trying to fix existing inadequate userid or capability > checks. > > -serge Thanks, -- Vasiliy Kulikov http://www.openwall.com - bringing security into open computing environments ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) 2011-09-27 15:56 ` [kernel-hardening] " Vasiliy Kulikov @ 2011-10-01 17:00 ` Serge E. Hallyn -1 siblings, 0 replies; 69+ messages in thread From: Serge E. Hallyn @ 2011-10-01 17:00 UTC (permalink / raw) To: Vasiliy Kulikov Cc: Serge Hallyn, akpm, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap, kernel-hardening Quoting Vasiliy Kulikov (segoon@openwall.com): > On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote: > > > First, the patches by design expose much kernel code to unprivileged > > > userspace processes. This code doesn't expect malformed data (e.g. VFS, > > > specific filesystems, block layer, char drivers, sysadmin part of LSMs, > > > etc. etc.). By relaxing permission rules you greatly increase attack > > > surface of the kernel from unprivileged users. Are you (or somebody > > > else) planning to audit this code? > > > > I had wanted to (but didn't) propose a discussion at ksummit about how > > best to approach the filesystem code. That's not even just for user > > namespaces - patches have been floated in the past to make mount an > > unprivileged operation depending on the FS and the user's permission > > over the device and target. > > This is a dangerous operation by itself. Of course it is :) And it's been a while since it has been brought up, but it *was* quite well thought through and throrougly discussed - see i.e. https://lkml.org/lkml/2008/1/8/131 Oh, that's right. In the end the reason it didn't go in had to do with the ability for an unprivileged user to prevent a privileged user from unmounting trees by leaving a busy mount in a hidden namespace. Eric, in the past we didn't know what to do about that, but I wonder if setns could be used in some clever way to solve it from userspace. > AFAICS, this is the reason why > e.g. FUSE doesn't pass user mount points to other users and even root. > Beginning from violating some rules like existance of single "." and > ".." in each directory and ending with filename charsets with /, \000 > and things like `, ", ', \ inside. > > > > So I don't know if a combination of auditing > > and fuzzing is the way to go, > > Maybe the combination of both. There are no generic recommendations, > it's always limited to the subsystem, checked property, and the > auditor. Ok, let me keep focusing on the tightening down right now, and then before proceeding with relaxing, I'll start some analysis and discussion of the code which is already under targeted (ns_capable) capability checks. > > > Also, will it be possible to somehow restrict what specific kernel > > > facilities are accessible from users (IOW, what root emulation > > > limitations are in action)? It is userful from both points of sysadmin, > > > who might not want to allow users to do such things, and from the > > > security POV in sense of attack surface reduction. > > > > You're probably thinking along different lines, but this is why I've > > been wanting seccomp2 to get pushed through. So that we can deny a > > container the syscalls we know it won't need, especially newer ones, > > to reduce the attack surface available to it. > > This dependency greatly complicates the things. IMO this is not a dependency for user namespaces though - it's only a dependency for unprivileged user namespaces. And we haven't seriously discussed doing that yet precisely because we're nowhere near ready (and frankly I don't know that it'll ever be sane). > First, there is a big misunderstanding between Will and Ingo in what > needs seccompv2 should serve. Will wants to reduce kernel attack I know I know :) > surface by limiting syscalls and syscall arguments available to a user > (a single task, btw). Ingo wants to see a full featured filtering > engine, which needs code changes all over the kernel. Given the needed > changes amounts, it will unlikely reduce attack surface. > > You probably don't want Will's version as syscalls filtering is a very It seems to me per-syscall filtering is a great start. I'm not looking to seccomp2 as an assurance against formerly privileged (and now only privileged per-namespace) code which may have had previously overlooked bugs. I'm looking to seccomp2 as an assurance against bugs in newly written syscalls or the compatibility layer. > bad abstraction in your case. user_namespaces likely need Ingo's > version of seccomp as it will be possible to filter e.g. fs-specific > events, but even if it is implemented, it will take a looong time for > your needs IMHO. Yes, I think that would just lead to exploits through bad policy. > Also, I'm afraid for _good_ user_namespace filtering the policy > definition will be too complicated (like SELinux policy definition for > non-trivial applications) if it is implemented in events filtering > terms. > > > > The way we're approaching it right now is that by default everything > > stays 'capable(X)', so that a non-init user namespace doesn't get the > > privileges. > > Great. I was not sure about it. > > > > While some of my patchsets this summer didn't follow this, > > Eric reminded me that we should first clamp down on the user namespaces > > as much as possible, and relax permissions in child namespaces later. > > I think it is the only sane way. Yup. I trust you and Eric will keep me in check if I get over-zealous :) -serge ^ permalink raw reply [flat|nested] 69+ messages in thread
* [kernel-hardening] Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) @ 2011-10-01 17:00 ` Serge E. Hallyn 0 siblings, 0 replies; 69+ messages in thread From: Serge E. Hallyn @ 2011-10-01 17:00 UTC (permalink / raw) To: Vasiliy Kulikov Cc: Serge Hallyn, akpm, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap, kernel-hardening Quoting Vasiliy Kulikov (segoon@openwall.com): > On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote: > > > First, the patches by design expose much kernel code to unprivileged > > > userspace processes. This code doesn't expect malformed data (e.g. VFS, > > > specific filesystems, block layer, char drivers, sysadmin part of LSMs, > > > etc. etc.). By relaxing permission rules you greatly increase attack > > > surface of the kernel from unprivileged users. Are you (or somebody > > > else) planning to audit this code? > > > > I had wanted to (but didn't) propose a discussion at ksummit about how > > best to approach the filesystem code. That's not even just for user > > namespaces - patches have been floated in the past to make mount an > > unprivileged operation depending on the FS and the user's permission > > over the device and target. > > This is a dangerous operation by itself. Of course it is :) And it's been a while since it has been brought up, but it *was* quite well thought through and throrougly discussed - see i.e. https://lkml.org/lkml/2008/1/8/131 Oh, that's right. In the end the reason it didn't go in had to do with the ability for an unprivileged user to prevent a privileged user from unmounting trees by leaving a busy mount in a hidden namespace. Eric, in the past we didn't know what to do about that, but I wonder if setns could be used in some clever way to solve it from userspace. > AFAICS, this is the reason why > e.g. FUSE doesn't pass user mount points to other users and even root. > Beginning from violating some rules like existance of single "." and > ".." in each directory and ending with filename charsets with /, \000 > and things like `, ", ', \ inside. > > > > So I don't know if a combination of auditing > > and fuzzing is the way to go, > > Maybe the combination of both. There are no generic recommendations, > it's always limited to the subsystem, checked property, and the > auditor. Ok, let me keep focusing on the tightening down right now, and then before proceeding with relaxing, I'll start some analysis and discussion of the code which is already under targeted (ns_capable) capability checks. > > > Also, will it be possible to somehow restrict what specific kernel > > > facilities are accessible from users (IOW, what root emulation > > > limitations are in action)? It is userful from both points of sysadmin, > > > who might not want to allow users to do such things, and from the > > > security POV in sense of attack surface reduction. > > > > You're probably thinking along different lines, but this is why I've > > been wanting seccomp2 to get pushed through. So that we can deny a > > container the syscalls we know it won't need, especially newer ones, > > to reduce the attack surface available to it. > > This dependency greatly complicates the things. IMO this is not a dependency for user namespaces though - it's only a dependency for unprivileged user namespaces. And we haven't seriously discussed doing that yet precisely because we're nowhere near ready (and frankly I don't know that it'll ever be sane). > First, there is a big misunderstanding between Will and Ingo in what > needs seccompv2 should serve. Will wants to reduce kernel attack I know I know :) > surface by limiting syscalls and syscall arguments available to a user > (a single task, btw). Ingo wants to see a full featured filtering > engine, which needs code changes all over the kernel. Given the needed > changes amounts, it will unlikely reduce attack surface. > > You probably don't want Will's version as syscalls filtering is a very It seems to me per-syscall filtering is a great start. I'm not looking to seccomp2 as an assurance against formerly privileged (and now only privileged per-namespace) code which may have had previously overlooked bugs. I'm looking to seccomp2 as an assurance against bugs in newly written syscalls or the compatibility layer. > bad abstraction in your case. user_namespaces likely need Ingo's > version of seccomp as it will be possible to filter e.g. fs-specific > events, but even if it is implemented, it will take a looong time for > your needs IMHO. Yes, I think that would just lead to exploits through bad policy. > Also, I'm afraid for _good_ user_namespace filtering the policy > definition will be too complicated (like SELinux policy definition for > non-trivial applications) if it is implemented in events filtering > terms. > > > > The way we're approaching it right now is that by default everything > > stays 'capable(X)', so that a non-init user namespace doesn't get the > > privileges. > > Great. I was not sure about it. > > > > While some of my patchsets this summer didn't follow this, > > Eric reminded me that we should first clamp down on the user namespaces > > as much as possible, and relax permissions in child namespaces later. > > I think it is the only sane way. Yup. I trust you and Eric will keep me in check if I get over-zealous :) -serge ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) 2011-10-01 17:00 ` [kernel-hardening] " Serge E. Hallyn @ 2011-10-03 1:46 ` Eric W. Biederman -1 siblings, 0 replies; 69+ messages in thread From: Eric W. Biederman @ 2011-10-03 1:46 UTC (permalink / raw) To: Serge E. Hallyn Cc: Vasiliy Kulikov, Serge Hallyn, akpm, linux-kernel, netdev, containers, dhowells, rdunlap, kernel-hardening "Serge E. Hallyn" <serge.hallyn@canonical.com> writes: > Quoting Vasiliy Kulikov (segoon@openwall.com): >> On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote: >> > > First, the patches by design expose much kernel code to unprivileged >> > > userspace processes. This code doesn't expect malformed data (e.g. VFS, >> > > specific filesystems, block layer, char drivers, sysadmin part of LSMs, >> > > etc. etc.). By relaxing permission rules you greatly increase attack >> > > surface of the kernel from unprivileged users. Are you (or somebody >> > > else) planning to audit this code? Well in theory this codes does expose this code to unprivileged user space in a way that increases the attack surface. However right now there are a lot of cases where because the kernel lacks a sufficient mechanism people are just given root provileges so that can get things done. Network manager controlling the network stack as an unprivileged user. Random filesystems on usb sticks being mounted and unmounted automatically when the usb sticks are inserted and removed. I completely agree that auditing and looking at the code is necessary I think most of what will happen is that we will start directly supporting how the kernel is actually used in the real world. Which should actually reduce our level of vulnerability, because we give up the delusion that large classes of operations don't need careful attention because only root can perform them. Operations which the user space authors turn around and write a suid binary for and unprivileged user space performs those operations all day long. >> > I had wanted to (but didn't) propose a discussion at ksummit about how >> > best to approach the filesystem code. That's not even just for user >> > namespaces - patches have been floated in the past to make mount an >> > unprivileged operation depending on the FS and the user's permission >> > over the device and target. >> >> This is a dangerous operation by itself. > > Of course it is :) And it's been a while since it has been brought up, > but it *was* quite well thought through and throrougly discussed - see > i.e. https://lkml.org/lkml/2008/1/8/131 > > Oh, that's right. In the end the reason it didn't go in had to do with > the ability for an unprivileged user to prevent a privileged user from > unmounting trees by leaving a busy mount in a hidden namespace. > > Eric, in the past we didn't know what to do about that, but I wonder > if setns could be used in some clever way to solve it from userspace. Oh. That is a good objection. I had not realized that unprivileged mounts had that problem. Still the solution is straight forward. If the concern is that an unprivileged user can prevent a privileged user from unmounting trees, we need to require that a forced unmount of the filesystem triggers a revoke on all open files. sysfs and proc already support revoke at the per file level so we can safely remove modules, we just need to extend that support to the forced unmount case. This is problem that actually needs to be solved for ordinary file systems as well because of hot pluggable usb drives. For filesystems like ext4 it is more difficult because we need a solution that does not sacrafice performance in the common case. I was talking to Ted Tso a bit about this at plumbers conf. It happens that hot unplug of usb devices with mount filesystems are currently a non-ending source of subtle bugs in the extN code. The one implementation detail that sounds a bit trick is what to do about mount structures in mount namespaces when we forcibly unmount a filesystem. That could get a bit complicated but if that is the only hang up I'm certain we can figure something out. Eric ^ permalink raw reply [flat|nested] 69+ messages in thread
* [kernel-hardening] Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) @ 2011-10-03 1:46 ` Eric W. Biederman 0 siblings, 0 replies; 69+ messages in thread From: Eric W. Biederman @ 2011-10-03 1:46 UTC (permalink / raw) To: Serge E. Hallyn Cc: Vasiliy Kulikov, Serge Hallyn, akpm, linux-kernel, netdev, containers, dhowells, rdunlap, kernel-hardening "Serge E. Hallyn" <serge.hallyn@canonical.com> writes: > Quoting Vasiliy Kulikov (segoon@openwall.com): >> On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote: >> > > First, the patches by design expose much kernel code to unprivileged >> > > userspace processes. This code doesn't expect malformed data (e.g. VFS, >> > > specific filesystems, block layer, char drivers, sysadmin part of LSMs, >> > > etc. etc.). By relaxing permission rules you greatly increase attack >> > > surface of the kernel from unprivileged users. Are you (or somebody >> > > else) planning to audit this code? Well in theory this codes does expose this code to unprivileged user space in a way that increases the attack surface. However right now there are a lot of cases where because the kernel lacks a sufficient mechanism people are just given root provileges so that can get things done. Network manager controlling the network stack as an unprivileged user. Random filesystems on usb sticks being mounted and unmounted automatically when the usb sticks are inserted and removed. I completely agree that auditing and looking at the code is necessary I think most of what will happen is that we will start directly supporting how the kernel is actually used in the real world. Which should actually reduce our level of vulnerability, because we give up the delusion that large classes of operations don't need careful attention because only root can perform them. Operations which the user space authors turn around and write a suid binary for and unprivileged user space performs those operations all day long. >> > I had wanted to (but didn't) propose a discussion at ksummit about how >> > best to approach the filesystem code. That's not even just for user >> > namespaces - patches have been floated in the past to make mount an >> > unprivileged operation depending on the FS and the user's permission >> > over the device and target. >> >> This is a dangerous operation by itself. > > Of course it is :) And it's been a while since it has been brought up, > but it *was* quite well thought through and throrougly discussed - see > i.e. https://lkml.org/lkml/2008/1/8/131 > > Oh, that's right. In the end the reason it didn't go in had to do with > the ability for an unprivileged user to prevent a privileged user from > unmounting trees by leaving a busy mount in a hidden namespace. > > Eric, in the past we didn't know what to do about that, but I wonder > if setns could be used in some clever way to solve it from userspace. Oh. That is a good objection. I had not realized that unprivileged mounts had that problem. Still the solution is straight forward. If the concern is that an unprivileged user can prevent a privileged user from unmounting trees, we need to require that a forced unmount of the filesystem triggers a revoke on all open files. sysfs and proc already support revoke at the per file level so we can safely remove modules, we just need to extend that support to the forced unmount case. This is problem that actually needs to be solved for ordinary file systems as well because of hot pluggable usb drives. For filesystems like ext4 it is more difficult because we need a solution that does not sacrafice performance in the common case. I was talking to Ted Tso a bit about this at plumbers conf. It happens that hot unplug of usb devices with mount filesystems are currently a non-ending source of subtle bugs in the extN code. The one implementation detail that sounds a bit trick is what to do about mount structures in mount namespaces when we forcibly unmount a filesystem. That could get a bit complicated but if that is the only hang up I'm certain we can figure something out. Eric ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) 2011-10-03 1:46 ` [kernel-hardening] " Eric W. Biederman @ 2011-10-03 19:53 ` Eric W. Biederman -1 siblings, 0 replies; 69+ messages in thread From: Eric W. Biederman @ 2011-10-03 19:53 UTC (permalink / raw) To: Serge E. Hallyn Cc: Vasiliy Kulikov, Serge Hallyn, akpm, linux-kernel, netdev, containers, dhowells, rdunlap, kernel-hardening ebiederm@xmission.com (Eric W. Biederman) writes: > "Serge E. Hallyn" <serge.hallyn@canonical.com> writes: > >> Quoting Vasiliy Kulikov (segoon@openwall.com): >>> On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote: >>> > > First, the patches by design expose much kernel code to unprivileged >>> > > userspace processes. This code doesn't expect malformed data (e.g. VFS, >>> > > specific filesystems, block layer, char drivers, sysadmin part of LSMs, >>> > > etc. etc.). By relaxing permission rules you greatly increase attack >>> > > surface of the kernel from unprivileged users. Are you (or somebody >>> > > else) planning to audit this code? > > Well in theory this codes does expose this code to unprivileged user > space in a way that increases the attack surface. However right now > there are a lot of cases where because the kernel lacks a sufficient > mechanism people are just given root provileges so that can get things > done. Network manager controlling the network stack as an unprivileged > user. Random filesystems on usb sticks being mounted and unmounted > automatically when the usb sticks are inserted and removed. > > I completely agree that auditing and looking at the code is necessary I > think most of what will happen is that we will start directly supporting > how the kernel is actually used in the real world. Which should > actually reduce our level of vulnerability, because we give up the > delusion that large classes of operations don't need careful > attention because only root can perform them. Operations which the > user space authors turn around and write a suid binary for and > unprivileged user space performs those operations all day long. > >>> > I had wanted to (but didn't) propose a discussion at ksummit about how >>> > best to approach the filesystem code. That's not even just for user >>> > namespaces - patches have been floated in the past to make mount an >>> > unprivileged operation depending on the FS and the user's permission >>> > over the device and target. >>> >>> This is a dangerous operation by itself. >> >> Of course it is :) And it's been a while since it has been brought up, >> but it *was* quite well thought through and throrougly discussed - see >> i.e. https://lkml.org/lkml/2008/1/8/131 >> >> Oh, that's right. In the end the reason it didn't go in had to do with >> the ability for an unprivileged user to prevent a privileged user from >> unmounting trees by leaving a busy mount in a hidden namespace. >> >> Eric, in the past we didn't know what to do about that, but I wonder >> if setns could be used in some clever way to solve it from userspace. > > Oh. That is a good objection. I had not realized that unprivileged > mounts had that problem. I just re-read the discussion you are referring to and that wasn't it. Fuse already has something like a revoke in it's umount -f implementation. Eric ^ permalink raw reply [flat|nested] 69+ messages in thread
* [kernel-hardening] Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) @ 2011-10-03 19:53 ` Eric W. Biederman 0 siblings, 0 replies; 69+ messages in thread From: Eric W. Biederman @ 2011-10-03 19:53 UTC (permalink / raw) To: Serge E. Hallyn Cc: Vasiliy Kulikov, Serge Hallyn, akpm, linux-kernel, netdev, containers, dhowells, rdunlap, kernel-hardening ebiederm@xmission.com (Eric W. Biederman) writes: > "Serge E. Hallyn" <serge.hallyn@canonical.com> writes: > >> Quoting Vasiliy Kulikov (segoon@openwall.com): >>> On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote: >>> > > First, the patches by design expose much kernel code to unprivileged >>> > > userspace processes. This code doesn't expect malformed data (e.g. VFS, >>> > > specific filesystems, block layer, char drivers, sysadmin part of LSMs, >>> > > etc. etc.). By relaxing permission rules you greatly increase attack >>> > > surface of the kernel from unprivileged users. Are you (or somebody >>> > > else) planning to audit this code? > > Well in theory this codes does expose this code to unprivileged user > space in a way that increases the attack surface. However right now > there are a lot of cases where because the kernel lacks a sufficient > mechanism people are just given root provileges so that can get things > done. Network manager controlling the network stack as an unprivileged > user. Random filesystems on usb sticks being mounted and unmounted > automatically when the usb sticks are inserted and removed. > > I completely agree that auditing and looking at the code is necessary I > think most of what will happen is that we will start directly supporting > how the kernel is actually used in the real world. Which should > actually reduce our level of vulnerability, because we give up the > delusion that large classes of operations don't need careful > attention because only root can perform them. Operations which the > user space authors turn around and write a suid binary for and > unprivileged user space performs those operations all day long. > >>> > I had wanted to (but didn't) propose a discussion at ksummit about how >>> > best to approach the filesystem code. That's not even just for user >>> > namespaces - patches have been floated in the past to make mount an >>> > unprivileged operation depending on the FS and the user's permission >>> > over the device and target. >>> >>> This is a dangerous operation by itself. >> >> Of course it is :) And it's been a while since it has been brought up, >> but it *was* quite well thought through and throrougly discussed - see >> i.e. https://lkml.org/lkml/2008/1/8/131 >> >> Oh, that's right. In the end the reason it didn't go in had to do with >> the ability for an unprivileged user to prevent a privileged user from >> unmounting trees by leaving a busy mount in a hidden namespace. >> >> Eric, in the past we didn't know what to do about that, but I wonder >> if setns could be used in some clever way to solve it from userspace. > > Oh. That is a good objection. I had not realized that unprivileged > mounts had that problem. I just re-read the discussion you are referring to and that wasn't it. Fuse already has something like a revoke in it's umount -f implementation. Eric ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) 2011-10-03 19:53 ` [kernel-hardening] " Eric W. Biederman @ 2011-10-03 20:04 ` Serge E. Hallyn -1 siblings, 0 replies; 69+ messages in thread From: Serge E. Hallyn @ 2011-10-03 20:04 UTC (permalink / raw) To: Eric W. Biederman Cc: Vasiliy Kulikov, Serge Hallyn, akpm, linux-kernel, netdev, containers, dhowells, rdunlap, kernel-hardening Quoting Eric W. Biederman (ebiederm@xmission.com): > ebiederm@xmission.com (Eric W. Biederman) writes: > > > "Serge E. Hallyn" <serge.hallyn@canonical.com> writes: > > > >> Quoting Vasiliy Kulikov (segoon@openwall.com): > >>> On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote: > >>> > > First, the patches by design expose much kernel code to unprivileged > >>> > > userspace processes. This code doesn't expect malformed data (e.g. VFS, > >>> > > specific filesystems, block layer, char drivers, sysadmin part of LSMs, > >>> > > etc. etc.). By relaxing permission rules you greatly increase attack > >>> > > surface of the kernel from unprivileged users. Are you (or somebody > >>> > > else) planning to audit this code? > > > > Well in theory this codes does expose this code to unprivileged user > > space in a way that increases the attack surface. However right now > > there are a lot of cases where because the kernel lacks a sufficient > > mechanism people are just given root provileges so that can get things > > done. Network manager controlling the network stack as an unprivileged > > user. Random filesystems on usb sticks being mounted and unmounted > > automatically when the usb sticks are inserted and removed. > > > > I completely agree that auditing and looking at the code is necessary I > > think most of what will happen is that we will start directly supporting > > how the kernel is actually used in the real world. Which should > > actually reduce our level of vulnerability, because we give up the > > delusion that large classes of operations don't need careful > > attention because only root can perform them. Operations which the > > user space authors turn around and write a suid binary for and > > unprivileged user space performs those operations all day long. > > > >>> > I had wanted to (but didn't) propose a discussion at ksummit about how > >>> > best to approach the filesystem code. That's not even just for user > >>> > namespaces - patches have been floated in the past to make mount an > >>> > unprivileged operation depending on the FS and the user's permission > >>> > over the device and target. > >>> > >>> This is a dangerous operation by itself. > >> > >> Of course it is :) And it's been a while since it has been brought up, > >> but it *was* quite well thought through and throrougly discussed - see > >> i.e. https://lkml.org/lkml/2008/1/8/131 > >> > >> Oh, that's right. In the end the reason it didn't go in had to do with > >> the ability for an unprivileged user to prevent a privileged user from > >> unmounting trees by leaving a busy mount in a hidden namespace. > >> > >> Eric, in the past we didn't know what to do about that, but I wonder > >> if setns could be used in some clever way to solve it from userspace. > > > > Oh. That is a good objection. I had not realized that unprivileged > > mounts had that problem. > > I just re-read the discussion you are referring to and that wasn't The one I linked was one discussion, but not the final one. https://lkml.org/lkml/2008/10/6/72 is the one where the need for revoke is brought up. > it. Fuse already has something like a revoke in it's umount -f > implementation. I'll have to (haven't yet) take a look at it. -serge ^ permalink raw reply [flat|nested] 69+ messages in thread
* [kernel-hardening] Re: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) @ 2011-10-03 20:04 ` Serge E. Hallyn 0 siblings, 0 replies; 69+ messages in thread From: Serge E. Hallyn @ 2011-10-03 20:04 UTC (permalink / raw) To: Eric W. Biederman Cc: Vasiliy Kulikov, Serge Hallyn, akpm, linux-kernel, netdev, containers, dhowells, rdunlap, kernel-hardening Quoting Eric W. Biederman (ebiederm@xmission.com): > ebiederm@xmission.com (Eric W. Biederman) writes: > > > "Serge E. Hallyn" <serge.hallyn@canonical.com> writes: > > > >> Quoting Vasiliy Kulikov (segoon@openwall.com): > >>> On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote: > >>> > > First, the patches by design expose much kernel code to unprivileged > >>> > > userspace processes. This code doesn't expect malformed data (e.g. VFS, > >>> > > specific filesystems, block layer, char drivers, sysadmin part of LSMs, > >>> > > etc. etc.). By relaxing permission rules you greatly increase attack > >>> > > surface of the kernel from unprivileged users. Are you (or somebody > >>> > > else) planning to audit this code? > > > > Well in theory this codes does expose this code to unprivileged user > > space in a way that increases the attack surface. However right now > > there are a lot of cases where because the kernel lacks a sufficient > > mechanism people are just given root provileges so that can get things > > done. Network manager controlling the network stack as an unprivileged > > user. Random filesystems on usb sticks being mounted and unmounted > > automatically when the usb sticks are inserted and removed. > > > > I completely agree that auditing and looking at the code is necessary I > > think most of what will happen is that we will start directly supporting > > how the kernel is actually used in the real world. Which should > > actually reduce our level of vulnerability, because we give up the > > delusion that large classes of operations don't need careful > > attention because only root can perform them. Operations which the > > user space authors turn around and write a suid binary for and > > unprivileged user space performs those operations all day long. > > > >>> > I had wanted to (but didn't) propose a discussion at ksummit about how > >>> > best to approach the filesystem code. That's not even just for user > >>> > namespaces - patches have been floated in the past to make mount an > >>> > unprivileged operation depending on the FS and the user's permission > >>> > over the device and target. > >>> > >>> This is a dangerous operation by itself. > >> > >> Of course it is :) And it's been a while since it has been brought up, > >> but it *was* quite well thought through and throrougly discussed - see > >> i.e. https://lkml.org/lkml/2008/1/8/131 > >> > >> Oh, that's right. In the end the reason it didn't go in had to do with > >> the ability for an unprivileged user to prevent a privileged user from > >> unmounting trees by leaving a busy mount in a hidden namespace. > >> > >> Eric, in the past we didn't know what to do about that, but I wonder > >> if setns could be used in some clever way to solve it from userspace. > > > > Oh. That is a good objection. I had not realized that unprivileged > > mounts had that problem. > > I just re-read the discussion you are referring to and that wasn't The one I linked was one discussion, but not the final one. https://lkml.org/lkml/2008/10/6/72 is the one where the need for revoke is brought up. > it. Fuse already has something like a revoke in it's umount -f > implementation. I'll have to (haven't yet) take a look at it. -serge ^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH 07/15] user namespace: use net->user_ns for some capable calls under net/ 2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn ` (2 preceding siblings ...) 2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 08/15] af_netlink.c: make netlink_capable userns-aware Serge Hallyn ` (7 subsequent siblings) 11 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm, segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap Cc: Serge Hallyn From: Serge Hallyn <serge.hallyn@ubuntu.com> Just a partial conversion to show how the previous patch is expected to be used. Changelog: 6/28/11: fix typo in net/core/sock.c 7/08/11: don't target capability which authorizes module loading Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Eric W. Biederman <ebiederm@xmission.com> --- net/core/dev.c | 4 ++-- net/core/sock.c | 14 ++++++++------ 2 files changed, 10 insertions(+), 8 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 17d67b5..6ae955f 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5014,7 +5014,7 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg) case SIOCGMIIPHY: case SIOCGMIIREG: case SIOCSIFNAME: - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; dev_load(net, ifr.ifr_name); rtnl_lock(); @@ -5053,7 +5053,7 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg) case SIOCBRADDIF: case SIOCBRDELIF: case SIOCSHWTSTAMP: - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; /* fall through */ case SIOCBONDSLAVEINFOQUERY: diff --git a/net/core/sock.c b/net/core/sock.c index bc745d0..0f31675 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -420,7 +420,7 @@ static int sock_bindtodevice(struct sock *sk, char __user *optval, int optlen) /* Sorry... */ ret = -EPERM; - if (!capable(CAP_NET_RAW)) + if (!ns_capable(net->user_ns, CAP_NET_RAW)) goto out; ret = -EINVAL; @@ -488,6 +488,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname, int valbool; struct linger ling; int ret = 0; + struct net *net = sock_net(sk); /* * Options without arguments @@ -508,7 +509,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname, switch (optname) { case SO_DEBUG: - if (val && !capable(CAP_NET_ADMIN)) + if (val && !ns_capable(net->user_ns, CAP_NET_ADMIN)) ret = -EACCES; else sock_valbool_flag(sk, SOCK_DBG, valbool); @@ -551,7 +552,7 @@ set_sndbuf: break; case SO_SNDBUFFORCE: - if (!capable(CAP_NET_ADMIN)) { + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) { ret = -EPERM; break; } @@ -589,7 +590,7 @@ set_rcvbuf: break; case SO_RCVBUFFORCE: - if (!capable(CAP_NET_ADMIN)) { + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) { ret = -EPERM; break; } @@ -612,7 +613,8 @@ set_rcvbuf: break; case SO_PRIORITY: - if ((val >= 0 && val <= 6) || capable(CAP_NET_ADMIN)) + if ((val >= 0 && val <= 6) || + ns_capable(net->user_ns, CAP_NET_ADMIN)) sk->sk_priority = val; else ret = -EPERM; @@ -729,7 +731,7 @@ set_rcvbuf: clear_bit(SOCK_PASSSEC, &sock->flags); break; case SO_MARK: - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) ret = -EPERM; else sk->sk_mark = val; -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 08/15] af_netlink.c: make netlink_capable userns-aware 2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn ` (3 preceding siblings ...) 2011-09-02 19:56 ` [PATCH 07/15] user namespace: use net->user_ns for some capable calls under net/ Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> ` (6 subsequent siblings) 11 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm, segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap Cc: Serge E. Hallyn, Eric Dumazet From: "Serge E. Hallyn" <serge.hallyn@canonical.com> netlink_capable should check for permissions against the user namespace owning the socket in question. Changelog: Per Eric Dumazet advice, use sock_net(sk) instead of #ifdef. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> --- net/netlink/af_netlink.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index 0a4db02..3cc0bbe 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -580,8 +580,9 @@ retry: static inline int netlink_capable(struct socket *sock, unsigned int flag) { - return (nl_table[sock->sk->sk_protocol].nl_nonroot & flag) || - capable(CAP_NET_ADMIN); + if (nl_table[sock->sk->sk_protocol].nl_nonroot & flag) + return 1; + return ns_capable(sock_net(sock->sk)->user_ns, CAP_NET_ADMIN); } static void -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
[parent not found: <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>]
* (unknown), [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> @ 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn ` (14 subsequent siblings) 15 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w GIT: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) GIT: [PATCH 02/15] user ns: setns: move capable checks into per-ns attach GIT: [PATCH 03/15] keyctl: check capabilities against key's user_ns GIT: [PATCH 04/15] user_ns: convert fs/attr.c to targeted capabilities GIT: [PATCH 05/15] userns: clamp down users of cap_raised GIT: [PATCH 06/15] user namespace: make each net (net_ns) belong to a GIT: [PATCH 07/15] user namespace: use net->user_ns for some capable GIT: [PATCH 08/15] af_netlink.c: make netlink_capable userns-aware GIT: [PATCH 09/15] user ns: convert ipv6 to targeted capabilities GIT: [PATCH 10/15] net/core/scm.c: target capable() calls to user_ns GIT: [PATCH 11/15] userns: make some net-sysfs capable calls targeted GIT: [PATCH 12/15] user_ns: target af_key capability check GIT: [PATCH 13/15] userns: net: make many network capable calls targeted GIT: [PATCH 14/15] net: pass user_ns to cap_netlink_recv() GIT: [PATCH 15/15] make kernel/signal.c user ns safe (v2) ^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> 2011-09-02 19:56 ` (unknown), Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` Serge Hallyn ` (13 subsequent siblings) 15 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w From: "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> Quoting David Howells (dhowells-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org): > Randy Dunlap <rdunlap-/UHa2rfvQTnk1uMJSBkQmQ@public.gmane.org> wrote: > > > > +Any task in or resource belonging to the initial user namespace will, to this > > > +new task, appear to belong to UID and GID -1 - which is usually known as > > > > that extra hyphen is confusing. how about: > > > > to UID and GID -1, which is > > 'which are'. > > David This will hold some info about the design. Currently it contains future todos, issues and questions. Changelog: jul 26: incorporate feedback from David Howells. jul 29: incorporate feedback from Randy Dunlap. Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> Cc: David Howells <dhowells-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Randy Dunlap <rdunlap-/UHa2rfvQTnk1uMJSBkQmQ@public.gmane.org> --- Documentation/namespaces/user_namespace.txt | 107 +++++++++++++++++++++++++++ 1 files changed, 107 insertions(+), 0 deletions(-) create mode 100644 Documentation/namespaces/user_namespace.txt diff --git a/Documentation/namespaces/user_namespace.txt b/Documentation/namespaces/user_namespace.txt new file mode 100644 index 0000000..b0bc480 --- /dev/null +++ b/Documentation/namespaces/user_namespace.txt @@ -0,0 +1,107 @@ +Description +=========== + +Traditionally, each task is owned by a user ID (UID) and belongs to one or more +groups (GID). Both are simple numeric IDs, though userspace usually translates +them to names. The user namespace allows tasks to have different views of the +UIDs and GIDs associated with tasks and other resources. (See 'UID mapping' +below for more.) + +The user namespace is a simple hierarchical one. The system starts with all +tasks belonging to the initial user namespace. A task creates a new user +namespace by passing the CLONE_NEWUSER flag to clone(2). This requires the +creating task to have the CAP_SETUID, CAP_SETGID, and CAP_CHOWN capabilities, +but it does not need to be running as root. The clone(2) call will result in a +new task which to itself appears to be running as UID and GID 0, but to its +creator seems to have the creator's credentials. + +To this new task, any resource belonging to the initial user namespace will +appear to belong to user and group 'nobody', which are UID and GID -1. +Permission to open such files will be granted according to world access +permissions. UID comparisons and group membership checks will return false, +and privilege will be denied. + +When a task belonging to (for example) userid 500 in the initial user namespace +creates a new user namespace, even though the new task will see itself as +belonging to UID 0, any task in the initial user namespace will see it as +belonging to UID 500. Therefore, UID 500 in the initial user namespace will be +able to kill the new task. Files created by the new user will (eventually) be +seen by tasks in its own user namespace as belonging to UID 0, but to tasks in +the initial user namespace as belonging to UID 500. + +Note that this userid mapping for the VFS is not yet implemented, though the +lkml and containers mailing list archives will show several previous +prototypes. In the end, those got hung up waiting on the concept of targeted +capabilities to be developed, which, thanks to the insight of Eric Biederman, +they finally did. + +Relationship between the User namespace and other namespaces +============================================================ + +Other namespaces, such as UTS and network, are owned by a user namespace. When +such a namespace is created, it is assigned to the user namespace of the task +by which it was created. Therefore, attempts to exercise privilege to +resources in, for instance, a particular network namespace, can be properly +validated by checking whether the caller has the needed privilege (i.e. +CAP_NET_ADMIN) targeted to the user namespace which owns the network namespace. +This is done using the ns_capable() function. + +As an example, if a new task is cloned with a private user namespace but +no private network namespace, then the task's network namespace is owned +by the parent user namespace. The new task has no privilege to the +parent user namespace, so it will not be able to create or configure +network devices. If, instead, the task were cloned with both private +user and network namespaces, then the private network namespace is owned +by the private user namespace, and so root in the new user namespace +will have privilege targeted to the network namespace. It will be able +to create and configure network devices. + +UID Mapping +=========== +The current plan (see 'flexible UID mapping' at +https://wiki.ubuntu.com/UserNamespace) is: + +The UID/GID stored on disk will be that in the init_user_ns. Most likely +UID/GID in other namespaces will be stored in xattrs. But Eric was advocating +(a few years ago) leaving the details up to filesystems while providing a lib/ +stock implementation. See the thread around here: +http://www.mail-archive.com/devel-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org/msg09331.html + + +Working notes +============= +Capability checks for actions related to syslog must be against the +init_user_ns until syslog is containerized. + +Same is true for reboot and power, control groups, devices, and time. + +Perf actions (kernel/event/core.c for instance) will always be constrained to +init_user_ns. + +Q: +Is accounting considered properly containerized with respect to pidns? (it +appears to be). If so, then we can change the capable() check in +kernel/acct.c to 'ns_capable(current_pid_ns()->user_ns, CAP_PACCT)' + +Q: +For things like nice and schedaffinity, we could allow root in a container to +control those, and leave only cgroups to constrain the container. I'm not sure +whether that is right, or whether it violates admin expectations. + +I deferred some of commoncap.c. I'm punting on xattr stuff as they take +dentries, not inodes. + +For drivers/tty/tty_io.c and drivers/tty/vt/vt.c, we'll want to (for some of +them) target the capability checks at the user_ns owning the tty. That will +have to wait until we get userns owning files straightened out. + +We need to figure out how to label devices. Should we just toss a user_ns +right into struct device? + +capable(CAP_MAC_ADMIN) checks are always to be against init_user_ns, unless +some day LSMs were to be containerized, near zero chance. + +inode_owner_or_capable() should probably take an optional ns and cap parameter. +If cap is 0, then CAP_FOWNER is checked. If ns is NULL, we derive the ns from +inode. But if ns is provided, then callers who need to derive +inode_userns(inode) anyway can save a few cycles. -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 02/15] user ns: setns: move capable checks into per-ns attach helper 2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn 2011-09-02 19:56 ` (unknown), Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn ` (9 subsequent siblings) 11 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w From: "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- ipc/namespace.c | 7 +++++++ kernel/fork.c | 5 +++++ kernel/nsproxy.c | 11 ++++++++--- kernel/utsname.c | 7 +++++++ net/core/net_namespace.c | 7 +++++++ 5 files changed, 34 insertions(+), 3 deletions(-) diff --git a/ipc/namespace.c b/ipc/namespace.c index ce0a647..a0a7609 100644 --- a/ipc/namespace.c +++ b/ipc/namespace.c @@ -163,6 +163,13 @@ static void ipcns_put(void *ns) static int ipcns_install(struct nsproxy *nsproxy, void *ns) { +#if 0 + struct ipc_namespace *newns = ns; + if (!ns_capable(newns->user_ns, CAP_SYS_ADMIN)) +#else + if (!capable(CAP_SYS_ADMIN)) +#endif + return -1; /* Ditch state from the old ipc namespace */ exit_sem(current); put_ipc_ns(nsproxy->ipc_ns); diff --git a/kernel/fork.c b/kernel/fork.c index 8e6b6f4..ca712f5 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1489,8 +1489,13 @@ long do_fork(unsigned long clone_flags, /* hopefully this check will go away when userns support is * complete */ +#if 0 + if (!nsown_capable(CAP_SYS_ADMIN) || !nsown_capable(CAP_SETUID) || + !nsown_capable(CAP_SETGID)) +#else if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SETUID) || !capable(CAP_SETGID)) +#endif return -EPERM; } diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index 9aeab4b..e274577 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -134,7 +134,11 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk) CLONE_NEWPID | CLONE_NEWNET))) return 0; +#if 0 + if (!nsown_capable(CAP_SYS_ADMIN)) { +#else if (!capable(CAP_SYS_ADMIN)) { +#endif err = -EPERM; goto out; } @@ -191,7 +195,11 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags, CLONE_NEWNET))) return 0; +#if 0 + if (!nsown_capable(CAP_SYS_ADMIN)) +#else if (!capable(CAP_SYS_ADMIN)) +#endif return -EPERM; *new_nsp = create_new_namespaces(unshare_flags, current, @@ -241,9 +249,6 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype) struct file *file; int err; - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - file = proc_ns_fget(fd); if (IS_ERR(file)) return PTR_ERR(file); diff --git a/kernel/utsname.c b/kernel/utsname.c index bff131b..4638a54 100644 --- a/kernel/utsname.c +++ b/kernel/utsname.c @@ -104,6 +104,13 @@ static void utsns_put(void *ns) static int utsns_install(struct nsproxy *nsproxy, void *ns) { +#if 0 + struct uts_namespace *newns = ns; + if (!ns_capable(newns->user_ns, CAP_SYS_ADMIN)) +#else + if (!capable(CAP_SYS_ADMIN)) +#endif + return -1; get_uts_ns(ns); put_uts_ns(nsproxy->uts_ns); nsproxy->uts_ns = ns; diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 5bbdbf0..6f6698d 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -620,6 +620,13 @@ static void netns_put(void *ns) static int netns_install(struct nsproxy *nsproxy, void *ns) { +#if 0 + struct net *net = ns; + if (!ns_capable(net->user_ns, CAP_SYS_ADMIN)) +#else + if (capable(CAP_SYS_ADMIN)) +#endif + return -1; put_net(nsproxy->net_ns); nsproxy->net_ns = get_net(ns); return 0; -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 02/15] user ns: setns: move capable checks into per-ns attach helper @ 2011-09-02 19:56 ` Serge Hallyn 0 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w From: "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- ipc/namespace.c | 7 +++++++ kernel/fork.c | 5 +++++ kernel/nsproxy.c | 11 ++++++++--- kernel/utsname.c | 7 +++++++ net/core/net_namespace.c | 7 +++++++ 5 files changed, 34 insertions(+), 3 deletions(-) diff --git a/ipc/namespace.c b/ipc/namespace.c index ce0a647..a0a7609 100644 --- a/ipc/namespace.c +++ b/ipc/namespace.c @@ -163,6 +163,13 @@ static void ipcns_put(void *ns) static int ipcns_install(struct nsproxy *nsproxy, void *ns) { +#if 0 + struct ipc_namespace *newns = ns; + if (!ns_capable(newns->user_ns, CAP_SYS_ADMIN)) +#else + if (!capable(CAP_SYS_ADMIN)) +#endif + return -1; /* Ditch state from the old ipc namespace */ exit_sem(current); put_ipc_ns(nsproxy->ipc_ns); diff --git a/kernel/fork.c b/kernel/fork.c index 8e6b6f4..ca712f5 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1489,8 +1489,13 @@ long do_fork(unsigned long clone_flags, /* hopefully this check will go away when userns support is * complete */ +#if 0 + if (!nsown_capable(CAP_SYS_ADMIN) || !nsown_capable(CAP_SETUID) || + !nsown_capable(CAP_SETGID)) +#else if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SETUID) || !capable(CAP_SETGID)) +#endif return -EPERM; } diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index 9aeab4b..e274577 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -134,7 +134,11 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk) CLONE_NEWPID | CLONE_NEWNET))) return 0; +#if 0 + if (!nsown_capable(CAP_SYS_ADMIN)) { +#else if (!capable(CAP_SYS_ADMIN)) { +#endif err = -EPERM; goto out; } @@ -191,7 +195,11 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags, CLONE_NEWNET))) return 0; +#if 0 + if (!nsown_capable(CAP_SYS_ADMIN)) +#else if (!capable(CAP_SYS_ADMIN)) +#endif return -EPERM; *new_nsp = create_new_namespaces(unshare_flags, current, @@ -241,9 +249,6 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype) struct file *file; int err; - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - file = proc_ns_fget(fd); if (IS_ERR(file)) return PTR_ERR(file); diff --git a/kernel/utsname.c b/kernel/utsname.c index bff131b..4638a54 100644 --- a/kernel/utsname.c +++ b/kernel/utsname.c @@ -104,6 +104,13 @@ static void utsns_put(void *ns) static int utsns_install(struct nsproxy *nsproxy, void *ns) { +#if 0 + struct uts_namespace *newns = ns; + if (!ns_capable(newns->user_ns, CAP_SYS_ADMIN)) +#else + if (!capable(CAP_SYS_ADMIN)) +#endif + return -1; get_uts_ns(ns); put_uts_ns(nsproxy->uts_ns); nsproxy->uts_ns = ns; diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 5bbdbf0..6f6698d 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -620,6 +620,13 @@ static void netns_put(void *ns) static int netns_install(struct nsproxy *nsproxy, void *ns) { +#if 0 + struct net *net = ns; + if (!ns_capable(net->user_ns, CAP_SYS_ADMIN)) +#else + if (capable(CAP_SYS_ADMIN)) +#endif + return -1; put_net(nsproxy->net_ns); nsproxy->net_ns = get_net(ns); return 0; -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 02/15] user ns: setns: move capable checks into per-ns attach helper @ 2011-09-02 19:56 ` Serge Hallyn 0 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm, segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap Cc: Serge E. Hallyn, Serge E. Hallyn From: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Eric W. Biederman <ebiederm@xmission.com> --- ipc/namespace.c | 7 +++++++ kernel/fork.c | 5 +++++ kernel/nsproxy.c | 11 ++++++++--- kernel/utsname.c | 7 +++++++ net/core/net_namespace.c | 7 +++++++ 5 files changed, 34 insertions(+), 3 deletions(-) diff --git a/ipc/namespace.c b/ipc/namespace.c index ce0a647..a0a7609 100644 --- a/ipc/namespace.c +++ b/ipc/namespace.c @@ -163,6 +163,13 @@ static void ipcns_put(void *ns) static int ipcns_install(struct nsproxy *nsproxy, void *ns) { +#if 0 + struct ipc_namespace *newns = ns; + if (!ns_capable(newns->user_ns, CAP_SYS_ADMIN)) +#else + if (!capable(CAP_SYS_ADMIN)) +#endif + return -1; /* Ditch state from the old ipc namespace */ exit_sem(current); put_ipc_ns(nsproxy->ipc_ns); diff --git a/kernel/fork.c b/kernel/fork.c index 8e6b6f4..ca712f5 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1489,8 +1489,13 @@ long do_fork(unsigned long clone_flags, /* hopefully this check will go away when userns support is * complete */ +#if 0 + if (!nsown_capable(CAP_SYS_ADMIN) || !nsown_capable(CAP_SETUID) || + !nsown_capable(CAP_SETGID)) +#else if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SETUID) || !capable(CAP_SETGID)) +#endif return -EPERM; } diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index 9aeab4b..e274577 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -134,7 +134,11 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk) CLONE_NEWPID | CLONE_NEWNET))) return 0; +#if 0 + if (!nsown_capable(CAP_SYS_ADMIN)) { +#else if (!capable(CAP_SYS_ADMIN)) { +#endif err = -EPERM; goto out; } @@ -191,7 +195,11 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags, CLONE_NEWNET))) return 0; +#if 0 + if (!nsown_capable(CAP_SYS_ADMIN)) +#else if (!capable(CAP_SYS_ADMIN)) +#endif return -EPERM; *new_nsp = create_new_namespaces(unshare_flags, current, @@ -241,9 +249,6 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype) struct file *file; int err; - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - file = proc_ns_fget(fd); if (IS_ERR(file)) return PTR_ERR(file); diff --git a/kernel/utsname.c b/kernel/utsname.c index bff131b..4638a54 100644 --- a/kernel/utsname.c +++ b/kernel/utsname.c @@ -104,6 +104,13 @@ static void utsns_put(void *ns) static int utsns_install(struct nsproxy *nsproxy, void *ns) { +#if 0 + struct uts_namespace *newns = ns; + if (!ns_capable(newns->user_ns, CAP_SYS_ADMIN)) +#else + if (!capable(CAP_SYS_ADMIN)) +#endif + return -1; get_uts_ns(ns); put_uts_ns(nsproxy->uts_ns); nsproxy->uts_ns = ns; diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 5bbdbf0..6f6698d 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -620,6 +620,13 @@ static void netns_put(void *ns) static int netns_install(struct nsproxy *nsproxy, void *ns) { +#if 0 + struct net *net = ns; + if (!ns_capable(net->user_ns, CAP_SYS_ADMIN)) +#else + if (capable(CAP_SYS_ADMIN)) +#endif + return -1; put_net(nsproxy->net_ns); nsproxy->net_ns = get_net(ns); return 0; -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 02/15] user ns: setns: move capable checks into per-ns attach helper 2011-09-02 19:56 ` Serge Hallyn (?) (?) @ 2011-09-04 1:51 ` Matt Helsley [not found] ` <20110904015140.GB32295-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> 2011-09-09 14:56 ` Serge E. Hallyn -1 siblings, 2 replies; 69+ messages in thread From: Matt Helsley @ 2011-09-04 1:51 UTC (permalink / raw) To: Serge Hallyn Cc: akpm, segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap On Fri, Sep 02, 2011 at 07:56:27PM +0000, Serge Hallyn wrote: > From: "Serge E. Hallyn" <serge@hallyn.com> I was confused about this patch until I realized that you're not simply "moving" the capability checks but "distributing" them. Then you're showing that you'll soon change some to nsown_capable() or ns_capable() using the strange cpp pattern in the snippet below. At least I think that's what you intended. A commit message would help :). Cheers, -Matt Helsley > > Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> > Cc: Eric W. Biederman <ebiederm@xmission.com> > --- > ipc/namespace.c | 7 +++++++ > kernel/fork.c | 5 +++++ > kernel/nsproxy.c | 11 ++++++++--- > kernel/utsname.c | 7 +++++++ > net/core/net_namespace.c | 7 +++++++ > 5 files changed, 34 insertions(+), 3 deletions(-) > > diff --git a/ipc/namespace.c b/ipc/namespace.c > index ce0a647..a0a7609 100644 > --- a/ipc/namespace.c > +++ b/ipc/namespace.c > @@ -163,6 +163,13 @@ static void ipcns_put(void *ns) > > static int ipcns_install(struct nsproxy *nsproxy, void *ns) > { > +#if 0 > + struct ipc_namespace *newns = ns; > + if (!ns_capable(newns->user_ns, CAP_SYS_ADMIN)) > +#else > + if (!capable(CAP_SYS_ADMIN)) > +#endif > + return -1; > /* Ditch state from the old ipc namespace */ > exit_sem(current); > put_ipc_ns(nsproxy->ipc_ns); ^ permalink raw reply [flat|nested] 69+ messages in thread
[parent not found: <20110904015140.GB32295-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>]
* Re: [PATCH 02/15] user ns: setns: move capable checks into per-ns attach helper [not found] ` <20110904015140.GB32295-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> @ 2011-09-09 14:56 ` Serge E. Hallyn 0 siblings, 0 replies; 69+ messages in thread From: Serge E. Hallyn @ 2011-09-09 14:56 UTC (permalink / raw) To: Matt Helsley Cc: netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, rdunlap-/UHa2rfvQTnk1uMJSBkQmQ, ebiederm-aS9lmoZGLiVWk0Htik3J/w Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org): > On Fri, Sep 02, 2011 at 07:56:27PM +0000, Serge Hallyn wrote: > > From: "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> > > I was confused about this patch until I realized that you're not > simply "moving" the capability checks but "distributing" them. Then > you're showing that you'll soon change some to nsown_capable() or > ns_capable() using the strange cpp pattern in the snippet below. > > At least I think that's what you intended. A commit message would > help :). Yes, sorry - Eric convinced me several times to be more conservative in the patch, and I failed to fix the commit msg when squashing the resulting patches. How about the following: ====== user ns: update capable calls when cloning and attaching namespaces Distribute the capable() checks at ns attach into the namespace-specific attach handler. Note the fact that the capable() checks will be changed to targeted checks at both namespace clone and attach methods, but don't actually make that change yet. Until that trigger is pulled, you must have the capabilities targeted toward the initial user namespace in order to do any of these actions, meaning that a task in a child user namespace cannot do them. Once we pull the trigger, a task in a child user namespace will be able to clone new namespaces if it is privileged in its own user namespace, and attach to existing namespaces to which it has privilege. ====== Thanks for taking a look, Matt! -serge ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 02/15] user ns: setns: move capable checks into per-ns attach helper 2011-09-04 1:51 ` Matt Helsley [not found] ` <20110904015140.GB32295-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> @ 2011-09-09 14:56 ` Serge E. Hallyn 1 sibling, 0 replies; 69+ messages in thread From: Serge E. Hallyn @ 2011-09-09 14:56 UTC (permalink / raw) To: Matt Helsley Cc: akpm, segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap Quoting Matt Helsley (matthltc@us.ibm.com): > On Fri, Sep 02, 2011 at 07:56:27PM +0000, Serge Hallyn wrote: > > From: "Serge E. Hallyn" <serge@hallyn.com> > > I was confused about this patch until I realized that you're not > simply "moving" the capability checks but "distributing" them. Then > you're showing that you'll soon change some to nsown_capable() or > ns_capable() using the strange cpp pattern in the snippet below. > > At least I think that's what you intended. A commit message would > help :). Yes, sorry - Eric convinced me several times to be more conservative in the patch, and I failed to fix the commit msg when squashing the resulting patches. How about the following: ====== user ns: update capable calls when cloning and attaching namespaces Distribute the capable() checks at ns attach into the namespace-specific attach handler. Note the fact that the capable() checks will be changed to targeted checks at both namespace clone and attach methods, but don't actually make that change yet. Until that trigger is pulled, you must have the capabilities targeted toward the initial user namespace in order to do any of these actions, meaning that a task in a child user namespace cannot do them. Once we pull the trigger, a task in a child user namespace will be able to clone new namespaces if it is privileged in its own user namespace, and attach to existing namespaces to which it has privilege. ====== Thanks for taking a look, Matt! -serge ^ permalink raw reply [flat|nested] 69+ messages in thread
[parent not found: <1314993400-6910-5-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 02/15] user ns: setns: move capable checks into per-ns attach helper [not found] ` <1314993400-6910-5-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> @ 2011-09-04 1:51 ` Matt Helsley 0 siblings, 0 replies; 69+ messages in thread From: Matt Helsley @ 2011-09-04 1:51 UTC (permalink / raw) To: Serge Hallyn Cc: netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, rdunlap-/UHa2rfvQTnk1uMJSBkQmQ, ebiederm-aS9lmoZGLiVWk0Htik3J/w On Fri, Sep 02, 2011 at 07:56:27PM +0000, Serge Hallyn wrote: > From: "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> I was confused about this patch until I realized that you're not simply "moving" the capability checks but "distributing" them. Then you're showing that you'll soon change some to nsown_capable() or ns_capable() using the strange cpp pattern in the snippet below. At least I think that's what you intended. A commit message would help :). Cheers, -Matt Helsley > > Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> > Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> > --- > ipc/namespace.c | 7 +++++++ > kernel/fork.c | 5 +++++ > kernel/nsproxy.c | 11 ++++++++--- > kernel/utsname.c | 7 +++++++ > net/core/net_namespace.c | 7 +++++++ > 5 files changed, 34 insertions(+), 3 deletions(-) > > diff --git a/ipc/namespace.c b/ipc/namespace.c > index ce0a647..a0a7609 100644 > --- a/ipc/namespace.c > +++ b/ipc/namespace.c > @@ -163,6 +163,13 @@ static void ipcns_put(void *ns) > > static int ipcns_install(struct nsproxy *nsproxy, void *ns) > { > +#if 0 > + struct ipc_namespace *newns = ns; > + if (!ns_capable(newns->user_ns, CAP_SYS_ADMIN)) > +#else > + if (!capable(CAP_SYS_ADMIN)) > +#endif > + return -1; > /* Ditch state from the old ipc namespace */ > exit_sem(current); > put_ipc_ns(nsproxy->ipc_ns); ^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH 03/15] keyctl: check capabilities against key's user_ns 2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn 2011-09-02 19:56 ` (unknown), Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn ` (9 subsequent siblings) 11 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> ATM, task should only be able to get his own user_ns's keys anyway, so nsown_capable should also work, but there is no advantage to doing that, while using key's user_ns is clearer. changelog: jun 6: compile fix: keyctl.c (key_user, not key has user_ns) Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Acked-by: David Howells <dhowells-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- security/keys/keyctl.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c index eca5191..fa7d420 100644 --- a/security/keys/keyctl.c +++ b/security/keys/keyctl.c @@ -745,7 +745,7 @@ long keyctl_chown_key(key_serial_t id, uid_t uid, gid_t gid) ret = -EACCES; down_write(&key->sem); - if (!capable(CAP_SYS_ADMIN)) { + if (!ns_capable(key->user->user_ns, CAP_SYS_ADMIN)) { /* only the sysadmin can chown a key to some other UID */ if (uid != (uid_t) -1 && key->uid != uid) goto error_put; @@ -852,7 +852,8 @@ long keyctl_setperm_key(key_serial_t id, key_perm_t perm) down_write(&key->sem); /* if we're not the sysadmin, we can only change a key that we own */ - if (capable(CAP_SYS_ADMIN) || key->uid == current_fsuid()) { + if (ns_capable(key->user->user_ns, CAP_SYS_ADMIN) || + key->uid == current_fsuid()) { key->perm = perm; ret = 0; } -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 03/15] keyctl: check capabilities against key's user_ns @ 2011-09-02 19:56 ` Serge Hallyn 0 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> ATM, task should only be able to get his own user_ns's keys anyway, so nsown_capable should also work, but there is no advantage to doing that, while using key's user_ns is clearer. changelog: jun 6: compile fix: keyctl.c (key_user, not key has user_ns) Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Acked-by: David Howells <dhowells-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- security/keys/keyctl.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c index eca5191..fa7d420 100644 --- a/security/keys/keyctl.c +++ b/security/keys/keyctl.c @@ -745,7 +745,7 @@ long keyctl_chown_key(key_serial_t id, uid_t uid, gid_t gid) ret = -EACCES; down_write(&key->sem); - if (!capable(CAP_SYS_ADMIN)) { + if (!ns_capable(key->user->user_ns, CAP_SYS_ADMIN)) { /* only the sysadmin can chown a key to some other UID */ if (uid != (uid_t) -1 && key->uid != uid) goto error_put; @@ -852,7 +852,8 @@ long keyctl_setperm_key(key_serial_t id, key_perm_t perm) down_write(&key->sem); /* if we're not the sysadmin, we can only change a key that we own */ - if (capable(CAP_SYS_ADMIN) || key->uid == current_fsuid()) { + if (ns_capable(key->user->user_ns, CAP_SYS_ADMIN) || + key->uid == current_fsuid()) { key->perm = perm; ret = 0; } -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 03/15] keyctl: check capabilities against key's user_ns @ 2011-09-02 19:56 ` Serge Hallyn 0 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm, segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap Cc: Serge E. Hallyn From: "Serge E. Hallyn" <serge.hallyn@canonical.com> ATM, task should only be able to get his own user_ns's keys anyway, so nsown_capable should also work, but there is no advantage to doing that, while using key's user_ns is clearer. changelog: jun 6: compile fix: keyctl.c (key_user, not key has user_ns) Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> --- security/keys/keyctl.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c index eca5191..fa7d420 100644 --- a/security/keys/keyctl.c +++ b/security/keys/keyctl.c @@ -745,7 +745,7 @@ long keyctl_chown_key(key_serial_t id, uid_t uid, gid_t gid) ret = -EACCES; down_write(&key->sem); - if (!capable(CAP_SYS_ADMIN)) { + if (!ns_capable(key->user->user_ns, CAP_SYS_ADMIN)) { /* only the sysadmin can chown a key to some other UID */ if (uid != (uid_t) -1 && key->uid != uid) goto error_put; @@ -852,7 +852,8 @@ long keyctl_setperm_key(key_serial_t id, key_perm_t perm) down_write(&key->sem); /* if we're not the sysadmin, we can only change a key that we own */ - if (capable(CAP_SYS_ADMIN) || key->uid == current_fsuid()) { + if (ns_capable(key->user->user_ns, CAP_SYS_ADMIN) || + key->uid == current_fsuid()) { key->perm = perm; ret = 0; } -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 04/15] user_ns: convert fs/attr.c to targeted capabilities 2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn 2011-09-02 19:56 ` (unknown), Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn ` (9 subsequent siblings) 11 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/attr.c | 20 +++++++++++++------- 1 files changed, 13 insertions(+), 7 deletions(-) diff --git a/fs/attr.c b/fs/attr.c index 538e279..e0cf46a 100644 --- a/fs/attr.c +++ b/fs/attr.c @@ -29,6 +29,7 @@ int inode_change_ok(const struct inode *inode, struct iattr *attr) { unsigned int ia_valid = attr->ia_valid; + struct user_namespace *ns; /* * First check size constraints. These can't be overriden using @@ -44,26 +45,28 @@ int inode_change_ok(const struct inode *inode, struct iattr *attr) if (ia_valid & ATTR_FORCE) return 0; + ns = inode_userns(inode); /* Make sure a caller can chown. */ if ((ia_valid & ATTR_UID) && - (current_fsuid() != inode->i_uid || - attr->ia_uid != inode->i_uid) && !capable(CAP_CHOWN)) + (ns != current_user_ns() || current_fsuid() != inode->i_uid || + attr->ia_uid != inode->i_uid) && !ns_capable(ns, CAP_CHOWN)) return -EPERM; /* Make sure caller can chgrp. */ if ((ia_valid & ATTR_GID) && - (current_fsuid() != inode->i_uid || + (ns != current_user_ns() || current_fsuid() != inode->i_uid || (!in_group_p(attr->ia_gid) && attr->ia_gid != inode->i_gid)) && - !capable(CAP_CHOWN)) + !ns_capable(ns, CAP_CHOWN)) return -EPERM; /* Make sure a caller can chmod. */ if (ia_valid & ATTR_MODE) { + gid_t gid = (ia_valid & ATTR_GID) ? attr->ia_gid : inode->i_gid; if (!inode_owner_or_capable(inode)) return -EPERM; /* Also check the setgid bit! */ - if (!in_group_p((ia_valid & ATTR_GID) ? attr->ia_gid : - inode->i_gid) && !capable(CAP_FSETID)) + if ((ns != current_user_ns() || !in_group_p(gid)) && + !ns_capable(ns, CAP_FSETID)) attr->ia_mode &= ~S_ISGID; } @@ -154,9 +157,12 @@ void setattr_copy(struct inode *inode, const struct iattr *attr) inode->i_sb->s_time_gran); if (ia_valid & ATTR_MODE) { umode_t mode = attr->ia_mode; + struct user_namespace *ns = inode_userns(inode); - if (!in_group_p(inode->i_gid) && !capable(CAP_FSETID)) + if ((ns != current_user_ns() || !in_group_p(inode->i_gid)) && + !ns_capable(ns, CAP_FSETID)) mode &= ~S_ISGID; + inode->i_mode = mode; } } -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 04/15] user_ns: convert fs/attr.c to targeted capabilities @ 2011-09-02 19:56 ` Serge Hallyn 0 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/attr.c | 20 +++++++++++++------- 1 files changed, 13 insertions(+), 7 deletions(-) diff --git a/fs/attr.c b/fs/attr.c index 538e279..e0cf46a 100644 --- a/fs/attr.c +++ b/fs/attr.c @@ -29,6 +29,7 @@ int inode_change_ok(const struct inode *inode, struct iattr *attr) { unsigned int ia_valid = attr->ia_valid; + struct user_namespace *ns; /* * First check size constraints. These can't be overriden using @@ -44,26 +45,28 @@ int inode_change_ok(const struct inode *inode, struct iattr *attr) if (ia_valid & ATTR_FORCE) return 0; + ns = inode_userns(inode); /* Make sure a caller can chown. */ if ((ia_valid & ATTR_UID) && - (current_fsuid() != inode->i_uid || - attr->ia_uid != inode->i_uid) && !capable(CAP_CHOWN)) + (ns != current_user_ns() || current_fsuid() != inode->i_uid || + attr->ia_uid != inode->i_uid) && !ns_capable(ns, CAP_CHOWN)) return -EPERM; /* Make sure caller can chgrp. */ if ((ia_valid & ATTR_GID) && - (current_fsuid() != inode->i_uid || + (ns != current_user_ns() || current_fsuid() != inode->i_uid || (!in_group_p(attr->ia_gid) && attr->ia_gid != inode->i_gid)) && - !capable(CAP_CHOWN)) + !ns_capable(ns, CAP_CHOWN)) return -EPERM; /* Make sure a caller can chmod. */ if (ia_valid & ATTR_MODE) { + gid_t gid = (ia_valid & ATTR_GID) ? attr->ia_gid : inode->i_gid; if (!inode_owner_or_capable(inode)) return -EPERM; /* Also check the setgid bit! */ - if (!in_group_p((ia_valid & ATTR_GID) ? attr->ia_gid : - inode->i_gid) && !capable(CAP_FSETID)) + if ((ns != current_user_ns() || !in_group_p(gid)) && + !ns_capable(ns, CAP_FSETID)) attr->ia_mode &= ~S_ISGID; } @@ -154,9 +157,12 @@ void setattr_copy(struct inode *inode, const struct iattr *attr) inode->i_sb->s_time_gran); if (ia_valid & ATTR_MODE) { umode_t mode = attr->ia_mode; + struct user_namespace *ns = inode_userns(inode); - if (!in_group_p(inode->i_gid) && !capable(CAP_FSETID)) + if ((ns != current_user_ns() || !in_group_p(inode->i_gid)) && + !ns_capable(ns, CAP_FSETID)) mode &= ~S_ISGID; + inode->i_mode = mode; } } -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 04/15] user_ns: convert fs/attr.c to targeted capabilities @ 2011-09-02 19:56 ` Serge Hallyn 0 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm, segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap Cc: Serge E. Hallyn From: "Serge E. Hallyn" <serge.hallyn@canonical.com> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Eric W. Biederman <ebiederm@xmission.com> --- fs/attr.c | 20 +++++++++++++------- 1 files changed, 13 insertions(+), 7 deletions(-) diff --git a/fs/attr.c b/fs/attr.c index 538e279..e0cf46a 100644 --- a/fs/attr.c +++ b/fs/attr.c @@ -29,6 +29,7 @@ int inode_change_ok(const struct inode *inode, struct iattr *attr) { unsigned int ia_valid = attr->ia_valid; + struct user_namespace *ns; /* * First check size constraints. These can't be overriden using @@ -44,26 +45,28 @@ int inode_change_ok(const struct inode *inode, struct iattr *attr) if (ia_valid & ATTR_FORCE) return 0; + ns = inode_userns(inode); /* Make sure a caller can chown. */ if ((ia_valid & ATTR_UID) && - (current_fsuid() != inode->i_uid || - attr->ia_uid != inode->i_uid) && !capable(CAP_CHOWN)) + (ns != current_user_ns() || current_fsuid() != inode->i_uid || + attr->ia_uid != inode->i_uid) && !ns_capable(ns, CAP_CHOWN)) return -EPERM; /* Make sure caller can chgrp. */ if ((ia_valid & ATTR_GID) && - (current_fsuid() != inode->i_uid || + (ns != current_user_ns() || current_fsuid() != inode->i_uid || (!in_group_p(attr->ia_gid) && attr->ia_gid != inode->i_gid)) && - !capable(CAP_CHOWN)) + !ns_capable(ns, CAP_CHOWN)) return -EPERM; /* Make sure a caller can chmod. */ if (ia_valid & ATTR_MODE) { + gid_t gid = (ia_valid & ATTR_GID) ? attr->ia_gid : inode->i_gid; if (!inode_owner_or_capable(inode)) return -EPERM; /* Also check the setgid bit! */ - if (!in_group_p((ia_valid & ATTR_GID) ? attr->ia_gid : - inode->i_gid) && !capable(CAP_FSETID)) + if ((ns != current_user_ns() || !in_group_p(gid)) && + !ns_capable(ns, CAP_FSETID)) attr->ia_mode &= ~S_ISGID; } @@ -154,9 +157,12 @@ void setattr_copy(struct inode *inode, const struct iattr *attr) inode->i_sb->s_time_gran); if (ia_valid & ATTR_MODE) { umode_t mode = attr->ia_mode; + struct user_namespace *ns = inode_userns(inode); - if (!in_group_p(inode->i_gid) && !capable(CAP_FSETID)) + if ((ns != current_user_ns() || !in_group_p(inode->i_gid)) && + !ns_capable(ns, CAP_FSETID)) mode &= ~S_ISGID; + inode->i_mode = mode; } } -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 05/15] userns: clamp down users of cap_raised 2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn 2011-09-02 19:56 ` (unknown), Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn ` (9 subsequent siblings) 11 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> A few modules are using cap_raised(current_cap(), cap) to authorize actions, but the privilege should be applicable against the initial user namespace. Refuse privilege if the caller is not in init_user_ns. Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- drivers/block/drbd/drbd_nl.c | 5 +++++ drivers/md/dm-log-userspace-transfer.c | 3 +++ drivers/staging/pohmelfs/config.c | 3 +++ drivers/video/uvesafb.c | 3 +++ 4 files changed, 14 insertions(+), 0 deletions(-) diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c index 0feab26..9a87a14 100644 --- a/drivers/block/drbd/drbd_nl.c +++ b/drivers/block/drbd/drbd_nl.c @@ -2297,6 +2297,11 @@ static void drbd_connector_callback(struct cn_msg *req, struct netlink_skb_parms return; } + if (current_user_ns() != &init_user_ns) { + retcode = ERR_PERM; + goto fail; + } + if (!cap_raised(current_cap(), CAP_SYS_ADMIN)) { retcode = ERR_PERM; goto fail; diff --git a/drivers/md/dm-log-userspace-transfer.c b/drivers/md/dm-log-userspace-transfer.c index 1f23e04..140ca81 100644 --- a/drivers/md/dm-log-userspace-transfer.c +++ b/drivers/md/dm-log-userspace-transfer.c @@ -134,6 +134,9 @@ static void cn_ulog_callback(struct cn_msg *msg, struct netlink_skb_parms *nsp) { struct dm_ulog_request *tfr = (struct dm_ulog_request *)(msg + 1); + if (current_user_ns() != &init_user_ns) + return; + if (!cap_raised(current_cap(), CAP_SYS_ADMIN)) return; diff --git a/drivers/staging/pohmelfs/config.c b/drivers/staging/pohmelfs/config.c index b6c42cb..cd259d0 100644 --- a/drivers/staging/pohmelfs/config.c +++ b/drivers/staging/pohmelfs/config.c @@ -525,6 +525,9 @@ static void pohmelfs_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *n { int err; + if (current_user_ns() != &init_user_ns) + return; + if (!cap_raised(current_cap(), CAP_SYS_ADMIN)) return; diff --git a/drivers/video/uvesafb.c b/drivers/video/uvesafb.c index 7f8472c..71dab8e 100644 --- a/drivers/video/uvesafb.c +++ b/drivers/video/uvesafb.c @@ -73,6 +73,9 @@ static void uvesafb_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *ns struct uvesafb_task *utask; struct uvesafb_ktask *task; + if (current_user_ns() != &init_user_ns) + return; + if (!cap_raised(current_cap(), CAP_SYS_ADMIN)) return; -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 05/15] userns: clamp down users of cap_raised @ 2011-09-02 19:56 ` Serge Hallyn 0 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> A few modules are using cap_raised(current_cap(), cap) to authorize actions, but the privilege should be applicable against the initial user namespace. Refuse privilege if the caller is not in init_user_ns. Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- drivers/block/drbd/drbd_nl.c | 5 +++++ drivers/md/dm-log-userspace-transfer.c | 3 +++ drivers/staging/pohmelfs/config.c | 3 +++ drivers/video/uvesafb.c | 3 +++ 4 files changed, 14 insertions(+), 0 deletions(-) diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c index 0feab26..9a87a14 100644 --- a/drivers/block/drbd/drbd_nl.c +++ b/drivers/block/drbd/drbd_nl.c @@ -2297,6 +2297,11 @@ static void drbd_connector_callback(struct cn_msg *req, struct netlink_skb_parms return; } + if (current_user_ns() != &init_user_ns) { + retcode = ERR_PERM; + goto fail; + } + if (!cap_raised(current_cap(), CAP_SYS_ADMIN)) { retcode = ERR_PERM; goto fail; diff --git a/drivers/md/dm-log-userspace-transfer.c b/drivers/md/dm-log-userspace-transfer.c index 1f23e04..140ca81 100644 --- a/drivers/md/dm-log-userspace-transfer.c +++ b/drivers/md/dm-log-userspace-transfer.c @@ -134,6 +134,9 @@ static void cn_ulog_callback(struct cn_msg *msg, struct netlink_skb_parms *nsp) { struct dm_ulog_request *tfr = (struct dm_ulog_request *)(msg + 1); + if (current_user_ns() != &init_user_ns) + return; + if (!cap_raised(current_cap(), CAP_SYS_ADMIN)) return; diff --git a/drivers/staging/pohmelfs/config.c b/drivers/staging/pohmelfs/config.c index b6c42cb..cd259d0 100644 --- a/drivers/staging/pohmelfs/config.c +++ b/drivers/staging/pohmelfs/config.c @@ -525,6 +525,9 @@ static void pohmelfs_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *n { int err; + if (current_user_ns() != &init_user_ns) + return; + if (!cap_raised(current_cap(), CAP_SYS_ADMIN)) return; diff --git a/drivers/video/uvesafb.c b/drivers/video/uvesafb.c index 7f8472c..71dab8e 100644 --- a/drivers/video/uvesafb.c +++ b/drivers/video/uvesafb.c @@ -73,6 +73,9 @@ static void uvesafb_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *ns struct uvesafb_task *utask; struct uvesafb_ktask *task; + if (current_user_ns() != &init_user_ns) + return; + if (!cap_raised(current_cap(), CAP_SYS_ADMIN)) return; -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 05/15] userns: clamp down users of cap_raised @ 2011-09-02 19:56 ` Serge Hallyn 0 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm, segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap Cc: Serge E. Hallyn From: "Serge E. Hallyn" <serge.hallyn@canonical.com> A few modules are using cap_raised(current_cap(), cap) to authorize actions, but the privilege should be applicable against the initial user namespace. Refuse privilege if the caller is not in init_user_ns. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Eric W. Biederman <ebiederm@xmission.com> --- drivers/block/drbd/drbd_nl.c | 5 +++++ drivers/md/dm-log-userspace-transfer.c | 3 +++ drivers/staging/pohmelfs/config.c | 3 +++ drivers/video/uvesafb.c | 3 +++ 4 files changed, 14 insertions(+), 0 deletions(-) diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c index 0feab26..9a87a14 100644 --- a/drivers/block/drbd/drbd_nl.c +++ b/drivers/block/drbd/drbd_nl.c @@ -2297,6 +2297,11 @@ static void drbd_connector_callback(struct cn_msg *req, struct netlink_skb_parms return; } + if (current_user_ns() != &init_user_ns) { + retcode = ERR_PERM; + goto fail; + } + if (!cap_raised(current_cap(), CAP_SYS_ADMIN)) { retcode = ERR_PERM; goto fail; diff --git a/drivers/md/dm-log-userspace-transfer.c b/drivers/md/dm-log-userspace-transfer.c index 1f23e04..140ca81 100644 --- a/drivers/md/dm-log-userspace-transfer.c +++ b/drivers/md/dm-log-userspace-transfer.c @@ -134,6 +134,9 @@ static void cn_ulog_callback(struct cn_msg *msg, struct netlink_skb_parms *nsp) { struct dm_ulog_request *tfr = (struct dm_ulog_request *)(msg + 1); + if (current_user_ns() != &init_user_ns) + return; + if (!cap_raised(current_cap(), CAP_SYS_ADMIN)) return; diff --git a/drivers/staging/pohmelfs/config.c b/drivers/staging/pohmelfs/config.c index b6c42cb..cd259d0 100644 --- a/drivers/staging/pohmelfs/config.c +++ b/drivers/staging/pohmelfs/config.c @@ -525,6 +525,9 @@ static void pohmelfs_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *n { int err; + if (current_user_ns() != &init_user_ns) + return; + if (!cap_raised(current_cap(), CAP_SYS_ADMIN)) return; diff --git a/drivers/video/uvesafb.c b/drivers/video/uvesafb.c index 7f8472c..71dab8e 100644 --- a/drivers/video/uvesafb.c +++ b/drivers/video/uvesafb.c @@ -73,6 +73,9 @@ static void uvesafb_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *ns struct uvesafb_task *utask; struct uvesafb_ktask *task; + if (current_user_ns() != &init_user_ns) + return; + if (!cap_raised(current_cap(), CAP_SYS_ADMIN)) return; -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 06/15] user namespace: make each net (net_ns) belong to a user_ns 2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn 2011-09-02 19:56 ` (unknown), Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn ` (9 subsequent siblings) 11 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> This way we can target capabilites at the user_ns which created the net ns. Changelog: jul 8: nsproxy: don't assign netns->userns if not cloning. Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- include/net/net_namespace.h | 2 ++ kernel/nsproxy.c | 2 ++ net/core/net_namespace.c | 3 +++ 3 files changed, 7 insertions(+), 0 deletions(-) diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h index 3bb6fa0..d91fe5f 100644 --- a/include/net/net_namespace.h +++ b/include/net/net_namespace.h @@ -29,6 +29,7 @@ struct ctl_table_header; struct net_generic; struct sock; struct netns_ipvs; +struct user_namespace; #define NETDEV_HASHBITS 8 @@ -101,6 +102,7 @@ struct net { struct netns_xfrm xfrm; #endif struct netns_ipvs *ipvs; + struct user_namespace *user_ns; }; diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index e274577..752b477 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -95,6 +95,8 @@ static struct nsproxy *create_new_namespaces(unsigned long flags, err = PTR_ERR(new_nsp->net_ns); goto out_net; } + if (flags & CLONE_NEWNET) + new_nsp->net_ns->user_ns = get_user_ns(task_cred_xxx(tsk, user_ns)); return new_nsp; diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 6f6698d..5ca95cc 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -10,6 +10,7 @@ #include <linux/nsproxy.h> #include <linux/proc_fs.h> #include <linux/file.h> +#include <linux/user_namespace.h> #include <net/net_namespace.h> #include <net/netns/generic.h> @@ -209,6 +210,7 @@ static void net_free(struct net *net) } #endif kfree(net->gen); + put_user_ns(net->user_ns); kmem_cache_free(net_cachep, net); } @@ -389,6 +391,7 @@ static int __init net_ns_init(void) rcu_assign_pointer(init_net.gen, ng); mutex_lock(&net_mutex); + init_net.user_ns = &init_user_ns; if (setup_net(&init_net)) panic("Could not setup the initial network namespace"); -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 06/15] user namespace: make each net (net_ns) belong to a user_ns @ 2011-09-02 19:56 ` Serge Hallyn 0 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> This way we can target capabilites at the user_ns which created the net ns. Changelog: jul 8: nsproxy: don't assign netns->userns if not cloning. Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- include/net/net_namespace.h | 2 ++ kernel/nsproxy.c | 2 ++ net/core/net_namespace.c | 3 +++ 3 files changed, 7 insertions(+), 0 deletions(-) diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h index 3bb6fa0..d91fe5f 100644 --- a/include/net/net_namespace.h +++ b/include/net/net_namespace.h @@ -29,6 +29,7 @@ struct ctl_table_header; struct net_generic; struct sock; struct netns_ipvs; +struct user_namespace; #define NETDEV_HASHBITS 8 @@ -101,6 +102,7 @@ struct net { struct netns_xfrm xfrm; #endif struct netns_ipvs *ipvs; + struct user_namespace *user_ns; }; diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index e274577..752b477 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -95,6 +95,8 @@ static struct nsproxy *create_new_namespaces(unsigned long flags, err = PTR_ERR(new_nsp->net_ns); goto out_net; } + if (flags & CLONE_NEWNET) + new_nsp->net_ns->user_ns = get_user_ns(task_cred_xxx(tsk, user_ns)); return new_nsp; diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 6f6698d..5ca95cc 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -10,6 +10,7 @@ #include <linux/nsproxy.h> #include <linux/proc_fs.h> #include <linux/file.h> +#include <linux/user_namespace.h> #include <net/net_namespace.h> #include <net/netns/generic.h> @@ -209,6 +210,7 @@ static void net_free(struct net *net) } #endif kfree(net->gen); + put_user_ns(net->user_ns); kmem_cache_free(net_cachep, net); } @@ -389,6 +391,7 @@ static int __init net_ns_init(void) rcu_assign_pointer(init_net.gen, ng); mutex_lock(&net_mutex); + init_net.user_ns = &init_user_ns; if (setup_net(&init_net)) panic("Could not setup the initial network namespace"); -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 06/15] user namespace: make each net (net_ns) belong to a user_ns @ 2011-09-02 19:56 ` Serge Hallyn 0 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm, segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap Cc: Serge E. Hallyn From: "Serge E. Hallyn" <serge.hallyn@canonical.com> This way we can target capabilites at the user_ns which created the net ns. Changelog: jul 8: nsproxy: don't assign netns->userns if not cloning. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Eric W. Biederman <ebiederm@xmission.com> --- include/net/net_namespace.h | 2 ++ kernel/nsproxy.c | 2 ++ net/core/net_namespace.c | 3 +++ 3 files changed, 7 insertions(+), 0 deletions(-) diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h index 3bb6fa0..d91fe5f 100644 --- a/include/net/net_namespace.h +++ b/include/net/net_namespace.h @@ -29,6 +29,7 @@ struct ctl_table_header; struct net_generic; struct sock; struct netns_ipvs; +struct user_namespace; #define NETDEV_HASHBITS 8 @@ -101,6 +102,7 @@ struct net { struct netns_xfrm xfrm; #endif struct netns_ipvs *ipvs; + struct user_namespace *user_ns; }; diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index e274577..752b477 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -95,6 +95,8 @@ static struct nsproxy *create_new_namespaces(unsigned long flags, err = PTR_ERR(new_nsp->net_ns); goto out_net; } + if (flags & CLONE_NEWNET) + new_nsp->net_ns->user_ns = get_user_ns(task_cred_xxx(tsk, user_ns)); return new_nsp; diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 6f6698d..5ca95cc 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -10,6 +10,7 @@ #include <linux/nsproxy.h> #include <linux/proc_fs.h> #include <linux/file.h> +#include <linux/user_namespace.h> #include <net/net_namespace.h> #include <net/netns/generic.h> @@ -209,6 +210,7 @@ static void net_free(struct net *net) } #endif kfree(net->gen); + put_user_ns(net->user_ns); kmem_cache_free(net_cachep, net); } @@ -389,6 +391,7 @@ static int __init net_ns_init(void) rcu_assign_pointer(init_net.gen, ng); mutex_lock(&net_mutex); + init_net.user_ns = &init_user_ns; if (setup_net(&init_net)) panic("Could not setup the initial network namespace"); -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 07/15] user namespace: use net->user_ns for some capable calls under net/ [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> ` (6 preceding siblings ...) 2011-09-02 19:56 ` Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 08/15] af_netlink.c: make netlink_capable userns-aware Serge Hallyn ` (7 subsequent siblings) 15 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w Cc: Serge Hallyn From: Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> Just a partial conversion to show how the previous patch is expected to be used. Changelog: 6/28/11: fix typo in net/core/sock.c 7/08/11: don't target capability which authorizes module loading Signed-off-by: Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- net/core/dev.c | 4 ++-- net/core/sock.c | 14 ++++++++------ 2 files changed, 10 insertions(+), 8 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 17d67b5..6ae955f 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5014,7 +5014,7 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg) case SIOCGMIIPHY: case SIOCGMIIREG: case SIOCSIFNAME: - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; dev_load(net, ifr.ifr_name); rtnl_lock(); @@ -5053,7 +5053,7 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg) case SIOCBRADDIF: case SIOCBRDELIF: case SIOCSHWTSTAMP: - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; /* fall through */ case SIOCBONDSLAVEINFOQUERY: diff --git a/net/core/sock.c b/net/core/sock.c index bc745d0..0f31675 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -420,7 +420,7 @@ static int sock_bindtodevice(struct sock *sk, char __user *optval, int optlen) /* Sorry... */ ret = -EPERM; - if (!capable(CAP_NET_RAW)) + if (!ns_capable(net->user_ns, CAP_NET_RAW)) goto out; ret = -EINVAL; @@ -488,6 +488,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname, int valbool; struct linger ling; int ret = 0; + struct net *net = sock_net(sk); /* * Options without arguments @@ -508,7 +509,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname, switch (optname) { case SO_DEBUG: - if (val && !capable(CAP_NET_ADMIN)) + if (val && !ns_capable(net->user_ns, CAP_NET_ADMIN)) ret = -EACCES; else sock_valbool_flag(sk, SOCK_DBG, valbool); @@ -551,7 +552,7 @@ set_sndbuf: break; case SO_SNDBUFFORCE: - if (!capable(CAP_NET_ADMIN)) { + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) { ret = -EPERM; break; } @@ -589,7 +590,7 @@ set_rcvbuf: break; case SO_RCVBUFFORCE: - if (!capable(CAP_NET_ADMIN)) { + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) { ret = -EPERM; break; } @@ -612,7 +613,8 @@ set_rcvbuf: break; case SO_PRIORITY: - if ((val >= 0 && val <= 6) || capable(CAP_NET_ADMIN)) + if ((val >= 0 && val <= 6) || + ns_capable(net->user_ns, CAP_NET_ADMIN)) sk->sk_priority = val; else ret = -EPERM; @@ -729,7 +731,7 @@ set_rcvbuf: clear_bit(SOCK_PASSSEC, &sock->flags); break; case SO_MARK: - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) ret = -EPERM; else sk->sk_mark = val; -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 08/15] af_netlink.c: make netlink_capable userns-aware [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> ` (7 preceding siblings ...) 2011-09-02 19:56 ` [PATCH 07/15] user namespace: use net->user_ns for some capable calls under net/ Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 09/15] user ns: convert ipv6 to targeted capabilities Serge Hallyn ` (6 subsequent siblings) 15 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w Cc: Eric Dumazet From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> netlink_capable should check for permissions against the user namespace owning the socket in question. Changelog: Per Eric Dumazet advice, use sock_net(sk) instead of #ifdef. Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> Cc: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> --- net/netlink/af_netlink.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index 0a4db02..3cc0bbe 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -580,8 +580,9 @@ retry: static inline int netlink_capable(struct socket *sock, unsigned int flag) { - return (nl_table[sock->sk->sk_protocol].nl_nonroot & flag) || - capable(CAP_NET_ADMIN); + if (nl_table[sock->sk->sk_protocol].nl_nonroot & flag) + return 1; + return ns_capable(sock_net(sock->sk)->user_ns, CAP_NET_ADMIN); } static void -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 09/15] user ns: convert ipv6 to targeted capabilities [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> ` (8 preceding siblings ...) 2011-09-02 19:56 ` [PATCH 08/15] af_netlink.c: make netlink_capable userns-aware Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 10/15] net/core/scm.c: target capable() calls to user_ns owning the net_ns Serge Hallyn ` (5 subsequent siblings) 15 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- net/ipv6/addrconf.c | 4 ++-- net/ipv6/af_inet6.c | 6 ++++-- net/ipv6/datagram.c | 6 +++--- net/ipv6/ip6_flowlabel.c | 24 ++++++++++++++---------- net/ipv6/ip6_tunnel.c | 4 ++-- net/ipv6/ip6mr.c | 2 +- net/ipv6/ipv6_sockglue.c | 7 ++++--- net/ipv6/netfilter/ip6_tables.c | 8 ++++---- net/ipv6/route.c | 2 +- net/ipv6/sit.c | 10 +++++----- 10 files changed, 40 insertions(+), 33 deletions(-) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index f012ebd..871e5cf 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -2230,7 +2230,7 @@ int addrconf_add_ifaddr(struct net *net, void __user *arg) struct in6_ifreq ireq; int err; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; if (copy_from_user(&ireq, arg, sizeof(struct in6_ifreq))) @@ -2249,7 +2249,7 @@ int addrconf_del_ifaddr(struct net *net, void __user *arg) struct in6_ifreq ireq; int err; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; if (copy_from_user(&ireq, arg, sizeof(struct in6_ifreq))) diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c index 3b5669a..1854ffe 100644 --- a/net/ipv6/af_inet6.c +++ b/net/ipv6/af_inet6.c @@ -160,7 +160,8 @@ lookup_protocol: } err = -EPERM; - if (sock->type == SOCK_RAW && !kern && !capable(CAP_NET_RAW)) + if (sock->type == SOCK_RAW && !kern && + !ns_capable(net->user_ns, CAP_NET_RAW)) goto out_rcu_unlock; sock->ops = answer->ops; @@ -281,7 +282,8 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) return -EINVAL; snum = ntohs(addr->sin6_port); - if (snum && snum < PROT_SOCK && !capable(CAP_NET_BIND_SERVICE)) + if (snum && snum < PROT_SOCK && + !ns_capable(sock_net(sk)->user_ns, CAP_NET_BIND_SERVICE)) return -EACCES; lock_sock(sk); diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c index 9ef1831..33b1b0f 100644 --- a/net/ipv6/datagram.c +++ b/net/ipv6/datagram.c @@ -701,7 +701,7 @@ int datagram_send_ctl(struct net *net, err = -EINVAL; goto exit_f; } - if (!capable(CAP_NET_RAW)) { + if (!ns_capable(net->user_ns, CAP_NET_RAW)) { err = -EPERM; goto exit_f; } @@ -721,7 +721,7 @@ int datagram_send_ctl(struct net *net, err = -EINVAL; goto exit_f; } - if (!capable(CAP_NET_RAW)) { + if (!ns_capable(net->user_ns, CAP_NET_RAW)) { err = -EPERM; goto exit_f; } @@ -746,7 +746,7 @@ int datagram_send_ctl(struct net *net, err = -EINVAL; goto exit_f; } - if (!capable(CAP_NET_RAW)) { + if (!ns_capable(net->user_ns, CAP_NET_RAW)) { err = -EPERM; goto exit_f; } diff --git a/net/ipv6/ip6_flowlabel.c b/net/ipv6/ip6_flowlabel.c index f3caf1b..4726c02 100644 --- a/net/ipv6/ip6_flowlabel.c +++ b/net/ipv6/ip6_flowlabel.c @@ -294,21 +294,22 @@ struct ipv6_txoptions *fl6_merge_options(struct ipv6_txoptions * opt_space, return opt_space; } -static unsigned long check_linger(unsigned long ttl) +static unsigned long check_linger(unsigned long ttl, struct user_namespace *ns) { if (ttl < FL_MIN_LINGER) return FL_MIN_LINGER*HZ; - if (ttl > FL_MAX_LINGER && !capable(CAP_NET_ADMIN)) + if (ttl > FL_MAX_LINGER && !ns_capable(ns, CAP_NET_ADMIN)) return 0; return ttl*HZ; } -static int fl6_renew(struct ip6_flowlabel *fl, unsigned long linger, unsigned long expires) +static int fl6_renew(struct ip6_flowlabel *fl, unsigned long linger, + unsigned long expires, struct user_namespace *ns) { - linger = check_linger(linger); + linger = check_linger(linger, ns); if (!linger) return -EPERM; - expires = check_linger(expires); + expires = check_linger(expires, ns); if (!expires) return -EPERM; fl->lastuse = jiffies; @@ -375,7 +376,7 @@ fl_create(struct net *net, struct in6_flowlabel_req *freq, char __user *optval, fl->fl_net = hold_net(net); fl->expires = jiffies; - err = fl6_renew(fl, freq->flr_linger, freq->flr_expires); + err = fl6_renew(fl, freq->flr_linger, freq->flr_expires, net->user_ns); if (err) goto done; fl->share = freq->flr_share; @@ -425,7 +426,7 @@ static int mem_check(struct sock *sk) if (room <= 0 || ((count >= FL_MAX_PER_SOCK || (count > 0 && room < FL_MAX_SIZE/2) || room < FL_MAX_SIZE/4) && - !capable(CAP_NET_ADMIN))) + !ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))) return -ENOBUFS; return 0; @@ -507,17 +508,20 @@ int ipv6_flowlabel_opt(struct sock *sk, char __user *optval, int optlen) read_lock_bh(&ip6_sk_fl_lock); for (sfl = np->ipv6_fl_list; sfl; sfl = sfl->next) { if (sfl->fl->label == freq.flr_label) { - err = fl6_renew(sfl->fl, freq.flr_linger, freq.flr_expires); + err = fl6_renew(sfl->fl, freq.flr_linger, freq.flr_expires, + net->user_ns); read_unlock_bh(&ip6_sk_fl_lock); return err; } } read_unlock_bh(&ip6_sk_fl_lock); - if (freq.flr_share == IPV6_FL_S_NONE && capable(CAP_NET_ADMIN)) { + if (freq.flr_share == IPV6_FL_S_NONE && + ns_capable(net->user_ns, CAP_NET_ADMIN)) { fl = fl_lookup(net, freq.flr_label); if (fl) { - err = fl6_renew(fl, freq.flr_linger, freq.flr_expires); + err = fl6_renew(fl, freq.flr_linger, freq.flr_expires, + net->user_ns); fl_release(fl); return err; } diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c index 0bc9888..c430d69 100644 --- a/net/ipv6/ip6_tunnel.c +++ b/net/ipv6/ip6_tunnel.c @@ -1269,7 +1269,7 @@ ip6_tnl_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd) case SIOCADDTUNNEL: case SIOCCHGTUNNEL: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) break; err = -EFAULT; if (copy_from_user(&p, ifr->ifr_ifru.ifru_data, sizeof (p))) @@ -1304,7 +1304,7 @@ ip6_tnl_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd) break; case SIOCDELTUNNEL: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) break; if (dev == ip6n->fb_tnl_dev) { diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c index 705c828..1649ccd 100644 --- a/net/ipv6/ip6mr.c +++ b/net/ipv6/ip6mr.c @@ -1582,7 +1582,7 @@ int ip6_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, uns return -ENOENT; if (optname != MRT6_INIT) { - if (sk != mrt->mroute6_sk && !capable(CAP_NET_ADMIN)) + if (sk != mrt->mroute6_sk && !ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EACCES; } diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c index 147ede38..485e181 100644 --- a/net/ipv6/ipv6_sockglue.c +++ b/net/ipv6/ipv6_sockglue.c @@ -343,7 +343,7 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname, break; case IPV6_TRANSPARENT: - if (!capable(CAP_NET_ADMIN)) { + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) { retv = -EPERM; break; } @@ -381,7 +381,8 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname, /* hop-by-hop / destination options are privileged option */ retv = -EPERM; - if (optname != IPV6_RTHDR && !capable(CAP_NET_RAW)) + if (optname != IPV6_RTHDR && + !ns_capable(net->user_ns, CAP_NET_RAW)) break; opt = ipv6_renew_options(sk, np->opt, optname, @@ -725,7 +726,7 @@ done: case IPV6_IPSEC_POLICY: case IPV6_XFRM_POLICY: retv = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) break; retv = xfrm_user_policy(sk, optname, optval, optlen); break; diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c index 94874b0..7fce7d8 100644 --- a/net/ipv6/netfilter/ip6_tables.c +++ b/net/ipv6/netfilter/ip6_tables.c @@ -1869,7 +1869,7 @@ compat_do_ip6t_set_ctl(struct sock *sk, int cmd, void __user *user, { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { @@ -1984,7 +1984,7 @@ compat_do_ip6t_get_ctl(struct sock *sk, int cmd, void __user *user, int *len) { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { @@ -2006,7 +2006,7 @@ do_ip6t_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len) { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { @@ -2031,7 +2031,7 @@ do_ip6t_get_ctl(struct sock *sk, int cmd, void __user *user, int *len) { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 9e69eb0..f00c18d 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -1938,7 +1938,7 @@ int ipv6_route_ioctl(struct net *net, unsigned int cmd, void __user *arg) switch(cmd) { case SIOCADDRT: /* Add a route */ case SIOCDELRT: /* Delete a route */ - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; err = copy_from_user(&rtmsg, arg, sizeof(struct in6_rtmsg)); diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c index 00b15ac..7438711 100644 --- a/net/ipv6/sit.c +++ b/net/ipv6/sit.c @@ -308,7 +308,7 @@ static int ipip6_tunnel_get_prl(struct ip_tunnel *t, /* For simple GET or for root users, * we try harder to allocate. */ - kp = (cmax <= 1 || capable(CAP_NET_ADMIN)) ? + kp = (cmax <= 1 || ns_capable(dev_net(t->dev)->user_ns, CAP_NET_ADMIN)) ? kcalloc(cmax, sizeof(*kp), GFP_KERNEL) : NULL; @@ -929,7 +929,7 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd) case SIOCADDTUNNEL: case SIOCCHGTUNNEL: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) goto done; err = -EFAULT; @@ -988,7 +988,7 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd) case SIOCDELTUNNEL: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) goto done; if (dev == sitn->fb_tunnel_dev) { @@ -1021,7 +1021,7 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd) case SIOCDELPRL: case SIOCCHGPRL: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) goto done; err = -EINVAL; if (dev == sitn->fb_tunnel_dev) @@ -1050,7 +1050,7 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd) case SIOCCHG6RD: case SIOCDEL6RD: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) goto done; err = -EFAULT; -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 10/15] net/core/scm.c: target capable() calls to user_ns owning the net_ns [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> ` (9 preceding siblings ...) 2011-09-02 19:56 ` [PATCH 09/15] user ns: convert ipv6 to targeted capabilities Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` Serge Hallyn ` (4 subsequent siblings) 15 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> The uid/gid comparisons don't have to be pulled out. This just seemed more easily proved correct. Changelog: mark struct cred arg const Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- net/core/scm.c | 41 ++++++++++++++++++++++++++++++++++------- 1 files changed, 34 insertions(+), 7 deletions(-) diff --git a/net/core/scm.c b/net/core/scm.c index 811b53f..4f376bf 100644 --- a/net/core/scm.c +++ b/net/core/scm.c @@ -43,17 +43,44 @@ * setu(g)id. */ -static __inline__ int scm_check_creds(struct ucred *creds) +static __inline__ bool uidequiv(const struct cred *src, struct ucred *tgt, + struct user_namespace *ns) +{ + if (src->user_ns != ns) + goto check_capable; + if (src->uid == tgt->uid || src->euid == tgt->uid || + src->suid == tgt->uid) + return true; +check_capable: + if (ns_capable(ns, CAP_SETUID)) + return true; + return false; +} + +static __inline__ bool gidequiv(const struct cred *src, struct ucred *tgt, + struct user_namespace *ns) +{ + if (src->user_ns != ns) + goto check_capable; + if (src->gid == tgt->gid || src->egid == tgt->gid || + src->sgid == tgt->gid) + return true; +check_capable: + if (ns_capable(ns, CAP_SETGID)) + return true; + return false; +} + +static __inline__ int scm_check_creds(struct ucred *creds, struct socket *sock) { const struct cred *cred = current_cred(); + struct user_namespace *ns = sock_net(sock->sk)->user_ns; - if ((creds->pid == task_tgid_vnr(current) || capable(CAP_SYS_ADMIN)) && - ((creds->uid == cred->uid || creds->uid == cred->euid || - creds->uid == cred->suid) || capable(CAP_SETUID)) && - ((creds->gid == cred->gid || creds->gid == cred->egid || - creds->gid == cred->sgid) || capable(CAP_SETGID))) { + if ((creds->pid == task_tgid_vnr(current) || ns_capable(ns, CAP_SYS_ADMIN)) && + uidequiv(cred, creds, ns) && gidequiv(cred, creds, ns)) { return 0; } + return -EPERM; } @@ -169,7 +196,7 @@ int __scm_send(struct socket *sock, struct msghdr *msg, struct scm_cookie *p) if (cmsg->cmsg_len != CMSG_LEN(sizeof(struct ucred))) goto error; memcpy(&p->creds, CMSG_DATA(cmsg), sizeof(struct ucred)); - err = scm_check_creds(&p->creds); + err = scm_check_creds(&p->creds, sock); if (err) goto error; -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 11/15] userns: make some net-sysfs capable calls targeted 2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn 2011-09-02 19:56 ` (unknown), Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn ` (9 subsequent siblings) 11 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Changelog: jul 1: fix compilation errors (net_device != net) Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- net/core/net-sysfs.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 1683e5d..876915b 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -76,7 +76,7 @@ static ssize_t netdev_store(struct device *dev, struct device_attribute *attr, unsigned long new; int ret = -EINVAL; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(net)->user_ns, CAP_NET_ADMIN)) return -EPERM; new = simple_strtoul(buf, &endp, 0); @@ -261,7 +261,7 @@ static ssize_t store_ifalias(struct device *dev, struct device_attribute *attr, size_t count = len; ssize_t ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(netdev)->user_ns, CAP_NET_ADMIN)) return -EPERM; /* ignore trailing newline */ -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 11/15] userns: make some net-sysfs capable calls targeted @ 2011-09-02 19:56 ` Serge Hallyn 0 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Changelog: jul 1: fix compilation errors (net_device != net) Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- net/core/net-sysfs.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 1683e5d..876915b 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -76,7 +76,7 @@ static ssize_t netdev_store(struct device *dev, struct device_attribute *attr, unsigned long new; int ret = -EINVAL; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(net)->user_ns, CAP_NET_ADMIN)) return -EPERM; new = simple_strtoul(buf, &endp, 0); @@ -261,7 +261,7 @@ static ssize_t store_ifalias(struct device *dev, struct device_attribute *attr, size_t count = len; ssize_t ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(netdev)->user_ns, CAP_NET_ADMIN)) return -EPERM; /* ignore trailing newline */ -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 11/15] userns: make some net-sysfs capable calls targeted @ 2011-09-02 19:56 ` Serge Hallyn 0 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm, segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap Cc: Serge E. Hallyn From: "Serge E. Hallyn" <serge.hallyn@canonical.com> Changelog: jul 1: fix compilation errors (net_device != net) Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Eric W. Biederman <ebiederm@xmission.com> --- net/core/net-sysfs.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 1683e5d..876915b 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -76,7 +76,7 @@ static ssize_t netdev_store(struct device *dev, struct device_attribute *attr, unsigned long new; int ret = -EINVAL; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(net)->user_ns, CAP_NET_ADMIN)) return -EPERM; new = simple_strtoul(buf, &endp, 0); @@ -261,7 +261,7 @@ static ssize_t store_ifalias(struct device *dev, struct device_attribute *attr, size_t count = len; ssize_t ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(netdev)->user_ns, CAP_NET_ADMIN)) return -EPERM; /* ignore trailing newline */ -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 12/15] user_ns: target af_key capability check 2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn 2011-09-02 19:56 ` (unknown), Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn ` (9 subsequent siblings) 11 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> This presumes that it really is complete wrt network namespaces. Looking at the code it appears to be. Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- net/key/af_key.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/key/af_key.c b/net/key/af_key.c index 1e733e9..1f90f4e 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -141,7 +141,7 @@ static int pfkey_create(struct net *net, struct socket *sock, int protocol, struct sock *sk; int err; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; if (sock->type != SOCK_RAW) return -ESOCKTNOSUPPORT; -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 12/15] user_ns: target af_key capability check @ 2011-09-02 19:56 ` Serge Hallyn 0 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> This presumes that it really is complete wrt network namespaces. Looking at the code it appears to be. Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- net/key/af_key.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/key/af_key.c b/net/key/af_key.c index 1e733e9..1f90f4e 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -141,7 +141,7 @@ static int pfkey_create(struct net *net, struct socket *sock, int protocol, struct sock *sk; int err; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; if (sock->type != SOCK_RAW) return -ESOCKTNOSUPPORT; -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 12/15] user_ns: target af_key capability check @ 2011-09-02 19:56 ` Serge Hallyn 0 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm, segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap Cc: Serge E. Hallyn From: "Serge E. Hallyn" <serge.hallyn@canonical.com> This presumes that it really is complete wrt network namespaces. Looking at the code it appears to be. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Eric W. Biederman <ebiederm@xmission.com> --- net/key/af_key.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/key/af_key.c b/net/key/af_key.c index 1e733e9..1f90f4e 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -141,7 +141,7 @@ static int pfkey_create(struct net *net, struct socket *sock, int protocol, struct sock *sk; int err; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; if (sock->type != SOCK_RAW) return -ESOCKTNOSUPPORT; -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 13/15] userns: net: make many network capable calls targeted [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> ` (12 preceding siblings ...) 2011-09-02 19:56 ` Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 14/15] net: pass user_ns to cap_netlink_recv() Serge Hallyn 2011-09-02 19:56 ` [PATCH 15/15] make kernel/signal.c user ns safe (v2) Serge Hallyn 15 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> When privilege is protected a namespaced network resource, then having the required privilege targed toward the user namespace which owns the resource suffices. As with other patches, a big concern here is that we be cleanly separating the cases where privilege protects a network resource from cases where privilege can lead to laxer constraints on input and, subsequently, the ability to corrupt, crash, or own the host kernel. Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- net/8021q/vlan.c | 12 ++++++------ net/bridge/br_ioctl.c | 22 +++++++++++----------- net/bridge/br_sysfs_br.c | 8 ++++---- net/bridge/br_sysfs_if.c | 2 +- net/bridge/netfilter/ebtables.c | 8 ++++---- net/core/ethtool.c | 2 +- net/ipv4/arp.c | 2 +- net/ipv4/devinet.c | 4 ++-- net/ipv4/fib_frontend.c | 2 +- net/ipv4/ip_options.c | 6 +++--- net/ipv4/ip_sockglue.c | 4 ++-- net/ipv4/ipip.c | 4 ++-- net/ipv4/ipmr.c | 2 +- net/ipv4/netfilter/arp_tables.c | 8 ++++---- net/ipv4/netfilter/ip_tables.c | 8 ++++---- net/netfilter/ipset/ip_set_core.c | 2 +- net/netfilter/ipvs/ip_vs_ctl.c | 4 ++-- net/packet/af_packet.c | 2 +- 18 files changed, 51 insertions(+), 51 deletions(-) diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c index 8970ba1..7d12f63 100644 --- a/net/8021q/vlan.c +++ b/net/8021q/vlan.c @@ -558,7 +558,7 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg) switch (args.cmd) { case SET_VLAN_INGRESS_PRIORITY_CMD: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) break; vlan_dev_set_ingress_priority(dev, args.u.skb_priority, @@ -568,7 +568,7 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg) case SET_VLAN_EGRESS_PRIORITY_CMD: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) break; err = vlan_dev_set_egress_priority(dev, args.u.skb_priority, @@ -577,7 +577,7 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg) case SET_VLAN_FLAG_CMD: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) break; err = vlan_dev_change_flags(dev, args.vlan_qos ? args.u.flag : 0, @@ -586,7 +586,7 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg) case SET_VLAN_NAME_TYPE_CMD: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) break; if ((args.u.name_type >= 0) && (args.u.name_type < VLAN_NAME_TYPE_HIGHEST)) { @@ -602,14 +602,14 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg) case ADD_VLAN_CMD: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) break; err = register_vlan_device(dev, args.u.VID); break; case DEL_VLAN_CMD: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) break; unregister_vlan_dev(dev, NULL); err = 0; diff --git a/net/bridge/br_ioctl.c b/net/bridge/br_ioctl.c index 7222fe1..c82f9cb 100644 --- a/net/bridge/br_ioctl.c +++ b/net/bridge/br_ioctl.c @@ -88,7 +88,7 @@ static int add_del_if(struct net_bridge *br, int ifindex, int isadd) struct net_device *dev; int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; dev = __dev_get_by_index(dev_net(br->dev), ifindex); @@ -178,25 +178,25 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd) } case BRCTL_SET_BRIDGE_FORWARD_DELAY: - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; return br_set_forward_delay(br, args[1]); case BRCTL_SET_BRIDGE_HELLO_TIME: - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; return br_set_hello_time(br, args[1]); case BRCTL_SET_BRIDGE_MAX_AGE: - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; return br_set_max_age(br, args[1]); case BRCTL_SET_AGEING_TIME: - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; br->ageing_time = clock_t_to_jiffies(args[1]); @@ -236,14 +236,14 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd) } case BRCTL_SET_BRIDGE_STP_STATE: - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; br_stp_set_enabled(br, args[1]); return 0; case BRCTL_SET_BRIDGE_PRIORITY: - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; spin_lock_bh(&br->lock); @@ -256,7 +256,7 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd) struct net_bridge_port *p; int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; spin_lock_bh(&br->lock); @@ -273,7 +273,7 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd) struct net_bridge_port *p; int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; spin_lock_bh(&br->lock); @@ -330,7 +330,7 @@ static int old_deviceless(struct net *net, void __user *uarg) { char buf[IFNAMSIZ]; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; if (copy_from_user(buf, (void __user *)args[1], IFNAMSIZ)) @@ -360,7 +360,7 @@ int br_ioctl_deviceless_stub(struct net *net, unsigned int cmd, void __user *uar { char buf[IFNAMSIZ]; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; if (copy_from_user(buf, uarg, IFNAMSIZ)) diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c index 68b893e..7f4fa3a 100644 --- a/net/bridge/br_sysfs_br.c +++ b/net/bridge/br_sysfs_br.c @@ -36,7 +36,7 @@ static ssize_t store_bridge_parm(struct device *d, unsigned long val; int err; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; val = simple_strtoul(buf, &endp, 0); @@ -132,7 +132,7 @@ static ssize_t store_stp_state(struct device *d, char *endp; unsigned long val; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; val = simple_strtoul(buf, &endp, 0); @@ -267,7 +267,7 @@ static ssize_t store_group_addr(struct device *d, unsigned new_addr[6]; int i; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; if (sscanf(buf, "%x:%x:%x:%x:%x:%x", @@ -304,7 +304,7 @@ static ssize_t store_flush(struct device *d, { struct net_bridge *br = to_bridge(d); - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; br_fdb_flush(br); diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c index 6229b62..9cb4d2e 100644 --- a/net/bridge/br_sysfs_if.c +++ b/net/bridge/br_sysfs_if.c @@ -209,7 +209,7 @@ static ssize_t brport_store(struct kobject * kobj, char *endp; unsigned long val; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(p->br->dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; val = simple_strtoul(buf, &endp, 0); diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c index 5864cc4..cc1198b 100644 --- a/net/bridge/netfilter/ebtables.c +++ b/net/bridge/netfilter/ebtables.c @@ -1463,7 +1463,7 @@ static int do_ebt_set_ctl(struct sock *sk, { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch(cmd) { @@ -1485,7 +1485,7 @@ static int do_ebt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len) struct ebt_replace tmp; struct ebt_table *t; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; if (copy_from_user(&tmp, user, sizeof(tmp))) @@ -2276,7 +2276,7 @@ static int compat_do_ebt_set_ctl(struct sock *sk, { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { @@ -2299,7 +2299,7 @@ static int compat_do_ebt_get_ctl(struct sock *sk, int cmd, struct compat_ebt_replace tmp; struct ebt_table *t; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; /* try real handler in case userland supplied needed padding */ diff --git a/net/core/ethtool.c b/net/core/ethtool.c index 6cdba5f..56878bf 100644 --- a/net/core/ethtool.c +++ b/net/core/ethtool.c @@ -1676,7 +1676,7 @@ int dev_ethtool(struct net *net, struct ifreq *ifr) case ETHTOOL_GFEATURES: break; default: - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; } diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c index 96a164a..023ad24 100644 --- a/net/ipv4/arp.c +++ b/net/ipv4/arp.c @@ -1175,7 +1175,7 @@ int arp_ioctl(struct net *net, unsigned int cmd, void __user *arg) switch (cmd) { case SIOCDARP: case SIOCSARP: - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; case SIOCGARP: err = copy_from_user(&r, arg, sizeof(struct arpreq)); diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index bc19bd0..93b5b0b 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -728,7 +728,7 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg) case SIOCSIFFLAGS: ret = -EACCES; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) goto out; break; case SIOCSIFADDR: /* Set interface address (and family) */ @@ -736,7 +736,7 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg) case SIOCSIFDSTADDR: /* Set the destination address */ case SIOCSIFNETMASK: /* Set the netmask for the interface */ ret = -EACCES; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) goto out; ret = -EINVAL; if (sin->sin_family != AF_INET) diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 92fc5f6..8f34a07 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -437,7 +437,7 @@ int ip_rt_ioctl(struct net *net, unsigned int cmd, void __user *arg) switch (cmd) { case SIOCADDRT: /* Add a route */ case SIOCDELRT: /* Delete a route */ - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; if (copy_from_user(&rt, arg, sizeof(rt))) diff --git a/net/ipv4/ip_options.c b/net/ipv4/ip_options.c index ec93335..21df700 100644 --- a/net/ipv4/ip_options.c +++ b/net/ipv4/ip_options.c @@ -396,7 +396,7 @@ int ip_options_compile(struct net *net, optptr[2] += 8; break; default: - if (!skb && !capable(CAP_NET_RAW)) { + if (!skb && !ns_capable(net->user_ns, CAP_NET_RAW)) { pp_ptr = optptr + 3; goto error; } @@ -432,7 +432,7 @@ int ip_options_compile(struct net *net, opt->router_alert = optptr - iph; break; case IPOPT_CIPSO: - if ((!skb && !capable(CAP_NET_RAW)) || opt->cipso) { + if ((!skb && !ns_capable(net->user_ns, CAP_NET_RAW)) || opt->cipso) { pp_ptr = optptr; goto error; } @@ -445,7 +445,7 @@ int ip_options_compile(struct net *net, case IPOPT_SEC: case IPOPT_SID: default: - if (!skb && !capable(CAP_NET_RAW)) { + if (!skb && !ns_capable(net->user_ns, CAP_NET_RAW)) { pp_ptr = optptr; goto error; } diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c index 8905e92..6408507 100644 --- a/net/ipv4/ip_sockglue.c +++ b/net/ipv4/ip_sockglue.c @@ -955,13 +955,13 @@ mc_msf_out: case IP_IPSEC_POLICY: case IP_XFRM_POLICY: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) break; err = xfrm_user_policy(sk, optname, optval, optlen); break; case IP_TRANSPARENT: - if (!capable(CAP_NET_ADMIN)) { + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) { err = -EPERM; break; } diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c index 378b20b..6725832 100644 --- a/net/ipv4/ipip.c +++ b/net/ipv4/ipip.c @@ -629,7 +629,7 @@ ipip_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd) case SIOCADDTUNNEL: case SIOCCHGTUNNEL: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) goto done; err = -EFAULT; @@ -689,7 +689,7 @@ ipip_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd) case SIOCDELTUNNEL: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) goto done; if (dev == ipn->fb_tunnel_dev) { diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 58e8791..309aa0c 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -1204,7 +1204,7 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi if (optname != MRT_INIT) { if (sk != rcu_dereference_raw(mrt->mroute_sk) && - !capable(CAP_NET_ADMIN)) + !ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EACCES; } diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c index fd7a3f6..acc908f 100644 --- a/net/ipv4/netfilter/arp_tables.c +++ b/net/ipv4/netfilter/arp_tables.c @@ -1534,7 +1534,7 @@ static int compat_do_arpt_set_ctl(struct sock *sk, int cmd, void __user *user, { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { @@ -1678,7 +1678,7 @@ static int compat_do_arpt_get_ctl(struct sock *sk, int cmd, void __user *user, { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { @@ -1699,7 +1699,7 @@ static int do_arpt_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { @@ -1723,7 +1723,7 @@ static int do_arpt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c index 24e556e..72f2cde 100644 --- a/net/ipv4/netfilter/ip_tables.c +++ b/net/ipv4/netfilter/ip_tables.c @@ -1847,7 +1847,7 @@ compat_do_ipt_set_ctl(struct sock *sk, int cmd, void __user *user, { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { @@ -1962,7 +1962,7 @@ compat_do_ipt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len) { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { @@ -1984,7 +1984,7 @@ do_ipt_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len) { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { @@ -2009,7 +2009,7 @@ do_ipt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len) { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c index d7e86ef..38d69a5 100644 --- a/net/netfilter/ipset/ip_set_core.c +++ b/net/netfilter/ipset/ip_set_core.c @@ -1596,7 +1596,7 @@ ip_set_sockfn_get(struct sock *sk, int optval, void __user *user, int *len) void *data; int copylen = *len, ret = 0; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; if (optval != SO_IP_SET) return -EBADF; diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c index 2b771dc..db224ef 100644 --- a/net/netfilter/ipvs/ip_vs_ctl.c +++ b/net/netfilter/ipvs/ip_vs_ctl.c @@ -2284,7 +2284,7 @@ do_ip_vs_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len) struct ip_vs_dest_user *udest_compat; struct ip_vs_dest_user_kern udest; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; if (cmd < IP_VS_BASE_CTL || cmd > IP_VS_SO_SET_MAX) @@ -2566,7 +2566,7 @@ do_ip_vs_get_ctl(struct sock *sk, int cmd, void __user *user, int *len) struct netns_ipvs *ipvs = net_ipvs(net); BUG_ON(!net); - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; if (cmd < IP_VS_BASE_CTL || cmd > IP_VS_SO_GET_MAX) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index c698cec..c2e6bb6 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -1793,7 +1793,7 @@ static int packet_create(struct net *net, struct socket *sock, int protocol, __be16 proto = (__force __be16)protocol; /* weird, but documented */ int err; - if (!capable(CAP_NET_RAW)) + if (!ns_capable(net->user_ns, CAP_NET_RAW)) return -EPERM; if (sock->type != SOCK_DGRAM && sock->type != SOCK_RAW && sock->type != SOCK_PACKET) -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 14/15] net: pass user_ns to cap_netlink_recv() [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> ` (13 preceding siblings ...) 2011-09-02 19:56 ` [PATCH 13/15] userns: net: make many network capable calls targeted Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 15/15] make kernel/signal.c user ns safe (v2) Serge Hallyn 15 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> and make cap_netlink_recv() userns-aware cap_netlink_recv() was granting privilege if a capability is in current_cap(), regardless of the user namespace. Fix that by targeting the capability check against the user namespace which owns the skb. Because sock_net is static inline defined in net/sock.h, which we don't want to #include at the cap_netlink_recv function (commoncap.h). Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- drivers/scsi/scsi_netlink.c | 3 ++- include/linux/security.h | 14 +++++++++----- kernel/audit.c | 6 ++++-- net/core/rtnetlink.c | 3 ++- net/decnet/netfilter/dn_rtmsg.c | 3 ++- net/ipv4/netfilter/ip_queue.c | 3 ++- net/ipv6/netfilter/ip6_queue.c | 3 ++- net/netfilter/nfnetlink.c | 2 +- net/netlink/genetlink.c | 2 +- net/xfrm/xfrm_user.c | 2 +- security/commoncap.c | 6 ++---- security/security.c | 4 ++-- security/selinux/hooks.c | 5 +++-- 13 files changed, 33 insertions(+), 23 deletions(-) diff --git a/drivers/scsi/scsi_netlink.c b/drivers/scsi/scsi_netlink.c index 26a8a45..0aa2e57 100644 --- a/drivers/scsi/scsi_netlink.c +++ b/drivers/scsi/scsi_netlink.c @@ -111,7 +111,8 @@ scsi_nl_rcv_msg(struct sk_buff *skb) goto next_msg; } - if (security_netlink_recv(skb, CAP_SYS_ADMIN)) { + if (security_netlink_recv(skb, CAP_SYS_ADMIN, + sock_net(skb->sk)->user_ns)) { err = -EPERM; goto next_msg; } diff --git a/include/linux/security.h b/include/linux/security.h index ebd2a53..cfa1f47 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -95,7 +95,8 @@ struct xfrm_user_sec_ctx; struct seq_file; extern int cap_netlink_send(struct sock *sk, struct sk_buff *skb); -extern int cap_netlink_recv(struct sk_buff *skb, int cap); +extern int cap_netlink_recv(struct sk_buff *skb, int cap, + struct user_namespace *ns); void reset_security_ops(void); @@ -797,6 +798,7 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts) * @skb. * @skb contains the sk_buff structure for the netlink message. * @cap indicates the capability required + * @ns is the user namespace which owns skb * Return 0 if permission is granted. * * Security hooks for Unix domain networking. @@ -1557,7 +1559,8 @@ struct security_operations { struct sembuf *sops, unsigned nsops, int alter); int (*netlink_send) (struct sock *sk, struct sk_buff *skb); - int (*netlink_recv) (struct sk_buff *skb, int cap); + int (*netlink_recv) (struct sk_buff *skb, int cap, + struct user_namespace *ns); void (*d_instantiate) (struct dentry *dentry, struct inode *inode); @@ -1806,7 +1809,7 @@ void security_d_instantiate(struct dentry *dentry, struct inode *inode); int security_getprocattr(struct task_struct *p, char *name, char **value); int security_setprocattr(struct task_struct *p, char *name, void *value, size_t size); int security_netlink_send(struct sock *sk, struct sk_buff *skb); -int security_netlink_recv(struct sk_buff *skb, int cap); +int security_netlink_recv(struct sk_buff *skb, int cap, struct user_namespace *ns); int security_secid_to_secctx(u32 secid, char **secdata, u32 *seclen); int security_secctx_to_secid(const char *secdata, u32 seclen, u32 *secid); void security_release_secctx(char *secdata, u32 seclen); @@ -2498,9 +2501,10 @@ static inline int security_netlink_send(struct sock *sk, struct sk_buff *skb) return cap_netlink_send(sk, skb); } -static inline int security_netlink_recv(struct sk_buff *skb, int cap) +static inline int security_netlink_recv(struct sk_buff *skb, int cap, + struct user_namespace *ns) { - return cap_netlink_recv(skb, cap); + return cap_netlink_recv(skb, cap, ns); } static inline int security_secid_to_secctx(u32 secid, char **secdata, u32 *seclen) diff --git a/kernel/audit.c b/kernel/audit.c index 0a1355c..48144c4 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -601,13 +601,15 @@ static int audit_netlink_ok(struct sk_buff *skb, u16 msg_type) case AUDIT_TTY_SET: case AUDIT_TRIM: case AUDIT_MAKE_EQUIV: - if (security_netlink_recv(skb, CAP_AUDIT_CONTROL)) + if (security_netlink_recv(skb, CAP_AUDIT_CONTROL, + sock_net(skb->sk)->user_ns)) err = -EPERM; break; case AUDIT_USER: case AUDIT_FIRST_USER_MSG ... AUDIT_LAST_USER_MSG: case AUDIT_FIRST_USER_MSG2 ... AUDIT_LAST_USER_MSG2: - if (security_netlink_recv(skb, CAP_AUDIT_WRITE)) + if (security_netlink_recv(skb, CAP_AUDIT_WRITE, + sock_net(skb->sk)->user_ns)) err = -EPERM; break; default: /* bad msg */ diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 99d9e95..4a444de 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -1931,7 +1931,8 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh) sz_idx = type>>2; kind = type&3; - if (kind != 2 && security_netlink_recv(skb, CAP_NET_ADMIN)) + if (kind != 2 && security_netlink_recv(skb, CAP_NET_ADMIN, + net->user_ns)) return -EPERM; if (kind == 2 && nlh->nlmsg_flags&NLM_F_DUMP) { diff --git a/net/decnet/netfilter/dn_rtmsg.c b/net/decnet/netfilter/dn_rtmsg.c index 69975e0..2d052ab 100644 --- a/net/decnet/netfilter/dn_rtmsg.c +++ b/net/decnet/netfilter/dn_rtmsg.c @@ -108,7 +108,8 @@ static inline void dnrmg_receive_user_skb(struct sk_buff *skb) if (nlh->nlmsg_len < sizeof(*nlh) || skb->len < nlh->nlmsg_len) return; - if (security_netlink_recv(skb, CAP_NET_ADMIN)) + if (security_netlink_recv(skb, CAP_NET_ADMIN, + sock_net(skb->sk)->user_ns)) RCV_SKB_FAIL(-EPERM); /* Eventually we might send routing messages too */ diff --git a/net/ipv4/netfilter/ip_queue.c b/net/ipv4/netfilter/ip_queue.c index 5c9b9d9..51d7c52 100644 --- a/net/ipv4/netfilter/ip_queue.c +++ b/net/ipv4/netfilter/ip_queue.c @@ -432,7 +432,8 @@ __ipq_rcv_skb(struct sk_buff *skb) if (type <= IPQM_BASE) return; - if (security_netlink_recv(skb, CAP_NET_ADMIN)) + if (security_netlink_recv(skb, CAP_NET_ADMIN, + sock_net(skb->sk)->user_ns)) RCV_SKB_FAIL(-EPERM); spin_lock_bh(&queue_lock); diff --git a/net/ipv6/netfilter/ip6_queue.c b/net/ipv6/netfilter/ip6_queue.c index 2493948..8206bf3 100644 --- a/net/ipv6/netfilter/ip6_queue.c +++ b/net/ipv6/netfilter/ip6_queue.c @@ -433,7 +433,8 @@ __ipq_rcv_skb(struct sk_buff *skb) if (type <= IPQM_BASE) return; - if (security_netlink_recv(skb, CAP_NET_ADMIN)) + if (security_netlink_recv(skb, CAP_NET_ADMIN, + sock_net(skb->sk)->user_ns)) RCV_SKB_FAIL(-EPERM); spin_lock_bh(&queue_lock); diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c index 1905976..bcaff9d 100644 --- a/net/netfilter/nfnetlink.c +++ b/net/netfilter/nfnetlink.c @@ -130,7 +130,7 @@ static int nfnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh) const struct nfnetlink_subsystem *ss; int type, err; - if (security_netlink_recv(skb, CAP_NET_ADMIN)) + if (security_netlink_recv(skb, CAP_NET_ADMIN, net->user_ns)) return -EPERM; /* All the messages must at least contain nfgenmsg */ diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c index 482fa57..00a101c 100644 --- a/net/netlink/genetlink.c +++ b/net/netlink/genetlink.c @@ -516,7 +516,7 @@ static int genl_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh) return -EOPNOTSUPP; if ((ops->flags & GENL_ADMIN_PERM) && - security_netlink_recv(skb, CAP_NET_ADMIN)) + security_netlink_recv(skb, CAP_NET_ADMIN, net->user_ns)) return -EPERM; if (nlh->nlmsg_flags & NLM_F_DUMP) { diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c index 0256b8a..1808e1e 100644 --- a/net/xfrm/xfrm_user.c +++ b/net/xfrm/xfrm_user.c @@ -2290,7 +2290,7 @@ static int xfrm_user_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh) link = &xfrm_dispatch[type]; /* All operations require privileges, even GET */ - if (security_netlink_recv(skb, CAP_NET_ADMIN)) + if (security_netlink_recv(skb, CAP_NET_ADMIN, net->user_ns)) return -EPERM; if ((type == (XFRM_MSG_GETSA - XFRM_MSG_BASE) || diff --git a/security/commoncap.c b/security/commoncap.c index a93b3b7..1e48e6a 100644 --- a/security/commoncap.c +++ b/security/commoncap.c @@ -56,11 +56,9 @@ int cap_netlink_send(struct sock *sk, struct sk_buff *skb) return 0; } -int cap_netlink_recv(struct sk_buff *skb, int cap) +int cap_netlink_recv(struct sk_buff *skb, int cap, struct user_namespace *ns) { - if (!cap_raised(current_cap(), cap)) - return -EPERM; - return 0; + return security_capable(ns, current_cred(), cap); } EXPORT_SYMBOL(cap_netlink_recv); diff --git a/security/security.c b/security/security.c index 0e4fccf..0a1453e 100644 --- a/security/security.c +++ b/security/security.c @@ -941,9 +941,9 @@ int security_netlink_send(struct sock *sk, struct sk_buff *skb) return security_ops->netlink_send(sk, skb); } -int security_netlink_recv(struct sk_buff *skb, int cap) +int security_netlink_recv(struct sk_buff *skb, int cap, struct user_namespace *ns) { - return security_ops->netlink_recv(skb, cap); + return security_ops->netlink_recv(skb, cap, ns); } EXPORT_SYMBOL(security_netlink_recv); diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index 266a229..fe290bb 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -4723,13 +4723,14 @@ static int selinux_netlink_send(struct sock *sk, struct sk_buff *skb) return selinux_nlmsg_perm(sk, skb); } -static int selinux_netlink_recv(struct sk_buff *skb, int capability) +static int selinux_netlink_recv(struct sk_buff *skb, int capability, + struct user_namespace *ns) { int err; struct common_audit_data ad; u32 sid; - err = cap_netlink_recv(skb, capability); + err = cap_netlink_recv(skb, capability, ns); if (err) return err; -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 15/15] make kernel/signal.c user ns safe (v2) [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> ` (14 preceding siblings ...) 2011-09-02 19:56 ` [PATCH 14/15] net: pass user_ns to cap_netlink_recv() Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn 15 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w From: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Signed-off-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> --- kernel/signal.c | 26 ++++++++++++++++++++++---- 1 files changed, 22 insertions(+), 4 deletions(-) diff --git a/kernel/signal.c b/kernel/signal.c index 291c970..c07b970 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -27,6 +27,7 @@ #include <linux/capability.h> #include <linux/freezer.h> #include <linux/pid_namespace.h> +#include <linux/user_namespace.h> #include <linux/nsproxy.h> #define CREATE_TRACE_POINTS #include <trace/events/signal.h> @@ -1073,7 +1074,8 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t, q->info.si_code = SI_USER; q->info.si_pid = task_tgid_nr_ns(current, task_active_pid_ns(t)); - q->info.si_uid = current_uid(); + q->info.si_uid = user_ns_map_uid(task_cred_xxx(t, user_ns), + current_cred(), current_uid()); break; case (unsigned long) SEND_SIG_PRIV: q->info.si_signo = sig; @@ -1363,6 +1365,11 @@ int kill_pid_info_as_uid(int sig, struct siginfo *info, struct pid *pid, goto out_unlock; } pcred = __task_cred(p); + /* + * this is called (only) from drivers/usb/core/devio.c. + * Do we need to add user_ns to urb and usb_device, so + * we can pass them in here? + */ if (si_fromuser(info) && euid != pcred->suid && euid != pcred->uid && uid != pcred->suid && uid != pcred->uid) { @@ -1618,7 +1625,8 @@ bool do_notify_parent(struct task_struct *tsk, int sig) */ rcu_read_lock(); info.si_pid = task_pid_nr_ns(tsk, tsk->parent->nsproxy->pid_ns); - info.si_uid = __task_cred(tsk)->uid; + info.si_uid = user_ns_map_uid(task_cred_xxx(tsk->parent, user_ns), + __task_cred(tsk), __task_cred(tsk)->uid); rcu_read_unlock(); info.si_utime = cputime_to_clock_t(cputime_add(tsk->utime, @@ -1688,6 +1696,7 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, unsigned long flags; struct task_struct *parent; struct sighand_struct *sighand; + const struct cred *cred; if (for_ptracer) { parent = tsk->parent; @@ -1703,7 +1712,9 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, */ rcu_read_lock(); info.si_pid = task_pid_nr_ns(tsk, parent->nsproxy->pid_ns); - info.si_uid = __task_cred(tsk)->uid; + cred = __task_cred(tsk); + info.si_uid = user_ns_map_uid(task_cred_xxx(parent, user_ns), + cred, cred->uid); rcu_read_unlock(); info.si_utime = cputime_to_clock_t(tsk->utime); @@ -2122,7 +2133,10 @@ static int ptrace_signal(int signr, siginfo_t *info, info->si_errno = 0; info->si_code = SI_USER; info->si_pid = task_pid_vnr(current->parent); - info->si_uid = task_uid(current->parent); + /* we can cache cred if this performs poorly */ + info->si_uid = user_ns_map_uid(current_user_ns(), + __task_cred(current->parent), + task_uid(current->parent)); } /* If the (new) signal is now blocked, requeue it. */ @@ -2552,6 +2566,10 @@ SYSCALL_DEFINE2(rt_sigpending, sigset_t __user *, set, size_t, sigsetsize) #ifndef HAVE_ARCH_COPY_SIGINFO_TO_USER +/* + * send_signal has converted the sender's uid to the receiver + * task user namespace, so no need to convert here + */ int copy_siginfo_to_user(siginfo_t __user *to, siginfo_t *from) { int err; -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 09/15] user ns: convert ipv6 to targeted capabilities 2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn ` (5 preceding siblings ...) [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> @ 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 10/15] net/core/scm.c: target capable() calls to user_ns owning the net_ns Serge Hallyn ` (4 subsequent siblings) 11 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm, segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap Cc: Serge E. Hallyn From: "Serge E. Hallyn" <serge.hallyn@canonical.com> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Eric W. Biederman <ebiederm@xmission.com> --- net/ipv6/addrconf.c | 4 ++-- net/ipv6/af_inet6.c | 6 ++++-- net/ipv6/datagram.c | 6 +++--- net/ipv6/ip6_flowlabel.c | 24 ++++++++++++++---------- net/ipv6/ip6_tunnel.c | 4 ++-- net/ipv6/ip6mr.c | 2 +- net/ipv6/ipv6_sockglue.c | 7 ++++--- net/ipv6/netfilter/ip6_tables.c | 8 ++++---- net/ipv6/route.c | 2 +- net/ipv6/sit.c | 10 +++++----- 10 files changed, 40 insertions(+), 33 deletions(-) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index f012ebd..871e5cf 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -2230,7 +2230,7 @@ int addrconf_add_ifaddr(struct net *net, void __user *arg) struct in6_ifreq ireq; int err; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; if (copy_from_user(&ireq, arg, sizeof(struct in6_ifreq))) @@ -2249,7 +2249,7 @@ int addrconf_del_ifaddr(struct net *net, void __user *arg) struct in6_ifreq ireq; int err; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; if (copy_from_user(&ireq, arg, sizeof(struct in6_ifreq))) diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c index 3b5669a..1854ffe 100644 --- a/net/ipv6/af_inet6.c +++ b/net/ipv6/af_inet6.c @@ -160,7 +160,8 @@ lookup_protocol: } err = -EPERM; - if (sock->type == SOCK_RAW && !kern && !capable(CAP_NET_RAW)) + if (sock->type == SOCK_RAW && !kern && + !ns_capable(net->user_ns, CAP_NET_RAW)) goto out_rcu_unlock; sock->ops = answer->ops; @@ -281,7 +282,8 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) return -EINVAL; snum = ntohs(addr->sin6_port); - if (snum && snum < PROT_SOCK && !capable(CAP_NET_BIND_SERVICE)) + if (snum && snum < PROT_SOCK && + !ns_capable(sock_net(sk)->user_ns, CAP_NET_BIND_SERVICE)) return -EACCES; lock_sock(sk); diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c index 9ef1831..33b1b0f 100644 --- a/net/ipv6/datagram.c +++ b/net/ipv6/datagram.c @@ -701,7 +701,7 @@ int datagram_send_ctl(struct net *net, err = -EINVAL; goto exit_f; } - if (!capable(CAP_NET_RAW)) { + if (!ns_capable(net->user_ns, CAP_NET_RAW)) { err = -EPERM; goto exit_f; } @@ -721,7 +721,7 @@ int datagram_send_ctl(struct net *net, err = -EINVAL; goto exit_f; } - if (!capable(CAP_NET_RAW)) { + if (!ns_capable(net->user_ns, CAP_NET_RAW)) { err = -EPERM; goto exit_f; } @@ -746,7 +746,7 @@ int datagram_send_ctl(struct net *net, err = -EINVAL; goto exit_f; } - if (!capable(CAP_NET_RAW)) { + if (!ns_capable(net->user_ns, CAP_NET_RAW)) { err = -EPERM; goto exit_f; } diff --git a/net/ipv6/ip6_flowlabel.c b/net/ipv6/ip6_flowlabel.c index f3caf1b..4726c02 100644 --- a/net/ipv6/ip6_flowlabel.c +++ b/net/ipv6/ip6_flowlabel.c @@ -294,21 +294,22 @@ struct ipv6_txoptions *fl6_merge_options(struct ipv6_txoptions * opt_space, return opt_space; } -static unsigned long check_linger(unsigned long ttl) +static unsigned long check_linger(unsigned long ttl, struct user_namespace *ns) { if (ttl < FL_MIN_LINGER) return FL_MIN_LINGER*HZ; - if (ttl > FL_MAX_LINGER && !capable(CAP_NET_ADMIN)) + if (ttl > FL_MAX_LINGER && !ns_capable(ns, CAP_NET_ADMIN)) return 0; return ttl*HZ; } -static int fl6_renew(struct ip6_flowlabel *fl, unsigned long linger, unsigned long expires) +static int fl6_renew(struct ip6_flowlabel *fl, unsigned long linger, + unsigned long expires, struct user_namespace *ns) { - linger = check_linger(linger); + linger = check_linger(linger, ns); if (!linger) return -EPERM; - expires = check_linger(expires); + expires = check_linger(expires, ns); if (!expires) return -EPERM; fl->lastuse = jiffies; @@ -375,7 +376,7 @@ fl_create(struct net *net, struct in6_flowlabel_req *freq, char __user *optval, fl->fl_net = hold_net(net); fl->expires = jiffies; - err = fl6_renew(fl, freq->flr_linger, freq->flr_expires); + err = fl6_renew(fl, freq->flr_linger, freq->flr_expires, net->user_ns); if (err) goto done; fl->share = freq->flr_share; @@ -425,7 +426,7 @@ static int mem_check(struct sock *sk) if (room <= 0 || ((count >= FL_MAX_PER_SOCK || (count > 0 && room < FL_MAX_SIZE/2) || room < FL_MAX_SIZE/4) && - !capable(CAP_NET_ADMIN))) + !ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))) return -ENOBUFS; return 0; @@ -507,17 +508,20 @@ int ipv6_flowlabel_opt(struct sock *sk, char __user *optval, int optlen) read_lock_bh(&ip6_sk_fl_lock); for (sfl = np->ipv6_fl_list; sfl; sfl = sfl->next) { if (sfl->fl->label == freq.flr_label) { - err = fl6_renew(sfl->fl, freq.flr_linger, freq.flr_expires); + err = fl6_renew(sfl->fl, freq.flr_linger, freq.flr_expires, + net->user_ns); read_unlock_bh(&ip6_sk_fl_lock); return err; } } read_unlock_bh(&ip6_sk_fl_lock); - if (freq.flr_share == IPV6_FL_S_NONE && capable(CAP_NET_ADMIN)) { + if (freq.flr_share == IPV6_FL_S_NONE && + ns_capable(net->user_ns, CAP_NET_ADMIN)) { fl = fl_lookup(net, freq.flr_label); if (fl) { - err = fl6_renew(fl, freq.flr_linger, freq.flr_expires); + err = fl6_renew(fl, freq.flr_linger, freq.flr_expires, + net->user_ns); fl_release(fl); return err; } diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c index 0bc9888..c430d69 100644 --- a/net/ipv6/ip6_tunnel.c +++ b/net/ipv6/ip6_tunnel.c @@ -1269,7 +1269,7 @@ ip6_tnl_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd) case SIOCADDTUNNEL: case SIOCCHGTUNNEL: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) break; err = -EFAULT; if (copy_from_user(&p, ifr->ifr_ifru.ifru_data, sizeof (p))) @@ -1304,7 +1304,7 @@ ip6_tnl_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd) break; case SIOCDELTUNNEL: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) break; if (dev == ip6n->fb_tnl_dev) { diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c index 705c828..1649ccd 100644 --- a/net/ipv6/ip6mr.c +++ b/net/ipv6/ip6mr.c @@ -1582,7 +1582,7 @@ int ip6_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, uns return -ENOENT; if (optname != MRT6_INIT) { - if (sk != mrt->mroute6_sk && !capable(CAP_NET_ADMIN)) + if (sk != mrt->mroute6_sk && !ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EACCES; } diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c index 147ede38..485e181 100644 --- a/net/ipv6/ipv6_sockglue.c +++ b/net/ipv6/ipv6_sockglue.c @@ -343,7 +343,7 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname, break; case IPV6_TRANSPARENT: - if (!capable(CAP_NET_ADMIN)) { + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) { retv = -EPERM; break; } @@ -381,7 +381,8 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname, /* hop-by-hop / destination options are privileged option */ retv = -EPERM; - if (optname != IPV6_RTHDR && !capable(CAP_NET_RAW)) + if (optname != IPV6_RTHDR && + !ns_capable(net->user_ns, CAP_NET_RAW)) break; opt = ipv6_renew_options(sk, np->opt, optname, @@ -725,7 +726,7 @@ done: case IPV6_IPSEC_POLICY: case IPV6_XFRM_POLICY: retv = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) break; retv = xfrm_user_policy(sk, optname, optval, optlen); break; diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c index 94874b0..7fce7d8 100644 --- a/net/ipv6/netfilter/ip6_tables.c +++ b/net/ipv6/netfilter/ip6_tables.c @@ -1869,7 +1869,7 @@ compat_do_ip6t_set_ctl(struct sock *sk, int cmd, void __user *user, { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { @@ -1984,7 +1984,7 @@ compat_do_ip6t_get_ctl(struct sock *sk, int cmd, void __user *user, int *len) { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { @@ -2006,7 +2006,7 @@ do_ip6t_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len) { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { @@ -2031,7 +2031,7 @@ do_ip6t_get_ctl(struct sock *sk, int cmd, void __user *user, int *len) { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 9e69eb0..f00c18d 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -1938,7 +1938,7 @@ int ipv6_route_ioctl(struct net *net, unsigned int cmd, void __user *arg) switch(cmd) { case SIOCADDRT: /* Add a route */ case SIOCDELRT: /* Delete a route */ - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; err = copy_from_user(&rtmsg, arg, sizeof(struct in6_rtmsg)); diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c index 00b15ac..7438711 100644 --- a/net/ipv6/sit.c +++ b/net/ipv6/sit.c @@ -308,7 +308,7 @@ static int ipip6_tunnel_get_prl(struct ip_tunnel *t, /* For simple GET or for root users, * we try harder to allocate. */ - kp = (cmax <= 1 || capable(CAP_NET_ADMIN)) ? + kp = (cmax <= 1 || ns_capable(dev_net(t->dev)->user_ns, CAP_NET_ADMIN)) ? kcalloc(cmax, sizeof(*kp), GFP_KERNEL) : NULL; @@ -929,7 +929,7 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd) case SIOCADDTUNNEL: case SIOCCHGTUNNEL: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) goto done; err = -EFAULT; @@ -988,7 +988,7 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd) case SIOCDELTUNNEL: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) goto done; if (dev == sitn->fb_tunnel_dev) { @@ -1021,7 +1021,7 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd) case SIOCDELPRL: case SIOCCHGPRL: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) goto done; err = -EINVAL; if (dev == sitn->fb_tunnel_dev) @@ -1050,7 +1050,7 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd) case SIOCCHG6RD: case SIOCDEL6RD: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) goto done; err = -EFAULT; -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 10/15] net/core/scm.c: target capable() calls to user_ns owning the net_ns 2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn ` (6 preceding siblings ...) 2011-09-02 19:56 ` [PATCH 09/15] user ns: convert ipv6 to targeted capabilities Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 13/15] userns: net: make many network capable calls targeted Serge Hallyn ` (3 subsequent siblings) 11 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm, segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap Cc: Serge E. Hallyn From: "Serge E. Hallyn" <serge.hallyn@canonical.com> The uid/gid comparisons don't have to be pulled out. This just seemed more easily proved correct. Changelog: mark struct cred arg const Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Eric W. Biederman <ebiederm@xmission.com> --- net/core/scm.c | 41 ++++++++++++++++++++++++++++++++++------- 1 files changed, 34 insertions(+), 7 deletions(-) diff --git a/net/core/scm.c b/net/core/scm.c index 811b53f..4f376bf 100644 --- a/net/core/scm.c +++ b/net/core/scm.c @@ -43,17 +43,44 @@ * setu(g)id. */ -static __inline__ int scm_check_creds(struct ucred *creds) +static __inline__ bool uidequiv(const struct cred *src, struct ucred *tgt, + struct user_namespace *ns) +{ + if (src->user_ns != ns) + goto check_capable; + if (src->uid == tgt->uid || src->euid == tgt->uid || + src->suid == tgt->uid) + return true; +check_capable: + if (ns_capable(ns, CAP_SETUID)) + return true; + return false; +} + +static __inline__ bool gidequiv(const struct cred *src, struct ucred *tgt, + struct user_namespace *ns) +{ + if (src->user_ns != ns) + goto check_capable; + if (src->gid == tgt->gid || src->egid == tgt->gid || + src->sgid == tgt->gid) + return true; +check_capable: + if (ns_capable(ns, CAP_SETGID)) + return true; + return false; +} + +static __inline__ int scm_check_creds(struct ucred *creds, struct socket *sock) { const struct cred *cred = current_cred(); + struct user_namespace *ns = sock_net(sock->sk)->user_ns; - if ((creds->pid == task_tgid_vnr(current) || capable(CAP_SYS_ADMIN)) && - ((creds->uid == cred->uid || creds->uid == cred->euid || - creds->uid == cred->suid) || capable(CAP_SETUID)) && - ((creds->gid == cred->gid || creds->gid == cred->egid || - creds->gid == cred->sgid) || capable(CAP_SETGID))) { + if ((creds->pid == task_tgid_vnr(current) || ns_capable(ns, CAP_SYS_ADMIN)) && + uidequiv(cred, creds, ns) && gidequiv(cred, creds, ns)) { return 0; } + return -EPERM; } @@ -169,7 +196,7 @@ int __scm_send(struct socket *sock, struct msghdr *msg, struct scm_cookie *p) if (cmsg->cmsg_len != CMSG_LEN(sizeof(struct ucred))) goto error; memcpy(&p->creds, CMSG_DATA(cmsg), sizeof(struct ucred)); - err = scm_check_creds(&p->creds); + err = scm_check_creds(&p->creds, sock); if (err) goto error; -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 13/15] userns: net: make many network capable calls targeted 2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn ` (7 preceding siblings ...) 2011-09-02 19:56 ` [PATCH 10/15] net/core/scm.c: target capable() calls to user_ns owning the net_ns Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 14/15] net: pass user_ns to cap_netlink_recv() Serge Hallyn ` (2 subsequent siblings) 11 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm, segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap Cc: Serge E. Hallyn From: "Serge E. Hallyn" <serge.hallyn@canonical.com> When privilege is protected a namespaced network resource, then having the required privilege targed toward the user namespace which owns the resource suffices. As with other patches, a big concern here is that we be cleanly separating the cases where privilege protects a network resource from cases where privilege can lead to laxer constraints on input and, subsequently, the ability to corrupt, crash, or own the host kernel. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Eric W. Biederman <ebiederm@xmission.com> --- net/8021q/vlan.c | 12 ++++++------ net/bridge/br_ioctl.c | 22 +++++++++++----------- net/bridge/br_sysfs_br.c | 8 ++++---- net/bridge/br_sysfs_if.c | 2 +- net/bridge/netfilter/ebtables.c | 8 ++++---- net/core/ethtool.c | 2 +- net/ipv4/arp.c | 2 +- net/ipv4/devinet.c | 4 ++-- net/ipv4/fib_frontend.c | 2 +- net/ipv4/ip_options.c | 6 +++--- net/ipv4/ip_sockglue.c | 4 ++-- net/ipv4/ipip.c | 4 ++-- net/ipv4/ipmr.c | 2 +- net/ipv4/netfilter/arp_tables.c | 8 ++++---- net/ipv4/netfilter/ip_tables.c | 8 ++++---- net/netfilter/ipset/ip_set_core.c | 2 +- net/netfilter/ipvs/ip_vs_ctl.c | 4 ++-- net/packet/af_packet.c | 2 +- 18 files changed, 51 insertions(+), 51 deletions(-) diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c index 8970ba1..7d12f63 100644 --- a/net/8021q/vlan.c +++ b/net/8021q/vlan.c @@ -558,7 +558,7 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg) switch (args.cmd) { case SET_VLAN_INGRESS_PRIORITY_CMD: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) break; vlan_dev_set_ingress_priority(dev, args.u.skb_priority, @@ -568,7 +568,7 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg) case SET_VLAN_EGRESS_PRIORITY_CMD: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) break; err = vlan_dev_set_egress_priority(dev, args.u.skb_priority, @@ -577,7 +577,7 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg) case SET_VLAN_FLAG_CMD: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) break; err = vlan_dev_change_flags(dev, args.vlan_qos ? args.u.flag : 0, @@ -586,7 +586,7 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg) case SET_VLAN_NAME_TYPE_CMD: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) break; if ((args.u.name_type >= 0) && (args.u.name_type < VLAN_NAME_TYPE_HIGHEST)) { @@ -602,14 +602,14 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg) case ADD_VLAN_CMD: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) break; err = register_vlan_device(dev, args.u.VID); break; case DEL_VLAN_CMD: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) break; unregister_vlan_dev(dev, NULL); err = 0; diff --git a/net/bridge/br_ioctl.c b/net/bridge/br_ioctl.c index 7222fe1..c82f9cb 100644 --- a/net/bridge/br_ioctl.c +++ b/net/bridge/br_ioctl.c @@ -88,7 +88,7 @@ static int add_del_if(struct net_bridge *br, int ifindex, int isadd) struct net_device *dev; int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; dev = __dev_get_by_index(dev_net(br->dev), ifindex); @@ -178,25 +178,25 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd) } case BRCTL_SET_BRIDGE_FORWARD_DELAY: - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; return br_set_forward_delay(br, args[1]); case BRCTL_SET_BRIDGE_HELLO_TIME: - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; return br_set_hello_time(br, args[1]); case BRCTL_SET_BRIDGE_MAX_AGE: - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; return br_set_max_age(br, args[1]); case BRCTL_SET_AGEING_TIME: - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; br->ageing_time = clock_t_to_jiffies(args[1]); @@ -236,14 +236,14 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd) } case BRCTL_SET_BRIDGE_STP_STATE: - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; br_stp_set_enabled(br, args[1]); return 0; case BRCTL_SET_BRIDGE_PRIORITY: - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; spin_lock_bh(&br->lock); @@ -256,7 +256,7 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd) struct net_bridge_port *p; int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; spin_lock_bh(&br->lock); @@ -273,7 +273,7 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd) struct net_bridge_port *p; int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; spin_lock_bh(&br->lock); @@ -330,7 +330,7 @@ static int old_deviceless(struct net *net, void __user *uarg) { char buf[IFNAMSIZ]; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; if (copy_from_user(buf, (void __user *)args[1], IFNAMSIZ)) @@ -360,7 +360,7 @@ int br_ioctl_deviceless_stub(struct net *net, unsigned int cmd, void __user *uar { char buf[IFNAMSIZ]; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; if (copy_from_user(buf, uarg, IFNAMSIZ)) diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c index 68b893e..7f4fa3a 100644 --- a/net/bridge/br_sysfs_br.c +++ b/net/bridge/br_sysfs_br.c @@ -36,7 +36,7 @@ static ssize_t store_bridge_parm(struct device *d, unsigned long val; int err; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; val = simple_strtoul(buf, &endp, 0); @@ -132,7 +132,7 @@ static ssize_t store_stp_state(struct device *d, char *endp; unsigned long val; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; val = simple_strtoul(buf, &endp, 0); @@ -267,7 +267,7 @@ static ssize_t store_group_addr(struct device *d, unsigned new_addr[6]; int i; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; if (sscanf(buf, "%x:%x:%x:%x:%x:%x", @@ -304,7 +304,7 @@ static ssize_t store_flush(struct device *d, { struct net_bridge *br = to_bridge(d); - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; br_fdb_flush(br); diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c index 6229b62..9cb4d2e 100644 --- a/net/bridge/br_sysfs_if.c +++ b/net/bridge/br_sysfs_if.c @@ -209,7 +209,7 @@ static ssize_t brport_store(struct kobject * kobj, char *endp; unsigned long val; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(dev_net(p->br->dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; val = simple_strtoul(buf, &endp, 0); diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c index 5864cc4..cc1198b 100644 --- a/net/bridge/netfilter/ebtables.c +++ b/net/bridge/netfilter/ebtables.c @@ -1463,7 +1463,7 @@ static int do_ebt_set_ctl(struct sock *sk, { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch(cmd) { @@ -1485,7 +1485,7 @@ static int do_ebt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len) struct ebt_replace tmp; struct ebt_table *t; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; if (copy_from_user(&tmp, user, sizeof(tmp))) @@ -2276,7 +2276,7 @@ static int compat_do_ebt_set_ctl(struct sock *sk, { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { @@ -2299,7 +2299,7 @@ static int compat_do_ebt_get_ctl(struct sock *sk, int cmd, struct compat_ebt_replace tmp; struct ebt_table *t; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; /* try real handler in case userland supplied needed padding */ diff --git a/net/core/ethtool.c b/net/core/ethtool.c index 6cdba5f..56878bf 100644 --- a/net/core/ethtool.c +++ b/net/core/ethtool.c @@ -1676,7 +1676,7 @@ int dev_ethtool(struct net *net, struct ifreq *ifr) case ETHTOOL_GFEATURES: break; default: - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; } diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c index 96a164a..023ad24 100644 --- a/net/ipv4/arp.c +++ b/net/ipv4/arp.c @@ -1175,7 +1175,7 @@ int arp_ioctl(struct net *net, unsigned int cmd, void __user *arg) switch (cmd) { case SIOCDARP: case SIOCSARP: - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; case SIOCGARP: err = copy_from_user(&r, arg, sizeof(struct arpreq)); diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index bc19bd0..93b5b0b 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -728,7 +728,7 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg) case SIOCSIFFLAGS: ret = -EACCES; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) goto out; break; case SIOCSIFADDR: /* Set interface address (and family) */ @@ -736,7 +736,7 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg) case SIOCSIFDSTADDR: /* Set the destination address */ case SIOCSIFNETMASK: /* Set the netmask for the interface */ ret = -EACCES; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) goto out; ret = -EINVAL; if (sin->sin_family != AF_INET) diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 92fc5f6..8f34a07 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -437,7 +437,7 @@ int ip_rt_ioctl(struct net *net, unsigned int cmd, void __user *arg) switch (cmd) { case SIOCADDRT: /* Add a route */ case SIOCDELRT: /* Delete a route */ - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; if (copy_from_user(&rt, arg, sizeof(rt))) diff --git a/net/ipv4/ip_options.c b/net/ipv4/ip_options.c index ec93335..21df700 100644 --- a/net/ipv4/ip_options.c +++ b/net/ipv4/ip_options.c @@ -396,7 +396,7 @@ int ip_options_compile(struct net *net, optptr[2] += 8; break; default: - if (!skb && !capable(CAP_NET_RAW)) { + if (!skb && !ns_capable(net->user_ns, CAP_NET_RAW)) { pp_ptr = optptr + 3; goto error; } @@ -432,7 +432,7 @@ int ip_options_compile(struct net *net, opt->router_alert = optptr - iph; break; case IPOPT_CIPSO: - if ((!skb && !capable(CAP_NET_RAW)) || opt->cipso) { + if ((!skb && !ns_capable(net->user_ns, CAP_NET_RAW)) || opt->cipso) { pp_ptr = optptr; goto error; } @@ -445,7 +445,7 @@ int ip_options_compile(struct net *net, case IPOPT_SEC: case IPOPT_SID: default: - if (!skb && !capable(CAP_NET_RAW)) { + if (!skb && !ns_capable(net->user_ns, CAP_NET_RAW)) { pp_ptr = optptr; goto error; } diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c index 8905e92..6408507 100644 --- a/net/ipv4/ip_sockglue.c +++ b/net/ipv4/ip_sockglue.c @@ -955,13 +955,13 @@ mc_msf_out: case IP_IPSEC_POLICY: case IP_XFRM_POLICY: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) break; err = xfrm_user_policy(sk, optname, optval, optlen); break; case IP_TRANSPARENT: - if (!capable(CAP_NET_ADMIN)) { + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) { err = -EPERM; break; } diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c index 378b20b..6725832 100644 --- a/net/ipv4/ipip.c +++ b/net/ipv4/ipip.c @@ -629,7 +629,7 @@ ipip_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd) case SIOCADDTUNNEL: case SIOCCHGTUNNEL: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) goto done; err = -EFAULT; @@ -689,7 +689,7 @@ ipip_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd) case SIOCDELTUNNEL: err = -EPERM; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) goto done; if (dev == ipn->fb_tunnel_dev) { diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 58e8791..309aa0c 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -1204,7 +1204,7 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi if (optname != MRT_INIT) { if (sk != rcu_dereference_raw(mrt->mroute_sk) && - !capable(CAP_NET_ADMIN)) + !ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EACCES; } diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c index fd7a3f6..acc908f 100644 --- a/net/ipv4/netfilter/arp_tables.c +++ b/net/ipv4/netfilter/arp_tables.c @@ -1534,7 +1534,7 @@ static int compat_do_arpt_set_ctl(struct sock *sk, int cmd, void __user *user, { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { @@ -1678,7 +1678,7 @@ static int compat_do_arpt_get_ctl(struct sock *sk, int cmd, void __user *user, { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { @@ -1699,7 +1699,7 @@ static int do_arpt_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { @@ -1723,7 +1723,7 @@ static int do_arpt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c index 24e556e..72f2cde 100644 --- a/net/ipv4/netfilter/ip_tables.c +++ b/net/ipv4/netfilter/ip_tables.c @@ -1847,7 +1847,7 @@ compat_do_ipt_set_ctl(struct sock *sk, int cmd, void __user *user, { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { @@ -1962,7 +1962,7 @@ compat_do_ipt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len) { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { @@ -1984,7 +1984,7 @@ do_ipt_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len) { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { @@ -2009,7 +2009,7 @@ do_ipt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len) { int ret; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c index d7e86ef..38d69a5 100644 --- a/net/netfilter/ipset/ip_set_core.c +++ b/net/netfilter/ipset/ip_set_core.c @@ -1596,7 +1596,7 @@ ip_set_sockfn_get(struct sock *sk, int optval, void __user *user, int *len) void *data; int copylen = *len, ret = 0; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; if (optval != SO_IP_SET) return -EBADF; diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c index 2b771dc..db224ef 100644 --- a/net/netfilter/ipvs/ip_vs_ctl.c +++ b/net/netfilter/ipvs/ip_vs_ctl.c @@ -2284,7 +2284,7 @@ do_ip_vs_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len) struct ip_vs_dest_user *udest_compat; struct ip_vs_dest_user_kern udest; - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; if (cmd < IP_VS_BASE_CTL || cmd > IP_VS_SO_SET_MAX) @@ -2566,7 +2566,7 @@ do_ip_vs_get_ctl(struct sock *sk, int cmd, void __user *user, int *len) struct netns_ipvs *ipvs = net_ipvs(net); BUG_ON(!net); - if (!capable(CAP_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; if (cmd < IP_VS_BASE_CTL || cmd > IP_VS_SO_GET_MAX) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index c698cec..c2e6bb6 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -1793,7 +1793,7 @@ static int packet_create(struct net *net, struct socket *sock, int protocol, __be16 proto = (__force __be16)protocol; /* weird, but documented */ int err; - if (!capable(CAP_NET_RAW)) + if (!ns_capable(net->user_ns, CAP_NET_RAW)) return -EPERM; if (sock->type != SOCK_DGRAM && sock->type != SOCK_RAW && sock->type != SOCK_PACKET) -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 14/15] net: pass user_ns to cap_netlink_recv() 2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn ` (8 preceding siblings ...) 2011-09-02 19:56 ` [PATCH 13/15] userns: net: make many network capable calls targeted Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 15/15] make kernel/signal.c user ns safe (v2) Serge Hallyn 2011-09-13 14:43 ` user namespaces v3: continue targetting capabilities Serge E. Hallyn 11 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm, segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap Cc: Serge E. Hallyn From: "Serge E. Hallyn" <serge.hallyn@canonical.com> and make cap_netlink_recv() userns-aware cap_netlink_recv() was granting privilege if a capability is in current_cap(), regardless of the user namespace. Fix that by targeting the capability check against the user namespace which owns the skb. Because sock_net is static inline defined in net/sock.h, which we don't want to #include at the cap_netlink_recv function (commoncap.h). Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Eric W. Biederman <ebiederm@xmission.com> --- drivers/scsi/scsi_netlink.c | 3 ++- include/linux/security.h | 14 +++++++++----- kernel/audit.c | 6 ++++-- net/core/rtnetlink.c | 3 ++- net/decnet/netfilter/dn_rtmsg.c | 3 ++- net/ipv4/netfilter/ip_queue.c | 3 ++- net/ipv6/netfilter/ip6_queue.c | 3 ++- net/netfilter/nfnetlink.c | 2 +- net/netlink/genetlink.c | 2 +- net/xfrm/xfrm_user.c | 2 +- security/commoncap.c | 6 ++---- security/security.c | 4 ++-- security/selinux/hooks.c | 5 +++-- 13 files changed, 33 insertions(+), 23 deletions(-) diff --git a/drivers/scsi/scsi_netlink.c b/drivers/scsi/scsi_netlink.c index 26a8a45..0aa2e57 100644 --- a/drivers/scsi/scsi_netlink.c +++ b/drivers/scsi/scsi_netlink.c @@ -111,7 +111,8 @@ scsi_nl_rcv_msg(struct sk_buff *skb) goto next_msg; } - if (security_netlink_recv(skb, CAP_SYS_ADMIN)) { + if (security_netlink_recv(skb, CAP_SYS_ADMIN, + sock_net(skb->sk)->user_ns)) { err = -EPERM; goto next_msg; } diff --git a/include/linux/security.h b/include/linux/security.h index ebd2a53..cfa1f47 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -95,7 +95,8 @@ struct xfrm_user_sec_ctx; struct seq_file; extern int cap_netlink_send(struct sock *sk, struct sk_buff *skb); -extern int cap_netlink_recv(struct sk_buff *skb, int cap); +extern int cap_netlink_recv(struct sk_buff *skb, int cap, + struct user_namespace *ns); void reset_security_ops(void); @@ -797,6 +798,7 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts) * @skb. * @skb contains the sk_buff structure for the netlink message. * @cap indicates the capability required + * @ns is the user namespace which owns skb * Return 0 if permission is granted. * * Security hooks for Unix domain networking. @@ -1557,7 +1559,8 @@ struct security_operations { struct sembuf *sops, unsigned nsops, int alter); int (*netlink_send) (struct sock *sk, struct sk_buff *skb); - int (*netlink_recv) (struct sk_buff *skb, int cap); + int (*netlink_recv) (struct sk_buff *skb, int cap, + struct user_namespace *ns); void (*d_instantiate) (struct dentry *dentry, struct inode *inode); @@ -1806,7 +1809,7 @@ void security_d_instantiate(struct dentry *dentry, struct inode *inode); int security_getprocattr(struct task_struct *p, char *name, char **value); int security_setprocattr(struct task_struct *p, char *name, void *value, size_t size); int security_netlink_send(struct sock *sk, struct sk_buff *skb); -int security_netlink_recv(struct sk_buff *skb, int cap); +int security_netlink_recv(struct sk_buff *skb, int cap, struct user_namespace *ns); int security_secid_to_secctx(u32 secid, char **secdata, u32 *seclen); int security_secctx_to_secid(const char *secdata, u32 seclen, u32 *secid); void security_release_secctx(char *secdata, u32 seclen); @@ -2498,9 +2501,10 @@ static inline int security_netlink_send(struct sock *sk, struct sk_buff *skb) return cap_netlink_send(sk, skb); } -static inline int security_netlink_recv(struct sk_buff *skb, int cap) +static inline int security_netlink_recv(struct sk_buff *skb, int cap, + struct user_namespace *ns) { - return cap_netlink_recv(skb, cap); + return cap_netlink_recv(skb, cap, ns); } static inline int security_secid_to_secctx(u32 secid, char **secdata, u32 *seclen) diff --git a/kernel/audit.c b/kernel/audit.c index 0a1355c..48144c4 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -601,13 +601,15 @@ static int audit_netlink_ok(struct sk_buff *skb, u16 msg_type) case AUDIT_TTY_SET: case AUDIT_TRIM: case AUDIT_MAKE_EQUIV: - if (security_netlink_recv(skb, CAP_AUDIT_CONTROL)) + if (security_netlink_recv(skb, CAP_AUDIT_CONTROL, + sock_net(skb->sk)->user_ns)) err = -EPERM; break; case AUDIT_USER: case AUDIT_FIRST_USER_MSG ... AUDIT_LAST_USER_MSG: case AUDIT_FIRST_USER_MSG2 ... AUDIT_LAST_USER_MSG2: - if (security_netlink_recv(skb, CAP_AUDIT_WRITE)) + if (security_netlink_recv(skb, CAP_AUDIT_WRITE, + sock_net(skb->sk)->user_ns)) err = -EPERM; break; default: /* bad msg */ diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 99d9e95..4a444de 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -1931,7 +1931,8 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh) sz_idx = type>>2; kind = type&3; - if (kind != 2 && security_netlink_recv(skb, CAP_NET_ADMIN)) + if (kind != 2 && security_netlink_recv(skb, CAP_NET_ADMIN, + net->user_ns)) return -EPERM; if (kind == 2 && nlh->nlmsg_flags&NLM_F_DUMP) { diff --git a/net/decnet/netfilter/dn_rtmsg.c b/net/decnet/netfilter/dn_rtmsg.c index 69975e0..2d052ab 100644 --- a/net/decnet/netfilter/dn_rtmsg.c +++ b/net/decnet/netfilter/dn_rtmsg.c @@ -108,7 +108,8 @@ static inline void dnrmg_receive_user_skb(struct sk_buff *skb) if (nlh->nlmsg_len < sizeof(*nlh) || skb->len < nlh->nlmsg_len) return; - if (security_netlink_recv(skb, CAP_NET_ADMIN)) + if (security_netlink_recv(skb, CAP_NET_ADMIN, + sock_net(skb->sk)->user_ns)) RCV_SKB_FAIL(-EPERM); /* Eventually we might send routing messages too */ diff --git a/net/ipv4/netfilter/ip_queue.c b/net/ipv4/netfilter/ip_queue.c index 5c9b9d9..51d7c52 100644 --- a/net/ipv4/netfilter/ip_queue.c +++ b/net/ipv4/netfilter/ip_queue.c @@ -432,7 +432,8 @@ __ipq_rcv_skb(struct sk_buff *skb) if (type <= IPQM_BASE) return; - if (security_netlink_recv(skb, CAP_NET_ADMIN)) + if (security_netlink_recv(skb, CAP_NET_ADMIN, + sock_net(skb->sk)->user_ns)) RCV_SKB_FAIL(-EPERM); spin_lock_bh(&queue_lock); diff --git a/net/ipv6/netfilter/ip6_queue.c b/net/ipv6/netfilter/ip6_queue.c index 2493948..8206bf3 100644 --- a/net/ipv6/netfilter/ip6_queue.c +++ b/net/ipv6/netfilter/ip6_queue.c @@ -433,7 +433,8 @@ __ipq_rcv_skb(struct sk_buff *skb) if (type <= IPQM_BASE) return; - if (security_netlink_recv(skb, CAP_NET_ADMIN)) + if (security_netlink_recv(skb, CAP_NET_ADMIN, + sock_net(skb->sk)->user_ns)) RCV_SKB_FAIL(-EPERM); spin_lock_bh(&queue_lock); diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c index 1905976..bcaff9d 100644 --- a/net/netfilter/nfnetlink.c +++ b/net/netfilter/nfnetlink.c @@ -130,7 +130,7 @@ static int nfnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh) const struct nfnetlink_subsystem *ss; int type, err; - if (security_netlink_recv(skb, CAP_NET_ADMIN)) + if (security_netlink_recv(skb, CAP_NET_ADMIN, net->user_ns)) return -EPERM; /* All the messages must at least contain nfgenmsg */ diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c index 482fa57..00a101c 100644 --- a/net/netlink/genetlink.c +++ b/net/netlink/genetlink.c @@ -516,7 +516,7 @@ static int genl_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh) return -EOPNOTSUPP; if ((ops->flags & GENL_ADMIN_PERM) && - security_netlink_recv(skb, CAP_NET_ADMIN)) + security_netlink_recv(skb, CAP_NET_ADMIN, net->user_ns)) return -EPERM; if (nlh->nlmsg_flags & NLM_F_DUMP) { diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c index 0256b8a..1808e1e 100644 --- a/net/xfrm/xfrm_user.c +++ b/net/xfrm/xfrm_user.c @@ -2290,7 +2290,7 @@ static int xfrm_user_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh) link = &xfrm_dispatch[type]; /* All operations require privileges, even GET */ - if (security_netlink_recv(skb, CAP_NET_ADMIN)) + if (security_netlink_recv(skb, CAP_NET_ADMIN, net->user_ns)) return -EPERM; if ((type == (XFRM_MSG_GETSA - XFRM_MSG_BASE) || diff --git a/security/commoncap.c b/security/commoncap.c index a93b3b7..1e48e6a 100644 --- a/security/commoncap.c +++ b/security/commoncap.c @@ -56,11 +56,9 @@ int cap_netlink_send(struct sock *sk, struct sk_buff *skb) return 0; } -int cap_netlink_recv(struct sk_buff *skb, int cap) +int cap_netlink_recv(struct sk_buff *skb, int cap, struct user_namespace *ns) { - if (!cap_raised(current_cap(), cap)) - return -EPERM; - return 0; + return security_capable(ns, current_cred(), cap); } EXPORT_SYMBOL(cap_netlink_recv); diff --git a/security/security.c b/security/security.c index 0e4fccf..0a1453e 100644 --- a/security/security.c +++ b/security/security.c @@ -941,9 +941,9 @@ int security_netlink_send(struct sock *sk, struct sk_buff *skb) return security_ops->netlink_send(sk, skb); } -int security_netlink_recv(struct sk_buff *skb, int cap) +int security_netlink_recv(struct sk_buff *skb, int cap, struct user_namespace *ns) { - return security_ops->netlink_recv(skb, cap); + return security_ops->netlink_recv(skb, cap, ns); } EXPORT_SYMBOL(security_netlink_recv); diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index 266a229..fe290bb 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -4723,13 +4723,14 @@ static int selinux_netlink_send(struct sock *sk, struct sk_buff *skb) return selinux_nlmsg_perm(sk, skb); } -static int selinux_netlink_recv(struct sk_buff *skb, int capability) +static int selinux_netlink_recv(struct sk_buff *skb, int capability, + struct user_namespace *ns) { int err; struct common_audit_data ad; u32 sid; - err = cap_netlink_recv(skb, capability); + err = cap_netlink_recv(skb, capability, ns); if (err) return err; -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 15/15] make kernel/signal.c user ns safe (v2) 2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn ` (9 preceding siblings ...) 2011-09-02 19:56 ` [PATCH 14/15] net: pass user_ns to cap_netlink_recv() Serge Hallyn @ 2011-09-02 19:56 ` Serge Hallyn 2011-09-13 14:43 ` user namespaces v3: continue targetting capabilities Serge E. Hallyn 11 siblings, 0 replies; 69+ messages in thread From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw) To: akpm, segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap Cc: Serge Hallyn From: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com> --- kernel/signal.c | 26 ++++++++++++++++++++++---- 1 files changed, 22 insertions(+), 4 deletions(-) diff --git a/kernel/signal.c b/kernel/signal.c index 291c970..c07b970 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -27,6 +27,7 @@ #include <linux/capability.h> #include <linux/freezer.h> #include <linux/pid_namespace.h> +#include <linux/user_namespace.h> #include <linux/nsproxy.h> #define CREATE_TRACE_POINTS #include <trace/events/signal.h> @@ -1073,7 +1074,8 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t, q->info.si_code = SI_USER; q->info.si_pid = task_tgid_nr_ns(current, task_active_pid_ns(t)); - q->info.si_uid = current_uid(); + q->info.si_uid = user_ns_map_uid(task_cred_xxx(t, user_ns), + current_cred(), current_uid()); break; case (unsigned long) SEND_SIG_PRIV: q->info.si_signo = sig; @@ -1363,6 +1365,11 @@ int kill_pid_info_as_uid(int sig, struct siginfo *info, struct pid *pid, goto out_unlock; } pcred = __task_cred(p); + /* + * this is called (only) from drivers/usb/core/devio.c. + * Do we need to add user_ns to urb and usb_device, so + * we can pass them in here? + */ if (si_fromuser(info) && euid != pcred->suid && euid != pcred->uid && uid != pcred->suid && uid != pcred->uid) { @@ -1618,7 +1625,8 @@ bool do_notify_parent(struct task_struct *tsk, int sig) */ rcu_read_lock(); info.si_pid = task_pid_nr_ns(tsk, tsk->parent->nsproxy->pid_ns); - info.si_uid = __task_cred(tsk)->uid; + info.si_uid = user_ns_map_uid(task_cred_xxx(tsk->parent, user_ns), + __task_cred(tsk), __task_cred(tsk)->uid); rcu_read_unlock(); info.si_utime = cputime_to_clock_t(cputime_add(tsk->utime, @@ -1688,6 +1696,7 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, unsigned long flags; struct task_struct *parent; struct sighand_struct *sighand; + const struct cred *cred; if (for_ptracer) { parent = tsk->parent; @@ -1703,7 +1712,9 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, */ rcu_read_lock(); info.si_pid = task_pid_nr_ns(tsk, parent->nsproxy->pid_ns); - info.si_uid = __task_cred(tsk)->uid; + cred = __task_cred(tsk); + info.si_uid = user_ns_map_uid(task_cred_xxx(parent, user_ns), + cred, cred->uid); rcu_read_unlock(); info.si_utime = cputime_to_clock_t(tsk->utime); @@ -2122,7 +2133,10 @@ static int ptrace_signal(int signr, siginfo_t *info, info->si_errno = 0; info->si_code = SI_USER; info->si_pid = task_pid_vnr(current->parent); - info->si_uid = task_uid(current->parent); + /* we can cache cred if this performs poorly */ + info->si_uid = user_ns_map_uid(current_user_ns(), + __task_cred(current->parent), + task_uid(current->parent)); } /* If the (new) signal is now blocked, requeue it. */ @@ -2552,6 +2566,10 @@ SYSCALL_DEFINE2(rt_sigpending, sigset_t __user *, set, size_t, sigsetsize) #ifndef HAVE_ARCH_COPY_SIGINFO_TO_USER +/* + * send_signal has converted the sender's uid to the receiver + * task user namespace, so no need to convert here + */ int copy_siginfo_to_user(siginfo_t __user *to, siginfo_t *from) { int err; -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: user namespaces v3: continue targetting capabilities 2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn ` (10 preceding siblings ...) 2011-09-02 19:56 ` [PATCH 15/15] make kernel/signal.c user ns safe (v2) Serge Hallyn @ 2011-09-13 14:43 ` Serge E. Hallyn 11 siblings, 0 replies; 69+ messages in thread From: Serge E. Hallyn @ 2011-09-13 14:43 UTC (permalink / raw) To: akpm, segooon, linux-kernel, netdev, containers, dhowells, ebiederm, rdunlap I did a bit of basic performance testing - just running unixbench and doing a kernel compile (without profiling) with and without this patchset, with USER_NS enabled for both. I could find no meaningful impact. 473.01user 32.48system 9:05.44elapsed 92%CPU (0avgtext+0avgdata 430752maxresident)k 112736inputs+576936outputs (8major+22057422minor)pagefaults 0swaps 473.78user 33.12system 9:06.14elapsed 92%CPU (0avgtext+0avgdata 430752maxresident)k 116656inputs+576936outputs (12major+22056621minor)pagefaults 0swaps and with: 474.09user 31.62system 9:05.70elapsed 92%CPU (0avgtext+0avgdata 430752maxresident)k 112648inputs+576936outputs (7major+22056909minor)pagefaults 0swaps 472.54user 33.26system 9:05.43elapsed 92%CPU (0avgtext+0avgdata 430608maxresident)k 116656inputs+576936outputs (12major+22058358minor)pagefaults 0swaps I'll append the full unixbench outputs below, but index score without the patchset was 1594.3, and with the patchset was 1597.4. thanks, -serge ===================================================================== unixbench without patchset: ===================================================================== BYTE UNIX Benchmarks (Version 5.1.3) System: marula: GNU/Linux OS: GNU/Linux -- 3.0.0-11-server -- #17-Ubuntu SMP Fri Sep 9 19:31:36 UTC 2011 Machine: x86_64 (x86_64) Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8") CPU 0: Intel(R) Xeon(R) CPU E5530 @ 2.40GHz (4800.3 bogomips) Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization 02:43:44 up 3:00, 1 user, load average: 0.05, 0.03, 0.03; runlevel 2 ------------------------------------------------------------------------ Benchmark Run: Mon Sep 12 2011 02:43:44 - 03:11:55 1 CPU in system; running 1 parallel copy of tests Dhrystone 2 using register variables 28147322.1 lps (10.0 s, 7 samples) Double-Precision Whetstone 3289.7 MWIPS (10.0 s, 7 samples) Execl Throughput 4557.5 lps (30.0 s, 2 samples) File Copy 1024 bufsize 2000 maxblocks 1145450.6 KBps (30.0 s, 2 samples) File Copy 256 bufsize 500 maxblocks 312941.7 KBps (30.0 s, 2 samples) File Copy 4096 bufsize 8000 maxblocks 1969030.8 KBps (30.0 s, 2 samples) Pipe Throughput 2080076.5 lps (10.0 s, 7 samples) Pipe-based Context Switching 331910.6 lps (10.0 s, 7 samples) Process Creation 14921.7 lps (30.0 s, 2 samples) Shell Scripts (1 concurrent) 6989.7 lpm (60.0 s, 2 samples) Shell Scripts (8 concurrent) 913.9 lpm (60.0 s, 2 samples) System Call Overhead 3453367.4 lps (10.0 s, 7 samples) System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 28147322.1 2411.9 Double-Precision Whetstone 55.0 3289.7 598.1 Execl Throughput 43.0 4557.5 1059.9 File Copy 1024 bufsize 2000 maxblocks 3960.0 1145450.6 2892.6 File Copy 256 bufsize 500 maxblocks 1655.0 312941.7 1890.9 File Copy 4096 bufsize 8000 maxblocks 5800.0 1969030.8 3394.9 Pipe Throughput 12440.0 2080076.5 1672.1 Pipe-based Context Switching 4000.0 331910.6 829.8 Process Creation 126.0 14921.7 1184.3 Shell Scripts (1 concurrent) 42.4 6989.7 1648.5 Shell Scripts (8 concurrent) 6.0 913.9 1523.2 System Call Overhead 15000.0 3453367.4 2302.2 ======== System Benchmarks Index Score 1594.3 ===================================================================== unixbench with patchset: ===================================================================== BYTE UNIX Benchmarks (Version 5.1.3) System: marula: GNU/Linux OS: GNU/Linux -- 3.0.0-11-server -- #17userns1 SMP Mon Sep 12 13:42:40 UTC 2011 Machine: x86_64 (x86_64) Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8") CPU 0: Intel(R) Xeon(R) CPU E5530 @ 2.40GHz (4799.6 bogomips) Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization 12:42:07 up 8 min, 1 user, load average: 0.00, 0.01, 0.02; runlevel 2 ------------------------------------------------------------------------ Benchmark Run: Mon Sep 12 2011 12:42:07 - 13:10:19 1 CPU in system; running 1 parallel copy of tests Dhrystone 2 using register variables 28232156.4 lps (10.0 s, 7 samples) Double-Precision Whetstone 3290.0 MWIPS (10.0 s, 7 samples) Execl Throughput 4553.7 lps (29.9 s, 2 samples) File Copy 1024 bufsize 2000 maxblocks 1142317.5 KBps (30.0 s, 2 samples) File Copy 256 bufsize 500 maxblocks 317068.8 KBps (30.0 s, 2 samples) File Copy 4096 bufsize 8000 maxblocks 1956611.4 KBps (30.0 s, 2 samples) Pipe Throughput 2086728.8 lps (10.0 s, 7 samples) Pipe-based Context Switching 343275.1 lps (10.0 s, 7 samples) Process Creation 14718.6 lps (30.0 s, 2 samples) Shell Scripts (1 concurrent) 6989.0 lpm (60.0 s, 2 samples) Shell Scripts (8 concurrent) 913.6 lpm (60.0 s, 2 samples) System Call Overhead 3434956.0 lps (10.0 s, 7 samples) System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 28232156.4 2419.2 Double-Precision Whetstone 55.0 3290.0 598.2 Execl Throughput 43.0 4553.7 1059.0 File Copy 1024 bufsize 2000 maxblocks 3960.0 1142317.5 2884.6 File Copy 256 bufsize 500 maxblocks 1655.0 317068.8 1915.8 File Copy 4096 bufsize 8000 maxblocks 5800.0 1956611.4 3373.5 Pipe Throughput 12440.0 2086728.8 1677.4 Pipe-based Context Switching 4000.0 343275.1 858.2 Process Creation 126.0 14718.6 1168.1 Shell Scripts (1 concurrent) 42.4 6989.0 1648.3 Shell Scripts (8 concurrent) 6.0 913.6 1522.6 System Call Overhead 15000.0 3434956.0 2290.0 ======== System Benchmarks Index Score 1597.4 ^ permalink raw reply [flat|nested] 69+ messages in thread
end of thread, other threads:[~2011-10-03 20:04 UTC | newest] Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-09-02 19:56 user namespaces v3: continue targetting capabilities Serge Hallyn 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` (unknown), Serge Hallyn 2011-09-02 19:56 ` Serge Hallyn [not found] ` <1314993400-6910-3-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> 2011-09-02 23:49 ` missing [PATCH 01/15] Eric W. Biederman 2011-09-02 23:49 ` Eric W. Biederman [not found] ` <m11uvyld2d.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org> 2011-09-03 1:09 ` Serge E. Hallyn 2011-09-03 1:09 ` Serge E. Hallyn 2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn 2011-09-07 22:50 ` Andrew Morton [not found] ` <20110907155024.42e3fe27.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> 2011-09-09 13:10 ` Serge E. Hallyn 2011-09-09 13:10 ` Serge E. Hallyn [not found] ` <1314993400-6910-4-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> 2011-09-07 22:50 ` Andrew Morton 2011-09-26 19:17 ` Vasiliy Kulikov 2011-09-26 19:17 ` [kernel-hardening] " Vasiliy Kulikov 2011-09-27 13:21 ` Serge E. Hallyn 2011-09-27 13:21 ` [kernel-hardening] " Serge E. Hallyn 2011-09-27 15:56 ` Vasiliy Kulikov 2011-09-27 15:56 ` [kernel-hardening] " Vasiliy Kulikov 2011-10-01 17:00 ` Serge E. Hallyn 2011-10-01 17:00 ` [kernel-hardening] " Serge E. Hallyn 2011-10-03 1:46 ` Eric W. Biederman 2011-10-03 1:46 ` [kernel-hardening] " Eric W. Biederman 2011-10-03 19:53 ` Eric W. Biederman 2011-10-03 19:53 ` [kernel-hardening] " Eric W. Biederman 2011-10-03 20:04 ` Serge E. Hallyn 2011-10-03 20:04 ` [kernel-hardening] " Serge E. Hallyn 2011-09-02 19:56 ` [PATCH 07/15] user namespace: use net->user_ns for some capable calls under net/ Serge Hallyn 2011-09-02 19:56 ` [PATCH 08/15] af_netlink.c: make netlink_capable userns-aware Serge Hallyn [not found] ` <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> 2011-09-02 19:56 ` (unknown), Serge Hallyn 2011-09-02 19:56 ` [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3) Serge Hallyn 2011-09-02 19:56 ` [PATCH 02/15] user ns: setns: move capable checks into per-ns attach helper Serge Hallyn 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` Serge Hallyn 2011-09-04 1:51 ` Matt Helsley [not found] ` <20110904015140.GB32295-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> 2011-09-09 14:56 ` Serge E. Hallyn 2011-09-09 14:56 ` Serge E. Hallyn [not found] ` <1314993400-6910-5-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> 2011-09-04 1:51 ` Matt Helsley 2011-09-02 19:56 ` [PATCH 03/15] keyctl: check capabilities against key's user_ns Serge Hallyn 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 04/15] user_ns: convert fs/attr.c to targeted capabilities Serge Hallyn 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 05/15] userns: clamp down users of cap_raised Serge Hallyn 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 06/15] user namespace: make each net (net_ns) belong to a user_ns Serge Hallyn 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 07/15] user namespace: use net->user_ns for some capable calls under net/ Serge Hallyn 2011-09-02 19:56 ` [PATCH 08/15] af_netlink.c: make netlink_capable userns-aware Serge Hallyn 2011-09-02 19:56 ` [PATCH 09/15] user ns: convert ipv6 to targeted capabilities Serge Hallyn 2011-09-02 19:56 ` [PATCH 10/15] net/core/scm.c: target capable() calls to user_ns owning the net_ns Serge Hallyn 2011-09-02 19:56 ` [PATCH 11/15] userns: make some net-sysfs capable calls targeted Serge Hallyn 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 12/15] user_ns: target af_key capability check Serge Hallyn 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` Serge Hallyn 2011-09-02 19:56 ` [PATCH 13/15] userns: net: make many network capable calls targeted Serge Hallyn 2011-09-02 19:56 ` [PATCH 14/15] net: pass user_ns to cap_netlink_recv() Serge Hallyn 2011-09-02 19:56 ` [PATCH 15/15] make kernel/signal.c user ns safe (v2) Serge Hallyn 2011-09-02 19:56 ` [PATCH 09/15] user ns: convert ipv6 to targeted capabilities Serge Hallyn 2011-09-02 19:56 ` [PATCH 10/15] net/core/scm.c: target capable() calls to user_ns owning the net_ns Serge Hallyn 2011-09-02 19:56 ` [PATCH 13/15] userns: net: make many network capable calls targeted Serge Hallyn 2011-09-02 19:56 ` [PATCH 14/15] net: pass user_ns to cap_netlink_recv() Serge Hallyn 2011-09-02 19:56 ` [PATCH 15/15] make kernel/signal.c user ns safe (v2) Serge Hallyn 2011-09-13 14:43 ` user namespaces v3: continue targetting capabilities Serge E. Hallyn
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.