[PATCH v2] cgroup: allow management of subtrees by cgroup namespaces

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] cgroup: allow management of subtrees by cgroup namespaces
@ 2016-05-01 13:41 ` Aleksa Sarai
  0 siblings, 0 replies; 17+ messages in thread
From: Aleksa Sarai @ 2016-05-01 13:41 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner
  Cc: cgroups, linux-kernel, dev, Aleksa Sarai, Aleksa Sarai

This is an updated version of v1 of this patchset[1], which properly
unlocks all of the relevant locks in the error path of copy_cgroup_ns().

[1]: https://lkml.org/lkml/2016/5/1/77

Aleksa Sarai (1):
  cgroup: allow management of subtrees by new cgroup namespaces

 kernel/cgroup.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 72 insertions(+), 2 deletions(-)

-- 
2.8.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2] cgroup: allow management of subtrees by cgroup namespaces
@ 2016-05-01 13:41 ` Aleksa Sarai
  0 siblings, 0 replies; 17+ messages in thread
From: Aleksa Sarai @ 2016-05-01 13:41 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dev-IGmTWi+3HBZvNhPySn5qfx2eb7JE58TQ, Aleksa Sarai, Aleksa Sarai

This is an updated version of v1 of this patchset[1], which properly
unlocks all of the relevant locks in the error path of copy_cgroup_ns().

[1]: https://lkml.org/lkml/2016/5/1/77

Aleksa Sarai (1):
  cgroup: allow management of subtrees by new cgroup namespaces

 kernel/cgroup.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 72 insertions(+), 2 deletions(-)

-- 
2.8.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2] cgroup: allow management of subtrees by new cgroup namespaces
  2016-05-01 13:41 ` Aleksa Sarai
  (?)
@ 2016-05-01 13:41 ` Aleksa Sarai
  2016-05-02  9:32     ` Aleksa Sarai
  2016-05-02 22:00     ` James Bottomley
  -1 siblings, 2 replies; 17+ messages in thread
From: Aleksa Sarai @ 2016-05-01 13:41 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner
  Cc: cgroups, linux-kernel, dev, Aleksa Sarai, Aleksa Sarai

Allow an unprivileged processes to control subtrees of their associated
cgroup, a necessary feature if an unprivileged container (set up with an
unprivileged user namespace) wishes to take advantage of cgroups for its
own subprocesses.

Change the mode of the cgroup directory for each cgroup association,
allowing the process to create subtrees and modify the limits of the
subtrees *without* allowing the process to modify its own limits. Due to
the cgroup core restrictions and unix permission model, this allows for
processes to create new subtrees without breaking the cgroup limits for
the process.

In addition, this change doesn't add any odd or new functionality (it
essentially emulates a privileged user allowing a process to create
subtrees of its current cgroup association). This means that client code
can take advantage of this without being aware of the kernel change.

It should be noted that the mode changing isn't done when a process
attaches to an existing cgroup namespace, because the process which
created the cgroup namespace may have decided to disallow other
processes from modifying the subtrees it set up. Such a process can do
so by creating another cgroup namespace with a subtree it owns as the
root of that namespace (then changing the file mode such that only
sufficiently capable processes in the associated user namespace can
modify the subtree setup).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
Cc: dev@opencontainers.org
---
 kernel/cgroup.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 72 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 909a7d3..8f944f3 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3554,6 +3554,14 @@ static int cgroup_kn_set_ugid(struct kernfs_node *kn)
 	return kernfs_setattr(kn, &iattr);
 }
 
+static int cgroup_kn_set_mode(struct kernfs_node *kn, umode_t mode)
+{
+	struct iattr iattr = { .ia_valid = ATTR_MODE,
+			       .ia_mode = mode, };
+
+	return kernfs_setattr(kn, &iattr);
+}
+
 static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
 			   struct cftype *cft)
 {
@@ -6228,8 +6236,12 @@ struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
 					struct user_namespace *user_ns,
 					struct cgroup_namespace *old_ns)
 {
+	struct cgroup_subsys *ss;
 	struct cgroup_namespace *new_ns;
 	struct css_set *cset;
+	int ssid, err;
+	umode_t mode[CGROUP_SUBSYS_COUNT];
+	u16 updated_mask = 0;
 
 	BUG_ON(!old_ns);
 
@@ -6244,11 +6256,54 @@ struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
 
 	mutex_lock(&cgroup_mutex);
 	spin_lock_bh(&css_set_lock);
-
 	cset = task_css_set(current);
 	get_css_set(cset);
-
 	spin_unlock_bh(&css_set_lock);
+
+	/*
+	 * When creating a new cgroup namespace, we change the permissions of
+	 * the cgroup's directory to be a+w. This is necessary in order to
+	 * allow new cgroup namespaces to manage their own subtrees. This does
+	 * not allow for an escape from cgroup policy for three reasons:
+	 *
+	 * 1. cgroups are hierarchical, so any subtree must (at the very least)
+	 *    obey the original cgroup's restrictions.
+	 *
+	 * 2. The unix permission model for directories does not allow a user
+	 *    with write access to a directory to directly modify the dentries.
+	 *    While a user can unlink such files in a normal directory, in
+	 *    cgroupfs this is not allowed.
+	 *
+	 * 3. cgroup core doesn't allow tasks to be migrated by users that have
+	 *    write access to two subtrees unless they also have write access to
+	 *    the common ancestor of the two subtrees. Thus you cannot use a
+	 *    complicit process in less restrictive cgroup to overcome your own
+	 *    cgroup restriction.
+	 *
+	 * Therefore, we can safely change the mode of the cgroup without any
+	 * ill effects. We don't do this on cgroupns_install(), because the
+	 * owner of the cgroup may have decided to disallow modifications to
+	 * the hierarchy (which can be done by creating a nested cgroup
+	 * namespace in a cgroup you now own).
+	 */
+	rcu_read_lock();
+	for_each_subsys(ss, ssid) {
+		struct kernfs_node *kn = cset->subsys[ssid]->cgroup->kn;
+
+		kernfs_get(kn);
+		kernfs_break_active_protection(kn);
+		mode[ssid] = kn->mode;
+		err = cgroup_kn_set_mode(kn, mode[ssid] |
+					     (S_IROTH | S_IWOTH | S_IXOTH));
+		kernfs_unbreak_active_protection(kn);
+		kernfs_put(kn);
+		if (err)
+			goto err_unset_mode;
+
+		updated_mask |= 1 << ssid;
+	}
+	rcu_read_unlock();
+
 	mutex_unlock(&cgroup_mutex);
 
 	new_ns = alloc_cgroup_ns();
@@ -6261,6 +6316,21 @@ struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
 	new_ns->root_cset = cset;
 
 	return new_ns;
+
+err_unset_mode:
+	/* Clean up the mode changes. */
+	do_each_subsys_mask(ss, ssid, updated_mask) {
+		struct kernfs_node *kn = cset->subsys[ssid]->cgroup->kn;
+
+		kernfs_break_active_protection(kn);
+		cgroup_kn_set_mode(kn, mode[ssid]);
+		kernfs_unbreak_active_protection(kn);
+	} while_each_subsys_mask();
+
+	rcu_read_unlock();
+	mutex_unlock(&cgroup_mutex);
+
+	return ERR_PTR(err);
 }
 
 static inline struct cgroup_namespace *to_cg_ns(struct ns_common *ns)
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] cgroup: allow management of subtrees by new cgroup namespaces
@ 2016-05-02  9:32     ` Aleksa Sarai
  0 siblings, 0 replies; 17+ messages in thread
From: Aleksa Sarai @ 2016-05-02  9:32 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner
  Cc: cgroups, linux-kernel, dev, Aleksa Sarai

> +	 * 3. cgroup core doesn't allow tasks to be migrated by users that have
> +	 *    write access to two subtrees unless they also have write access to
> +	 *    the common ancestor of the two subtrees. Thus you cannot use a
> +	 *    complicit process in less restrictive cgroup to overcome your own
> +	 *    cgroup restriction.

It appears this restriction isn't actually being applied on cgroupv1. 
I'll send an updated patch which makes sure the cgroup.proc common 
ancestor restriction is enforced for all hierarchies.

-- 
Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] cgroup: allow management of subtrees by new cgroup namespaces
@ 2016-05-02  9:32     ` Aleksa Sarai
  0 siblings, 0 replies; 17+ messages in thread
From: Aleksa Sarai @ 2016-05-02  9:32 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dev-IGmTWi+3HBZvNhPySn5qfx2eb7JE58TQ, Aleksa Sarai

> +	 * 3. cgroup core doesn't allow tasks to be migrated by users that have
> +	 *    write access to two subtrees unless they also have write access to
> +	 *    the common ancestor of the two subtrees. Thus you cannot use a
> +	 *    complicit process in less restrictive cgroup to overcome your own
> +	 *    cgroup restriction.

It appears this restriction isn't actually being applied on cgroupv1. 
I'll send an updated patch which makes sure the cgroup.proc common 
ancestor restriction is enforced for all hierarchies.

-- 
Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] cgroup: allow management of subtrees by new cgroup namespaces
@ 2016-05-02 22:00     ` James Bottomley
  0 siblings, 0 replies; 17+ messages in thread
From: James Bottomley @ 2016-05-02 22:00 UTC (permalink / raw)
  To: Aleksa Sarai, Tejun Heo, Li Zefan, Johannes Weiner
  Cc: cgroups, linux-kernel, dev, Aleksa Sarai

On Sun, 2016-05-01 at 23:41 +1000, Aleksa Sarai wrote:
> Allow an unprivileged processes to control subtrees of their 
> associated cgroup, a necessary feature if an unprivileged container 
> (set up with an unprivileged user namespace) wishes to take advantage 
> of cgroups for its own subprocesses.
> 
> Change the mode of the cgroup directory for each cgroup association,
> allowing the process to create subtrees and modify the limits of the
> subtrees *without* allowing the process to modify its own limits. Due 
> to the cgroup core restrictions and unix permission model, this 
> allows for processes to create new subtrees without breaking the 
> cgroup limits for the process.

Actually, that's not really what this patch does.  If you unshare
without having created any cgroups, it sets the other permission of the
entire top level hierarchy to o+rwx:

jejb@jarvis:~> unshare -r --cgroup
root@jarvis:~# ls -l /sys/fs/cgroup/
total 0
dr-xr-xrwx 2 nobody nogroup  0 May  2 14:50 blkio/
lrwxrwxrwx 1 nobody nogroup 11 May  2 14:50 cpu -> cpu,cpuacct/
lrwxrwxrwx 1 nobody nogroup 11 May  2 14:50 cpuacct -> cpu,cpuacct/
dr-xr-xrwx 2 nobody nogroup  0 May  2 14:50 cpu,cpuacct/
dr-xr-xrwx 2 nobody nogroup  0 May  2 14:50 cpuset/
dr-xr-xrwx 2 nobody nogroup  0 May  2 14:50 devices/
dr-xr-xrwx 3 nobody nogroup  0 May  2 15:04 freezer/
dr-xr-xrwx 2 nobody nogroup  0 May  2 14:50 hugetlb/
dr-xr-xrwx 2 nobody nogroup  0 May  2 14:50 memory/
lrwxrwxrwx 1 nobody nogroup 16 May  2 14:50 net_cls -> net_cls,net_prio/
dr-xr-xrwx 2 nobody nogroup  0 May  2 14:50 net_cls,net_prio/
lrwxrwxrwx 1 nobody nogroup 16 May  2 14:50 net_prio -> net_cls,net_prio/
dr-xr-xrwx 2 nobody nogroup  0 May  2 14:50 perf_evesnt/
dr-xr-xrwx 2 nobody nogroup  0 May  2 14:50 pids/
dr-xr-xr-x 4 nobody nogroup  0 May  2 14:50 systemd/

But this is visible even to an unprivileged user not in any namespace,
meaning that now any user may create a cgroup control directory:

jejb@jarvis:~> ls -l /sys/fs/cgroup/
total 0
dr-xr-xrwx 2 root root  0 May  2 14:50 blkio/
lrwxrwxrwx 1 root root 11 May  2 14:50 cpu -> cpu,cpuacct/
lrwxrwxrwx 1 root root 11 May  2 14:50 cpuacct -> cpu,cpuacct/
dr-xr-xrwx 2 root root  0 May  2 14:50 cpu,cpuacct/
dr-xr-xrwx 2 root root  0 May  2 14:50 cpuset/
dr-xr-xrwx 2 root root  0 May  2 14:50 devices/
dr-xr-xrwx 3 root root  0 May  2 15:04 freezer/
dr-xr-xrwx 2 root root  0 May  2 14:50 hugetlb/
dr-xr-xrwx 2 root root  0 May  2 14:50 memory/
lrwxrwxrwx 1 root root 16 May  2 14:50 net_cls -> net_cls,net_prio/
dr-xr-xrwx 2 root root  0 May  2 14:50 net_cls,net_prio/
lrwxrwxrwx 1 root root 16 May  2 14:50 net_prio -> net_cls,net_prio/
dr-xr-xrwx 2 root root  0 May  2 14:50 perf_event/
dr-xr-xrwx 2 root root  0 May  2 14:50 pids/
dr-xr-xr-x 4 root root  0 May  2 14:50 systemd/

ironically, this now makes the root group a permission denier (at least
for my distribution), because if I were in the root group (and not
root), the r-x on the group would rule the rwx on other ... I really
don't think that sounds correct.

Perhaps what you should to be arguing then that the default permissions
of the cgroup directories need to be all rwx for everyone and then your
patch becomes unnecessary?

Alternatively, if the desire is fully to virtualize /sys/fs/cgroups,
then I think we have to decide how that would happen.  I think the
default requirements would be that a pid namespace be established (so
only the tasks in that pid namespace would be able to be controlled by
the cgroup namespace.  That, I think requires that any given cgroup
namespace "own" a pid namespace (being the one present when it was
created) but that it only gets a new virtual set of directories owned
by the userns owner if there's a pid namespace established for the
cgroup and cgroup->user_ns == pid_ns->user_ns (meaning we established a
user ns then a pid one then a cgroup one, so it's now safe to treat
root in the user_ns as owning the virtualized cgroup directories).  We
could do this in the same way that proc gets virtualized after
remounting (in a new mount namespace) on fork into a pid namespace.

James

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] cgroup: allow management of subtrees by new cgroup namespaces
@ 2016-05-02 22:00     ` James Bottomley
  0 siblings, 0 replies; 17+ messages in thread
From: James Bottomley @ 2016-05-02 22:00 UTC (permalink / raw)
  To: Aleksa Sarai, Tejun Heo, Li Zefan, Johannes Weiner
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dev-IGmTWi+3HBZvNhPySn5qfx2eb7JE58TQ, Aleksa Sarai

On Sun, 2016-05-01 at 23:41 +1000, Aleksa Sarai wrote:
> Allow an unprivileged processes to control subtrees of their 
> associated cgroup, a necessary feature if an unprivileged container 
> (set up with an unprivileged user namespace) wishes to take advantage 
> of cgroups for its own subprocesses.
> 
> Change the mode of the cgroup directory for each cgroup association,
> allowing the process to create subtrees and modify the limits of the
> subtrees *without* allowing the process to modify its own limits. Due 
> to the cgroup core restrictions and unix permission model, this 
> allows for processes to create new subtrees without breaking the 
> cgroup limits for the process.

Actually, that's not really what this patch does.  If you unshare
without having created any cgroups, it sets the other permission of the
entire top level hierarchy to o+rwx:

jejb@jarvis:~> unshare -r --cgroup
root@jarvis:~# ls -l /sys/fs/cgroup/
total 0
dr-xr-xrwx 2 nobody nogroup  0 May  2 14:50 blkio/
lrwxrwxrwx 1 nobody nogroup 11 May  2 14:50 cpu -> cpu,cpuacct/
lrwxrwxrwx 1 nobody nogroup 11 May  2 14:50 cpuacct -> cpu,cpuacct/
dr-xr-xrwx 2 nobody nogroup  0 May  2 14:50 cpu,cpuacct/
dr-xr-xrwx 2 nobody nogroup  0 May  2 14:50 cpuset/
dr-xr-xrwx 2 nobody nogroup  0 May  2 14:50 devices/
dr-xr-xrwx 3 nobody nogroup  0 May  2 15:04 freezer/
dr-xr-xrwx 2 nobody nogroup  0 May  2 14:50 hugetlb/
dr-xr-xrwx 2 nobody nogroup  0 May  2 14:50 memory/
lrwxrwxrwx 1 nobody nogroup 16 May  2 14:50 net_cls -> net_cls,net_prio/
dr-xr-xrwx 2 nobody nogroup  0 May  2 14:50 net_cls,net_prio/
lrwxrwxrwx 1 nobody nogroup 16 May  2 14:50 net_prio -> net_cls,net_prio/
dr-xr-xrwx 2 nobody nogroup  0 May  2 14:50 perf_evesnt/
dr-xr-xrwx 2 nobody nogroup  0 May  2 14:50 pids/
dr-xr-xr-x 4 nobody nogroup  0 May  2 14:50 systemd/

But this is visible even to an unprivileged user not in any namespace,
meaning that now any user may create a cgroup control directory:

jejb@jarvis:~> ls -l /sys/fs/cgroup/
total 0
dr-xr-xrwx 2 root root  0 May  2 14:50 blkio/
lrwxrwxrwx 1 root root 11 May  2 14:50 cpu -> cpu,cpuacct/
lrwxrwxrwx 1 root root 11 May  2 14:50 cpuacct -> cpu,cpuacct/
dr-xr-xrwx 2 root root  0 May  2 14:50 cpu,cpuacct/
dr-xr-xrwx 2 root root  0 May  2 14:50 cpuset/
dr-xr-xrwx 2 root root  0 May  2 14:50 devices/
dr-xr-xrwx 3 root root  0 May  2 15:04 freezer/
dr-xr-xrwx 2 root root  0 May  2 14:50 hugetlb/
dr-xr-xrwx 2 root root  0 May  2 14:50 memory/
lrwxrwxrwx 1 root root 16 May  2 14:50 net_cls -> net_cls,net_prio/
dr-xr-xrwx 2 root root  0 May  2 14:50 net_cls,net_prio/
lrwxrwxrwx 1 root root 16 May  2 14:50 net_prio -> net_cls,net_prio/
dr-xr-xrwx 2 root root  0 May  2 14:50 perf_event/
dr-xr-xrwx 2 root root  0 May  2 14:50 pids/
dr-xr-xr-x 4 root root  0 May  2 14:50 systemd/

ironically, this now makes the root group a permission denier (at least
for my distribution), because if I were in the root group (and not
root), the r-x on the group would rule the rwx on other ... I really
don't think that sounds correct.

Perhaps what you should to be arguing then that the default permissions
of the cgroup directories need to be all rwx for everyone and then your
patch becomes unnecessary?

Alternatively, if the desire is fully to virtualize /sys/fs/cgroups,
then I think we have to decide how that would happen.  I think the
default requirements would be that a pid namespace be established (so
only the tasks in that pid namespace would be able to be controlled by
the cgroup namespace.  That, I think requires that any given cgroup
namespace "own" a pid namespace (being the one present when it was
created) but that it only gets a new virtual set of directories owned
by the userns owner if there's a pid namespace established for the
cgroup and cgroup->user_ns == pid_ns->user_ns (meaning we established a
user ns then a pid one then a cgroup one, so it's now safe to treat
root in the user_ns as owning the virtualized cgroup directories).  We
could do this in the same way that proc gets virtualized after
remounting (in a new mount namespace) on fork into a pid namespace.

James

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] cgroup: allow management of subtrees by new cgroup namespaces
@ 2016-05-03  1:59       ` Aleksa Sarai
  0 siblings, 0 replies; 17+ messages in thread
From: Aleksa Sarai @ 2016-05-03  1:59 UTC (permalink / raw)
  To: James Bottomley, Tejun Heo, Li Zefan, Johannes Weiner
  Cc: cgroups, linux-kernel, dev, Aleksa Sarai

>> Change the mode of the cgroup directory for each cgroup association,
>> allowing the process to create subtrees and modify the limits of the
>> subtrees *without* allowing the process to modify its own limits. Due
>> to the cgroup core restrictions and unix permission model, this
>> allows for processes to create new subtrees without breaking the
>> cgroup limits for the process.
>
> Actually, that's not really what this patch does.  If you unshare
> without having created any cgroups, it sets the other permission of the
> entire top level hierarchy to o+rwx:

While that is odd, it makes sense (because that's the "current cgroup" 
you are in). But I agree with your point that this patch is less than ideal.

> ironically, this now makes the root group a permission denier (at least
> for my distribution), because if I were in the root group (and not
> root), the r-x on the group would rule the rwx on other ... I really
> don't think that sounds correct.

You're right, that's odd. I'm confused why your root cgroups have u-w 
though.

>
> Perhaps what you should to be arguing then that the default permissions
> of the cgroup directories need to be all rwx for everyone and then your
> patch becomes unnecessary?

I don't think that would be the nicest way of dealing with this (then a 
process can make very large numbers of cgroups all over the tree, which 
might not cause huge issues but would still be a pain for administrators 
and systemds alike).

> Alternatively, if the desire is fully to virtualize /sys/fs/cgroups,
> then I think we have to decide how that would happen.  I think the
> default requirements would be that a pid namespace be established (so
> only the tasks in that pid namespace would be able to be controlled by
> the cgroup namespace.  That, I think requires that any given cgroup
> namespace "own" a pid namespace (being the one present when it was
> created) but that it only gets a new virtual set of directories owned
> by the userns owner if there's a pid namespace established for the
> cgroup and cgroup->user_ns == pid_ns->user_ns (meaning we established a
> user ns then a pid one then a cgroup one, so it's now safe to treat
> root in the user_ns as owning the virtualized cgroup directories).

I know this is probably a stupid question, but why couldn't we just 
compare the user_ns with the tcred->user_ns? Or are you worried about a 
process in a cgroup namespace moving processes to a subtree that isn't 
in the same pid namespace (even though they're in the same user 
namespace)? I don't mind implementing that this way (although we'd have 
to change a bunch of the checks with pid_ns to use the 
cgroup_ns->pid_ns), I'm just wondering if it's necessary.

> We could do this in the same way that proc gets virtualized after
> remounting (in a new mount namespace) on fork into a pid namespace.

I actually really like this idea. I'll get to work on it.

-- 
Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] cgroup: allow management of subtrees by new cgroup namespaces
@ 2016-05-03  1:59       ` Aleksa Sarai
  0 siblings, 0 replies; 17+ messages in thread
From: Aleksa Sarai @ 2016-05-03  1:59 UTC (permalink / raw)
  To: James Bottomley, Tejun Heo, Li Zefan, Johannes Weiner
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dev-IGmTWi+3HBZvNhPySn5qfx2eb7JE58TQ, Aleksa Sarai

>> Change the mode of the cgroup directory for each cgroup association,
>> allowing the process to create subtrees and modify the limits of the
>> subtrees *without* allowing the process to modify its own limits. Due
>> to the cgroup core restrictions and unix permission model, this
>> allows for processes to create new subtrees without breaking the
>> cgroup limits for the process.
>
> Actually, that's not really what this patch does.  If you unshare
> without having created any cgroups, it sets the other permission of the
> entire top level hierarchy to o+rwx:

While that is odd, it makes sense (because that's the "current cgroup" 
you are in). But I agree with your point that this patch is less than ideal.

> ironically, this now makes the root group a permission denier (at least
> for my distribution), because if I were in the root group (and not
> root), the r-x on the group would rule the rwx on other ... I really
> don't think that sounds correct.

You're right, that's odd. I'm confused why your root cgroups have u-w 
though.

>
> Perhaps what you should to be arguing then that the default permissions
> of the cgroup directories need to be all rwx for everyone and then your
> patch becomes unnecessary?

I don't think that would be the nicest way of dealing with this (then a 
process can make very large numbers of cgroups all over the tree, which 
might not cause huge issues but would still be a pain for administrators 
and systemds alike).

> Alternatively, if the desire is fully to virtualize /sys/fs/cgroups,
> then I think we have to decide how that would happen.  I think the
> default requirements would be that a pid namespace be established (so
> only the tasks in that pid namespace would be able to be controlled by
> the cgroup namespace.  That, I think requires that any given cgroup
> namespace "own" a pid namespace (being the one present when it was
> created) but that it only gets a new virtual set of directories owned
> by the userns owner if there's a pid namespace established for the
> cgroup and cgroup->user_ns == pid_ns->user_ns (meaning we established a
> user ns then a pid one then a cgroup one, so it's now safe to treat
> root in the user_ns as owning the virtualized cgroup directories).

I know this is probably a stupid question, but why couldn't we just 
compare the user_ns with the tcred->user_ns? Or are you worried about a 
process in a cgroup namespace moving processes to a subtree that isn't 
in the same pid namespace (even though they're in the same user 
namespace)? I don't mind implementing that this way (although we'd have 
to change a bunch of the checks with pid_ns to use the 
cgroup_ns->pid_ns), I'm just wondering if it's necessary.

> We could do this in the same way that proc gets virtualized after
> remounting (in a new mount namespace) on fork into a pid namespace.

I actually really like this idea. I'll get to work on it.

-- 
Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] cgroup: allow management of subtrees by new cgroup namespaces
@ 2016-05-03  2:26         ` James Bottomley
  0 siblings, 0 replies; 17+ messages in thread
From: James Bottomley @ 2016-05-03  2:26 UTC (permalink / raw)
  To: Aleksa Sarai, Tejun Heo, Li Zefan, Johannes Weiner
  Cc: cgroups, linux-kernel, dev, Aleksa Sarai

On Tue, 2016-05-03 at 11:59 +1000, Aleksa Sarai wrote:
> > > Change the mode of the cgroup directory for each cgroup
> > > association,
> > > allowing the process to create subtrees and modify the limits of
> > > the
> > > subtrees *without* allowing the process to modify its own limits.
> > > Due
> > > to the cgroup core restrictions and unix permission model, this
> > > allows for processes to create new subtrees without breaking the
> > > cgroup limits for the process.
> > 
> > Actually, that's not really what this patch does.  If you unshare
> > without having created any cgroups, it sets the other permission of 
> > the entire top level hierarchy to o+rwx:
> 
> While that is odd, it makes sense (because that's the "current 
> cgroup" you are in). But I agree with your point that this patch is 
> less than ideal.
> 
> > ironically, this now makes the root group a permission denier (at 
> > least for my distribution), because if I were in the root group 
> > (and not root), the r-x on the group would rule the rwx on other 
> > ... I really
> > don't think that sounds correct.
> 
> You're right, that's odd. I'm confused why your root cgroups have u-w
> though.

I've never bothered to inquire.  This is openSUSE Leap 42.1, so it's
either something systemd or something suse.

> > Perhaps what you should to be arguing then that the default 
> > permissions of the cgroup directories need to be all rwx for 
> > everyone and then your patch becomes unnecessary?
> 
> I don't think that would be the nicest way of dealing with this (then 
> a process can make very large numbers of cgroups all over the tree,
> which might not cause huge issues but would still be a pain for
> administrators and systemds alike).

Beware of what you cite as a problem.  Any user can enter a user
namespace and then unshare a cgroup namespace.  This means that what
you seem to want is equivalent to any user at all being able to create
a cgroup hierarchy.  This means that either it is a problem, and the
cgroup namespace will have to be restricted in some way over how it can
create subordinate cgroups or it's not a problem and we might as well
just see what happens if any old user can do it.

> > Alternatively, if the desire is fully to virtualize /sys/fs/cgroups
> > , then I think we have to decide how that would happen.  I think 
> > the default requirements would be that a pid namespace be 
> > established (so only the tasks in that pid namespace would be able 
> > to be controlled by the cgroup namespace.  That, I think requires 
> > that any given cgroup namespace "own" a pid namespace (being the 
> > one present when it was created) but that it only gets a new 
> > virtual set of directories owned by the userns owner if there's a 
> > pid namespace established for the cgroup and cgroup->user_ns == 
> > pid_ns->user_ns (meaning we established a user ns then a pid one 
> > then a cgroup one, so it's now safe to treat root in the user_ns as
> > owning the virtualized cgroup directories).
> 
> I know this is probably a stupid question, but why couldn't we just 
> compare the user_ns with the tcred->user_ns?

If any old user namespace can unshare a cgroup namespace and manipulate
the tree, then that condition is just fine.  If we're going to require
they have to create a pid namespace as well, then you need a more
elaborate condition.

>  Or are you worried about a process in a cgroup namespace moving 
> processes to a subtree that isn't in the same pid namespace (even 
> though they're in the same user namespace)?

The corner case I'm worrying about is what happens to a process owned
by the user that gets moved by the administrator to a more confining
cgroup after the establishment of the cgroup namespace?  If we allow
too much capability to the user_ns->owner, then they could just take it
out again.  The semantics of who can do what after the namespace is
established seem to need better definition.  One answer might be that
after the cgroup namespace is established, the real admin can't safely
move the processes, which is why they should be better confined (say
within a pid namespace) so it's not *all* processes owned by this user
that can escape control, merely ones that the user has declared a
desire to control the cgroups for).

James

>  I don't mind implementing that this way (although we'd have to
> change a bunch of the checks with pid_ns to use the cgroup_ns
> ->pid_ns), I'm just wondering if it's necessary.
> 
> > We could do this in the same way that proc gets virtualized after
> > remounting (in a new mount namespace) on fork into a pid namespace.
> 
> I actually really like this idea. I'll get to work on it.
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] cgroup: allow management of subtrees by new cgroup namespaces
@ 2016-05-03  2:26         ` James Bottomley
  0 siblings, 0 replies; 17+ messages in thread
From: James Bottomley @ 2016-05-03  2:26 UTC (permalink / raw)
  To: Aleksa Sarai, Tejun Heo, Li Zefan, Johannes Weiner
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dev-IGmTWi+3HBZvNhPySn5qfx2eb7JE58TQ, Aleksa Sarai

On Tue, 2016-05-03 at 11:59 +1000, Aleksa Sarai wrote:
> > > Change the mode of the cgroup directory for each cgroup
> > > association,
> > > allowing the process to create subtrees and modify the limits of
> > > the
> > > subtrees *without* allowing the process to modify its own limits.
> > > Due
> > > to the cgroup core restrictions and unix permission model, this
> > > allows for processes to create new subtrees without breaking the
> > > cgroup limits for the process.
> > 
> > Actually, that's not really what this patch does.  If you unshare
> > without having created any cgroups, it sets the other permission of 
> > the entire top level hierarchy to o+rwx:
> 
> While that is odd, it makes sense (because that's the "current 
> cgroup" you are in). But I agree with your point that this patch is 
> less than ideal.
> 
> > ironically, this now makes the root group a permission denier (at 
> > least for my distribution), because if I were in the root group 
> > (and not root), the r-x on the group would rule the rwx on other 
> > ... I really
> > don't think that sounds correct.
> 
> You're right, that's odd. I'm confused why your root cgroups have u-w
> though.

I've never bothered to inquire.  This is openSUSE Leap 42.1, so it's
either something systemd or something suse.

> > Perhaps what you should to be arguing then that the default 
> > permissions of the cgroup directories need to be all rwx for 
> > everyone and then your patch becomes unnecessary?
> 
> I don't think that would be the nicest way of dealing with this (then 
> a process can make very large numbers of cgroups all over the tree,
> which might not cause huge issues but would still be a pain for
> administrators and systemds alike).

Beware of what you cite as a problem.  Any user can enter a user
namespace and then unshare a cgroup namespace.  This means that what
you seem to want is equivalent to any user at all being able to create
a cgroup hierarchy.  This means that either it is a problem, and the
cgroup namespace will have to be restricted in some way over how it can
create subordinate cgroups or it's not a problem and we might as well
just see what happens if any old user can do it.

> > Alternatively, if the desire is fully to virtualize /sys/fs/cgroups
> > , then I think we have to decide how that would happen.  I think 
> > the default requirements would be that a pid namespace be 
> > established (so only the tasks in that pid namespace would be able 
> > to be controlled by the cgroup namespace.  That, I think requires 
> > that any given cgroup namespace "own" a pid namespace (being the 
> > one present when it was created) but that it only gets a new 
> > virtual set of directories owned by the userns owner if there's a 
> > pid namespace established for the cgroup and cgroup->user_ns == 
> > pid_ns->user_ns (meaning we established a user ns then a pid one 
> > then a cgroup one, so it's now safe to treat root in the user_ns as
> > owning the virtualized cgroup directories).
> 
> I know this is probably a stupid question, but why couldn't we just 
> compare the user_ns with the tcred->user_ns?

If any old user namespace can unshare a cgroup namespace and manipulate
the tree, then that condition is just fine.  If we're going to require
they have to create a pid namespace as well, then you need a more
elaborate condition.

>  Or are you worried about a process in a cgroup namespace moving 
> processes to a subtree that isn't in the same pid namespace (even 
> though they're in the same user namespace)?

The corner case I'm worrying about is what happens to a process owned
by the user that gets moved by the administrator to a more confining
cgroup after the establishment of the cgroup namespace?  If we allow
too much capability to the user_ns->owner, then they could just take it
out again.  The semantics of who can do what after the namespace is
established seem to need better definition.  One answer might be that
after the cgroup namespace is established, the real admin can't safely
move the processes, which is why they should be better confined (say
within a pid namespace) so it's not *all* processes owned by this user
that can escape control, merely ones that the user has declared a
desire to control the cgroups for).

James


>  I don't mind implementing that this way (although we'd have to
> change a bunch of the checks with pid_ns to use the cgroup_ns
> ->pid_ns), I'm just wondering if it's necessary.
> 
> > We could do this in the same way that proc gets virtualized after
> > remounting (in a new mount namespace) on fork into a pid namespace.
> 
> I actually really like this idea. I'll get to work on it.
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] cgroup: allow management of subtrees by new cgroup namespaces
@ 2016-05-03  6:48           ` Aleksa Sarai
  0 siblings, 0 replies; 17+ messages in thread
From: Aleksa Sarai @ 2016-05-03  6:48 UTC (permalink / raw)
  To: James Bottomley, Tejun Heo, Li Zefan, Johannes Weiner
  Cc: cgroups, linux-kernel, dev, Aleksa Sarai

>>> Perhaps what you should to be arguing then that the default
>>> permissions of the cgroup directories need to be all rwx for
>>> everyone and then your patch becomes unnecessary?
>>
>> I don't think that would be the nicest way of dealing with this (then
>> a process can make very large numbers of cgroups all over the tree,
>> which might not cause huge issues but would still be a pain for
>> administrators and systemds alike).
>
> Beware of what you cite as a problem.  Any user can enter a user
> namespace and then unshare a cgroup namespace.  This means that what
> you seem to want is equivalent to any user at all being able to create
> a cgroup hierarchy.

They should only be allowed to make subtrees of the cgroup *they 
currently reside in* IMO. Making the hierarchies chmod(0777) would allow 
any process in any hierarchy to create cgroups anywhere in the tree. It 
would just make management within cgroupv2 much harder (especially with 
the no internal process semantics of cgroupv2). This wouldn't happen 
with the cgroup namespace.

It should be noted that cgroupv1 doesn't have the same protection I 
outline below, so this would actually cause cgroup escapes in that case 
(which we should obviously avoid).

> [...] This means that either it is a problem, and the
> cgroup namespace will have to be restricted in some way over how it can
> create subordinate cgroups or it's not a problem and we might as well
> just see what happens if any old user can do it.

See above. The only restriction I think is necessary is that the process 
can only create subtrees of the current cgroup it is in.

>>> Alternatively, if the desire is fully to virtualize /sys/fs/cgroups
>>> , then I think we have to decide how that would happen.  I think
>>> the default requirements would be that a pid namespace be
>>> established (so only the tasks in that pid namespace would be able
>>> to be controlled by the cgroup namespace.  That, I think requires
>>> that any given cgroup namespace "own" a pid namespace (being the
>>> one present when it was created) but that it only gets a new
>>> virtual set of directories owned by the userns owner if there's a
>>> pid namespace established for the cgroup and cgroup->user_ns ==
>>> pid_ns->user_ns (meaning we established a user ns then a pid one
>>> then a cgroup one, so it's now safe to treat root in the user_ns as
>>> owning the virtualized cgroup directories).
>>
>> I know this is probably a stupid question, but why couldn't we just
>> compare the user_ns with the tcred->user_ns?
>
> If any old user namespace can unshare a cgroup namespace and manipulate
> the tree, then that condition is just fine.  If we're going to require
> they have to create a pid namespace as well, then you need a more
> elaborate condition.

Well, I guess my question was more like "if we don't require the pid 
namespace pinning, what bad things will happen?". You've described the 
corner case, and I'm not sure it's a problem for cgroupv2. It is a 
problem for cgroupv1 (unfortunately), due to backwards compatibility 
reasons. So, we have to decide whether we are going to add a restriction 
for cgroup namespaces, so that this functionality can be implemented for 
both versions -- or should we only implement the minimal version (which 
would only work on cgroupv2).

If we decide to implement both, we have to agree on the restrictions 
*immediately* because the cgroup namespace was merged in 4.6-rc1 so 
changing the restrictions on it in 4.7 would probably be frowned upon.

>>   Or are you worried about a process in a cgroup namespace moving
>> processes to a subtree that isn't in the same pid namespace (even
>> though they're in the same user namespace)?
>
> The corner case I'm worrying about is what happens to a process owned
> by the user that gets moved by the administrator to a more confining
> cgroup after the establishment of the cgroup namespace?  If we allow
> too much capability to the user_ns->owner, then they could just take it
> out again.  The semantics of who can do what after the namespace is
> established seem to need better definition.  One answer might be that
> after the cgroup namespace is established, the real admin can't safely
> move the processes, which is why they should be better confined (say
> within a pid namespace) so it's not *all* processes owned by this user
> that can escape control, merely ones that the user has declared a
> desire to control the cgroups for).

My thinking was that rename(2) would make this a simple decision, but I 
just realised that rename(2) doesn't let you change the hierarchy. But 
it should be noted that cgroupv2 has a fix for this: you can't move a 
task to another cgroup unless you have attach rights (cgroup.procs) to 
the common ancestor of the current cgroup and the target cgroup.

What this means is that you can only "fight the administrator" in the 
case that the admin has decided to move you inside the subtree of the 
cgroup namespace you are in. Itherwise, you can't move back to your old 
cgroup. Of course, it means that the cgroup namespace is now broken -- 
but that's what the admin wanted to do. I don't think that should be a 
problem.

Since this isn't available in cgroupv1, there's a question about whether 
this functionality should be allowed for cgroupv1. Since cgroupv2 still 
doesn't support all of the controllers needed for many container 
runtimes to work, I don't think we should not implement this for 
cgroupv1. But that's just my opinion.

-- 
Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] cgroup: allow management of subtrees by new cgroup namespaces
@ 2016-05-03  6:48           ` Aleksa Sarai
  0 siblings, 0 replies; 17+ messages in thread
From: Aleksa Sarai @ 2016-05-03  6:48 UTC (permalink / raw)
  To: James Bottomley, Tejun Heo, Li Zefan, Johannes Weiner
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dev-IGmTWi+3HBZvNhPySn5qfx2eb7JE58TQ, Aleksa Sarai

>>> Perhaps what you should to be arguing then that the default
>>> permissions of the cgroup directories need to be all rwx for
>>> everyone and then your patch becomes unnecessary?
>>
>> I don't think that would be the nicest way of dealing with this (then
>> a process can make very large numbers of cgroups all over the tree,
>> which might not cause huge issues but would still be a pain for
>> administrators and systemds alike).
>
> Beware of what you cite as a problem.  Any user can enter a user
> namespace and then unshare a cgroup namespace.  This means that what
> you seem to want is equivalent to any user at all being able to create
> a cgroup hierarchy.

They should only be allowed to make subtrees of the cgroup *they 
currently reside in* IMO. Making the hierarchies chmod(0777) would allow 
any process in any hierarchy to create cgroups anywhere in the tree. It 
would just make management within cgroupv2 much harder (especially with 
the no internal process semantics of cgroupv2). This wouldn't happen 
with the cgroup namespace.

It should be noted that cgroupv1 doesn't have the same protection I 
outline below, so this would actually cause cgroup escapes in that case 
(which we should obviously avoid).

> [...] This means that either it is a problem, and the
> cgroup namespace will have to be restricted in some way over how it can
> create subordinate cgroups or it's not a problem and we might as well
> just see what happens if any old user can do it.

See above. The only restriction I think is necessary is that the process 
can only create subtrees of the current cgroup it is in.

>>> Alternatively, if the desire is fully to virtualize /sys/fs/cgroups
>>> , then I think we have to decide how that would happen.  I think
>>> the default requirements would be that a pid namespace be
>>> established (so only the tasks in that pid namespace would be able
>>> to be controlled by the cgroup namespace.  That, I think requires
>>> that any given cgroup namespace "own" a pid namespace (being the
>>> one present when it was created) but that it only gets a new
>>> virtual set of directories owned by the userns owner if there's a
>>> pid namespace established for the cgroup and cgroup->user_ns ==
>>> pid_ns->user_ns (meaning we established a user ns then a pid one
>>> then a cgroup one, so it's now safe to treat root in the user_ns as
>>> owning the virtualized cgroup directories).
>>
>> I know this is probably a stupid question, but why couldn't we just
>> compare the user_ns with the tcred->user_ns?
>
> If any old user namespace can unshare a cgroup namespace and manipulate
> the tree, then that condition is just fine.  If we're going to require
> they have to create a pid namespace as well, then you need a more
> elaborate condition.

Well, I guess my question was more like "if we don't require the pid 
namespace pinning, what bad things will happen?". You've described the 
corner case, and I'm not sure it's a problem for cgroupv2. It is a 
problem for cgroupv1 (unfortunately), due to backwards compatibility 
reasons. So, we have to decide whether we are going to add a restriction 
for cgroup namespaces, so that this functionality can be implemented for 
both versions -- or should we only implement the minimal version (which 
would only work on cgroupv2).

If we decide to implement both, we have to agree on the restrictions 
*immediately* because the cgroup namespace was merged in 4.6-rc1 so 
changing the restrictions on it in 4.7 would probably be frowned upon.

>>   Or are you worried about a process in a cgroup namespace moving
>> processes to a subtree that isn't in the same pid namespace (even
>> though they're in the same user namespace)?
>
> The corner case I'm worrying about is what happens to a process owned
> by the user that gets moved by the administrator to a more confining
> cgroup after the establishment of the cgroup namespace?  If we allow
> too much capability to the user_ns->owner, then they could just take it
> out again.  The semantics of who can do what after the namespace is
> established seem to need better definition.  One answer might be that
> after the cgroup namespace is established, the real admin can't safely
> move the processes, which is why they should be better confined (say
> within a pid namespace) so it's not *all* processes owned by this user
> that can escape control, merely ones that the user has declared a
> desire to control the cgroups for).

My thinking was that rename(2) would make this a simple decision, but I 
just realised that rename(2) doesn't let you change the hierarchy. But 
it should be noted that cgroupv2 has a fix for this: you can't move a 
task to another cgroup unless you have attach rights (cgroup.procs) to 
the common ancestor of the current cgroup and the target cgroup.

What this means is that you can only "fight the administrator" in the 
case that the admin has decided to move you inside the subtree of the 
cgroup namespace you are in. Itherwise, you can't move back to your old 
cgroup. Of course, it means that the cgroup namespace is now broken -- 
but that's what the admin wanted to do. I don't think that should be a 
problem.

Since this isn't available in cgroupv1, there's a question about whether 
this functionality should be allowed for cgroupv1. Since cgroupv2 still 
doesn't support all of the controllers needed for many container 
runtimes to work, I don't think we should not implement this for 
cgroupv1. But that's just my opinion.

-- 
Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] cgroup: allow management of subtrees by new cgroup namespaces
@ 2016-05-03 14:26             ` James Bottomley
  0 siblings, 0 replies; 17+ messages in thread
From: James Bottomley @ 2016-05-03 14:26 UTC (permalink / raw)
  To: Aleksa Sarai, Tejun Heo, Li Zefan, Johannes Weiner
  Cc: cgroups, linux-kernel, dev, Aleksa Sarai

On Tue, 2016-05-03 at 16:48 +1000, Aleksa Sarai wrote:
> > > > Perhaps what you should to be arguing then that the default
> > > > permissions of the cgroup directories need to be all rwx for
> > > > everyone and then your patch becomes unnecessary?
> > > 
> > > I don't think that would be the nicest way of dealing with this 
> > > (then a process can make very large numbers of cgroups all over 
> > > the tree, which might not cause huge issues but would still be a 
> > > pain for administrators and systemds alike).
> > 
> > Beware of what you cite as a problem.  Any user can enter a user
> > namespace and then unshare a cgroup namespace.  This means that 
> > what you seem to want is equivalent to any user at all being able 
> > to create a cgroup hierarchy.
> 
> They should only be allowed to make subtrees of the cgroup *they 
> currently reside in* IMO.

For the usual case that is the top level cgroup because most processes
don't get initially confined.  If there is initial confinement by
something, then whatever it is could alter the permissions as well.

So if the default case is equivalent to making all the initial top
level cgroups rwx, we should understand the implications of that and
the best way to concentrate minds is to ask what happens if it were the
default.

>  Making the hierarchies chmod(0777) would allow any process in any 
> hierarchy to create cgroups anywhere in the tree. It would just make 
> management within cgroupv2 much harder (especially with the no
>  internal process semantics of cgroupv2). This wouldn't happen with
> the cgroup namespace.
> 
> It should be noted that cgroupv1 doesn't have the same protection I 
> outline below, so this would actually cause cgroup escapes in that 
> case (which we should obviously avoid).

> > [...] This means that either it is a problem, and the
> > cgroup namespace will have to be restricted in some way over how it 
> > can create subordinate cgroups or it's not a problem and we might 
> > as well just see what happens if any old user can do it.
> 
> See above. The only restriction I think is necessary is that the 
> process can only create subtrees of the current cgroup it is in.

I don't think there's a disagreement about that.  The problem is that
the current cgroup is usually the top level one because most systems
don't place processes into any other cgroup in the default case.

> > > > Alternatively, if the desire is fully to virtualize 
> > > > /sys/fs/cgroups, then I think we have to decide how that would 
> > > > happen.  I think the default requirements would be that a pid 
> > > > namespace be established (so only the tasks in that pid 
> > > > namespace would be able to be controlled by the cgroup 
> > > > namespace.  That, I think requires that any given cgroup 
> > > > namespace "own" a pid namespace (being the one present when it 
> > > > was created) but that it only gets a new virtual set of 
> > > > directories owned by the userns owner if there's a pid
> > > > namespace established for the cgroup and cgroup->user_ns ==
> > > > pid_ns->user_ns (meaning we established a user ns then a pid 
> > > > one then a cgroup one, so it's now safe to treat root in the
> > > > user_ns as owning the virtualized cgroup directories).
> > > 
> > > I know this is probably a stupid question, but why couldn't we 
> > > just compare the user_ns with the tcred->user_ns?
> > 
> > If any old user namespace can unshare a cgroup namespace and 
> > manipulate the tree, then that condition is just fine.  If we're 
> > going to require they have to create a pid namespace as well, then 
> > you need a more elaborate condition. 

> Well, I guess my question was more like "if we don't require the pid 
> namespace pinning, what bad things will happen?". You've described 
> the corner case, and I'm not sure it's a problem for cgroupv2. It is 
> a problem for cgroupv1 (unfortunately), due to backwards 
> compatibility reasons. So, we have to decide whether we are going to 
> add a restriction for cgroup namespaces, so that this functionality 
> can be implemented for both versions -- or should we only implement 
> the minimal version (which would only work on cgroupv2).
> 
> If we decide to implement both, we have to agree on the restrictions 
> *immediately* because the cgroup namespace was merged in 4.6-rc1 so 
> changing the restrictions on it in 4.7 would probably be frowned
> upon.

No, that horse has left the stable: the cgroup namespace applies to
both v1 and v2.

> > >   Or are you worried about a process in a cgroup namespace moving
> > > processes to a subtree that isn't in the same pid namespace (even
> > > though they're in the same user namespace)?
> > 
> > The corner case I'm worrying about is what happens to a process 
> > owned by the user that gets moved by the administrator to a more
> > confining cgroup after the establishment of the cgroup namespace? 
> >  If we allow too much capability to the user_ns->owner, then they 
> > could just take it out again.  The semantics of who can do what 
> > after the namespace is established seem to need better definition. 
> >  One answer might be that after the cgroup namespace is established
> > , the real admin can't safely move the processes, which is why they 
> > should be better confined (say within a pid namespace) so it's not 
> > *all* processes owned by this user that can escape control, merely 
> > ones that the user has declared a desire to control the cgroups
> > for).
> 
> My thinking was that rename(2) would make this a simple decision, but 
> I just realised that rename(2) doesn't let you change the hierarchy.
> But it should be noted that cgroupv2 has a fix for this: you can't 
> move a task to another cgroup unless you have attach rights 
> (cgroup.procs) to the common ancestor of the current cgroup and the
> target cgroup.

Currently the decision is made in cgroup_procs_write_permission() and
actually is blind to the user namespace, so this needs updating anyway.

James


> What this means is that you can only "fight the administrator" in the
> case that the admin has decided to move you inside the subtree of the
> cgroup namespace you are in. Itherwise, you can't move back to your 
> old cgroup. Of course, it means that the cgroup namespace is now 
> broken -- but that's what the admin wanted to do. I don't think that 
> should be a problem.
> 
> Since this isn't available in cgroupv1, there's a question about 
> whether this functionality should be allowed for cgroupv1. Since 
> cgroupv2 still doesn't support all of the controllers needed for many 
> container runtimes to work, I don't think we should not implement 
> this for cgroupv1. But that's just my opinion.
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] cgroup: allow management of subtrees by new cgroup namespaces
@ 2016-05-03 14:26             ` James Bottomley
  0 siblings, 0 replies; 17+ messages in thread
From: James Bottomley @ 2016-05-03 14:26 UTC (permalink / raw)
  To: Aleksa Sarai, Tejun Heo, Li Zefan, Johannes Weiner
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dev-IGmTWi+3HBZvNhPySn5qfx2eb7JE58TQ, Aleksa Sarai

On Tue, 2016-05-03 at 16:48 +1000, Aleksa Sarai wrote:
> > > > Perhaps what you should to be arguing then that the default
> > > > permissions of the cgroup directories need to be all rwx for
> > > > everyone and then your patch becomes unnecessary?
> > > 
> > > I don't think that would be the nicest way of dealing with this 
> > > (then a process can make very large numbers of cgroups all over 
> > > the tree, which might not cause huge issues but would still be a 
> > > pain for administrators and systemds alike).
> > 
> > Beware of what you cite as a problem.  Any user can enter a user
> > namespace and then unshare a cgroup namespace.  This means that 
> > what you seem to want is equivalent to any user at all being able 
> > to create a cgroup hierarchy.
> 
> They should only be allowed to make subtrees of the cgroup *they 
> currently reside in* IMO.

For the usual case that is the top level cgroup because most processes
don't get initially confined.  If there is initial confinement by
something, then whatever it is could alter the permissions as well.

So if the default case is equivalent to making all the initial top
level cgroups rwx, we should understand the implications of that and
the best way to concentrate minds is to ask what happens if it were the
default.

>  Making the hierarchies chmod(0777) would allow any process in any 
> hierarchy to create cgroups anywhere in the tree. It would just make 
> management within cgroupv2 much harder (especially with the no
>  internal process semantics of cgroupv2). This wouldn't happen with
> the cgroup namespace.
> 
> It should be noted that cgroupv1 doesn't have the same protection I 
> outline below, so this would actually cause cgroup escapes in that 
> case (which we should obviously avoid).

> > [...] This means that either it is a problem, and the
> > cgroup namespace will have to be restricted in some way over how it 
> > can create subordinate cgroups or it's not a problem and we might 
> > as well just see what happens if any old user can do it.
> 
> See above. The only restriction I think is necessary is that the 
> process can only create subtrees of the current cgroup it is in.

I don't think there's a disagreement about that.  The problem is that
the current cgroup is usually the top level one because most systems
don't place processes into any other cgroup in the default case.

> > > > Alternatively, if the desire is fully to virtualize 
> > > > /sys/fs/cgroups, then I think we have to decide how that would 
> > > > happen.  I think the default requirements would be that a pid 
> > > > namespace be established (so only the tasks in that pid 
> > > > namespace would be able to be controlled by the cgroup 
> > > > namespace.  That, I think requires that any given cgroup 
> > > > namespace "own" a pid namespace (being the one present when it 
> > > > was created) but that it only gets a new virtual set of 
> > > > directories owned by the userns owner if there's a pid
> > > > namespace established for the cgroup and cgroup->user_ns ==
> > > > pid_ns->user_ns (meaning we established a user ns then a pid 
> > > > one then a cgroup one, so it's now safe to treat root in the
> > > > user_ns as owning the virtualized cgroup directories).
> > > 
> > > I know this is probably a stupid question, but why couldn't we 
> > > just compare the user_ns with the tcred->user_ns?
> > 
> > If any old user namespace can unshare a cgroup namespace and 
> > manipulate the tree, then that condition is just fine.  If we're 
> > going to require they have to create a pid namespace as well, then 
> > you need a more elaborate condition. 

> Well, I guess my question was more like "if we don't require the pid 
> namespace pinning, what bad things will happen?". You've described 
> the corner case, and I'm not sure it's a problem for cgroupv2. It is 
> a problem for cgroupv1 (unfortunately), due to backwards 
> compatibility reasons. So, we have to decide whether we are going to 
> add a restriction for cgroup namespaces, so that this functionality 
> can be implemented for both versions -- or should we only implement 
> the minimal version (which would only work on cgroupv2).
> 
> If we decide to implement both, we have to agree on the restrictions 
> *immediately* because the cgroup namespace was merged in 4.6-rc1 so 
> changing the restrictions on it in 4.7 would probably be frowned
> upon.

No, that horse has left the stable: the cgroup namespace applies to
both v1 and v2.

> > >   Or are you worried about a process in a cgroup namespace moving
> > > processes to a subtree that isn't in the same pid namespace (even
> > > though they're in the same user namespace)?
> > 
> > The corner case I'm worrying about is what happens to a process 
> > owned by the user that gets moved by the administrator to a more
> > confining cgroup after the establishment of the cgroup namespace? 
> >  If we allow too much capability to the user_ns->owner, then they 
> > could just take it out again.  The semantics of who can do what 
> > after the namespace is established seem to need better definition. 
> >  One answer might be that after the cgroup namespace is established
> > , the real admin can't safely move the processes, which is why they 
> > should be better confined (say within a pid namespace) so it's not 
> > *all* processes owned by this user that can escape control, merely 
> > ones that the user has declared a desire to control the cgroups
> > for).
> 
> My thinking was that rename(2) would make this a simple decision, but 
> I just realised that rename(2) doesn't let you change the hierarchy.
> But it should be noted that cgroupv2 has a fix for this: you can't 
> move a task to another cgroup unless you have attach rights 
> (cgroup.procs) to the common ancestor of the current cgroup and the
> target cgroup.

Currently the decision is made in cgroup_procs_write_permission() and
actually is blind to the user namespace, so this needs updating anyway.

James


> What this means is that you can only "fight the administrator" in the
> case that the admin has decided to move you inside the subtree of the
> cgroup namespace you are in. Itherwise, you can't move back to your 
> old cgroup. Of course, it means that the cgroup namespace is now 
> broken -- but that's what the admin wanted to do. I don't think that 
> should be a problem.
> 
> Since this isn't available in cgroupv1, there's a question about 
> whether this functionality should be allowed for cgroupv1. Since 
> cgroupv2 still doesn't support all of the controllers needed for many 
> container runtimes to work, I don't think we should not implement 
> this for cgroupv1. But that's just my opinion.
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] cgroup: allow management of subtrees by new cgroup namespaces
@ 2016-05-04  9:49               ` Aleksa Sarai
  0 siblings, 0 replies; 17+ messages in thread
From: Aleksa Sarai @ 2016-05-04  9:49 UTC (permalink / raw)
  To: James Bottomley, Tejun Heo, Li Zefan, Johannes Weiner
  Cc: cgroups, linux-kernel, dev, Aleksa Sarai

>>>>> Perhaps what you should to be arguing then that the default
>>>>> permissions of the cgroup directories need to be all rwx for
>>>>> everyone and then your patch becomes unnecessary?
>>>>
>>>> I don't think that would be the nicest way of dealing with this
>>>> (then a process can make very large numbers of cgroups all over
>>>> the tree, which might not cause huge issues but would still be a
>>>> pain for administrators and systemds alike).
>>>
>>> Beware of what you cite as a problem.  Any user can enter a user
>>> namespace and then unshare a cgroup namespace.  This means that
>>> what you seem to want is equivalent to any user at all being able
>>> to create a cgroup hierarchy.
>>
>> They should only be allowed to make subtrees of the cgroup *they
>> currently reside in* IMO.
>
> For the usual case that is the top level cgroup because most processes
> don't get initially confined.  If there is initial confinement by
> something, then whatever it is could alter the permissions as well.
>
> So if the default case is equivalent to making all the initial top
> level cgroups rwx, we should understand the implications of that and
> the best way to concentrate minds is to ask what happens if it were the
> default.

A patchset I worked on (and then trashed) before writing this one would 
create a cgroup under your current cgroup, then would make you the owner 
of the new cgroup (and move you to it, making it the root of the 
namespace). This would alleviate this particular issue, but brings up 
many others (such as making sure there's no name clashes, and the fact 
that processes will start moving around in cgroups and whether or not 
userspace will be sufficiently alerted to the changes). In addition, the 
code was quite bad.

My ideal solution would be something like the above, because it means 
that we don't have to have disagreement about who "owns" a particular 
node in the cgroup hierarchy. Then we don't even have to virtualise 
/sys/fs/cgroups because there can be a global agreement on who owns what.

The only issue I could think of was the name clashes, and the fact that 
processes will now be moving around cgroups without explicitly writing 
to cgroup.procs.

>> If we decide to implement both, we have to agree on the restrictions
>> *immediately* because the cgroup namespace was merged in 4.6-rc1 so
>> changing the restrictions on it in 4.7 would probably be frowned
>> upon.
>
> No, that horse has left the stable: the cgroup namespace applies to
> both v1 and v2.

I was referring to the "what restrictions should apply to cgroup.procs 
in a cgroup namespace" question, because if we don't agree on this 
before 4.7 we would break back-compat.

>> My thinking was that rename(2) would make this a simple decision, but
>> I just realised that rename(2) doesn't let you change the hierarchy.
>> But it should be noted that cgroupv2 has a fix for this: you can't
>> move a task to another cgroup unless you have attach rights
>> (cgroup.procs) to the common ancestor of the current cgroup and the
>> target cgroup.
>
> Currently the decision is made in cgroup_procs_write_permission() and
> actually is blind to the user namespace, so this needs updating anyway.

Yeah, but we can't apply it (the common ancestor restriction) to 
cgroupv1 (back-compat). Maybe we could combine both updates as one 
"correcting the semantics" patch?

-- 
Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] cgroup: allow management of subtrees by new cgroup namespaces
@ 2016-05-04  9:49               ` Aleksa Sarai
  0 siblings, 0 replies; 17+ messages in thread
From: Aleksa Sarai @ 2016-05-04  9:49 UTC (permalink / raw)
  To: James Bottomley, Tejun Heo, Li Zefan, Johannes Weiner
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dev-IGmTWi+3HBZvNhPySn5qfx2eb7JE58TQ, Aleksa Sarai

>>>>> Perhaps what you should to be arguing then that the default
>>>>> permissions of the cgroup directories need to be all rwx for
>>>>> everyone and then your patch becomes unnecessary?
>>>>
>>>> I don't think that would be the nicest way of dealing with this
>>>> (then a process can make very large numbers of cgroups all over
>>>> the tree, which might not cause huge issues but would still be a
>>>> pain for administrators and systemds alike).
>>>
>>> Beware of what you cite as a problem.  Any user can enter a user
>>> namespace and then unshare a cgroup namespace.  This means that
>>> what you seem to want is equivalent to any user at all being able
>>> to create a cgroup hierarchy.
>>
>> They should only be allowed to make subtrees of the cgroup *they
>> currently reside in* IMO.
>
> For the usual case that is the top level cgroup because most processes
> don't get initially confined.  If there is initial confinement by
> something, then whatever it is could alter the permissions as well.
>
> So if the default case is equivalent to making all the initial top
> level cgroups rwx, we should understand the implications of that and
> the best way to concentrate minds is to ask what happens if it were the
> default.

A patchset I worked on (and then trashed) before writing this one would 
create a cgroup under your current cgroup, then would make you the owner 
of the new cgroup (and move you to it, making it the root of the 
namespace). This would alleviate this particular issue, but brings up 
many others (such as making sure there's no name clashes, and the fact 
that processes will start moving around in cgroups and whether or not 
userspace will be sufficiently alerted to the changes). In addition, the 
code was quite bad.

My ideal solution would be something like the above, because it means 
that we don't have to have disagreement about who "owns" a particular 
node in the cgroup hierarchy. Then we don't even have to virtualise 
/sys/fs/cgroups because there can be a global agreement on who owns what.

The only issue I could think of was the name clashes, and the fact that 
processes will now be moving around cgroups without explicitly writing 
to cgroup.procs.

>> If we decide to implement both, we have to agree on the restrictions
>> *immediately* because the cgroup namespace was merged in 4.6-rc1 so
>> changing the restrictions on it in 4.7 would probably be frowned
>> upon.
>
> No, that horse has left the stable: the cgroup namespace applies to
> both v1 and v2.

I was referring to the "what restrictions should apply to cgroup.procs 
in a cgroup namespace" question, because if we don't agree on this 
before 4.7 we would break back-compat.

>> My thinking was that rename(2) would make this a simple decision, but
>> I just realised that rename(2) doesn't let you change the hierarchy.
>> But it should be noted that cgroupv2 has a fix for this: you can't
>> move a task to another cgroup unless you have attach rights
>> (cgroup.procs) to the common ancestor of the current cgroup and the
>> target cgroup.
>
> Currently the decision is made in cgroup_procs_write_permission() and
> actually is blind to the user namespace, so this needs updating anyway.

Yeah, but we can't apply it (the common ancestor restriction) to 
cgroupv1 (back-compat). Maybe we could combine both updates as one 
"correcting the semantics" patch?

-- 
Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-05-04  9:49 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-01 13:41 [PATCH v2] cgroup: allow management of subtrees by cgroup namespaces Aleksa Sarai
2016-05-01 13:41 ` Aleksa Sarai
2016-05-01 13:41 ` [PATCH v2] cgroup: allow management of subtrees by new " Aleksa Sarai
2016-05-02  9:32   ` Aleksa Sarai
2016-05-02  9:32     ` Aleksa Sarai
2016-05-02 22:00   ` James Bottomley
2016-05-02 22:00     ` James Bottomley
2016-05-03  1:59     ` Aleksa Sarai
2016-05-03  1:59       ` Aleksa Sarai
2016-05-03  2:26       ` James Bottomley
2016-05-03  2:26         ` James Bottomley
2016-05-03  6:48         ` Aleksa Sarai
2016-05-03  6:48           ` Aleksa Sarai
2016-05-03 14:26           ` James Bottomley
2016-05-03 14:26             ` James Bottomley
2016-05-04  9:49             ` Aleksa Sarai
2016-05-04  9:49               ` Aleksa Sarai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.