All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] cgroup: add cgroup.signal
@ 2021-04-23 17:13 Christian Brauner
       [not found] ` <20210423171351.3614430-1-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2021-04-23 17:13 UTC (permalink / raw)
  To: Tejun Heo, Roman Gushchin
  Cc: Shakeel Butt, Zefan Li, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

Introduce a new cgroup.signal file which is available in non-root
cgroups and allows to send signals to cgroups. As with all new cgroup
features this is recursive by default. This first implementation is
restricted to sending SIGKILL with the possibility to be extended to
other signals in the future which I think is very likely.

There are various use-cases of this interface. In container-land
assuming a conservative layout where a container is treated as a
separate machine each container usually has a delegated cgroup.

This means there is a 1:1 mapping between container and cgroup. If the
container in addition uses a separate pid namespace then killing a
container becomes a simple kill -9 <container-init-pid> from an ancestor
pid namespace.

However, there are quite a few scenarios where that isn't true. For
example, there are containers that share the cgroup with other processes
on purpose that are supposed to be bound to the lifetime of the
container but are not in the same pidns of the container. Containers
that are in a delegated cgroup but share the pid namespace with the host
or other containers.

Other use-cases arise for service management in systemd and potentially
userspace oom implementations.

I'm honestly a bit worried because this patch feels way to
straightforward right now and I've been holding it back because I fear I
keep missing obvious problems. In any case, here are the semantics and
the corner-cases that came to my mind and how I reasoned about them:
- Signals are specified by writing the signal number into cgroup.signal.
  An alternative would be to allow writing the signal name but I don't
  think that's worth it. Callers like systemd can easily do a snprintf()
  with the signal's define/enum.
- Since signaling is a one-time event and we're holding cgroup_mutex()
  as we do for freezer we don't need to worry about tasks joining the
  cgroup while we're signaling the cgroup. Freezer needed to care about
  this because a task could join or leave frozen/non-frozen cgroups.
  Since we only support SIGKILL currently and SIGKILL works for frozen
  tasks there's also not significant interaction with frozen cgroups.
- Since signaling leads to an event and not a state change the
  cgroup.signal file is write-only.
- Since we currently only support SIGKILL we don't need to generate a
  separate notification and can rely on the unpopulated notification
  meachnism. If we support more signals we can introduce a separate
  notification in cgroup.events.
- We skip signaling kthreads this also means a cgroup that has a kthread
  but receives a SIGKILL signal will not become empty and consequently
  no populated event will be generated. This could potentially be
  handled if we were to generate an event whenever we are finished
  sending a signal. I'm not sure that's worth it.
- Freezer doesn't care about tasks in different pid namespaces, i.e. if
  you have two tasks in different pid namespaces the cgroup would still
  be frozen.
  The cgroup.signal mechanism should consequently behave the same way,
  i.e.  signal all processes and ignore in which pid namespace they
  exist. This would obviously mean that if you e.g. had a task from an
  ancestor pid namespace join a delegated cgroup of a container in a
  child pid namespace the container can kill that task. But I think this
  is fine and actually the semantics we want since the cgroup has been
  delegated.
- It is currently not possible to signal tasks in ancestor or sibling
  pid namespaces with regular singal APIs. This is not even possible
  with pidfds right now as I've added a check access_pidfd_pidns() which
  verifies that the task to be signaled is in the same or a descendant
  pid namespace as the caller. However, with cgroup.signal this would be
  possible whenever a task lives in a cgroup that is delegated to a
  caller in an ancestor or sibling pid namespace.
- SIGKILLing a task that is PID 1 in a pid namespace which is
  located in a delegated cgroup but which has children in non-delegated
  cgroups (further up the hierarchy for example) would necessarily cause
  those children to die too.
- We skip signaling tasks that already have pending fatal signals.
- If we are located in a cgroup that is going to get SIGKILLed we'll be
  SIGKILLed as well which is fine and doesn't require us to do anything
  special.
- We're holding the read-side of tasklist lock while we're signaling
  tasks. That seems fine since kill(-1, SIGKILL) holds the read-side
  of tasklist lock walking all processes and is a way for unprivileged
  users to trigger tasklist lock being held for a long time. In contrast
  it would require a delegated cgroup with lots of processes and a deep
  hierarchy to allow for something similar with this interface.

Fwiw, in addition to the system manager and container use-cases I think
this has the potential to be picked up by the "kill" tool. In the future
I'd hope we can do: kill -9 --cgroup /sys/fs/cgroup/delegated

Once we see this is a feasible direction and I haven't missed a bunch of
obvious problems I'll add proper tests and send out a non-RFC version of
this patch. Manual testing indicates it works as expected.

Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
---
 kernel/cgroup/cgroup.c | 75 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9153b20e5cc6..fff90e0bf5ae 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3654,6 +3654,76 @@ static ssize_t cgroup_freeze_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static void __cgroup_signal(struct cgroup *cgrp, int signr)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	css_task_iter_start(&cgrp->self, 0, &it);
+	while ((task = css_task_iter_next(&it))) {
+		/* Ignore kernel threads here. */
+		if (task->flags & PF_KTHREAD)
+			continue;
+
+		/* Skip dying tasks. */
+		if (__fatal_signal_pending(task))
+			continue;
+
+		send_sig(signr, task, 0);
+	}
+	css_task_iter_end(&it);
+}
+
+static void cgroup_signal(struct cgroup *cgrp, int signr)
+{
+	struct cgroup_subsys_state *css;
+	struct cgroup *dsct;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	/*
+	 * Propagate changes downwards the cgroup tree.
+	 */
+	read_lock(&tasklist_lock);
+	css_for_each_descendant_pre(css, &cgrp->self) {
+		dsct = css->cgroup;
+
+		if (cgroup_is_dead(dsct))
+			continue;
+
+		__cgroup_signal(dsct, signr);
+	}
+	read_unlock(&tasklist_lock);
+}
+
+static ssize_t cgroup_signal_write(struct kernfs_open_file *of, char *buf,
+				   size_t nbytes, loff_t off)
+{
+	struct cgroup *cgrp;
+	ssize_t ret;
+	int signr;
+
+	ret = kstrtoint(strstrip(buf), 0, &signr);
+	if (ret)
+		return ret;
+
+	/* Only allow SIGKILL for now. */
+	if (signr != SIGKILL)
+		return -EINVAL;
+
+	cgrp = cgroup_kn_lock_live(of->kn, false);
+	if (!cgrp)
+		return -ENOENT;
+
+	cgroup_signal(cgrp, signr);
+
+	cgroup_kn_unlock(of->kn);
+
+	return nbytes;
+}
+
 static int cgroup_file_open(struct kernfs_open_file *of)
 {
 	struct cftype *cft = of_cft(of);
@@ -4846,6 +4916,11 @@ static struct cftype cgroup_base_files[] = {
 		.seq_show = cgroup_freeze_show,
 		.write = cgroup_freeze_write,
 	},
+	{
+		.name = "cgroup.signal",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.write = cgroup_signal_write,
+	},
 	{
 		.name = "cpu.stat",
 		.seq_show = cpu_stat_show,

base-commit: d434405aaab7d0ebc516b68a8fc4100922d7f5ef
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] cgroup: add cgroup.signal
       [not found] ` <20210423171351.3614430-1-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2021-04-23 19:01   ` Roman Gushchin
       [not found]     ` <YIMZkjzNFypjZao9-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
  2021-04-26 14:42   ` Michal Koutný
  2021-04-26 19:03   ` Michal Koutný
  2 siblings, 1 reply; 15+ messages in thread
From: Roman Gushchin @ 2021-04-23 19:01 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tejun Heo, Shakeel Butt, Zefan Li, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

On Fri, Apr 23, 2021 at 07:13:51PM +0200, Christian Brauner wrote:
> From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> 
> Introduce a new cgroup.signal file which is available in non-root
> cgroups and allows to send signals to cgroups. As with all new cgroup
> features this is recursive by default. This first implementation is
> restricted to sending SIGKILL with the possibility to be extended to
> other signals in the future which I think is very likely.
> 
> There are various use-cases of this interface. In container-land
> assuming a conservative layout where a container is treated as a
> separate machine each container usually has a delegated cgroup.
> 
> This means there is a 1:1 mapping between container and cgroup. If the
> container in addition uses a separate pid namespace then killing a
> container becomes a simple kill -9 <container-init-pid> from an ancestor
> pid namespace.
> 
> However, there are quite a few scenarios where that isn't true. For
> example, there are containers that share the cgroup with other processes
> on purpose that are supposed to be bound to the lifetime of the
> container but are not in the same pidns of the container. Containers
> that are in a delegated cgroup but share the pid namespace with the host
> or other containers.
> 
> Other use-cases arise for service management in systemd and potentially
> userspace oom implementations.
> 
> I'm honestly a bit worried because this patch feels way to
> straightforward right now and I've been holding it back because I fear I
> keep missing obvious problems. In any case, here are the semantics and
> the corner-cases that came to my mind and how I reasoned about them:
> - Signals are specified by writing the signal number into cgroup.signal.
>   An alternative would be to allow writing the signal name but I don't
>   think that's worth it. Callers like systemd can easily do a snprintf()
>   with the signal's define/enum.
> - Since signaling is a one-time event and we're holding cgroup_mutex()
>   as we do for freezer we don't need to worry about tasks joining the
>   cgroup while we're signaling the cgroup. Freezer needed to care about
>   this because a task could join or leave frozen/non-frozen cgroups.
>   Since we only support SIGKILL currently and SIGKILL works for frozen
>   tasks there's also not significant interaction with frozen cgroups.

Hello Christian!

I'm worried that we might miss a fork'ing task. If a task started fork'ing,
holding a tasklist_lock will not prevent it from joining the cgroup after
we'll release it. So I guess the task could escape from being killed, which
will obviously break user's expectations. Something to check here.


> - Since signaling leads to an event and not a state change the
>   cgroup.signal file is write-only.
> - Since we currently only support SIGKILL we don't need to generate a
>   separate notification and can rely on the unpopulated notification
>   meachnism. If we support more signals we can introduce a separate
>   notification in cgroup.events.

An alternative interface is to have echo "1" > cgroup.kill .
I'm not sure what's better long-term.
Given that it's unlikely that new signals will appear, maybe it's better
to decide right now if we're going to support anything beyond SIGKILL and
implement it all together. Otherwise applications would need to handle
-EINVAL on some signals on some kernel versions...

> - We skip signaling kthreads this also means a cgroup that has a kthread
>   but receives a SIGKILL signal will not become empty and consequently
>   no populated event will be generated. This could potentially be
>   handled if we were to generate an event whenever we are finished
>   sending a signal. I'm not sure that's worth it.

I think this is fine. Generally it's expected that all kthreads are in
the root cgroup (at least on cgroup v2). If it's not the case, a user should
be prepared for various side-effects. The freezer has them too. There is no
way to treat kthreads as normal processes without bad consequences.

> - Freezer doesn't care about tasks in different pid namespaces, i.e. if
>   you have two tasks in different pid namespaces the cgroup would still
>   be frozen.
>   The cgroup.signal mechanism should consequently behave the same way,
>   i.e.  signal all processes and ignore in which pid namespace they
>   exist. This would obviously mean that if you e.g. had a task from an
>   ancestor pid namespace join a delegated cgroup of a container in a
>   child pid namespace the container can kill that task. But I think this
>   is fine and actually the semantics we want since the cgroup has been
>   delegated.
> - It is currently not possible to signal tasks in ancestor or sibling
>   pid namespaces with regular singal APIs. This is not even possible
>   with pidfds right now as I've added a check access_pidfd_pidns() which
>   verifies that the task to be signaled is in the same or a descendant
>   pid namespace as the caller. However, with cgroup.signal this would be
>   possible whenever a task lives in a cgroup that is delegated to a
>   caller in an ancestor or sibling pid namespace.

I agree.

> - SIGKILLing a task that is PID 1 in a pid namespace which is
>   located in a delegated cgroup but which has children in non-delegated
>   cgroups (further up the hierarchy for example) would necessarily cause
>   those children to die too.
> - We skip signaling tasks that already have pending fatal signals.
> - If we are located in a cgroup that is going to get SIGKILLed we'll be
>   SIGKILLed as well which is fine and doesn't require us to do anything
>   special.
> - We're holding the read-side of tasklist lock while we're signaling
>   tasks. That seems fine since kill(-1, SIGKILL) holds the read-side
>   of tasklist lock walking all processes and is a way for unprivileged
>   users to trigger tasklist lock being held for a long time. In contrast
>   it would require a delegated cgroup with lots of processes and a deep
>   hierarchy to allow for something similar with this interface.
> 
> Fwiw, in addition to the system manager and container use-cases I think
> this has the potential to be picked up by the "kill" tool. In the future
> I'd hope we can do: kill -9 --cgroup /sys/fs/cgroup/delegated
> 
> Once we see this is a feasible direction and I haven't missed a bunch of
> obvious problems I'll add proper tests and send out a non-RFC version of
> this patch. Manual testing indicates it works as expected.

Overall it sounds very reasonable and makes total sense to me. Many userspace
applications can use the new interface instead of reading cgroup.procs in
a cycle and killing all processes or using the freezer and kill a frozen
list of tasks. It will simplify the code and make it more reliable.

Thanks!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] cgroup: add cgroup.signal
       [not found]     ` <YIMZkjzNFypjZao9-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
@ 2021-04-26 14:42       ` Michal Koutný
  2021-04-26 15:15         ` Christian Brauner
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Koutný @ 2021-04-26 14:42 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Christian Brauner, Tejun Heo, Shakeel Butt, Zefan Li,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Christian Brauner

[-- Attachment #1: Type: text/plain, Size: 1162 bytes --]

Hello.

On Fri, Apr 23, 2021 at 12:01:38PM -0700, Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org> wrote:
> Overall it sounds very reasonable and makes total sense to me.
I agree this sounds like very desired convenience...

> Many userspace applications can use the new interface instead of
> reading cgroup.procs in a cycle and killing all processes or using the
> freezer and kill a frozen list of tasks.
...however, exactly because of this, I'm not convinced it's justifying
yet another way how to do it and implement that in kernel. (AFAIU, both
those ways should be reliable too (assuming reading cgroup.procs of the
_default_ hierarchy), please correct me if I'm wrong.)

> It will simplify the code and make it more reliable.
It's not cost free though, part of the complexity is moved to the
kernel.
As Roman already pointed earlier, there are is unclear situation wrt
forking tasks. The similar had to be solved for the freezer hence why
not let uspace rely on that already? Having similar codepaths for
signalling the cgroups seems like a way to have two similar codepaths
side by side where one of them serves just to simplify uspace tools.

Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] cgroup: add cgroup.signal
       [not found] ` <20210423171351.3614430-1-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2021-04-23 19:01   ` Roman Gushchin
@ 2021-04-26 14:42   ` Michal Koutný
  2021-04-26 15:29     ` Christian Brauner
  2021-04-26 19:03   ` Michal Koutný
  2 siblings, 1 reply; 15+ messages in thread
From: Michal Koutný @ 2021-04-26 14:42 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tejun Heo, Roman Gushchin, Shakeel Butt, Zefan Li,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Christian Brauner

[-- Attachment #1: Type: text/plain, Size: 2954 bytes --]

Hello Christian,
I have some questions to understand the motivation here.

On Fri, Apr 23, 2021 at 07:13:51PM +0200, Christian Brauner <brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> - Signals are specified by writing the signal number into cgroup.signal.
>   An alternative would be to allow writing the signal name but I don't
>   think that's worth it. Callers like systemd can easily do a snprintf()
>   with the signal's define/enum.
> - Since signaling is a one-time event and we're holding cgroup_mutex()
>   as we do for freezer we don't need to worry about tasks joining the
>   cgroup while we're signaling the cgroup. Freezer needed to care about
>   this because a task could join or leave frozen/non-frozen cgroups.
>   Since we only support SIGKILL currently and SIGKILL works for frozen
>   tasks there's also not significant interaction with frozen cgroups.
> - Since signaling leads to an event and not a state change the
>   cgroup.signal file is write-only.
Have you considered accepting a cgroup fd to pidfd_send_signal and
realize this operation through this syscall? (Just asking as it may
prevent some of these consequences whereas bring other unclarities.)


> - Since we currently only support SIGKILL we don't need to generate a
>   separate notification and can rely on the unpopulated notification
>   meachnism. If we support more signals we can introduce a separate
>   notification in cgroup.events.
What kind of notification do you have in mind here?

> - Freezer doesn't care about tasks in different pid namespaces, i.e. if
>   you have two tasks in different pid namespaces the cgroup would still
>   be frozen.
>   The cgroup.signal mechanism should consequently behave the same way,
>   i.e.  signal all processes and ignore in which pid namespace they
>   exist. This would obviously mean that if you e.g. had a task from an
>   ancestor pid namespace join a delegated cgroup of a container in a
>   child pid namespace the container can kill that task. But I think this
>   is fine and actually the semantics we want since the cgroup has been
>   delegated.
What do you mean by a delegated cgroup in this context?

> - We're holding the read-side of tasklist lock while we're signaling
>   tasks. That seems fine since kill(-1, SIGKILL) holds the read-side
>   of tasklist lock walking all processes and is a way for unprivileged
>   users to trigger tasklist lock being held for a long time. In contrast
>   it would require a delegated cgroup with lots of processes and a deep
>   hierarchy to allow for something similar with this interface.
I'd better not proliferate tasklist_lock users if it's avoidable (such
as freezer does).

> Fwiw, in addition to the system manager and container use-cases I think
> this has the potential to be picked up by the "kill" tool. In the future
> I'd hope we can do: kill -9 --cgroup /sys/fs/cgroup/delegated
(OT: FTR, there's `systemctl kill` already ;-))

Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] cgroup: add cgroup.signal
  2021-04-26 14:42       ` Michal Koutný
@ 2021-04-26 15:15         ` Christian Brauner
  2021-04-26 19:02           ` Michal Koutný
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2021-04-26 15:15 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Roman Gushchin, Christian Brauner, Tejun Heo, Shakeel Butt,
	Zefan Li, Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, Apr 26, 2021 at 04:42:07PM +0200, Michal Koutný wrote:
> Hello.

(Tiny favor, can you please leave a newline before your inline replies.
Would make it easier to follow.)

> 
> On Fri, Apr 23, 2021 at 12:01:38PM -0700, Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org> wrote:
> > Overall it sounds very reasonable and makes total sense to me.
> I agree this sounds like very desired convenience...
> 
> > Many userspace applications can use the new interface instead of
> > reading cgroup.procs in a cycle and killing all processes or using the
> > freezer and kill a frozen list of tasks.
> ...however, exactly because of this, I'm not convinced it's justifying
> yet another way how to do it and implement that in kernel. (AFAIU, both
> those ways should be reliable too (assuming reading cgroup.procs of the
> _default_ hierarchy), please correct me if I'm wrong.)
> 
> > It will simplify the code and make it more reliable.
> It's not cost free though, part of the complexity is moved to the
> kernel.
> As Roman already pointed earlier, there are is unclear situation wrt
> forking tasks. The similar had to be solved for the freezer hence why
> not let uspace rely on that already? Having similar codepaths for
> signalling the cgroups seems like a way to have two similar codepaths
> side by side where one of them serves just to simplify uspace tools.

The established way of doing things in userspace stems from cgroup v1
where the concept of "kill this cgroup and all its descendants didn't
really make sense in the face of multiple hierarchies. This is in pretty
stark contrast to cgroup v2 where that concept first does make a lot of
sense. So the "traditional" way of killing cgroups is a take-over from
the legacy cgroup world.

Since cgroups organize and manage resources and processes killing
cgroups is arguably a core cgroup feature and in some form was always
planned. It just hasn't been high-priority.

We often tend to bring "this is just convenience and therefore not worth
it" as an argument (sure did it myself) but I think that's quite often
not a very good argument. We should very much try to make interfaces
simpler to use for userspace. In this specific instance the code comes
down from an algorithm to recursively kill all cgroups to a single write
into a file. Which seems like a good win.

This also allows for quite a few things that aren't currently possible
in userspace. Some of which were in the original mail (at least
implicitly).
For example, you can delegate killing of a privileged cgroup to a less
privileged or sandboxed process by having the privileged process open
the cgroup.kill file and then handing it off. In general similar to
freezing you can send around that file descriptor. You can kill
processes in ancestor or sibling pid namespaces as long as they are
encompassed in the same cgroup. And other useful things.

But really, the simplifcation alone is already quite good.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] cgroup: add cgroup.signal
  2021-04-26 14:42   ` Michal Koutný
@ 2021-04-26 15:29     ` Christian Brauner
  2021-04-26 16:08       ` Shakeel Butt
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2021-04-26 15:29 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Christian Brauner, Tejun Heo, Roman Gushchin, Shakeel Butt,
	Zefan Li, Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, Apr 26, 2021 at 04:42:45PM +0200, Michal Koutný wrote:
> Hello Christian,
> I have some questions to understand the motivation here.
> 
> On Fri, Apr 23, 2021 at 07:13:51PM +0200, Christian Brauner <brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > - Signals are specified by writing the signal number into cgroup.signal.
> >   An alternative would be to allow writing the signal name but I don't
> >   think that's worth it. Callers like systemd can easily do a snprintf()
> >   with the signal's define/enum.
> > - Since signaling is a one-time event and we're holding cgroup_mutex()
> >   as we do for freezer we don't need to worry about tasks joining the
> >   cgroup while we're signaling the cgroup. Freezer needed to care about
> >   this because a task could join or leave frozen/non-frozen cgroups.
> >   Since we only support SIGKILL currently and SIGKILL works for frozen
> >   tasks there's also not significant interaction with frozen cgroups.
> > - Since signaling leads to an event and not a state change the
> >   cgroup.signal file is write-only.
> Have you considered accepting a cgroup fd to pidfd_send_signal and
> realize this operation through this syscall? (Just asking as it may
> prevent some of these consequences whereas bring other unclarities.)

That's semantically quite wrong on several fronts, I think.
pidfd_send_signal() operates on pidfds (and for a quirky historical
reason /proc/<pid> though that should die at some point). Making this
operate on cgroup fds is essentially implicit multiplexing which is
pretty nasty imho. In addition, this is a cgroup concept not a pidfd
concept.
I've also removed the "arbitrary signal" feature from the cgroup.signal
knob. I think Roman's right that it should simply be cgroup.kill.

> 
> 
> > - Since we currently only support SIGKILL we don't need to generate a
> >   separate notification and can rely on the unpopulated notification
> >   meachnism. If we support more signals we can introduce a separate
> >   notification in cgroup.events.
> What kind of notification do you have in mind here?

This is now irrelevant with dumbing this down to cgroup.kill.

> 
> > - Freezer doesn't care about tasks in different pid namespaces, i.e. if
> >   you have two tasks in different pid namespaces the cgroup would still
> >   be frozen.
> >   The cgroup.signal mechanism should consequently behave the same way,
> >   i.e.  signal all processes and ignore in which pid namespace they
> >   exist. This would obviously mean that if you e.g. had a task from an
> >   ancestor pid namespace join a delegated cgroup of a container in a
> >   child pid namespace the container can kill that task. But I think this
> >   is fine and actually the semantics we want since the cgroup has been
> >   delegated.
> What do you mean by a delegated cgroup in this context?

Hm? I mean a cgroup that is delegated to a specific user according to
the official cgroup2 documentation.

brauner@wittgenstein|/sys/fs/cgroup/payload.f1
> ls -al
total 0
drwxrwxr-x  6 root   100000 0 Apr 25 11:51 .
dr-xr-xr-x 41 root   root   0 Apr 25 11:51 ..
-r--r--r--  1 root   root   0 Apr 26 17:20 cgroup.controllers
-r--r--r--  1 root   root   0 Apr 26 17:20 cgroup.events
-rw-r--r--  1 root   root   0 Apr 26 17:20 cgroup.freeze
-rw-r--r--  1 root   root   0 Apr 26 17:20 cgroup.max.depth
-rw-r--r--  1 root   root   0 Apr 26 17:20 cgroup.max.descendants
-rw-rw-r--  1 root   100000 0 Apr 25 11:51 cgroup.procs
-r--r--r--  1 root   root   0 Apr 26 17:20 cgroup.stat
-rw-rw-r--  1 root   100000 0 Apr 25 11:51 cgroup.subtree_control
-rw-rw-r--  1 root   100000 0 Apr 25 11:51 cgroup.threads
-rw-r--r--  1 root   root   0 Apr 26 17:20 cgroup.type
-rw-r--r--  1 root   root   0 Apr 26 17:20 cpu.pressure
-r--r--r--  1 root   root   0 Apr 26 17:20 cpu.stat
drwxr-xr-x  2 100000 100000 0 Apr 25 11:51 init.scope
-rw-r--r--  1 root   root   0 Apr 26 17:20 io.pressure
drwxr-xr-x  2 100000 100000 0 Apr 25 11:51 .lxc
-rw-r--r--  1 root   root   0 Apr 26 17:20 memory.pressure
drwxr-xr-x 78 100000 100000 0 Apr 26 15:24 system.slice
drwxr-xr-x  2 100000 100000 0 Apr 25 11:52 user.slice

> 
> > - We're holding the read-side of tasklist lock while we're signaling
> >   tasks. That seems fine since kill(-1, SIGKILL) holds the read-side
> >   of tasklist lock walking all processes and is a way for unprivileged
> >   users to trigger tasklist lock being held for a long time. In contrast
> >   it would require a delegated cgroup with lots of processes and a deep
> >   hierarchy to allow for something similar with this interface.
> I'd better not proliferate tasklist_lock users if it's avoidable (such
> as freezer does).
> 
> > Fwiw, in addition to the system manager and container use-cases I think
> > this has the potential to be picked up by the "kill" tool. In the future
> > I'd hope we can do: kill -9 --cgroup /sys/fs/cgroup/delegated
> (OT: FTR, there's `systemctl kill` already ;-))

Not every distro uses systemd. :)
And actually systemd is one of the users of this feature.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] cgroup: add cgroup.signal
  2021-04-26 15:29     ` Christian Brauner
@ 2021-04-26 16:08       ` Shakeel Butt
       [not found]         ` <CALvZod5=eLQMdVXxuhj9ia=PkoRvT5oBxeqZAVtQpSukZ=tCxA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Shakeel Butt @ 2021-04-26 16:08 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Michal Koutný,
	Christian Brauner, Tejun Heo, Roman Gushchin, Zefan Li,
	Johannes Weiner, Cgroups

On Mon, Apr 26, 2021 at 8:29 AM Christian Brauner
<christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> wrote:
>
[...]
> > Have you considered accepting a cgroup fd to pidfd_send_signal and
> > realize this operation through this syscall? (Just asking as it may
> > prevent some of these consequences whereas bring other unclarities.)
>
> That's semantically quite wrong on several fronts, I think.
> pidfd_send_signal() operates on pidfds (and for a quirky historical
> reason /proc/<pid> though that should die at some point). Making this
> operate on cgroup fds is essentially implicit multiplexing which is
> pretty nasty imho. In addition, this is a cgroup concept not a pidfd
> concept.

What's your take on a new syscall cgroupfd_send_signal()? One
complexity would be potentially different semantics for v1 and v2.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] cgroup: add cgroup.signal
       [not found]         ` <CALvZod5=eLQMdVXxuhj9ia=PkoRvT5oBxeqZAVtQpSukZ=tCxA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2021-04-26 16:24           ` Christian Brauner
  0 siblings, 0 replies; 15+ messages in thread
From: Christian Brauner @ 2021-04-26 16:24 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Koutný,
	Christian Brauner, Tejun Heo, Roman Gushchin, Zefan Li,
	Johannes Weiner, Cgroups

On Mon, Apr 26, 2021 at 09:08:32AM -0700, Shakeel Butt wrote:
> On Mon, Apr 26, 2021 at 8:29 AM Christian Brauner
> <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> wrote:
> >
> [...]
> > > Have you considered accepting a cgroup fd to pidfd_send_signal and
> > > realize this operation through this syscall? (Just asking as it may
> > > prevent some of these consequences whereas bring other unclarities.)
> >
> > That's semantically quite wrong on several fronts, I think.
> > pidfd_send_signal() operates on pidfds (and for a quirky historical
> > reason /proc/<pid> though that should die at some point). Making this
> > operate on cgroup fds is essentially implicit multiplexing which is
> > pretty nasty imho. In addition, this is a cgroup concept not a pidfd
> > concept.
> 
> What's your take on a new syscall cgroupfd_send_signal()? One
> complexity would be potentially different semantics for v1 and v2.

I would think that this will be labeled overkill and has - imho - almost
zero chance of being accepted.
You'd essentially need to argue that it's sensible for a (virtual)
filesystem to have a separate syscall or syscalls (a signal sending one
at that).
In addition you'd break the filesystem api model that cgroups are
designed around which seems unpleasant.
(And fwiw, an ioctl() would have the same issues though it has
precedence if you look at nsfs. But nsfs doesn't have any real
filesystem interaction model apart from being open()able and is
basically a broken out part from procfs.)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] cgroup: add cgroup.signal
  2021-04-26 15:15         ` Christian Brauner
@ 2021-04-26 19:02           ` Michal Koutný
  0 siblings, 0 replies; 15+ messages in thread
From: Michal Koutný @ 2021-04-26 19:02 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Roman Gushchin, Christian Brauner, Tejun Heo, Shakeel Butt,
	Zefan Li, Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 1738 bytes --]

(Thanks for your explanations in the other thread.)

On Mon, Apr 26, 2021 at 05:15:14PM +0200, Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> wrote:
> Since cgroups organize and manage resources and processes killing
> cgroups is arguably a core cgroup feature and in some form was always
> planned. It just hasn't been high-priority.

This holds for v1 as well, actually, cgroup.signal had been considered
[1] (but dropped in favor the freezer IIUC).

> We should very much try to make interfaces simpler to use for
> userspace.

Another way of seeing this is to have one canonical way how to do that
(i.e. with freezing).

> In this specific instance the code comes down from an algorithm to
> recursively kill all cgroups to a single write into a file. Which
> seems like a good win.

I'm considering the SIGKILL-only implementation now, i.e. the recursion
would still be needed for other signals.

> You can kill processes in ancestor or sibling pid namespaces as long
> as they are encompassed in the same cgroup. And other useful things.

This seems like the main differentiating point, the ability to pass
around a suicidal igniter around, that'll blow you up, all your house
including any privileged or invisible visitors in it. (Rewording just
for the fun of the simile.)

So with the restriction to mere SIGKILL of a cgroup and this reasoning,
I'll look at the patch itself (replying directly to original message).

> But really, the simplifcation alone is already quite good.

Yes, let's keep this a simplification :-)

Michal
 
[1] https://lore.kernel.org/lkml/CALdu-PBLCNXTaZuODyzw_g_FQNyLqK_FsdObC=HjtEp75vkdFQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] cgroup: add cgroup.signal
       [not found] ` <20210423171351.3614430-1-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2021-04-23 19:01   ` Roman Gushchin
  2021-04-26 14:42   ` Michal Koutný
@ 2021-04-26 19:03   ` Michal Koutný
  2021-04-27  9:36     ` Christian Brauner
  2 siblings, 1 reply; 15+ messages in thread
From: Michal Koutný @ 2021-04-26 19:03 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tejun Heo, Roman Gushchin, Shakeel Butt, Zefan Li,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Christian Brauner

[-- Attachment #1: Type: text/plain, Size: 1342 bytes --]

On Fri, Apr 23, 2021 at 07:13:51PM +0200, Christian Brauner <brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> +static void __cgroup_signal(struct cgroup *cgrp, int signr)
> +{
> +	struct css_task_iter it;
> +	struct task_struct *task;
> +
> +	lockdep_assert_held(&cgroup_mutex);
> +
> +	css_task_iter_start(&cgrp->self, 0, &it);

I think here you'd need CSS_TASK_ITER_PROCS here to avoid signalling
multithreaded processes multiple times? (OTOH, with commiting just to
SIGKILL this may be irrelevant.)

> +static void cgroup_signal(struct cgroup *cgrp, int signr)
> [...]
> +	read_lock(&tasklist_lock);

(Thinking loudly.)
I wonder if it's possible to get rid of this? A similar check that
freezer does in cgroup_post_fork() but perhaps in cgroup_can_fork(). The
fork/clone would apparently fail for the soon-to-die parent but there's
already similar error path returning ENODEV (heh, the macabrous name
cgroup_is_dead() is already occupied).

This way the cgroup-killer wouldn't unncessarily preempt tasklist_lock
exclusive requestors system-wide.


> @@ -4846,6 +4916,11 @@ static struct cftype cgroup_base_files[] = {
> +	{
> +		.name = "cgroup.signal",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.write = cgroup_signal_write,

I think this shouldn't be visible in threaded cgroups (or return an
error when attempting to kill those).

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] cgroup: add cgroup.signal
  2021-04-26 19:03   ` Michal Koutný
@ 2021-04-27  9:36     ` Christian Brauner
  2021-04-27 14:29       ` Tejun Heo
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2021-04-27  9:36 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Christian Brauner, Tejun Heo, Roman Gushchin, Shakeel Butt,
	Zefan Li, Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, Apr 26, 2021 at 09:03:00PM +0200, Michal Koutný wrote:
> On Fri, Apr 23, 2021 at 07:13:51PM +0200, Christian Brauner <brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > +static void __cgroup_signal(struct cgroup *cgrp, int signr)
> > +{
> > +	struct css_task_iter it;
> > +	struct task_struct *task;
> > +
> > +	lockdep_assert_held(&cgroup_mutex);
> > +
> > +	css_task_iter_start(&cgrp->self, 0, &it);
> 
> I think here you'd need CSS_TASK_ITER_PROCS here to avoid signalling
> multithreaded processes multiple times? (OTOH, with commiting just to
> SIGKILL this may be irrelevant.)

I thought about this optimization but (see below) given that it should
work with threaded cgroups we can't only walk thread-group leaders,
afaiu.
Instead I tried to rely on __fatal_signal_pending(). If any thread in a
thread-group leader catches SIGKILL it'll set the signal as pending for
each thread in the thread-group and __fatal_signal_pending() should
therefore be enough to skip useless send_sig() calls.

> 
> > +static void cgroup_signal(struct cgroup *cgrp, int signr)
> > [...]
> > +	read_lock(&tasklist_lock);
> 
> (Thinking loudly.)
> I wonder if it's possible to get rid of this? A similar check that
> freezer does in cgroup_post_fork() but perhaps in cgroup_can_fork(). The
> fork/clone would apparently fail for the soon-to-die parent but there's
> already similar error path returning ENODEV (heh, the macabrous name
> cgroup_is_dead() is already occupied).
> 
> This way the cgroup-killer wouldn't unncessarily preempt tasklist_lock
> exclusive requestors system-wide.

Good point. I've been playing with revamping this whole thing. We should
definitely do a post_fork() check too though but that's fairly cheap if
we simply introduce a simple CGRP_KILL flag and set it prior to killing
the cgroup.

> 
> 
> > @@ -4846,6 +4916,11 @@ static struct cftype cgroup_base_files[] = {
> > +	{
> > +		.name = "cgroup.signal",
> > +		.flags = CFTYPE_NOT_ON_ROOT,
> > +		.write = cgroup_signal_write,
> 
> I think this shouldn't be visible in threaded cgroups (or return an
> error when attempting to kill those).

I've been wondering about this too but then decided to follow freezer in
that regard. I think it should also be fine because a kill to a thread
will cause the whole thread-group to be taken down which arguably is the
semantics we want anyway.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] cgroup: add cgroup.signal
  2021-04-27  9:36     ` Christian Brauner
@ 2021-04-27 14:29       ` Tejun Heo
       [not found]         ` <YIgfrP5J3aXHfM1i-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2021-04-27 14:29 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Michal Koutný,
	Christian Brauner, Roman Gushchin, Shakeel Butt, Zefan Li,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Tue, Apr 27, 2021 at 11:36:06AM +0200, Christian Brauner wrote:
> I thought about this optimization but (see below) given that it should
> work with threaded cgroups we can't only walk thread-group leaders,
> afaiu.

CSS_TASK_ITER_PROCS|CSS_TASK_ITER_THREADED iterates all thread group leaders
in the threaded domain and is used to implement cgroup.procs. This should
work, right?

> > > @@ -4846,6 +4916,11 @@ static struct cftype cgroup_base_files[] = {
> > > +	{
> > > +		.name = "cgroup.signal",
> > > +		.flags = CFTYPE_NOT_ON_ROOT,
> > > +		.write = cgroup_signal_write,
> > 
> > I think this shouldn't be visible in threaded cgroups (or return an
> > error when attempting to kill those).
> 
> I've been wondering about this too but then decided to follow freezer in
> that regard. I think it should also be fine because a kill to a thread
> will cause the whole thread-group to be taken down which arguably is the
> semantics we want anyway.

I'd align it with cgroup.procs. Killing is a process-level operation (unlike
arbitrary signal delivery which I think is another reason to confine this to
killing) and threaded cgroups should be invisible to process-level
operations.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] cgroup: add cgroup.signal
       [not found]         ` <YIgfrP5J3aXHfM1i-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
@ 2021-04-28 14:37           ` Christian Brauner
  2021-04-28 16:04             ` Tejun Heo
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2021-04-28 14:37 UTC (permalink / raw)
  To: Tejun Heo, Michal Koutný, Roman Gushchin
  Cc: Christian Brauner, Shakeel Butt, Zefan Li, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, Apr 27, 2021 at 10:29:00AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Apr 27, 2021 at 11:36:06AM +0200, Christian Brauner wrote:
> > I thought about this optimization but (see below) given that it should
> > work with threaded cgroups we can't only walk thread-group leaders,
> > afaiu.
> 
> CSS_TASK_ITER_PROCS|CSS_TASK_ITER_THREADED iterates all thread group leaders
> in the threaded domain and is used to implement cgroup.procs. This should
> work, right?

Yes, I switched to that.

> 
> > > > @@ -4846,6 +4916,11 @@ static struct cftype cgroup_base_files[] = {
> > > > +	{
> > > > +		.name = "cgroup.signal",
> > > > +		.flags = CFTYPE_NOT_ON_ROOT,
> > > > +		.write = cgroup_signal_write,
> > > 
> > > I think this shouldn't be visible in threaded cgroups (or return an
> > > error when attempting to kill those).
> > 
> > I've been wondering about this too but then decided to follow freezer in
> > that regard. I think it should also be fine because a kill to a thread
> > will cause the whole thread-group to be taken down which arguably is the
> > semantics we want anyway.
> 
> I'd align it with cgroup.procs. Killing is a process-level operation (unlike
> arbitrary signal delivery which I think is another reason to confine this to
> killing) and threaded cgroups should be invisible to process-level
> operations.

Ok, so we make write to cgroup.kill in threaded cgroups EOPNOTSUPP which
is equivalent what a read on cgroup.procs would yield.

Tejun, Roman, Michal, I've been thinking a bit about the escaping
children during fork() when killing a cgroup and I would like to propose
we simply take the write-side of threadgroup_rwsem during cgroup.kill.

This would give us robust protection against escaping children during
fork() since fork()ing takes the read-side already in cgroup_can_fork().
And cgroup.kill should be sufficiently rare that this isn't an
additional burden.

Other ways seems more fragile where the child can potentially escape
being killed. The most obvious case is when CLONE_INTO_CGROUP is not
used. If a cgroup.kill is initiated after cgroup_can_fork() and after
the parent's fatal_signal_pending() check we will wait for the parent to
release the siglock in cgroup_kill(). Once it does we deliver the fatal
signal to the parent. But if we haven't passed cgroup_post_fork() fast
enough the child can be placed into that cgroup right after the kill.
That's not super bad per se since the child isn't technically visible in
the target cgroup anyway but it feels a bit cleaner if it would die
right away. We could minimize the window by raising a flag say CGRP_KILL
say:

static void __cgroup_kill(struct cgroup *cgrp)
{
       spin_lock_irq(&css_set_lock);
       set_bit(CGRP_KILL, &cgrp->flags);
       spin_unlock_irq(&css_set_lock);

       css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED, &it);
       while ((task = css_task_iter_next(&it)))
               /* KILL IT */

       spin_lock_irq(&css_set_lock);
       clear_bit(CGRP_KILL, &cgrp->flags);
       spin_unlock_irq(&css_set_lock);

but that's also not clean enough for my taste. So I think the
threadgroup_rwsem might be quite clean and acceptable?

static void cgroup_kill(struct cgroup *cgrp)
{
       struct cgroup_subsys_state *css;
       struct cgroup *dsct;

       lockdep_assert_held(&cgroup_mutex);

       percpu_down_write(&cgroup_threadgroup_rwsem);
       cgroup_for_each_live_descendant_pre(dsct, css, cgrp)
               __cgroup_kill(dsct);
       percpu_up_write(&cgroup_threadgroup_rwsem);
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] cgroup: add cgroup.signal
  2021-04-28 14:37           ` Christian Brauner
@ 2021-04-28 16:04             ` Tejun Heo
       [not found]               ` <YImHjGGuIt0ebL0G-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2021-04-28 16:04 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Michal Koutný,
	Roman Gushchin, Christian Brauner, Shakeel Butt, Zefan Li,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Wed, Apr 28, 2021 at 04:37:46PM +0200, Christian Brauner wrote:
> > I'd align it with cgroup.procs. Killing is a process-level operation (unlike
> > arbitrary signal delivery which I think is another reason to confine this to
> > killing) and threaded cgroups should be invisible to process-level
> > operations.
> 
> Ok, so we make write to cgroup.kill in threaded cgroups EOPNOTSUPP which
> is equivalent what a read on cgroup.procs would yield.

Sounds good to me.

> Tejun, Roman, Michal, I've been thinking a bit about the escaping
> children during fork() when killing a cgroup and I would like to propose
> we simply take the write-side of threadgroup_rwsem during cgroup.kill.
> 
> This would give us robust protection against escaping children during
> fork() since fork()ing takes the read-side already in cgroup_can_fork().
> And cgroup.kill should be sufficiently rare that this isn't an
> additional burden.
> 
> Other ways seems more fragile where the child can potentially escape
> being killed. The most obvious case is when CLONE_INTO_CGROUP is not
> used. If a cgroup.kill is initiated after cgroup_can_fork() and after
> the parent's fatal_signal_pending() check we will wait for the parent to
> release the siglock in cgroup_kill(). Once it does we deliver the fatal
> signal to the parent. But if we haven't passed cgroup_post_fork() fast
> enough the child can be placed into that cgroup right after the kill.
> That's not super bad per se since the child isn't technically visible in
> the target cgroup anyway but it feels a bit cleaner if it would die
> right away. We could minimize the window by raising a flag say CGRP_KILL
> say:

So, yeah, I wouldn't worry about the case where migration is competing
against killing. The order of operations simply isn't defined and any
outcome is fine. As for the specific synchronization method to use, my gut
feeling is whatever which aligns better with the freezer implementation but
I don't have strong feelings otherwise. Roman, what do you think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] cgroup: add cgroup.signal
       [not found]               ` <YImHjGGuIt0ebL0G-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
@ 2021-04-28 18:12                 ` Roman Gushchin
  0 siblings, 0 replies; 15+ messages in thread
From: Roman Gushchin @ 2021-04-28 18:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christian Brauner, Michal Koutný,
	Christian Brauner, Shakeel Butt, Zefan Li, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, Apr 28, 2021 at 12:04:28PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Wed, Apr 28, 2021 at 04:37:46PM +0200, Christian Brauner wrote:
> > > I'd align it with cgroup.procs. Killing is a process-level operation (unlike
> > > arbitrary signal delivery which I think is another reason to confine this to
> > > killing) and threaded cgroups should be invisible to process-level
> > > operations.
> > 
> > Ok, so we make write to cgroup.kill in threaded cgroups EOPNOTSUPP which
> > is equivalent what a read on cgroup.procs would yield.
> 
> Sounds good to me.
> 
> > Tejun, Roman, Michal, I've been thinking a bit about the escaping
> > children during fork() when killing a cgroup and I would like to propose
> > we simply take the write-side of threadgroup_rwsem during cgroup.kill.
> > 
> > This would give us robust protection against escaping children during
> > fork() since fork()ing takes the read-side already in cgroup_can_fork().
> > And cgroup.kill should be sufficiently rare that this isn't an
> > additional burden.
> > 
> > Other ways seems more fragile where the child can potentially escape
> > being killed. The most obvious case is when CLONE_INTO_CGROUP is not
> > used. If a cgroup.kill is initiated after cgroup_can_fork() and after
> > the parent's fatal_signal_pending() check we will wait for the parent to
> > release the siglock in cgroup_kill(). Once it does we deliver the fatal
> > signal to the parent. But if we haven't passed cgroup_post_fork() fast
> > enough the child can be placed into that cgroup right after the kill.
> > That's not super bad per se since the child isn't technically visible in
> > the target cgroup anyway but it feels a bit cleaner if it would die
> > right away. We could minimize the window by raising a flag say CGRP_KILL
> > say:
> 
> So, yeah, I wouldn't worry about the case where migration is competing
> against killing. The order of operations simply isn't defined and any
> outcome is fine. As for the specific synchronization method to use, my gut
> feeling is whatever which aligns better with the freezer implementation but
> I don't have strong feelings otherwise. Roman, what do you think?

I'd introduce a CGRP_KILL flag and check it in cgroup_post_fork(), similar
to how we check CGRP_FREEZE. That would solve the problem with a forking bomb.
Migrations and kills are synchronized via cgroup_mutex. So we guarantee
that all tasks (and their descendants) that were in the cgroup at the moment
when a user asked to kill the cgroup will die. Tasks moved into the cgroup
later shouldn't be killed.

Thanks!

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-04-28 18:12 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-23 17:13 [RFC PATCH] cgroup: add cgroup.signal Christian Brauner
     [not found] ` <20210423171351.3614430-1-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2021-04-23 19:01   ` Roman Gushchin
     [not found]     ` <YIMZkjzNFypjZao9-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
2021-04-26 14:42       ` Michal Koutný
2021-04-26 15:15         ` Christian Brauner
2021-04-26 19:02           ` Michal Koutný
2021-04-26 14:42   ` Michal Koutný
2021-04-26 15:29     ` Christian Brauner
2021-04-26 16:08       ` Shakeel Butt
     [not found]         ` <CALvZod5=eLQMdVXxuhj9ia=PkoRvT5oBxeqZAVtQpSukZ=tCxA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2021-04-26 16:24           ` Christian Brauner
2021-04-26 19:03   ` Michal Koutný
2021-04-27  9:36     ` Christian Brauner
2021-04-27 14:29       ` Tejun Heo
     [not found]         ` <YIgfrP5J3aXHfM1i-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
2021-04-28 14:37           ` Christian Brauner
2021-04-28 16:04             ` Tejun Heo
     [not found]               ` <YImHjGGuIt0ebL0G-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
2021-04-28 18:12                 ` Roman Gushchin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.