All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/5] cgroup: introduce cgroup.kill
@ 2021-04-29 12:01 Christian Brauner
       [not found] ` <20210429120113.2238065-1-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2021-04-29 12:01 UTC (permalink / raw)
  To: Tejun Heo, Roman Gushchin
  Cc: Shakeel Butt, Zefan Li, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

Introduce the cgroup.kill file. It does what it says on the tin and
allows a caller to kill a cgroup by writing "1" into cgroup.kill.
The file is available in non-root cgroups.

Killing cgroups is a process directed operation, i.e. the whole
thread-group is affected. Consequently trying to write to cgroup.kill in
threaded cgroups will be rejected and EOPNOTSUPP returned. This behavior
aligns with cgroup.procs where reads in threaded-cgroups are rejected
with EOPNOTSUPP.

The cgroup.kill file is write-only since killing a cgroup is an event
not which makes it different from e.g. freezer where a cgroup
transitions between the two states.

As with all new cgroup features cgroup.kill is recursive by default.

Killing a cgroup is protected against concurrent migrations through the
cgroup mutex. To protect against forkbombs and to mitigate the effect of
racing forks a new CGRP_KILL css set lock protected flag is introduced
that is set prior to killing a cgroup and unset after the cgroup has
been killed. We can then check in cgroup_post_fork() where we hold the
css set lock already whether the cgroup is currently being killed. If so
we send the child a SIGKILL signal immediately taking it down as soon as
it returns to userspace. To make the killing of the child semantically
clean it is killed after all cgroup attachment operations have been
finalized.

There are various use-cases of this interface:
- Containers usually have a conservative layout where each container
  usually has a delegated cgroup. For such layouts there is a 1:1
  mapping between container and cgroup. If the container in addition
  uses a separate pid namespace then killing a container usually becomes
  a simple kill -9 <container-init-pid> from an ancestor pid namespace.
  However, there are quite a few scenarios where that isn't true. For
  example, there are containers that share the cgroup with other
  processes on purpose that are supposed to be bound to the lifetime of
  the container but are not in the same pidns of the container.
  Containers that are in a delegated cgroup but share the pid namespace
  with the host or other containers.
- Service managers such as systemd use cgroups to group and organize
  processes belonging to a service. They usually rely on a recursive
  algorithm now to kill a service. With cgroup.kill this becomes a
  simple write to cgroup.kill.
- Userspace OOM implementations can make good use of this feature to
  efficiently take down whole cgroups quickly.
- The kill program can gain a new
  kill --cgroup /sys/fs/cgroup/delegated
  flag to take down cgroups.

A few observations about the semantics:
- If parent and child are in the same cgroup and CLONE_INTO_CGROUP is
  not specified we are not taking cgroup mutex meaning the cgroup can be
  killed while a process in that cgroup is forking.
  If the kill request happens right before cgroup_can_fork() and before
  the parent grabs its siglock the parent is guaranteed to see the
  pending SIGKILL. In addition we perform another check in
  cgroup_post_fork() whether the cgroup is being killed and is so take
  down the child (see above). This is robust enough and protects gainst
  forkbombs. If userspace really really wants to have stricter
  protection the simple solution would be to grab the write side of the
  cgroup threadgroup rwsem which will force all ongoing forks to
  complete before killing starts. We concluded that this is not
  necessary as the semantics for concurrent forking should simply align
  with freezer where a similar check as cgroup_post_fork() is performed.

  For all other cases CLONE_INTO_CGROUP is required. In this case we
  will grab the cgroup mutex so the cgroup can't be killed while we
  fork. Once we're done with the fork and have dropped cgroup mutex we
  are visible and will be found by any subsequent kill request.

- We obviously don't kill kthreads. This means a cgroup that has a
  kthread will not become empty after killing and consequently no
  unpopulated event will be generated. The assumption is that kthreads
  should be in the root cgroup only anyway so this is not an issue.
- We skip killing tasks that already have pending fatal signals.
- Freezer doesn't care about tasks in different pid namespaces, i.e. if
  you have two tasks in different pid namespaces the cgroup would still
  be frozen. The cgroup.kill mechanism consequently behaves the same
  way, i.e. we kill all processes and ignore in which pid namespace they
  exist.
- If the caller is located in a cgroup that is killed the caller will
  obviously be killed as well.

Cc: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
---
 include/linux/cgroup-defs.h |  3 ++
 kernel/cgroup/cgroup.c      | 90 +++++++++++++++++++++++++++++++++++++
 2 files changed, 93 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 559ee05f86b2..43fef771009a 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -71,6 +71,9 @@ enum {
 
 	/* Cgroup is frozen. */
 	CGRP_FROZEN,
+
+	/* Control group has to be killed. */
+	CGRP_KILL,
 };
 
 /* cgroup_root->flags */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9153b20e5cc6..000f58b6ea35 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3654,6 +3654,80 @@ static ssize_t cgroup_freeze_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static void __cgroup_kill(struct cgroup *cgrp)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	spin_lock_irq(&css_set_lock);
+	set_bit(CGRP_KILL, &cgrp->flags);
+	spin_unlock_irq(&css_set_lock);
+
+	css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED, &it);
+	while ((task = css_task_iter_next(&it))) {
+		/* Ignore kernel threads here. */
+		if (task->flags & PF_KTHREAD)
+			continue;
+
+		/* Skip tasks that are already dying. */
+		if (__fatal_signal_pending(task))
+			continue;
+
+		send_sig(SIGKILL, task, 0);
+	}
+	css_task_iter_end(&it);
+
+	spin_lock_irq(&css_set_lock);
+	clear_bit(CGRP_KILL, &cgrp->flags);
+	spin_unlock_irq(&css_set_lock);
+}
+
+static void cgroup_kill(struct cgroup *cgrp)
+{
+	struct cgroup_subsys_state *css;
+	struct cgroup *dsct;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	cgroup_for_each_live_descendant_pre(dsct, css, cgrp)
+		__cgroup_kill(dsct);
+}
+
+static ssize_t cgroup_kill_write(struct kernfs_open_file *of, char *buf,
+				 size_t nbytes, loff_t off)
+{
+	ssize_t ret = 0;
+	int kill;
+	struct cgroup *cgrp;
+
+	ret = kstrtoint(strstrip(buf), 0, &kill);
+	if (ret)
+		return ret;
+
+	if (kill != 1)
+		return -ERANGE;
+
+	cgrp = cgroup_kn_lock_live(of->kn, false);
+	if (!cgrp)
+		return -ENOENT;
+
+	/*
+	 * Killing is a process directed operation, i.e. the whole thread-group
+	 * is taken down so act like we do for cgroup.procs and only make this
+	 * writable in non-threaded cgroups.
+	 */
+	if (cgroup_is_threaded(cgrp))
+		ret = -EOPNOTSUPP;
+	else
+		cgroup_kill(cgrp);
+
+	cgroup_kn_unlock(of->kn);
+
+	return ret ?: nbytes;
+}
+
 static int cgroup_file_open(struct kernfs_open_file *of)
 {
 	struct cftype *cft = of_cft(of);
@@ -4846,6 +4920,11 @@ static struct cftype cgroup_base_files[] = {
 		.seq_show = cgroup_freeze_show,
 		.write = cgroup_freeze_write,
 	},
+	{
+		.name = "cgroup.kill",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.write = cgroup_kill_write,
+	},
 	{
 		.name = "cpu.stat",
 		.seq_show = cpu_stat_show,
@@ -6077,6 +6156,7 @@ void cgroup_post_fork(struct task_struct *child,
 		      struct kernel_clone_args *kargs)
 	__releases(&cgroup_threadgroup_rwsem) __releases(&cgroup_mutex)
 {
+	bool kill = false;
 	struct cgroup_subsys *ss;
 	struct css_set *cset;
 	int i;
@@ -6088,6 +6168,12 @@ void cgroup_post_fork(struct task_struct *child,
 
 	/* init tasks are special, only link regular threads */
 	if (likely(child->pid)) {
+		struct cgroup *cgrp;
+
+		/* Check whether the target cgroup is being killed. */
+		cgrp = kargs->cgrp ?: cset->dfl_cgrp;
+		kill = test_bit(CGRP_KILL, &cgrp->flags);
+
 		WARN_ON_ONCE(!list_empty(&child->cg_list));
 		cset->nr_tasks++;
 		css_set_move_task(child, NULL, cset, false);
@@ -6135,6 +6221,10 @@ void cgroup_post_fork(struct task_struct *child,
 		put_css_set(rcset);
 	}
 
+	/* Take down child immediately. */
+	if (kill)
+		send_sig(SIGKILL, child, 0);
+
 	cgroup_css_set_put_fork(kargs);
 }
 

base-commit: 9f4ad9e425a1d3b6a34617b8ea226d56a119a717
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/5] docs/cgroup: add entry for cgroup.kill
       [not found] ` <20210429120113.2238065-1-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2021-04-29 12:01   ` Christian Brauner
       [not found]     ` <20210429120113.2238065-2-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2021-04-29 12:01   ` [PATCH 3/5] tests/cgroup: use cgroup.kill in cg_killall() Christian Brauner
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2021-04-29 12:01 UTC (permalink / raw)
  To: Tejun Heo, Roman Gushchin
  Cc: Shakeel Butt, Zefan Li, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

Give a brief overview of the cgroup.kill functionality.

Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
---
 Documentation/admin-guide/cgroup-v2.rst | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 64c62b979f2f..c9f656a84590 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -949,6 +949,23 @@ All cgroup core files are prefixed with "cgroup."
 	it's possible to delete a frozen (and empty) cgroup, as well as
 	create new sub-cgroups.
 
+  cgroup.kill
+	A write-only single value file which exists in non-root cgroups.
+	The only allowed value is "1".
+
+	Writing "1" to the file causes the cgroup and all descendant cgroups to
+	be killed. This means that all processes located in the affected cgroup
+	tree will be killed via SIGKILL.
+
+	Killing a cgroup tree will deal with concurrent forks appropriately and
+	is protected against migrations. If callers require strict guarantees
+	they can issue the cgroup.kill request after a freezing the cgroup via
+	cgroup.freeze.
+
+	In a threaded cgroup, writing this file fails with EOPNOTSUPP as
+	killing cgroups is a process directed operation, i.e. it affects
+	the whole thread-group.
+
 Controllers
 ===========
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/5] tests/cgroup: use cgroup.kill in cg_killall()
       [not found] ` <20210429120113.2238065-1-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2021-04-29 12:01   ` [PATCH 2/5] docs/cgroup: add entry for cgroup.kill Christian Brauner
@ 2021-04-29 12:01   ` Christian Brauner
       [not found]     ` <20210429120113.2238065-3-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2021-04-29 12:01   ` [PATCH 4/5] tests/cgroup: move cg_wait_for(), cg_prepare_for_wait() Christian Brauner
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2021-04-29 12:01 UTC (permalink / raw)
  To: Tejun Heo, Roman Gushchin
  Cc: Shakeel Butt, Zefan Li, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

If cgroup.kill file is supported make use of it.

Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
---
 tools/testing/selftests/cgroup/cgroup_util.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c
index 027014662fb2..3e27cd9bda75 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.c
+++ b/tools/testing/selftests/cgroup/cgroup_util.c
@@ -252,6 +252,10 @@ int cg_killall(const char *cgroup)
 	char buf[PAGE_SIZE];
 	char *ptr = buf;
 
+	/* If cgroup.kill exists use it. */
+        if (!cg_write(cgroup, "cgroup.kill", "1"))
+		return 0;
+
 	if (cg_read(cgroup, "cgroup.procs", buf, sizeof(buf)))
 		return -1;
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/5] tests/cgroup: move cg_wait_for(), cg_prepare_for_wait()
       [not found] ` <20210429120113.2238065-1-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2021-04-29 12:01   ` [PATCH 2/5] docs/cgroup: add entry for cgroup.kill Christian Brauner
  2021-04-29 12:01   ` [PATCH 3/5] tests/cgroup: use cgroup.kill in cg_killall() Christian Brauner
@ 2021-04-29 12:01   ` Christian Brauner
       [not found]     ` <20210429120113.2238065-4-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2021-04-29 12:01   ` [PATCH 5/5] tests/cgroup: test cgroup.kill Christian Brauner
  2021-04-30  3:20   ` [PATCH 1/5] cgroup: introduce cgroup.kill Roman Gushchin
  4 siblings, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2021-04-29 12:01 UTC (permalink / raw)
  To: Tejun Heo, Roman Gushchin
  Cc: Shakeel Butt, Zefan Li, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

as they will be used by the tests for cgroup killing.

Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
---
 tools/testing/selftests/cgroup/cgroup_util.c  | 47 +++++++++++++++
 tools/testing/selftests/cgroup/cgroup_util.h  |  2 +
 tools/testing/selftests/cgroup/test_freezer.c | 57 -------------------
 3 files changed, 49 insertions(+), 57 deletions(-)

diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c
index 3e27cd9bda75..e8b397215888 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.c
+++ b/tools/testing/selftests/cgroup/cgroup_util.c
@@ -5,10 +5,12 @@
 #include <errno.h>
 #include <fcntl.h>
 #include <linux/limits.h>
+#include <poll.h>
 #include <signal.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
+#include <sys/inotify.h>
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <sys/wait.h>
@@ -580,3 +582,48 @@ int clone_into_cgroup_run_wait(const char *cgroup)
 	(void)clone_reap(pid, WEXITED);
 	return 0;
 }
+
+int cg_prepare_for_wait(const char *cgroup)
+{
+	int fd, ret = -1;
+
+	fd = inotify_init1(0);
+	if (fd == -1)
+		return fd;
+
+	ret = inotify_add_watch(fd, cg_control(cgroup, "cgroup.events"),
+				IN_MODIFY);
+	if (ret == -1) {
+		close(fd);
+		fd = -1;
+	}
+
+	return fd;
+}
+
+int cg_wait_for(int fd)
+{
+	int ret = -1;
+	struct pollfd fds = {
+		.fd = fd,
+		.events = POLLIN,
+	};
+
+	while (true) {
+		ret = poll(&fds, 1, 10000);
+
+		if (ret == -1) {
+			if (errno == EINTR)
+				continue;
+
+			break;
+		}
+
+		if (ret > 0 && fds.revents & POLLIN) {
+			ret = 0;
+			break;
+		}
+	}
+
+	return ret;
+}
diff --git a/tools/testing/selftests/cgroup/cgroup_util.h b/tools/testing/selftests/cgroup/cgroup_util.h
index 5a1305dd1f0b..82e59cdf16e7 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.h
+++ b/tools/testing/selftests/cgroup/cgroup_util.h
@@ -54,3 +54,5 @@ extern pid_t clone_into_cgroup(int cgroup_fd);
 extern int clone_reap(pid_t pid, int options);
 extern int clone_into_cgroup_run_wait(const char *cgroup);
 extern int dirfd_open_opath(const char *dir);
+extern int cg_prepare_for_wait(const char *cgroup);
+extern int cg_wait_for(int fd);
diff --git a/tools/testing/selftests/cgroup/test_freezer.c b/tools/testing/selftests/cgroup/test_freezer.c
index 23d8fa4a3e4e..ff519029f6f4 100644
--- a/tools/testing/selftests/cgroup/test_freezer.c
+++ b/tools/testing/selftests/cgroup/test_freezer.c
@@ -7,9 +7,7 @@
 #include <unistd.h>
 #include <stdio.h>
 #include <errno.h>
-#include <poll.h>
 #include <stdlib.h>
-#include <sys/inotify.h>
 #include <string.h>
 #include <sys/wait.h>
 
@@ -54,61 +52,6 @@ static int cg_freeze_nowait(const char *cgroup, bool freeze)
 	return cg_write(cgroup, "cgroup.freeze", freeze ? "1" : "0");
 }
 
-/*
- * Prepare for waiting on cgroup.events file.
- */
-static int cg_prepare_for_wait(const char *cgroup)
-{
-	int fd, ret = -1;
-
-	fd = inotify_init1(0);
-	if (fd == -1) {
-		debug("Error: inotify_init1() failed\n");
-		return fd;
-	}
-
-	ret = inotify_add_watch(fd, cg_control(cgroup, "cgroup.events"),
-				IN_MODIFY);
-	if (ret == -1) {
-		debug("Error: inotify_add_watch() failed\n");
-		close(fd);
-		fd = -1;
-	}
-
-	return fd;
-}
-
-/*
- * Wait for an event. If there are no events for 10 seconds,
- * treat this an error.
- */
-static int cg_wait_for(int fd)
-{
-	int ret = -1;
-	struct pollfd fds = {
-		.fd = fd,
-		.events = POLLIN,
-	};
-
-	while (true) {
-		ret = poll(&fds, 1, 10000);
-
-		if (ret == -1) {
-			if (errno == EINTR)
-				continue;
-			debug("Error: poll() failed\n");
-			break;
-		}
-
-		if (ret > 0 && fds.revents & POLLIN) {
-			ret = 0;
-			break;
-		}
-	}
-
-	return ret;
-}
-
 /*
  * Attach a task to the given cgroup and wait for a cgroup frozen event.
  * All transient events (e.g. populated) are ignored.
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/5] tests/cgroup: test cgroup.kill
       [not found] ` <20210429120113.2238065-1-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (2 preceding siblings ...)
  2021-04-29 12:01   ` [PATCH 4/5] tests/cgroup: move cg_wait_for(), cg_prepare_for_wait() Christian Brauner
@ 2021-04-29 12:01   ` Christian Brauner
       [not found]     ` <20210429120113.2238065-5-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2021-04-30  3:20   ` [PATCH 1/5] cgroup: introduce cgroup.kill Roman Gushchin
  4 siblings, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2021-04-29 12:01 UTC (permalink / raw)
  To: Tejun Heo, Roman Gushchin
  Cc: Shakeel Butt, Zefan Li, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

Test that the new cgroup.kill feature works as intended.

Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
---
 tools/testing/selftests/cgroup/.gitignore  |   3 +-
 tools/testing/selftests/cgroup/Makefile    |   2 +
 tools/testing/selftests/cgroup/test_kill.c | 293 +++++++++++++++++++++
 3 files changed, 297 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/cgroup/test_kill.c

diff --git a/tools/testing/selftests/cgroup/.gitignore b/tools/testing/selftests/cgroup/.gitignore
index 84cfcabea838..be9643ef6285 100644
--- a/tools/testing/selftests/cgroup/.gitignore
+++ b/tools/testing/selftests/cgroup/.gitignore
@@ -2,4 +2,5 @@
 test_memcontrol
 test_core
 test_freezer
-test_kmem
\ No newline at end of file
+test_kmem
+test_kill
diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
index f027d933595b..87b7b0d90773 100644
--- a/tools/testing/selftests/cgroup/Makefile
+++ b/tools/testing/selftests/cgroup/Makefile
@@ -9,6 +9,7 @@ TEST_GEN_PROGS = test_memcontrol
 TEST_GEN_PROGS += test_kmem
 TEST_GEN_PROGS += test_core
 TEST_GEN_PROGS += test_freezer
+TEST_GEN_PROGS += test_kill
 
 include ../lib.mk
 
@@ -16,3 +17,4 @@ $(OUTPUT)/test_memcontrol: cgroup_util.c ../clone3/clone3_selftests.h
 $(OUTPUT)/test_kmem: cgroup_util.c ../clone3/clone3_selftests.h
 $(OUTPUT)/test_core: cgroup_util.c ../clone3/clone3_selftests.h
 $(OUTPUT)/test_freezer: cgroup_util.c ../clone3/clone3_selftests.h
+$(OUTPUT)/test_kill: cgroup_util.c ../clone3/clone3_selftests.h
diff --git a/tools/testing/selftests/cgroup/test_kill.c b/tools/testing/selftests/cgroup/test_kill.c
new file mode 100644
index 000000000000..c4e7b2e87395
--- /dev/null
+++ b/tools/testing/selftests/cgroup/test_kill.c
@@ -0,0 +1,293 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <stdbool.h>
+#include <linux/limits.h>
+#include <sys/ptrace.h>
+#include <sys/types.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <errno.h>
+#include <poll.h>
+#include <stdlib.h>
+#include <sys/inotify.h>
+#include <string.h>
+#include <sys/wait.h>
+
+#include "../kselftest.h"
+#include "cgroup_util.h"
+
+#define DEBUG
+#ifdef DEBUG
+#define debug(args...) fprintf(stderr, args)
+#else
+#define debug(args...)
+#endif
+
+/*
+ * Kill the given cgroup and wait for the inotify signal.
+ * If there are no events in 10 seconds, treat this as an error.
+ * Then check that the cgroup is in the desired state.
+ */
+static int cg_kill_wait(const char *cgroup)
+{
+	int fd, ret = -1;
+
+	fd = cg_prepare_for_wait(cgroup);
+	if (fd < 0)
+		return fd;
+
+	ret = cg_write(cgroup, "cgroup.kill", "1");
+	if (ret) {
+		debug("Error: cg_write() failed\n");
+		goto out;
+	}
+
+	ret = cg_wait_for(fd);
+	if (ret)
+		goto out;
+
+	ret = cg_read_strcmp(cgroup, "cgroup.events", "populated 0\n");
+out:
+	close(fd);
+	return ret;
+}
+
+/*
+ * A simple process running in a sleep loop until being
+ * re-parented.
+ */
+static int child_fn(const char *cgroup, void *arg)
+{
+	int ppid = getppid();
+
+	while (getppid() == ppid)
+		usleep(1000);
+
+	return getppid() == ppid;
+}
+
+static int test_cgkill_simple(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *cgroup = NULL;
+	int i;
+
+	cgroup = cg_name(root, "cg_test_simple");
+	if (!cgroup)
+		goto cleanup;
+
+	if (cg_create(cgroup))
+		goto cleanup;
+
+	for (i = 0; i < 100; i++)
+		cg_run_nowait(cgroup, child_fn, NULL);
+
+	if (cg_wait_for_proc_count(cgroup, 100))
+		goto cleanup;
+
+        if (cg_write(cgroup, "cgroup.kill", "1"))
+		goto cleanup;
+
+	if (cg_read_strcmp(cgroup, "cgroup.events", "populated 1\n"))
+		goto cleanup;
+
+	if (cg_kill_wait(cgroup))
+		goto cleanup;
+
+	if (cg_read_strcmp(cgroup, "cgroup.events", "populated 0\n"))
+		goto cleanup;
+
+	ret = KSFT_PASS;
+
+cleanup:
+	if (cgroup)
+		cg_destroy(cgroup);
+	free(cgroup);
+	return ret;
+}
+
+/*
+ * The test creates the following hierarchy:
+ *       A
+ *    / / \ \
+ *   B  E  I K
+ *  /\  |
+ * C  D F
+ *      |
+ *      G
+ *      |
+ *      H
+ *
+ * with a process in C, H and 3 processes in K.
+ * Then it tries to kill the whole tree.
+ */
+static int test_cgkill_tree(const char *root)
+{
+	char *cgroup[10] = {0};
+	int ret = KSFT_FAIL;
+	int i;
+
+	cgroup[0] = cg_name(root, "cg_test_tree_A");
+	if (!cgroup[0])
+		goto cleanup;
+
+	cgroup[1] = cg_name(cgroup[0], "B");
+	if (!cgroup[1])
+		goto cleanup;
+
+	cgroup[2] = cg_name(cgroup[1], "C");
+	if (!cgroup[2])
+		goto cleanup;
+
+	cgroup[3] = cg_name(cgroup[1], "D");
+	if (!cgroup[3])
+		goto cleanup;
+
+	cgroup[4] = cg_name(cgroup[0], "E");
+	if (!cgroup[4])
+		goto cleanup;
+
+	cgroup[5] = cg_name(cgroup[4], "F");
+	if (!cgroup[5])
+		goto cleanup;
+
+	cgroup[6] = cg_name(cgroup[5], "G");
+	if (!cgroup[6])
+		goto cleanup;
+
+	cgroup[7] = cg_name(cgroup[6], "H");
+	if (!cgroup[7])
+		goto cleanup;
+
+	cgroup[8] = cg_name(cgroup[0], "I");
+	if (!cgroup[8])
+		goto cleanup;
+
+	cgroup[9] = cg_name(cgroup[0], "K");
+	if (!cgroup[9])
+		goto cleanup;
+
+	for (i = 0; i < 10; i++)
+		if (cg_create(cgroup[i]))
+			goto cleanup;
+
+	cg_run_nowait(cgroup[2], child_fn, NULL);
+	cg_run_nowait(cgroup[7], child_fn, NULL);
+	cg_run_nowait(cgroup[9], child_fn, NULL);
+	cg_run_nowait(cgroup[9], child_fn, NULL);
+	cg_run_nowait(cgroup[9], child_fn, NULL);
+
+	/*
+	 * Wait until all child processes will enter
+	 * corresponding cgroups.
+	 */
+
+	if (cg_wait_for_proc_count(cgroup[2], 1) ||
+	    cg_wait_for_proc_count(cgroup[7], 1) ||
+	    cg_wait_for_proc_count(cgroup[9], 3))
+		goto cleanup;
+
+	/*
+	 * Kill A and check that we get an empty notification.
+	 */
+	if (cg_kill_wait(cgroup[0]))
+		goto cleanup;
+
+	if (cg_read_strcmp(cgroup[0], "cgroup.events", "populated 0\n"))
+		goto cleanup;
+
+	ret = KSFT_PASS;
+
+cleanup:
+	for (i = 9; i >= 0 && cgroup[i]; i--) {
+		cg_destroy(cgroup[i]);
+		free(cgroup[i]);
+	}
+
+	return ret;
+}
+
+static int forkbomb_fn(const char *cgroup, void *arg)
+{
+	int ppid;
+
+	fork();
+	fork();
+
+	ppid = getppid();
+
+	while (getppid() == ppid)
+		usleep(1000);
+
+	return getppid() == ppid;
+}
+
+/*
+ * The test runs a fork bomb in a cgroup and tries to kill it.
+ */
+static int test_cgkill_forkbomb(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *cgroup = NULL;
+
+	cgroup = cg_name(root, "cg_forkbomb_test");
+	if (!cgroup)
+		goto cleanup;
+
+	if (cg_create(cgroup))
+		goto cleanup;
+
+	cg_run_nowait(cgroup, forkbomb_fn, NULL);
+
+	usleep(100000);
+
+	if (cg_kill_wait(cgroup))
+		goto cleanup;
+
+	if (cg_wait_for_proc_count(cgroup, 0))
+		goto cleanup;
+
+	ret = KSFT_PASS;
+
+cleanup:
+	if (cgroup)
+		cg_destroy(cgroup);
+	free(cgroup);
+	return ret;
+}
+
+#define T(x) { x, #x }
+struct cgkill_test {
+	int (*fn)(const char *root);
+	const char *name;
+} tests[] = {
+	T(test_cgkill_simple),
+	T(test_cgkill_tree),
+	T(test_cgkill_forkbomb),
+};
+#undef T
+
+int main(int argc, char *argv[])
+{
+	char root[PATH_MAX];
+	int i, ret = EXIT_SUCCESS;
+
+	if (cg_find_unified_root(root, sizeof(root)))
+		ksft_exit_skip("cgroup v2 isn't mounted\n");
+	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		switch (tests[i].fn(root)) {
+		case KSFT_PASS:
+			ksft_test_result_pass("%s\n", tests[i].name);
+			break;
+		case KSFT_SKIP:
+			ksft_test_result_skip("%s\n", tests[i].name);
+			break;
+		default:
+			ret = EXIT_FAILURE;
+			ksft_test_result_fail("%s\n", tests[i].name);
+			break;
+		}
+	}
+
+	return ret;
+}
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/5] cgroup: introduce cgroup.kill
       [not found] ` <20210429120113.2238065-1-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (3 preceding siblings ...)
  2021-04-29 12:01   ` [PATCH 5/5] tests/cgroup: test cgroup.kill Christian Brauner
@ 2021-04-30  3:20   ` Roman Gushchin
       [not found]     ` <YIt3a5R5tYgIpoVQ-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
  4 siblings, 1 reply; 17+ messages in thread
From: Roman Gushchin @ 2021-04-30  3:20 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tejun Heo, Shakeel Butt, Zefan Li, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

On Thu, Apr 29, 2021 at 02:01:09PM +0200, Christian Brauner wrote:
> From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> 
> Introduce the cgroup.kill file. It does what it says on the tin and
> allows a caller to kill a cgroup by writing "1" into cgroup.kill.
> The file is available in non-root cgroups.
> 
> Killing cgroups is a process directed operation, i.e. the whole
> thread-group is affected. Consequently trying to write to cgroup.kill in
> threaded cgroups will be rejected and EOPNOTSUPP returned. This behavior
> aligns with cgroup.procs where reads in threaded-cgroups are rejected
> with EOPNOTSUPP.
> 
> The cgroup.kill file is write-only since killing a cgroup is an event
> not which makes it different from e.g. freezer where a cgroup
> transitions between the two states.
> 
> As with all new cgroup features cgroup.kill is recursive by default.
> 
> Killing a cgroup is protected against concurrent migrations through the
> cgroup mutex. To protect against forkbombs and to mitigate the effect of
> racing forks a new CGRP_KILL css set lock protected flag is introduced
> that is set prior to killing a cgroup and unset after the cgroup has
> been killed. We can then check in cgroup_post_fork() where we hold the
> css set lock already whether the cgroup is currently being killed. If so
> we send the child a SIGKILL signal immediately taking it down as soon as
> it returns to userspace. To make the killing of the child semantically
> clean it is killed after all cgroup attachment operations have been
> finalized.
> 
> There are various use-cases of this interface:
> - Containers usually have a conservative layout where each container
>   usually has a delegated cgroup. For such layouts there is a 1:1
>   mapping between container and cgroup. If the container in addition
>   uses a separate pid namespace then killing a container usually becomes
>   a simple kill -9 <container-init-pid> from an ancestor pid namespace.
>   However, there are quite a few scenarios where that isn't true. For
>   example, there are containers that share the cgroup with other
>   processes on purpose that are supposed to be bound to the lifetime of
>   the container but are not in the same pidns of the container.
>   Containers that are in a delegated cgroup but share the pid namespace
>   with the host or other containers.
> - Service managers such as systemd use cgroups to group and organize
>   processes belonging to a service. They usually rely on a recursive
>   algorithm now to kill a service. With cgroup.kill this becomes a
>   simple write to cgroup.kill.
> - Userspace OOM implementations can make good use of this feature to
>   efficiently take down whole cgroups quickly.
> - The kill program can gain a new
>   kill --cgroup /sys/fs/cgroup/delegated
>   flag to take down cgroups.
> 
> A few observations about the semantics:
> - If parent and child are in the same cgroup and CLONE_INTO_CGROUP is
>   not specified we are not taking cgroup mutex meaning the cgroup can be
>   killed while a process in that cgroup is forking.
>   If the kill request happens right before cgroup_can_fork() and before
>   the parent grabs its siglock the parent is guaranteed to see the
>   pending SIGKILL. In addition we perform another check in
>   cgroup_post_fork() whether the cgroup is being killed and is so take
>   down the child (see above). This is robust enough and protects gainst
>   forkbombs. If userspace really really wants to have stricter
>   protection the simple solution would be to grab the write side of the
>   cgroup threadgroup rwsem which will force all ongoing forks to
>   complete before killing starts. We concluded that this is not
>   necessary as the semantics for concurrent forking should simply align
>   with freezer where a similar check as cgroup_post_fork() is performed.
> 
>   For all other cases CLONE_INTO_CGROUP is required. In this case we
>   will grab the cgroup mutex so the cgroup can't be killed while we
>   fork. Once we're done with the fork and have dropped cgroup mutex we
>   are visible and will be found by any subsequent kill request.
> 
> - We obviously don't kill kthreads. This means a cgroup that has a
>   kthread will not become empty after killing and consequently no
>   unpopulated event will be generated. The assumption is that kthreads
>   should be in the root cgroup only anyway so this is not an issue.
> - We skip killing tasks that already have pending fatal signals.
> - Freezer doesn't care about tasks in different pid namespaces, i.e. if
>   you have two tasks in different pid namespaces the cgroup would still
>   be frozen. The cgroup.kill mechanism consequently behaves the same
>   way, i.e. we kill all processes and ignore in which pid namespace they
>   exist.
> - If the caller is located in a cgroup that is killed the caller will
>   obviously be killed as well.
> 
> Cc: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

Hello Christian!

This version looks very good to me, thanks for working on it!
One small not below.

> ---
>  include/linux/cgroup-defs.h |  3 ++
>  kernel/cgroup/cgroup.c      | 90 +++++++++++++++++++++++++++++++++++++
>  2 files changed, 93 insertions(+)
> 
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 559ee05f86b2..43fef771009a 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -71,6 +71,9 @@ enum {
>  
>  	/* Cgroup is frozen. */
>  	CGRP_FROZEN,
> +
> +	/* Control group has to be killed. */
> +	CGRP_KILL,
>  };
>  
>  /* cgroup_root->flags */
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 9153b20e5cc6..000f58b6ea35 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3654,6 +3654,80 @@ static ssize_t cgroup_freeze_write(struct kernfs_open_file *of,
>  	return nbytes;
>  }
>  
> +static void __cgroup_kill(struct cgroup *cgrp)
> +{
> +	struct css_task_iter it;
> +	struct task_struct *task;
> +
> +	lockdep_assert_held(&cgroup_mutex);
> +
> +	spin_lock_irq(&css_set_lock);
> +	set_bit(CGRP_KILL, &cgrp->flags);
> +	spin_unlock_irq(&css_set_lock);
> +
> +	css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED, &it);
> +	while ((task = css_task_iter_next(&it))) {
> +		/* Ignore kernel threads here. */
> +		if (task->flags & PF_KTHREAD)
> +			continue;
> +
> +		/* Skip tasks that are already dying. */
> +		if (__fatal_signal_pending(task))
> +			continue;
> +
> +		send_sig(SIGKILL, task, 0);
> +	}
> +	css_task_iter_end(&it);
> +
> +	spin_lock_irq(&css_set_lock);
> +	clear_bit(CGRP_KILL, &cgrp->flags);
> +	spin_unlock_irq(&css_set_lock);
> +}
> +
> +static void cgroup_kill(struct cgroup *cgrp)
> +{
> +	struct cgroup_subsys_state *css;
> +	struct cgroup *dsct;
> +
> +	lockdep_assert_held(&cgroup_mutex);
> +
> +	cgroup_for_each_live_descendant_pre(dsct, css, cgrp)
> +		__cgroup_kill(dsct);
> +}
> +
> +static ssize_t cgroup_kill_write(struct kernfs_open_file *of, char *buf,
> +				 size_t nbytes, loff_t off)
> +{
> +	ssize_t ret = 0;
> +	int kill;
> +	struct cgroup *cgrp;
> +
> +	ret = kstrtoint(strstrip(buf), 0, &kill);
> +	if (ret)
> +		return ret;
> +
> +	if (kill != 1)
> +		return -ERANGE;
> +
> +	cgrp = cgroup_kn_lock_live(of->kn, false);
> +	if (!cgrp)
> +		return -ENOENT;
> +
> +	/*
> +	 * Killing is a process directed operation, i.e. the whole thread-group
> +	 * is taken down so act like we do for cgroup.procs and only make this
> +	 * writable in non-threaded cgroups.
> +	 */
> +	if (cgroup_is_threaded(cgrp))
> +		ret = -EOPNOTSUPP;
> +	else
> +		cgroup_kill(cgrp);
> +
> +	cgroup_kn_unlock(of->kn);
> +
> +	return ret ?: nbytes;
> +}
> +
>  static int cgroup_file_open(struct kernfs_open_file *of)
>  {
>  	struct cftype *cft = of_cft(of);
> @@ -4846,6 +4920,11 @@ static struct cftype cgroup_base_files[] = {
>  		.seq_show = cgroup_freeze_show,
>  		.write = cgroup_freeze_write,
>  	},
> +	{
> +		.name = "cgroup.kill",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.write = cgroup_kill_write,
> +	},
>  	{
>  		.name = "cpu.stat",
>  		.seq_show = cpu_stat_show,
> @@ -6077,6 +6156,7 @@ void cgroup_post_fork(struct task_struct *child,
>  		      struct kernel_clone_args *kargs)
>  	__releases(&cgroup_threadgroup_rwsem) __releases(&cgroup_mutex)
>  {
> +	bool kill = false;
>  	struct cgroup_subsys *ss;
>  	struct css_set *cset;
>  	int i;
> @@ -6088,6 +6168,12 @@ void cgroup_post_fork(struct task_struct *child,
>  
>  	/* init tasks are special, only link regular threads */
>  	if (likely(child->pid)) {
> +		struct cgroup *cgrp;
> +
> +		/* Check whether the target cgroup is being killed. */
> +		cgrp = kargs->cgrp ?: cset->dfl_cgrp;
> +		kill = test_bit(CGRP_KILL, &cgrp->flags);
> +
>  		WARN_ON_ONCE(!list_empty(&child->cg_list));
>  		cset->nr_tasks++;
>  		css_set_move_task(child, NULL, cset, false);
> @@ -6135,6 +6221,10 @@ void cgroup_post_fork(struct task_struct *child,
>  		put_css_set(rcset);
>  	}
>  
> +	/* Take down child immediately. */
> +	if (kill)
> +		send_sig(SIGKILL, child, 0);
> +

This part can be hot, so I'd integrate it better with the freezer check:
instead of getting child task's cgroup's flags and checking individual
bits twice we can do it once.

Other than that the patch looks good to me.

Thanks!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/5] docs/cgroup: add entry for cgroup.kill
       [not found]     ` <20210429120113.2238065-2-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2021-04-30  3:22       ` Roman Gushchin
       [not found]         ` <YIt32/aQJfkw53ic-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Roman Gushchin @ 2021-04-30  3:22 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tejun Heo, Shakeel Butt, Zefan Li, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

On Thu, Apr 29, 2021 at 02:01:10PM +0200, Christian Brauner wrote:
> From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> 
> Give a brief overview of the cgroup.kill functionality.
> 
> Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 64c62b979f2f..c9f656a84590 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -949,6 +949,23 @@ All cgroup core files are prefixed with "cgroup."
>  	it's possible to delete a frozen (and empty) cgroup, as well as
>  	create new sub-cgroups.
>  
> +  cgroup.kill
> +	A write-only single value file which exists in non-root cgroups.
> +	The only allowed value is "1".
> +
> +	Writing "1" to the file causes the cgroup and all descendant cgroups to
> +	be killed. This means that all processes located in the affected cgroup
> +	tree will be killed via SIGKILL.
> +
> +	Killing a cgroup tree will deal with concurrent forks appropriately and
> +	is protected against migrations. If callers require strict guarantees
> +	they can issue the cgroup.kill request after a freezing the cgroup via
> +	cgroup.freeze.

Hm, is it necessarily? What additional guarantees adds using the freezer?

> +
> +	In a threaded cgroup, writing this file fails with EOPNOTSUPP as
> +	killing cgroups is a process directed operation, i.e. it affects
> +	the whole thread-group.
> +
>  Controllers
>  ===========
>  
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/5] tests/cgroup: use cgroup.kill in cg_killall()
       [not found]     ` <20210429120113.2238065-3-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2021-04-30  3:23       ` Roman Gushchin
       [not found]         ` <YIt4NhikbQKc0Vku-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Roman Gushchin @ 2021-04-30  3:23 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tejun Heo, Shakeel Butt, Zefan Li, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

On Thu, Apr 29, 2021 at 02:01:11PM +0200, Christian Brauner wrote:
> From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> 
> If cgroup.kill file is supported make use of it.
> 
> Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> ---
>  tools/testing/selftests/cgroup/cgroup_util.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c
> index 027014662fb2..3e27cd9bda75 100644
> --- a/tools/testing/selftests/cgroup/cgroup_util.c
> +++ b/tools/testing/selftests/cgroup/cgroup_util.c
> @@ -252,6 +252,10 @@ int cg_killall(const char *cgroup)
>  	char buf[PAGE_SIZE];
>  	char *ptr = buf;
>  
> +	/* If cgroup.kill exists use it. */
> +        if (!cg_write(cgroup, "cgroup.kill", "1"))
   ^^^^^^^^
   spaces?

> +		return 0;
> +
>  	if (cg_read(cgroup, "cgroup.procs", buf, sizeof(buf)))
>  		return -1;
>  
> -- 
> 2.27.0
> 

Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>

with a fixed formatting.

Thanks!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/5] tests/cgroup: move cg_wait_for(), cg_prepare_for_wait()
       [not found]     ` <20210429120113.2238065-4-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2021-04-30  3:45       ` Roman Gushchin
  0 siblings, 0 replies; 17+ messages in thread
From: Roman Gushchin @ 2021-04-30  3:45 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tejun Heo, Shakeel Butt, Zefan Li, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

On Thu, Apr 29, 2021 at 02:01:12PM +0200, Christian Brauner wrote:
> From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> 
> as they will be used by the tests for cgroup killing.
> 
> Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>

Thanks!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] tests/cgroup: test cgroup.kill
       [not found]     ` <20210429120113.2238065-5-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2021-04-30  4:05       ` Roman Gushchin
       [not found]         ` <YIuCDz3h9/ZQPCMV-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Roman Gushchin @ 2021-04-30  4:05 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tejun Heo, Shakeel Butt, Zefan Li, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

On Thu, Apr 29, 2021 at 02:01:13PM +0200, Christian Brauner wrote:
> From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> 
> Test that the new cgroup.kill feature works as intended.
> 
> Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> ---
>  tools/testing/selftests/cgroup/.gitignore  |   3 +-
>  tools/testing/selftests/cgroup/Makefile    |   2 +
>  tools/testing/selftests/cgroup/test_kill.c | 293 +++++++++++++++++++++
>  3 files changed, 297 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/cgroup/test_kill.c
> 
> diff --git a/tools/testing/selftests/cgroup/.gitignore b/tools/testing/selftests/cgroup/.gitignore
> index 84cfcabea838..be9643ef6285 100644
> --- a/tools/testing/selftests/cgroup/.gitignore
> +++ b/tools/testing/selftests/cgroup/.gitignore
> @@ -2,4 +2,5 @@
>  test_memcontrol
>  test_core
>  test_freezer
> -test_kmem
> \ No newline at end of file
> +test_kmem
> +test_kill
> diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
> index f027d933595b..87b7b0d90773 100644
> --- a/tools/testing/selftests/cgroup/Makefile
> +++ b/tools/testing/selftests/cgroup/Makefile
> @@ -9,6 +9,7 @@ TEST_GEN_PROGS = test_memcontrol
>  TEST_GEN_PROGS += test_kmem
>  TEST_GEN_PROGS += test_core
>  TEST_GEN_PROGS += test_freezer
> +TEST_GEN_PROGS += test_kill
>  
>  include ../lib.mk
>  
> @@ -16,3 +17,4 @@ $(OUTPUT)/test_memcontrol: cgroup_util.c ../clone3/clone3_selftests.h
>  $(OUTPUT)/test_kmem: cgroup_util.c ../clone3/clone3_selftests.h
>  $(OUTPUT)/test_core: cgroup_util.c ../clone3/clone3_selftests.h
>  $(OUTPUT)/test_freezer: cgroup_util.c ../clone3/clone3_selftests.h
> +$(OUTPUT)/test_kill: cgroup_util.c ../clone3/clone3_selftests.h
> diff --git a/tools/testing/selftests/cgroup/test_kill.c b/tools/testing/selftests/cgroup/test_kill.c
> new file mode 100644
> index 000000000000..c4e7b2e87395
> --- /dev/null
> +++ b/tools/testing/selftests/cgroup/test_kill.c
> @@ -0,0 +1,293 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#include <stdbool.h>
> +#include <linux/limits.h>
> +#include <sys/ptrace.h>
> +#include <sys/types.h>
> +#include <sys/mman.h>
> +#include <unistd.h>
> +#include <stdio.h>
> +#include <errno.h>
> +#include <poll.h>
> +#include <stdlib.h>
> +#include <sys/inotify.h>
> +#include <string.h>
> +#include <sys/wait.h>
> +
> +#include "../kselftest.h"
> +#include "cgroup_util.h"
> +
> +#define DEBUG
> +#ifdef DEBUG
> +#define debug(args...) fprintf(stderr, args)
> +#else
> +#define debug(args...)
> +#endif
> +
> +/*
> + * Kill the given cgroup and wait for the inotify signal.
> + * If there are no events in 10 seconds, treat this as an error.
> + * Then check that the cgroup is in the desired state.
> + */
> +static int cg_kill_wait(const char *cgroup)
> +{
> +	int fd, ret = -1;
> +
> +	fd = cg_prepare_for_wait(cgroup);
> +	if (fd < 0)
> +		return fd;
> +
> +	ret = cg_write(cgroup, "cgroup.kill", "1");
> +	if (ret) {
> +		debug("Error: cg_write() failed\n");
> +		goto out;
> +	}
> +
> +	ret = cg_wait_for(fd);
> +	if (ret)
> +		goto out;
> +
> +	ret = cg_read_strcmp(cgroup, "cgroup.events", "populated 0\n");
> +out:
> +	close(fd);
> +	return ret;
> +}
> +
> +/*
> + * A simple process running in a sleep loop until being
> + * re-parented.
> + */
> +static int child_fn(const char *cgroup, void *arg)
> +{
> +	int ppid = getppid();
> +
> +	while (getppid() == ppid)
> +		usleep(1000);
> +
> +	return getppid() == ppid;
> +}
> +
> +static int test_cgkill_simple(const char *root)
> +{
> +	int ret = KSFT_FAIL;
> +	char *cgroup = NULL;
> +	int i;
> +
> +	cgroup = cg_name(root, "cg_test_simple");
> +	if (!cgroup)
> +		goto cleanup;
> +
> +	if (cg_create(cgroup))
> +		goto cleanup;
> +
> +	for (i = 0; i < 100; i++)
> +		cg_run_nowait(cgroup, child_fn, NULL);
> +
> +	if (cg_wait_for_proc_count(cgroup, 100))
> +		goto cleanup;
> +
> +        if (cg_write(cgroup, "cgroup.kill", "1"))
> +		goto cleanup;
> +
> +	if (cg_read_strcmp(cgroup, "cgroup.events", "populated 1\n"))
> +		goto cleanup;
> +
> +	if (cg_kill_wait(cgroup))
> +		goto cleanup;
> +
> +	if (cg_read_strcmp(cgroup, "cgroup.events", "populated 0\n"))
> +		goto cleanup;
> +
> +	ret = KSFT_PASS;
> +
> +cleanup:
> +	if (cgroup)
> +		cg_destroy(cgroup);
> +	free(cgroup);
> +	return ret;
> +}
> +
> +/*
> + * The test creates the following hierarchy:
> + *       A
> + *    / / \ \
> + *   B  E  I K
> + *  /\  |
> + * C  D F
> + *      |
> + *      G
> + *      |
> + *      H
> + *
> + * with a process in C, H and 3 processes in K.
> + * Then it tries to kill the whole tree.
> + */
> +static int test_cgkill_tree(const char *root)
> +{
> +	char *cgroup[10] = {0};
> +	int ret = KSFT_FAIL;
> +	int i;
> +
> +	cgroup[0] = cg_name(root, "cg_test_tree_A");
> +	if (!cgroup[0])
> +		goto cleanup;
> +
> +	cgroup[1] = cg_name(cgroup[0], "B");
> +	if (!cgroup[1])
> +		goto cleanup;
> +
> +	cgroup[2] = cg_name(cgroup[1], "C");
> +	if (!cgroup[2])
> +		goto cleanup;
> +
> +	cgroup[3] = cg_name(cgroup[1], "D");
> +	if (!cgroup[3])
> +		goto cleanup;
> +
> +	cgroup[4] = cg_name(cgroup[0], "E");
> +	if (!cgroup[4])
> +		goto cleanup;
> +
> +	cgroup[5] = cg_name(cgroup[4], "F");
> +	if (!cgroup[5])
> +		goto cleanup;
> +
> +	cgroup[6] = cg_name(cgroup[5], "G");
> +	if (!cgroup[6])
> +		goto cleanup;
> +
> +	cgroup[7] = cg_name(cgroup[6], "H");
> +	if (!cgroup[7])
> +		goto cleanup;
> +
> +	cgroup[8] = cg_name(cgroup[0], "I");
> +	if (!cgroup[8])
> +		goto cleanup;
> +
> +	cgroup[9] = cg_name(cgroup[0], "K");
> +	if (!cgroup[9])
> +		goto cleanup;
> +
> +	for (i = 0; i < 10; i++)
> +		if (cg_create(cgroup[i]))
> +			goto cleanup;
> +
> +	cg_run_nowait(cgroup[2], child_fn, NULL);
> +	cg_run_nowait(cgroup[7], child_fn, NULL);
> +	cg_run_nowait(cgroup[9], child_fn, NULL);
> +	cg_run_nowait(cgroup[9], child_fn, NULL);
> +	cg_run_nowait(cgroup[9], child_fn, NULL);
> +
> +	/*
> +	 * Wait until all child processes will enter
> +	 * corresponding cgroups.
> +	 */
> +
> +	if (cg_wait_for_proc_count(cgroup[2], 1) ||
> +	    cg_wait_for_proc_count(cgroup[7], 1) ||
> +	    cg_wait_for_proc_count(cgroup[9], 3))
> +		goto cleanup;
> +
> +	/*
> +	 * Kill A and check that we get an empty notification.
> +	 */
> +	if (cg_kill_wait(cgroup[0]))
> +		goto cleanup;
> +
> +	if (cg_read_strcmp(cgroup[0], "cgroup.events", "populated 0\n"))
> +		goto cleanup;
> +
> +	ret = KSFT_PASS;
> +
> +cleanup:
> +	for (i = 9; i >= 0 && cgroup[i]; i--) {
> +		cg_destroy(cgroup[i]);
> +		free(cgroup[i]);
> +	}
> +
> +	return ret;
> +}
> +
> +static int forkbomb_fn(const char *cgroup, void *arg)
> +{
> +	int ppid;
> +
> +	fork();
> +	fork();
> +
> +	ppid = getppid();
> +
> +	while (getppid() == ppid)
> +		usleep(1000);
> +
> +	return getppid() == ppid;
> +}
> +
> +/*
> + * The test runs a fork bomb in a cgroup and tries to kill it.
> + */
> +static int test_cgkill_forkbomb(const char *root)
> +{
> +	int ret = KSFT_FAIL;
> +	char *cgroup = NULL;
> +
> +	cgroup = cg_name(root, "cg_forkbomb_test");
> +	if (!cgroup)
> +		goto cleanup;
> +
> +	if (cg_create(cgroup))
> +		goto cleanup;
> +
> +	cg_run_nowait(cgroup, forkbomb_fn, NULL);
> +
> +	usleep(100000);
> +
> +	if (cg_kill_wait(cgroup))
> +		goto cleanup;
> +
> +	if (cg_wait_for_proc_count(cgroup, 0))
> +		goto cleanup;
> +
> +	ret = KSFT_PASS;
> +
> +cleanup:
> +	if (cgroup)
> +		cg_destroy(cgroup);
> +	free(cgroup);
> +	return ret;
> +}
> +
> +#define T(x) { x, #x }
> +struct cgkill_test {
> +	int (*fn)(const char *root);
> +	const char *name;
> +} tests[] = {
> +	T(test_cgkill_simple),
> +	T(test_cgkill_tree),
> +	T(test_cgkill_forkbomb),

I'm a little bit worried about the forkbomb test: how reliable it is and won't
it kill the system, but Idk how make it better. Maybe instead of a true fork
bomb we need to spawn children by a single process? I'm not exactly sure.

Other than that the patch looks good to me.

And thank you for writing tests!

Roman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/5] tests/cgroup: use cgroup.kill in cg_killall()
       [not found]         ` <YIt4NhikbQKc0Vku-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
@ 2021-04-30  6:42           ` Christian Brauner
  0 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2021-04-30  6:42 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Christian Brauner, Tejun Heo, Shakeel Butt, Zefan Li,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu, Apr 29, 2021 at 08:23:34PM -0700, Roman Gushchin wrote:
> On Thu, Apr 29, 2021 at 02:01:11PM +0200, Christian Brauner wrote:
> > From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> > 
> > If cgroup.kill file is supported make use of it.
> > 
> > Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> > Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> > ---
> >  tools/testing/selftests/cgroup/cgroup_util.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c
> > index 027014662fb2..3e27cd9bda75 100644
> > --- a/tools/testing/selftests/cgroup/cgroup_util.c
> > +++ b/tools/testing/selftests/cgroup/cgroup_util.c
> > @@ -252,6 +252,10 @@ int cg_killall(const char *cgroup)
> >  	char buf[PAGE_SIZE];
> >  	char *ptr = buf;
> >  
> > +	/* If cgroup.kill exists use it. */
> > +        if (!cg_write(cgroup, "cgroup.kill", "1"))
>    ^^^^^^^^
>    spaces?

Huh, sorry about that weird how that got in there...
Will fix.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] tests/cgroup: test cgroup.kill
       [not found]         ` <YIuCDz3h9/ZQPCMV-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
@ 2021-04-30  6:47           ` Christian Brauner
  2021-05-01  2:49             ` Roman Gushchin
  0 siblings, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2021-04-30  6:47 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Christian Brauner, Tejun Heo, Shakeel Butt, Zefan Li,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu, Apr 29, 2021 at 09:05:35PM -0700, Roman Gushchin wrote:
> On Thu, Apr 29, 2021 at 02:01:13PM +0200, Christian Brauner wrote:
> > From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> > 
> > Test that the new cgroup.kill feature works as intended.
> > 
> > Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> > Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> > ---
> >  tools/testing/selftests/cgroup/.gitignore  |   3 +-
> >  tools/testing/selftests/cgroup/Makefile    |   2 +
> >  tools/testing/selftests/cgroup/test_kill.c | 293 +++++++++++++++++++++
> >  3 files changed, 297 insertions(+), 1 deletion(-)
> >  create mode 100644 tools/testing/selftests/cgroup/test_kill.c
> > 
> > diff --git a/tools/testing/selftests/cgroup/.gitignore b/tools/testing/selftests/cgroup/.gitignore
> > index 84cfcabea838..be9643ef6285 100644
> > --- a/tools/testing/selftests/cgroup/.gitignore
> > +++ b/tools/testing/selftests/cgroup/.gitignore
> > @@ -2,4 +2,5 @@
> >  test_memcontrol
> >  test_core
> >  test_freezer
> > -test_kmem
> > \ No newline at end of file
> > +test_kmem
> > +test_kill
> > diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
> > index f027d933595b..87b7b0d90773 100644
> > --- a/tools/testing/selftests/cgroup/Makefile
> > +++ b/tools/testing/selftests/cgroup/Makefile
> > @@ -9,6 +9,7 @@ TEST_GEN_PROGS = test_memcontrol
> >  TEST_GEN_PROGS += test_kmem
> >  TEST_GEN_PROGS += test_core
> >  TEST_GEN_PROGS += test_freezer
> > +TEST_GEN_PROGS += test_kill
> >  
> >  include ../lib.mk
> >  
> > @@ -16,3 +17,4 @@ $(OUTPUT)/test_memcontrol: cgroup_util.c ../clone3/clone3_selftests.h
> >  $(OUTPUT)/test_kmem: cgroup_util.c ../clone3/clone3_selftests.h
> >  $(OUTPUT)/test_core: cgroup_util.c ../clone3/clone3_selftests.h
> >  $(OUTPUT)/test_freezer: cgroup_util.c ../clone3/clone3_selftests.h
> > +$(OUTPUT)/test_kill: cgroup_util.c ../clone3/clone3_selftests.h
> > diff --git a/tools/testing/selftests/cgroup/test_kill.c b/tools/testing/selftests/cgroup/test_kill.c
> > new file mode 100644
> > index 000000000000..c4e7b2e87395
> > --- /dev/null
> > +++ b/tools/testing/selftests/cgroup/test_kill.c
> > @@ -0,0 +1,293 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#include <stdbool.h>
> > +#include <linux/limits.h>
> > +#include <sys/ptrace.h>
> > +#include <sys/types.h>
> > +#include <sys/mman.h>
> > +#include <unistd.h>
> > +#include <stdio.h>
> > +#include <errno.h>
> > +#include <poll.h>
> > +#include <stdlib.h>
> > +#include <sys/inotify.h>
> > +#include <string.h>
> > +#include <sys/wait.h>
> > +
> > +#include "../kselftest.h"
> > +#include "cgroup_util.h"
> > +
> > +#define DEBUG
> > +#ifdef DEBUG
> > +#define debug(args...) fprintf(stderr, args)
> > +#else
> > +#define debug(args...)
> > +#endif
> > +
> > +/*
> > + * Kill the given cgroup and wait for the inotify signal.
> > + * If there are no events in 10 seconds, treat this as an error.
> > + * Then check that the cgroup is in the desired state.
> > + */
> > +static int cg_kill_wait(const char *cgroup)
> > +{
> > +	int fd, ret = -1;
> > +
> > +	fd = cg_prepare_for_wait(cgroup);
> > +	if (fd < 0)
> > +		return fd;
> > +
> > +	ret = cg_write(cgroup, "cgroup.kill", "1");
> > +	if (ret) {
> > +		debug("Error: cg_write() failed\n");
> > +		goto out;
> > +	}
> > +
> > +	ret = cg_wait_for(fd);
> > +	if (ret)
> > +		goto out;
> > +
> > +	ret = cg_read_strcmp(cgroup, "cgroup.events", "populated 0\n");
> > +out:
> > +	close(fd);
> > +	return ret;
> > +}
> > +
> > +/*
> > + * A simple process running in a sleep loop until being
> > + * re-parented.
> > + */
> > +static int child_fn(const char *cgroup, void *arg)
> > +{
> > +	int ppid = getppid();
> > +
> > +	while (getppid() == ppid)
> > +		usleep(1000);
> > +
> > +	return getppid() == ppid;
> > +}
> > +
> > +static int test_cgkill_simple(const char *root)
> > +{
> > +	int ret = KSFT_FAIL;
> > +	char *cgroup = NULL;
> > +	int i;
> > +
> > +	cgroup = cg_name(root, "cg_test_simple");
> > +	if (!cgroup)
> > +		goto cleanup;
> > +
> > +	if (cg_create(cgroup))
> > +		goto cleanup;
> > +
> > +	for (i = 0; i < 100; i++)
> > +		cg_run_nowait(cgroup, child_fn, NULL);
> > +
> > +	if (cg_wait_for_proc_count(cgroup, 100))
> > +		goto cleanup;
> > +
> > +        if (cg_write(cgroup, "cgroup.kill", "1"))
> > +		goto cleanup;
> > +
> > +	if (cg_read_strcmp(cgroup, "cgroup.events", "populated 1\n"))
> > +		goto cleanup;
> > +
> > +	if (cg_kill_wait(cgroup))
> > +		goto cleanup;
> > +
> > +	if (cg_read_strcmp(cgroup, "cgroup.events", "populated 0\n"))
> > +		goto cleanup;
> > +
> > +	ret = KSFT_PASS;
> > +
> > +cleanup:
> > +	if (cgroup)
> > +		cg_destroy(cgroup);
> > +	free(cgroup);
> > +	return ret;
> > +}
> > +
> > +/*
> > + * The test creates the following hierarchy:
> > + *       A
> > + *    / / \ \
> > + *   B  E  I K
> > + *  /\  |
> > + * C  D F
> > + *      |
> > + *      G
> > + *      |
> > + *      H
> > + *
> > + * with a process in C, H and 3 processes in K.
> > + * Then it tries to kill the whole tree.
> > + */
> > +static int test_cgkill_tree(const char *root)
> > +{
> > +	char *cgroup[10] = {0};
> > +	int ret = KSFT_FAIL;
> > +	int i;
> > +
> > +	cgroup[0] = cg_name(root, "cg_test_tree_A");
> > +	if (!cgroup[0])
> > +		goto cleanup;
> > +
> > +	cgroup[1] = cg_name(cgroup[0], "B");
> > +	if (!cgroup[1])
> > +		goto cleanup;
> > +
> > +	cgroup[2] = cg_name(cgroup[1], "C");
> > +	if (!cgroup[2])
> > +		goto cleanup;
> > +
> > +	cgroup[3] = cg_name(cgroup[1], "D");
> > +	if (!cgroup[3])
> > +		goto cleanup;
> > +
> > +	cgroup[4] = cg_name(cgroup[0], "E");
> > +	if (!cgroup[4])
> > +		goto cleanup;
> > +
> > +	cgroup[5] = cg_name(cgroup[4], "F");
> > +	if (!cgroup[5])
> > +		goto cleanup;
> > +
> > +	cgroup[6] = cg_name(cgroup[5], "G");
> > +	if (!cgroup[6])
> > +		goto cleanup;
> > +
> > +	cgroup[7] = cg_name(cgroup[6], "H");
> > +	if (!cgroup[7])
> > +		goto cleanup;
> > +
> > +	cgroup[8] = cg_name(cgroup[0], "I");
> > +	if (!cgroup[8])
> > +		goto cleanup;
> > +
> > +	cgroup[9] = cg_name(cgroup[0], "K");
> > +	if (!cgroup[9])
> > +		goto cleanup;
> > +
> > +	for (i = 0; i < 10; i++)
> > +		if (cg_create(cgroup[i]))
> > +			goto cleanup;
> > +
> > +	cg_run_nowait(cgroup[2], child_fn, NULL);
> > +	cg_run_nowait(cgroup[7], child_fn, NULL);
> > +	cg_run_nowait(cgroup[9], child_fn, NULL);
> > +	cg_run_nowait(cgroup[9], child_fn, NULL);
> > +	cg_run_nowait(cgroup[9], child_fn, NULL);
> > +
> > +	/*
> > +	 * Wait until all child processes will enter
> > +	 * corresponding cgroups.
> > +	 */
> > +
> > +	if (cg_wait_for_proc_count(cgroup[2], 1) ||
> > +	    cg_wait_for_proc_count(cgroup[7], 1) ||
> > +	    cg_wait_for_proc_count(cgroup[9], 3))
> > +		goto cleanup;
> > +
> > +	/*
> > +	 * Kill A and check that we get an empty notification.
> > +	 */
> > +	if (cg_kill_wait(cgroup[0]))
> > +		goto cleanup;
> > +
> > +	if (cg_read_strcmp(cgroup[0], "cgroup.events", "populated 0\n"))
> > +		goto cleanup;
> > +
> > +	ret = KSFT_PASS;
> > +
> > +cleanup:
> > +	for (i = 9; i >= 0 && cgroup[i]; i--) {
> > +		cg_destroy(cgroup[i]);
> > +		free(cgroup[i]);
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +static int forkbomb_fn(const char *cgroup, void *arg)
> > +{
> > +	int ppid;
> > +
> > +	fork();
> > +	fork();
> > +
> > +	ppid = getppid();
> > +
> > +	while (getppid() == ppid)
> > +		usleep(1000);
> > +
> > +	return getppid() == ppid;
> > +}
> > +
> > +/*
> > + * The test runs a fork bomb in a cgroup and tries to kill it.
> > + */
> > +static int test_cgkill_forkbomb(const char *root)
> > +{
> > +	int ret = KSFT_FAIL;
> > +	char *cgroup = NULL;
> > +
> > +	cgroup = cg_name(root, "cg_forkbomb_test");
> > +	if (!cgroup)
> > +		goto cleanup;
> > +
> > +	if (cg_create(cgroup))
> > +		goto cleanup;
> > +
> > +	cg_run_nowait(cgroup, forkbomb_fn, NULL);
> > +
> > +	usleep(100000);
> > +
> > +	if (cg_kill_wait(cgroup))
> > +		goto cleanup;
> > +
> > +	if (cg_wait_for_proc_count(cgroup, 0))
> > +		goto cleanup;
> > +
> > +	ret = KSFT_PASS;
> > +
> > +cleanup:
> > +	if (cgroup)
> > +		cg_destroy(cgroup);
> > +	free(cgroup);
> > +	return ret;
> > +}
> > +
> > +#define T(x) { x, #x }
> > +struct cgkill_test {
> > +	int (*fn)(const char *root);
> > +	const char *name;
> > +} tests[] = {
> > +	T(test_cgkill_simple),
> > +	T(test_cgkill_tree),
> > +	T(test_cgkill_forkbomb),
> 
> I'm a little bit worried about the forkbomb test: how reliable it is and won't
> it kill the system, but Idk how make it better. Maybe instead of a true fork
> bomb we need to spawn children by a single process? I'm not exactly sure.

I had the tests run in tight a loop and it didn't take down the system.
The cgroup was nicely killed everytime which is why I kept it. Freezer
performs the same test. So what we could do is freeze first and then
kill under the assumption that the freezer tests don't suffer from the
same problem.

> 
> Other than that the patch looks good to me.
> 
> And thank you for writing tests!
> 
> Roman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/5] docs/cgroup: add entry for cgroup.kill
       [not found]         ` <YIt32/aQJfkw53ic-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
@ 2021-04-30  6:50           ` Christian Brauner
  2021-05-01  2:48             ` Roman Gushchin
  0 siblings, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2021-04-30  6:50 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Christian Brauner, Tejun Heo, Shakeel Butt, Zefan Li,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu, Apr 29, 2021 at 08:22:03PM -0700, Roman Gushchin wrote:
> On Thu, Apr 29, 2021 at 02:01:10PM +0200, Christian Brauner wrote:
> > From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> > 
> > Give a brief overview of the cgroup.kill functionality.
> > 
> > Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> > Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst | 17 +++++++++++++++++
> >  1 file changed, 17 insertions(+)
> > 
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 64c62b979f2f..c9f656a84590 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -949,6 +949,23 @@ All cgroup core files are prefixed with "cgroup."
> >  	it's possible to delete a frozen (and empty) cgroup, as well as
> >  	create new sub-cgroups.
> >  
> > +  cgroup.kill
> > +	A write-only single value file which exists in non-root cgroups.
> > +	The only allowed value is "1".
> > +
> > +	Writing "1" to the file causes the cgroup and all descendant cgroups to
> > +	be killed. This means that all processes located in the affected cgroup
> > +	tree will be killed via SIGKILL.
> > +
> > +	Killing a cgroup tree will deal with concurrent forks appropriately and
> > +	is protected against migrations. If callers require strict guarantees
> > +	they can issue the cgroup.kill request after a freezing the cgroup via
> > +	cgroup.freeze.
> 
> Hm, is it necessarily? What additional guarantees adds using the freezer?

Every new process that get's added is frozen. So even if the a process
ends up escaping the cgroup.kill request somehow it will be frozen in
the cgroup and can't itself fork again right away. So you could do:

echo 1 > cgroup.freeze
wait for frozen notification

echo 1 > cgroup.kill
wait for unpopulated notification

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/5] cgroup: introduce cgroup.kill
       [not found]     ` <YIt3a5R5tYgIpoVQ-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
@ 2021-04-30  7:28       ` Christian Brauner
  0 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2021-04-30  7:28 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Christian Brauner, Tejun Heo, Shakeel Butt, Zefan Li,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu, Apr 29, 2021 at 08:20:11PM -0700, Roman Gushchin wrote:
> On Thu, Apr 29, 2021 at 02:01:09PM +0200, Christian Brauner wrote:
> > From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> > 
> > Introduce the cgroup.kill file. It does what it says on the tin and
> > allows a caller to kill a cgroup by writing "1" into cgroup.kill.
> > The file is available in non-root cgroups.
> > 
> > Killing cgroups is a process directed operation, i.e. the whole
> > thread-group is affected. Consequently trying to write to cgroup.kill in
> > threaded cgroups will be rejected and EOPNOTSUPP returned. This behavior
> > aligns with cgroup.procs where reads in threaded-cgroups are rejected
> > with EOPNOTSUPP.
> > 
> > The cgroup.kill file is write-only since killing a cgroup is an event
> > not which makes it different from e.g. freezer where a cgroup
> > transitions between the two states.
> > 
> > As with all new cgroup features cgroup.kill is recursive by default.
> > 
> > Killing a cgroup is protected against concurrent migrations through the
> > cgroup mutex. To protect against forkbombs and to mitigate the effect of
> > racing forks a new CGRP_KILL css set lock protected flag is introduced
> > that is set prior to killing a cgroup and unset after the cgroup has
> > been killed. We can then check in cgroup_post_fork() where we hold the
> > css set lock already whether the cgroup is currently being killed. If so
> > we send the child a SIGKILL signal immediately taking it down as soon as
> > it returns to userspace. To make the killing of the child semantically
> > clean it is killed after all cgroup attachment operations have been
> > finalized.
> > 
> > There are various use-cases of this interface:
> > - Containers usually have a conservative layout where each container
> >   usually has a delegated cgroup. For such layouts there is a 1:1
> >   mapping between container and cgroup. If the container in addition
> >   uses a separate pid namespace then killing a container usually becomes
> >   a simple kill -9 <container-init-pid> from an ancestor pid namespace.
> >   However, there are quite a few scenarios where that isn't true. For
> >   example, there are containers that share the cgroup with other
> >   processes on purpose that are supposed to be bound to the lifetime of
> >   the container but are not in the same pidns of the container.
> >   Containers that are in a delegated cgroup but share the pid namespace
> >   with the host or other containers.
> > - Service managers such as systemd use cgroups to group and organize
> >   processes belonging to a service. They usually rely on a recursive
> >   algorithm now to kill a service. With cgroup.kill this becomes a
> >   simple write to cgroup.kill.
> > - Userspace OOM implementations can make good use of this feature to
> >   efficiently take down whole cgroups quickly.
> > - The kill program can gain a new
> >   kill --cgroup /sys/fs/cgroup/delegated
> >   flag to take down cgroups.
> > 
> > A few observations about the semantics:
> > - If parent and child are in the same cgroup and CLONE_INTO_CGROUP is
> >   not specified we are not taking cgroup mutex meaning the cgroup can be
> >   killed while a process in that cgroup is forking.
> >   If the kill request happens right before cgroup_can_fork() and before
> >   the parent grabs its siglock the parent is guaranteed to see the
> >   pending SIGKILL. In addition we perform another check in
> >   cgroup_post_fork() whether the cgroup is being killed and is so take
> >   down the child (see above). This is robust enough and protects gainst
> >   forkbombs. If userspace really really wants to have stricter
> >   protection the simple solution would be to grab the write side of the
> >   cgroup threadgroup rwsem which will force all ongoing forks to
> >   complete before killing starts. We concluded that this is not
> >   necessary as the semantics for concurrent forking should simply align
> >   with freezer where a similar check as cgroup_post_fork() is performed.
> > 
> >   For all other cases CLONE_INTO_CGROUP is required. In this case we
> >   will grab the cgroup mutex so the cgroup can't be killed while we
> >   fork. Once we're done with the fork and have dropped cgroup mutex we
> >   are visible and will be found by any subsequent kill request.
> > 
> > - We obviously don't kill kthreads. This means a cgroup that has a
> >   kthread will not become empty after killing and consequently no
> >   unpopulated event will be generated. The assumption is that kthreads
> >   should be in the root cgroup only anyway so this is not an issue.
> > - We skip killing tasks that already have pending fatal signals.
> > - Freezer doesn't care about tasks in different pid namespaces, i.e. if
> >   you have two tasks in different pid namespaces the cgroup would still
> >   be frozen. The cgroup.kill mechanism consequently behaves the same
> >   way, i.e. we kill all processes and ignore in which pid namespace they
> >   exist.
> > - If the caller is located in a cgroup that is killed the caller will
> >   obviously be killed as well.
> > 
> > Cc: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> > Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> 
> Hello Christian!
> 
> This version looks very good to me, thanks for working on it!
> One small not below.
> 
> > ---
> >  include/linux/cgroup-defs.h |  3 ++
> >  kernel/cgroup/cgroup.c      | 90 +++++++++++++++++++++++++++++++++++++
> >  2 files changed, 93 insertions(+)
> > 
> > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> > index 559ee05f86b2..43fef771009a 100644
> > --- a/include/linux/cgroup-defs.h
> > +++ b/include/linux/cgroup-defs.h
> > @@ -71,6 +71,9 @@ enum {
> >  
> >  	/* Cgroup is frozen. */
> >  	CGRP_FROZEN,
> > +
> > +	/* Control group has to be killed. */
> > +	CGRP_KILL,
> >  };
> >  
> >  /* cgroup_root->flags */
> > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > index 9153b20e5cc6..000f58b6ea35 100644
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -3654,6 +3654,80 @@ static ssize_t cgroup_freeze_write(struct kernfs_open_file *of,
> >  	return nbytes;
> >  }
> >  
> > +static void __cgroup_kill(struct cgroup *cgrp)
> > +{
> > +	struct css_task_iter it;
> > +	struct task_struct *task;
> > +
> > +	lockdep_assert_held(&cgroup_mutex);
> > +
> > +	spin_lock_irq(&css_set_lock);
> > +	set_bit(CGRP_KILL, &cgrp->flags);
> > +	spin_unlock_irq(&css_set_lock);
> > +
> > +	css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED, &it);
> > +	while ((task = css_task_iter_next(&it))) {
> > +		/* Ignore kernel threads here. */
> > +		if (task->flags & PF_KTHREAD)
> > +			continue;
> > +
> > +		/* Skip tasks that are already dying. */
> > +		if (__fatal_signal_pending(task))
> > +			continue;
> > +
> > +		send_sig(SIGKILL, task, 0);
> > +	}
> > +	css_task_iter_end(&it);
> > +
> > +	spin_lock_irq(&css_set_lock);
> > +	clear_bit(CGRP_KILL, &cgrp->flags);
> > +	spin_unlock_irq(&css_set_lock);
> > +}
> > +
> > +static void cgroup_kill(struct cgroup *cgrp)
> > +{
> > +	struct cgroup_subsys_state *css;
> > +	struct cgroup *dsct;
> > +
> > +	lockdep_assert_held(&cgroup_mutex);
> > +
> > +	cgroup_for_each_live_descendant_pre(dsct, css, cgrp)
> > +		__cgroup_kill(dsct);
> > +}
> > +
> > +static ssize_t cgroup_kill_write(struct kernfs_open_file *of, char *buf,
> > +				 size_t nbytes, loff_t off)
> > +{
> > +	ssize_t ret = 0;
> > +	int kill;
> > +	struct cgroup *cgrp;
> > +
> > +	ret = kstrtoint(strstrip(buf), 0, &kill);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (kill != 1)
> > +		return -ERANGE;
> > +
> > +	cgrp = cgroup_kn_lock_live(of->kn, false);
> > +	if (!cgrp)
> > +		return -ENOENT;
> > +
> > +	/*
> > +	 * Killing is a process directed operation, i.e. the whole thread-group
> > +	 * is taken down so act like we do for cgroup.procs and only make this
> > +	 * writable in non-threaded cgroups.
> > +	 */
> > +	if (cgroup_is_threaded(cgrp))
> > +		ret = -EOPNOTSUPP;
> > +	else
> > +		cgroup_kill(cgrp);
> > +
> > +	cgroup_kn_unlock(of->kn);
> > +
> > +	return ret ?: nbytes;
> > +}
> > +
> >  static int cgroup_file_open(struct kernfs_open_file *of)
> >  {
> >  	struct cftype *cft = of_cft(of);
> > @@ -4846,6 +4920,11 @@ static struct cftype cgroup_base_files[] = {
> >  		.seq_show = cgroup_freeze_show,
> >  		.write = cgroup_freeze_write,
> >  	},
> > +	{
> > +		.name = "cgroup.kill",
> > +		.flags = CFTYPE_NOT_ON_ROOT,
> > +		.write = cgroup_kill_write,
> > +	},
> >  	{
> >  		.name = "cpu.stat",
> >  		.seq_show = cpu_stat_show,
> > @@ -6077,6 +6156,7 @@ void cgroup_post_fork(struct task_struct *child,
> >  		      struct kernel_clone_args *kargs)
> >  	__releases(&cgroup_threadgroup_rwsem) __releases(&cgroup_mutex)
> >  {
> > +	bool kill = false;
> >  	struct cgroup_subsys *ss;
> >  	struct css_set *cset;
> >  	int i;
> > @@ -6088,6 +6168,12 @@ void cgroup_post_fork(struct task_struct *child,
> >  
> >  	/* init tasks are special, only link regular threads */
> >  	if (likely(child->pid)) {
> > +		struct cgroup *cgrp;
> > +
> > +		/* Check whether the target cgroup is being killed. */
> > +		cgrp = kargs->cgrp ?: cset->dfl_cgrp;
> > +		kill = test_bit(CGRP_KILL, &cgrp->flags);
> > +
> >  		WARN_ON_ONCE(!list_empty(&child->cg_list));
> >  		cset->nr_tasks++;
> >  		css_set_move_task(child, NULL, cset, false);
> > @@ -6135,6 +6221,10 @@ void cgroup_post_fork(struct task_struct *child,
> >  		put_css_set(rcset);
> >  	}
> >  
> > +	/* Take down child immediately. */
> > +	if (kill)
> > +		send_sig(SIGKILL, child, 0);
> > +
> 
> This part can be hot, so I'd integrate it better with the freezer check:
> instead of getting child task's cgroup's flags and checking individual
> bits twice we can do it once.

Oh, you essentially mean don't deref twice. Right. I'll tweak that.

Christian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/5] docs/cgroup: add entry for cgroup.kill
  2021-04-30  6:50           ` Christian Brauner
@ 2021-05-01  2:48             ` Roman Gushchin
       [not found]               ` <YIzBc5IgiaIFykc5-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Roman Gushchin @ 2021-05-01  2:48 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christian Brauner, Tejun Heo, Shakeel Butt, Zefan Li,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Apr 30, 2021 at 08:50:36AM +0200, Christian Brauner wrote:
> On Thu, Apr 29, 2021 at 08:22:03PM -0700, Roman Gushchin wrote:
> > On Thu, Apr 29, 2021 at 02:01:10PM +0200, Christian Brauner wrote:
> > > From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> > > 
> > > Give a brief overview of the cgroup.kill functionality.
> > > 
> > > Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> > > Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > > Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> > > ---
> > >  Documentation/admin-guide/cgroup-v2.rst | 17 +++++++++++++++++
> > >  1 file changed, 17 insertions(+)
> > > 
> > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > > index 64c62b979f2f..c9f656a84590 100644
> > > --- a/Documentation/admin-guide/cgroup-v2.rst
> > > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > > @@ -949,6 +949,23 @@ All cgroup core files are prefixed with "cgroup."
> > >  	it's possible to delete a frozen (and empty) cgroup, as well as
> > >  	create new sub-cgroups.
> > >  
> > > +  cgroup.kill
> > > +	A write-only single value file which exists in non-root cgroups.
> > > +	The only allowed value is "1".
> > > +
> > > +	Writing "1" to the file causes the cgroup and all descendant cgroups to
> > > +	be killed. This means that all processes located in the affected cgroup
> > > +	tree will be killed via SIGKILL.
> > > +
> > > +	Killing a cgroup tree will deal with concurrent forks appropriately and
> > > +	is protected against migrations. If callers require strict guarantees
> > > +	they can issue the cgroup.kill request after a freezing the cgroup via
> > > +	cgroup.freeze.
> > 
> > Hm, is it necessarily? What additional guarantees adds using the freezer?
> 
> Every new process that get's added is frozen. So even if the a process
> ends up escaping the cgroup.kill request somehow it will be frozen in
> the cgroup and can't itself fork again right away. So you could do:

Right, but how it can escape? I think one of the main reasons of introducing
a dedicated cgroup.kill interface is to avoid a necessity to use freezer
as a synchronization mechanism.

I'd just drop the sentence about the freezer here.

Thanks!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] tests/cgroup: test cgroup.kill
  2021-04-30  6:47           ` Christian Brauner
@ 2021-05-01  2:49             ` Roman Gushchin
  0 siblings, 0 replies; 17+ messages in thread
From: Roman Gushchin @ 2021-05-01  2:49 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christian Brauner, Tejun Heo, Shakeel Butt, Zefan Li,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Apr 30, 2021 at 08:47:40AM +0200, Christian Brauner wrote:
> On Thu, Apr 29, 2021 at 09:05:35PM -0700, Roman Gushchin wrote:
> > On Thu, Apr 29, 2021 at 02:01:13PM +0200, Christian Brauner wrote:
> > > From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> > > 
> > > Test that the new cgroup.kill feature works as intended.
> > > 
> > > Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> > > Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > > Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> > > ---
> > >  tools/testing/selftests/cgroup/.gitignore  |   3 +-
> > >  tools/testing/selftests/cgroup/Makefile    |   2 +
> > >  tools/testing/selftests/cgroup/test_kill.c | 293 +++++++++++++++++++++
> > >  3 files changed, 297 insertions(+), 1 deletion(-)
> > >  create mode 100644 tools/testing/selftests/cgroup/test_kill.c
> > > 
> > > diff --git a/tools/testing/selftests/cgroup/.gitignore b/tools/testing/selftests/cgroup/.gitignore
> > > index 84cfcabea838..be9643ef6285 100644
> > > --- a/tools/testing/selftests/cgroup/.gitignore
> > > +++ b/tools/testing/selftests/cgroup/.gitignore
> > > @@ -2,4 +2,5 @@
> > >  test_memcontrol
> > >  test_core
> > >  test_freezer
> > > -test_kmem
> > > \ No newline at end of file
> > > +test_kmem
> > > +test_kill
> > > diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
> > > index f027d933595b..87b7b0d90773 100644
> > > --- a/tools/testing/selftests/cgroup/Makefile
> > > +++ b/tools/testing/selftests/cgroup/Makefile
> > > @@ -9,6 +9,7 @@ TEST_GEN_PROGS = test_memcontrol
> > >  TEST_GEN_PROGS += test_kmem
> > >  TEST_GEN_PROGS += test_core
> > >  TEST_GEN_PROGS += test_freezer
> > > +TEST_GEN_PROGS += test_kill
> > >  
> > >  include ../lib.mk
> > >  
> > > @@ -16,3 +17,4 @@ $(OUTPUT)/test_memcontrol: cgroup_util.c ../clone3/clone3_selftests.h
> > >  $(OUTPUT)/test_kmem: cgroup_util.c ../clone3/clone3_selftests.h
> > >  $(OUTPUT)/test_core: cgroup_util.c ../clone3/clone3_selftests.h
> > >  $(OUTPUT)/test_freezer: cgroup_util.c ../clone3/clone3_selftests.h
> > > +$(OUTPUT)/test_kill: cgroup_util.c ../clone3/clone3_selftests.h
> > > diff --git a/tools/testing/selftests/cgroup/test_kill.c b/tools/testing/selftests/cgroup/test_kill.c
> > > new file mode 100644
> > > index 000000000000..c4e7b2e87395
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/cgroup/test_kill.c
> > > @@ -0,0 +1,293 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +#include <stdbool.h>
> > > +#include <linux/limits.h>
> > > +#include <sys/ptrace.h>
> > > +#include <sys/types.h>
> > > +#include <sys/mman.h>
> > > +#include <unistd.h>
> > > +#include <stdio.h>
> > > +#include <errno.h>
> > > +#include <poll.h>
> > > +#include <stdlib.h>
> > > +#include <sys/inotify.h>
> > > +#include <string.h>
> > > +#include <sys/wait.h>
> > > +
> > > +#include "../kselftest.h"
> > > +#include "cgroup_util.h"
> > > +
> > > +#define DEBUG
> > > +#ifdef DEBUG
> > > +#define debug(args...) fprintf(stderr, args)
> > > +#else
> > > +#define debug(args...)
> > > +#endif
> > > +
> > > +/*
> > > + * Kill the given cgroup and wait for the inotify signal.
> > > + * If there are no events in 10 seconds, treat this as an error.
> > > + * Then check that the cgroup is in the desired state.
> > > + */
> > > +static int cg_kill_wait(const char *cgroup)
> > > +{
> > > +	int fd, ret = -1;
> > > +
> > > +	fd = cg_prepare_for_wait(cgroup);
> > > +	if (fd < 0)
> > > +		return fd;
> > > +
> > > +	ret = cg_write(cgroup, "cgroup.kill", "1");
> > > +	if (ret) {
> > > +		debug("Error: cg_write() failed\n");
> > > +		goto out;
> > > +	}
> > > +
> > > +	ret = cg_wait_for(fd);
> > > +	if (ret)
> > > +		goto out;
> > > +
> > > +	ret = cg_read_strcmp(cgroup, "cgroup.events", "populated 0\n");
> > > +out:
> > > +	close(fd);
> > > +	return ret;
> > > +}
> > > +
> > > +/*
> > > + * A simple process running in a sleep loop until being
> > > + * re-parented.
> > > + */
> > > +static int child_fn(const char *cgroup, void *arg)
> > > +{
> > > +	int ppid = getppid();
> > > +
> > > +	while (getppid() == ppid)
> > > +		usleep(1000);
> > > +
> > > +	return getppid() == ppid;
> > > +}
> > > +
> > > +static int test_cgkill_simple(const char *root)
> > > +{
> > > +	int ret = KSFT_FAIL;
> > > +	char *cgroup = NULL;
> > > +	int i;
> > > +
> > > +	cgroup = cg_name(root, "cg_test_simple");
> > > +	if (!cgroup)
> > > +		goto cleanup;
> > > +
> > > +	if (cg_create(cgroup))
> > > +		goto cleanup;
> > > +
> > > +	for (i = 0; i < 100; i++)
> > > +		cg_run_nowait(cgroup, child_fn, NULL);
> > > +
> > > +	if (cg_wait_for_proc_count(cgroup, 100))
> > > +		goto cleanup;
> > > +
> > > +        if (cg_write(cgroup, "cgroup.kill", "1"))
> > > +		goto cleanup;
> > > +
> > > +	if (cg_read_strcmp(cgroup, "cgroup.events", "populated 1\n"))
> > > +		goto cleanup;
> > > +
> > > +	if (cg_kill_wait(cgroup))
> > > +		goto cleanup;
> > > +
> > > +	if (cg_read_strcmp(cgroup, "cgroup.events", "populated 0\n"))
> > > +		goto cleanup;
> > > +
> > > +	ret = KSFT_PASS;
> > > +
> > > +cleanup:
> > > +	if (cgroup)
> > > +		cg_destroy(cgroup);
> > > +	free(cgroup);
> > > +	return ret;
> > > +}
> > > +
> > > +/*
> > > + * The test creates the following hierarchy:
> > > + *       A
> > > + *    / / \ \
> > > + *   B  E  I K
> > > + *  /\  |
> > > + * C  D F
> > > + *      |
> > > + *      G
> > > + *      |
> > > + *      H
> > > + *
> > > + * with a process in C, H and 3 processes in K.
> > > + * Then it tries to kill the whole tree.
> > > + */
> > > +static int test_cgkill_tree(const char *root)
> > > +{
> > > +	char *cgroup[10] = {0};
> > > +	int ret = KSFT_FAIL;
> > > +	int i;
> > > +
> > > +	cgroup[0] = cg_name(root, "cg_test_tree_A");
> > > +	if (!cgroup[0])
> > > +		goto cleanup;
> > > +
> > > +	cgroup[1] = cg_name(cgroup[0], "B");
> > > +	if (!cgroup[1])
> > > +		goto cleanup;
> > > +
> > > +	cgroup[2] = cg_name(cgroup[1], "C");
> > > +	if (!cgroup[2])
> > > +		goto cleanup;
> > > +
> > > +	cgroup[3] = cg_name(cgroup[1], "D");
> > > +	if (!cgroup[3])
> > > +		goto cleanup;
> > > +
> > > +	cgroup[4] = cg_name(cgroup[0], "E");
> > > +	if (!cgroup[4])
> > > +		goto cleanup;
> > > +
> > > +	cgroup[5] = cg_name(cgroup[4], "F");
> > > +	if (!cgroup[5])
> > > +		goto cleanup;
> > > +
> > > +	cgroup[6] = cg_name(cgroup[5], "G");
> > > +	if (!cgroup[6])
> > > +		goto cleanup;
> > > +
> > > +	cgroup[7] = cg_name(cgroup[6], "H");
> > > +	if (!cgroup[7])
> > > +		goto cleanup;
> > > +
> > > +	cgroup[8] = cg_name(cgroup[0], "I");
> > > +	if (!cgroup[8])
> > > +		goto cleanup;
> > > +
> > > +	cgroup[9] = cg_name(cgroup[0], "K");
> > > +	if (!cgroup[9])
> > > +		goto cleanup;
> > > +
> > > +	for (i = 0; i < 10; i++)
> > > +		if (cg_create(cgroup[i]))
> > > +			goto cleanup;
> > > +
> > > +	cg_run_nowait(cgroup[2], child_fn, NULL);
> > > +	cg_run_nowait(cgroup[7], child_fn, NULL);
> > > +	cg_run_nowait(cgroup[9], child_fn, NULL);
> > > +	cg_run_nowait(cgroup[9], child_fn, NULL);
> > > +	cg_run_nowait(cgroup[9], child_fn, NULL);
> > > +
> > > +	/*
> > > +	 * Wait until all child processes will enter
> > > +	 * corresponding cgroups.
> > > +	 */
> > > +
> > > +	if (cg_wait_for_proc_count(cgroup[2], 1) ||
> > > +	    cg_wait_for_proc_count(cgroup[7], 1) ||
> > > +	    cg_wait_for_proc_count(cgroup[9], 3))
> > > +		goto cleanup;
> > > +
> > > +	/*
> > > +	 * Kill A and check that we get an empty notification.
> > > +	 */
> > > +	if (cg_kill_wait(cgroup[0]))
> > > +		goto cleanup;
> > > +
> > > +	if (cg_read_strcmp(cgroup[0], "cgroup.events", "populated 0\n"))
> > > +		goto cleanup;
> > > +
> > > +	ret = KSFT_PASS;
> > > +
> > > +cleanup:
> > > +	for (i = 9; i >= 0 && cgroup[i]; i--) {
> > > +		cg_destroy(cgroup[i]);
> > > +		free(cgroup[i]);
> > > +	}
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +static int forkbomb_fn(const char *cgroup, void *arg)
> > > +{
> > > +	int ppid;
> > > +
> > > +	fork();
> > > +	fork();
> > > +
> > > +	ppid = getppid();
> > > +
> > > +	while (getppid() == ppid)
> > > +		usleep(1000);
> > > +
> > > +	return getppid() == ppid;
> > > +}
> > > +
> > > +/*
> > > + * The test runs a fork bomb in a cgroup and tries to kill it.
> > > + */
> > > +static int test_cgkill_forkbomb(const char *root)
> > > +{
> > > +	int ret = KSFT_FAIL;
> > > +	char *cgroup = NULL;
> > > +
> > > +	cgroup = cg_name(root, "cg_forkbomb_test");
> > > +	if (!cgroup)
> > > +		goto cleanup;
> > > +
> > > +	if (cg_create(cgroup))
> > > +		goto cleanup;
> > > +
> > > +	cg_run_nowait(cgroup, forkbomb_fn, NULL);
> > > +
> > > +	usleep(100000);
> > > +
> > > +	if (cg_kill_wait(cgroup))
> > > +		goto cleanup;
> > > +
> > > +	if (cg_wait_for_proc_count(cgroup, 0))
> > > +		goto cleanup;
> > > +
> > > +	ret = KSFT_PASS;
> > > +
> > > +cleanup:
> > > +	if (cgroup)
> > > +		cg_destroy(cgroup);
> > > +	free(cgroup);
> > > +	return ret;
> > > +}
> > > +
> > > +#define T(x) { x, #x }
> > > +struct cgkill_test {
> > > +	int (*fn)(const char *root);
> > > +	const char *name;
> > > +} tests[] = {
> > > +	T(test_cgkill_simple),
> > > +	T(test_cgkill_tree),
> > > +	T(test_cgkill_forkbomb),
> > 
> > I'm a little bit worried about the forkbomb test: how reliable it is and won't
> > it kill the system, but Idk how make it better. Maybe instead of a true fork
> > bomb we need to spawn children by a single process? I'm not exactly sure.
> 
> I had the tests run in tight a loop and it didn't take down the system.
> The cgroup was nicely killed everytime which is why I kept it. Freezer
> performs the same test. So what we could do is freeze first and then
> kill under the assumption that the freezer tests don't suffer from the
> same problem.

Ok, I'm fine with it.
Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>

Thanks!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/5] docs/cgroup: add entry for cgroup.kill
       [not found]               ` <YIzBc5IgiaIFykc5-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
@ 2021-05-01 11:56                 ` Christian Brauner
  0 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2021-05-01 11:56 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Christian Brauner, Tejun Heo, Shakeel Butt, Zefan Li,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Apr 30, 2021 at 07:48:19PM -0700, Roman Gushchin wrote:
> On Fri, Apr 30, 2021 at 08:50:36AM +0200, Christian Brauner wrote:
> > On Thu, Apr 29, 2021 at 08:22:03PM -0700, Roman Gushchin wrote:
> > > On Thu, Apr 29, 2021 at 02:01:10PM +0200, Christian Brauner wrote:
> > > > From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> > > > 
> > > > Give a brief overview of the cgroup.kill functionality.
> > > > 
> > > > Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> > > > Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > > > Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> > > > ---
> > > >  Documentation/admin-guide/cgroup-v2.rst | 17 +++++++++++++++++
> > > >  1 file changed, 17 insertions(+)
> > > > 
> > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > > > index 64c62b979f2f..c9f656a84590 100644
> > > > --- a/Documentation/admin-guide/cgroup-v2.rst
> > > > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > > > @@ -949,6 +949,23 @@ All cgroup core files are prefixed with "cgroup."
> > > >  	it's possible to delete a frozen (and empty) cgroup, as well as
> > > >  	create new sub-cgroups.
> > > >  
> > > > +  cgroup.kill
> > > > +	A write-only single value file which exists in non-root cgroups.
> > > > +	The only allowed value is "1".
> > > > +
> > > > +	Writing "1" to the file causes the cgroup and all descendant cgroups to
> > > > +	be killed. This means that all processes located in the affected cgroup
> > > > +	tree will be killed via SIGKILL.
> > > > +
> > > > +	Killing a cgroup tree will deal with concurrent forks appropriately and
> > > > +	is protected against migrations. If callers require strict guarantees
> > > > +	they can issue the cgroup.kill request after a freezing the cgroup via
> > > > +	cgroup.freeze.
> > > 
> > > Hm, is it necessarily? What additional guarantees adds using the freezer?
> > 
> > Every new process that get's added is frozen. So even if the a process
> > ends up escaping the cgroup.kill request somehow it will be frozen in
> > the cgroup and can't itself fork again right away. So you could do:
> 
> Right, but how it can escape? I think one of the main reasons of introducing
> a dedicated cgroup.kill interface is to avoid a necessity to use freezer
> as a synchronization mechanism.
> 
> I'd just drop the sentence about the freezer here.

Thanks, will do.

Christian

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-05-01 11:56 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-29 12:01 [PATCH 1/5] cgroup: introduce cgroup.kill Christian Brauner
     [not found] ` <20210429120113.2238065-1-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2021-04-29 12:01   ` [PATCH 2/5] docs/cgroup: add entry for cgroup.kill Christian Brauner
     [not found]     ` <20210429120113.2238065-2-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2021-04-30  3:22       ` Roman Gushchin
     [not found]         ` <YIt32/aQJfkw53ic-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
2021-04-30  6:50           ` Christian Brauner
2021-05-01  2:48             ` Roman Gushchin
     [not found]               ` <YIzBc5IgiaIFykc5-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
2021-05-01 11:56                 ` Christian Brauner
2021-04-29 12:01   ` [PATCH 3/5] tests/cgroup: use cgroup.kill in cg_killall() Christian Brauner
     [not found]     ` <20210429120113.2238065-3-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2021-04-30  3:23       ` Roman Gushchin
     [not found]         ` <YIt4NhikbQKc0Vku-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
2021-04-30  6:42           ` Christian Brauner
2021-04-29 12:01   ` [PATCH 4/5] tests/cgroup: move cg_wait_for(), cg_prepare_for_wait() Christian Brauner
     [not found]     ` <20210429120113.2238065-4-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2021-04-30  3:45       ` Roman Gushchin
2021-04-29 12:01   ` [PATCH 5/5] tests/cgroup: test cgroup.kill Christian Brauner
     [not found]     ` <20210429120113.2238065-5-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2021-04-30  4:05       ` Roman Gushchin
     [not found]         ` <YIuCDz3h9/ZQPCMV-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
2021-04-30  6:47           ` Christian Brauner
2021-05-01  2:49             ` Roman Gushchin
2021-04-30  3:20   ` [PATCH 1/5] cgroup: introduce cgroup.kill Roman Gushchin
     [not found]     ` <YIt3a5R5tYgIpoVQ-cx5fftMpWqeCjSd+JxjunQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
2021-04-30  7:28       ` Christian Brauner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.