containers.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 1/5] cgroup: introduce cgroup.kill
@ 2021-05-03 14:39 Christian Brauner
  2021-05-03 14:39 ` [PATCH v2 2/5] docs/cgroup: add entry for cgroup.kill Christian Brauner
                   ` (8 more replies)
  0 siblings, 9 replies; 18+ messages in thread
From: Christian Brauner @ 2021-05-03 14:39 UTC (permalink / raw)
  To: Tejun Heo, Roman Gushchin
  Cc: Shakeel Butt, Zefan Li, Johannes Weiner, cgroups, containers,
	Christian Brauner

From: Christian Brauner <christian.brauner@ubuntu.com>

Introduce the cgroup.kill file. It does what it says on the tin and
allows a caller to kill a cgroup by writing "1" into cgroup.kill.
The file is available in non-root cgroups.

Killing cgroups is a process directed operation, i.e. the whole
thread-group is affected. Consequently trying to write to cgroup.kill in
threaded cgroups will be rejected and EOPNOTSUPP returned. This behavior
aligns with cgroup.procs where reads in threaded-cgroups are rejected
with EOPNOTSUPP.

The cgroup.kill file is write-only since killing a cgroup is an event
not which makes it different from e.g. freezer where a cgroup
transitions between the two states.

As with all new cgroup features cgroup.kill is recursive by default.

Killing a cgroup is protected against concurrent migrations through the
cgroup mutex. To protect against forkbombs and to mitigate the effect of
racing forks a new CGRP_KILL css set lock protected flag is introduced
that is set prior to killing a cgroup and unset after the cgroup has
been killed. We can then check in cgroup_post_fork() where we hold the
css set lock already whether the cgroup is currently being killed. If so
we send the child a SIGKILL signal immediately taking it down as soon as
it returns to userspace. To make the killing of the child semantically
clean it is killed after all cgroup attachment operations have been
finalized.

There are various use-cases of this interface:
- Containers usually have a conservative layout where each container
  usually has a delegated cgroup. For such layouts there is a 1:1
  mapping between container and cgroup. If the container in addition
  uses a separate pid namespace then killing a container usually becomes
  a simple kill -9 <container-init-pid> from an ancestor pid namespace.
  However, there are quite a few scenarios where that isn't true. For
  example, there are containers that share the cgroup with other
  processes on purpose that are supposed to be bound to the lifetime of
  the container but are not in the same pidns of the container.
  Containers that are in a delegated cgroup but share the pid namespace
  with the host or other containers.
- Service managers such as systemd use cgroups to group and organize
  processes belonging to a service. They usually rely on a recursive
  algorithm now to kill a service. With cgroup.kill this becomes a
  simple write to cgroup.kill.
- Userspace OOM implementations can make good use of this feature to
  efficiently take down whole cgroups quickly.
- The kill program can gain a new
  kill --cgroup /sys/fs/cgroup/delegated
  flag to take down cgroups.

A few observations about the semantics:
- If parent and child are in the same cgroup and CLONE_INTO_CGROUP is
  not specified we are not taking cgroup mutex meaning the cgroup can be
  killed while a process in that cgroup is forking.
  If the kill request happens right before cgroup_can_fork() and before
  the parent grabs its siglock the parent is guaranteed to see the
  pending SIGKILL. In addition we perform another check in
  cgroup_post_fork() whether the cgroup is being killed and is so take
  down the child (see above). This is robust enough and protects gainst
  forkbombs. If userspace really really wants to have stricter
  protection the simple solution would be to grab the write side of the
  cgroup threadgroup rwsem which will force all ongoing forks to
  complete before killing starts. We concluded that this is not
  necessary as the semantics for concurrent forking should simply align
  with freezer where a similar check as cgroup_post_fork() is performed.

  For all other cases CLONE_INTO_CGROUP is required. In this case we
  will grab the cgroup mutex so the cgroup can't be killed while we
  fork. Once we're done with the fork and have dropped cgroup mutex we
  are visible and will be found by any subsequent kill request.
- We obviously don't kill kthreads. This means a cgroup that has a
  kthread will not become empty after killing and consequently no
  unpopulated event will be generated. The assumption is that kthreads
  should be in the root cgroup only anyway so this is not an issue.
- We skip killing tasks that already have pending fatal signals.
- Freezer doesn't care about tasks in different pid namespaces, i.e. if
  you have two tasks in different pid namespaces the cgroup would still
  be frozen. The cgroup.kill mechanism consequently behaves the same
  way, i.e. we kill all processes and ignore in which pid namespace they
  exist.
- If the caller is located in a cgroup that is killed the caller will
  obviously be killed as well.

Cc: Shakeel Butt <shakeelb@google.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---

The series can be pulled from

git@gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux tags/cgroup.kill.v5.14

/* v2 */
- Roman Gushchin <guro@fb.com>:
  - Retrieve cgrp->flags only once and check CGRP_* bits on it.
---
 include/linux/cgroup-defs.h |   3 +
 kernel/cgroup/cgroup.c      | 127 ++++++++++++++++++++++++++++++++----
 2 files changed, 116 insertions(+), 14 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 559ee05f86b2..43fef771009a 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -71,6 +71,9 @@ enum {
 
 	/* Cgroup is frozen. */
 	CGRP_FROZEN,
+
+	/* Control group has to be killed. */
+	CGRP_KILL,
 };
 
 /* cgroup_root->flags */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9153b20e5cc6..aee84b99534a 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3654,6 +3654,80 @@ static ssize_t cgroup_freeze_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static void __cgroup_kill(struct cgroup *cgrp)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	spin_lock_irq(&css_set_lock);
+	set_bit(CGRP_KILL, &cgrp->flags);
+	spin_unlock_irq(&css_set_lock);
+
+	css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED, &it);
+	while ((task = css_task_iter_next(&it))) {
+		/* Ignore kernel threads here. */
+		if (task->flags & PF_KTHREAD)
+			continue;
+
+		/* Skip tasks that are already dying. */
+		if (__fatal_signal_pending(task))
+			continue;
+
+		send_sig(SIGKILL, task, 0);
+	}
+	css_task_iter_end(&it);
+
+	spin_lock_irq(&css_set_lock);
+	clear_bit(CGRP_KILL, &cgrp->flags);
+	spin_unlock_irq(&css_set_lock);
+}
+
+static void cgroup_kill(struct cgroup *cgrp)
+{
+	struct cgroup_subsys_state *css;
+	struct cgroup *dsct;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	cgroup_for_each_live_descendant_pre(dsct, css, cgrp)
+		__cgroup_kill(dsct);
+}
+
+static ssize_t cgroup_kill_write(struct kernfs_open_file *of, char *buf,
+				 size_t nbytes, loff_t off)
+{
+	ssize_t ret = 0;
+	int kill;
+	struct cgroup *cgrp;
+
+	ret = kstrtoint(strstrip(buf), 0, &kill);
+	if (ret)
+		return ret;
+
+	if (kill != 1)
+		return -ERANGE;
+
+	cgrp = cgroup_kn_lock_live(of->kn, false);
+	if (!cgrp)
+		return -ENOENT;
+
+	/*
+	 * Killing is a process directed operation, i.e. the whole thread-group
+	 * is taken down so act like we do for cgroup.procs and only make this
+	 * writable in non-threaded cgroups.
+	 */
+	if (cgroup_is_threaded(cgrp))
+		ret = -EOPNOTSUPP;
+	else
+		cgroup_kill(cgrp);
+
+	cgroup_kn_unlock(of->kn);
+
+	return ret ?: nbytes;
+}
+
 static int cgroup_file_open(struct kernfs_open_file *of)
 {
 	struct cftype *cft = of_cft(of);
@@ -4846,6 +4920,11 @@ static struct cftype cgroup_base_files[] = {
 		.seq_show = cgroup_freeze_show,
 		.write = cgroup_freeze_write,
 	},
+	{
+		.name = "cgroup.kill",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.write = cgroup_kill_write,
+	},
 	{
 		.name = "cpu.stat",
 		.seq_show = cpu_stat_show,
@@ -6077,6 +6156,8 @@ void cgroup_post_fork(struct task_struct *child,
 		      struct kernel_clone_args *kargs)
 	__releases(&cgroup_threadgroup_rwsem) __releases(&cgroup_mutex)
 {
+	unsigned long cgrp_flags = 0;
+	bool kill = false;
 	struct cgroup_subsys *ss;
 	struct css_set *cset;
 	int i;
@@ -6088,6 +6169,11 @@ void cgroup_post_fork(struct task_struct *child,
 
 	/* init tasks are special, only link regular threads */
 	if (likely(child->pid)) {
+		if (kargs->cgrp)
+			cgrp_flags = kargs->cgrp->flags;
+		else
+			cgrp_flags = cset->dfl_cgrp->flags;
+
 		WARN_ON_ONCE(!list_empty(&child->cg_list));
 		cset->nr_tasks++;
 		css_set_move_task(child, NULL, cset, false);
@@ -6096,23 +6182,32 @@ void cgroup_post_fork(struct task_struct *child,
 		cset = NULL;
 	}
 
-	/*
-	 * If the cgroup has to be frozen, the new task has too.  Let's set
-	 * the JOBCTL_TRAP_FREEZE jobctl bit to get the task into the
-	 * frozen state.
-	 */
-	if (unlikely(cgroup_task_freeze(child))) {
-		spin_lock(&child->sighand->siglock);
-		WARN_ON_ONCE(child->frozen);
-		child->jobctl |= JOBCTL_TRAP_FREEZE;
-		spin_unlock(&child->sighand->siglock);
+	if (!(child->flags & PF_KTHREAD)) {
+		if (test_bit(CGRP_FREEZE, &cgrp_flags)) {
+			/*
+			 * If the cgroup has to be frozen, the new task has
+			 * too. Let's set the JOBCTL_TRAP_FREEZE jobctl bit to
+			 * get the task into the frozen state.
+			 */
+			spin_lock(&child->sighand->siglock);
+			WARN_ON_ONCE(child->frozen);
+			child->jobctl |= JOBCTL_TRAP_FREEZE;
+			spin_unlock(&child->sighand->siglock);
+
+			/*
+			 * Calling cgroup_update_frozen() isn't required here,
+			 * because it will be called anyway a bit later from
+			 * do_freezer_trap(). So we avoid cgroup's transient
+			 * switch from the frozen state and back.
+			 */
+		}
 
 		/*
-		 * Calling cgroup_update_frozen() isn't required here,
-		 * because it will be called anyway a bit later from
-		 * do_freezer_trap(). So we avoid cgroup's transient switch
-		 * from the frozen state and back.
+		 * If the cgroup is to be killed notice it now and take the
+		 * child down right after we finished preparing it for
+		 * userspace.
 		 */
+		kill = test_bit(CGRP_KILL, &cgrp_flags);
 	}
 
 	spin_unlock_irq(&css_set_lock);
@@ -6135,6 +6230,10 @@ void cgroup_post_fork(struct task_struct *child,
 		put_css_set(rcset);
 	}
 
+	/* Cgroup has to be killed so take down child immediately. */
+	if (kill)
+		send_sig(SIGKILL, child, 0);
+
 	cgroup_css_set_put_fork(kargs);
 }
 

base-commit: 9f4ad9e425a1d3b6a34617b8ea226d56a119a717
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 2/5] docs/cgroup: add entry for cgroup.kill
  2021-05-03 14:39 [PATCH v2 1/5] cgroup: introduce cgroup.kill Christian Brauner
@ 2021-05-03 14:39 ` Christian Brauner
  2021-05-05 16:29   ` Shakeel Butt
  2021-05-03 14:39 ` [PATCH v2 3/5] tests/cgroup: use cgroup.kill in cg_killall() Christian Brauner
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Christian Brauner @ 2021-05-03 14:39 UTC (permalink / raw)
  To: Tejun Heo, Roman Gushchin
  Cc: Shakeel Butt, Zefan Li, Johannes Weiner, cgroups, containers,
	Christian Brauner

From: Christian Brauner <christian.brauner@ubuntu.com>

Give a brief overview of the cgroup.kill functionality.

Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
- Roman Gushchin <guro@fb.com>:
  - Drop sentence that mentions combined useage of cgroup.kill and
    cgroup.freezer.
---
 Documentation/admin-guide/cgroup-v2.rst | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 64c62b979f2f..6adc749d863e 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -949,6 +949,21 @@ All cgroup core files are prefixed with "cgroup."
 	it's possible to delete a frozen (and empty) cgroup, as well as
 	create new sub-cgroups.
 
+  cgroup.kill
+	A write-only single value file which exists in non-root cgroups.
+	The only allowed value is "1".
+
+	Writing "1" to the file causes the cgroup and all descendant cgroups to
+	be killed. This means that all processes located in the affected cgroup
+	tree will be killed via SIGKILL.
+
+	Killing a cgroup tree will deal with concurrent forks appropriately and
+	is protected against migrations.
+
+	In a threaded cgroup, writing this file fails with EOPNOTSUPP as
+	killing cgroups is a process directed operation, i.e. it affects
+	the whole thread-group.
+
 Controllers
 ===========
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 3/5] tests/cgroup: use cgroup.kill in cg_killall()
  2021-05-03 14:39 [PATCH v2 1/5] cgroup: introduce cgroup.kill Christian Brauner
  2021-05-03 14:39 ` [PATCH v2 2/5] docs/cgroup: add entry for cgroup.kill Christian Brauner
@ 2021-05-03 14:39 ` Christian Brauner
  2021-05-05 16:42   ` Shakeel Butt
  2021-05-03 14:39 ` [PATCH v2 4/5] tests/cgroup: move cg_wait_for(), cg_prepare_for_wait() Christian Brauner
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Christian Brauner @ 2021-05-03 14:39 UTC (permalink / raw)
  To: Tejun Heo, Roman Gushchin
  Cc: Shakeel Butt, Zefan Li, Johannes Weiner, cgroups, containers,
	Christian Brauner

From: Christian Brauner <christian.brauner@ubuntu.com>

If cgroup.kill file is supported make use of it.

Cc: Tejun Heo <tj@kernel.org>
Cc: cgroups@vger.kernel.org
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
- Roman Gushchin <guro@fb.com>:
  - Fix whitespace.
---
 tools/testing/selftests/cgroup/cgroup_util.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c
index 027014662fb2..f60f7d764690 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.c
+++ b/tools/testing/selftests/cgroup/cgroup_util.c
@@ -252,6 +252,10 @@ int cg_killall(const char *cgroup)
 	char buf[PAGE_SIZE];
 	char *ptr = buf;
 
+	/* If cgroup.kill exists use it. */
+	if (!cg_write(cgroup, "cgroup.kill", "1"))
+		return 0;
+
 	if (cg_read(cgroup, "cgroup.procs", buf, sizeof(buf)))
 		return -1;
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 4/5] tests/cgroup: move cg_wait_for(), cg_prepare_for_wait()
  2021-05-03 14:39 [PATCH v2 1/5] cgroup: introduce cgroup.kill Christian Brauner
  2021-05-03 14:39 ` [PATCH v2 2/5] docs/cgroup: add entry for cgroup.kill Christian Brauner
  2021-05-03 14:39 ` [PATCH v2 3/5] tests/cgroup: use cgroup.kill in cg_killall() Christian Brauner
@ 2021-05-03 14:39 ` Christian Brauner
  2021-05-05 16:43   ` Shakeel Butt
  2021-05-03 14:39 ` [PATCH v2 5/5] tests/cgroup: test cgroup.kill Christian Brauner
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Christian Brauner @ 2021-05-03 14:39 UTC (permalink / raw)
  To: Tejun Heo, Roman Gushchin
  Cc: Shakeel Butt, Zefan Li, Johannes Weiner, cgroups, containers,
	Christian Brauner

From: Christian Brauner <christian.brauner@ubuntu.com>

as they will be used by the tests for cgroup killing.

Cc: Tejun Heo <tj@kernel.org>
Cc: cgroups@vger.kernel.org
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
unchanged
---
 tools/testing/selftests/cgroup/cgroup_util.c  | 47 +++++++++++++++
 tools/testing/selftests/cgroup/cgroup_util.h  |  2 +
 tools/testing/selftests/cgroup/test_freezer.c | 57 -------------------
 3 files changed, 49 insertions(+), 57 deletions(-)

diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c
index f60f7d764690..623cec04ad42 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.c
+++ b/tools/testing/selftests/cgroup/cgroup_util.c
@@ -5,10 +5,12 @@
 #include <errno.h>
 #include <fcntl.h>
 #include <linux/limits.h>
+#include <poll.h>
 #include <signal.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
+#include <sys/inotify.h>
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <sys/wait.h>
@@ -580,3 +582,48 @@ int clone_into_cgroup_run_wait(const char *cgroup)
 	(void)clone_reap(pid, WEXITED);
 	return 0;
 }
+
+int cg_prepare_for_wait(const char *cgroup)
+{
+	int fd, ret = -1;
+
+	fd = inotify_init1(0);
+	if (fd == -1)
+		return fd;
+
+	ret = inotify_add_watch(fd, cg_control(cgroup, "cgroup.events"),
+				IN_MODIFY);
+	if (ret == -1) {
+		close(fd);
+		fd = -1;
+	}
+
+	return fd;
+}
+
+int cg_wait_for(int fd)
+{
+	int ret = -1;
+	struct pollfd fds = {
+		.fd = fd,
+		.events = POLLIN,
+	};
+
+	while (true) {
+		ret = poll(&fds, 1, 10000);
+
+		if (ret == -1) {
+			if (errno == EINTR)
+				continue;
+
+			break;
+		}
+
+		if (ret > 0 && fds.revents & POLLIN) {
+			ret = 0;
+			break;
+		}
+	}
+
+	return ret;
+}
diff --git a/tools/testing/selftests/cgroup/cgroup_util.h b/tools/testing/selftests/cgroup/cgroup_util.h
index 5a1305dd1f0b..82e59cdf16e7 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.h
+++ b/tools/testing/selftests/cgroup/cgroup_util.h
@@ -54,3 +54,5 @@ extern pid_t clone_into_cgroup(int cgroup_fd);
 extern int clone_reap(pid_t pid, int options);
 extern int clone_into_cgroup_run_wait(const char *cgroup);
 extern int dirfd_open_opath(const char *dir);
+extern int cg_prepare_for_wait(const char *cgroup);
+extern int cg_wait_for(int fd);
diff --git a/tools/testing/selftests/cgroup/test_freezer.c b/tools/testing/selftests/cgroup/test_freezer.c
index 23d8fa4a3e4e..ff519029f6f4 100644
--- a/tools/testing/selftests/cgroup/test_freezer.c
+++ b/tools/testing/selftests/cgroup/test_freezer.c
@@ -7,9 +7,7 @@
 #include <unistd.h>
 #include <stdio.h>
 #include <errno.h>
-#include <poll.h>
 #include <stdlib.h>
-#include <sys/inotify.h>
 #include <string.h>
 #include <sys/wait.h>
 
@@ -54,61 +52,6 @@ static int cg_freeze_nowait(const char *cgroup, bool freeze)
 	return cg_write(cgroup, "cgroup.freeze", freeze ? "1" : "0");
 }
 
-/*
- * Prepare for waiting on cgroup.events file.
- */
-static int cg_prepare_for_wait(const char *cgroup)
-{
-	int fd, ret = -1;
-
-	fd = inotify_init1(0);
-	if (fd == -1) {
-		debug("Error: inotify_init1() failed\n");
-		return fd;
-	}
-
-	ret = inotify_add_watch(fd, cg_control(cgroup, "cgroup.events"),
-				IN_MODIFY);
-	if (ret == -1) {
-		debug("Error: inotify_add_watch() failed\n");
-		close(fd);
-		fd = -1;
-	}
-
-	return fd;
-}
-
-/*
- * Wait for an event. If there are no events for 10 seconds,
- * treat this an error.
- */
-static int cg_wait_for(int fd)
-{
-	int ret = -1;
-	struct pollfd fds = {
-		.fd = fd,
-		.events = POLLIN,
-	};
-
-	while (true) {
-		ret = poll(&fds, 1, 10000);
-
-		if (ret == -1) {
-			if (errno == EINTR)
-				continue;
-			debug("Error: poll() failed\n");
-			break;
-		}
-
-		if (ret > 0 && fds.revents & POLLIN) {
-			ret = 0;
-			break;
-		}
-	}
-
-	return ret;
-}
-
 /*
  * Attach a task to the given cgroup and wait for a cgroup frozen event.
  * All transient events (e.g. populated) are ignored.
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 5/5] tests/cgroup: test cgroup.kill
  2021-05-03 14:39 [PATCH v2 1/5] cgroup: introduce cgroup.kill Christian Brauner
                   ` (2 preceding siblings ...)
  2021-05-03 14:39 ` [PATCH v2 4/5] tests/cgroup: move cg_wait_for(), cg_prepare_for_wait() Christian Brauner
@ 2021-05-03 14:39 ` Christian Brauner
  2021-05-05 18:34   ` Shakeel Butt
  2021-05-03 17:18 ` [PATCH v2 1/5] cgroup: introduce cgroup.kill Shakeel Butt
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Christian Brauner @ 2021-05-03 14:39 UTC (permalink / raw)
  To: Tejun Heo, Roman Gushchin
  Cc: Shakeel Butt, Zefan Li, Johannes Weiner, cgroups, containers,
	Christian Brauner

From: Christian Brauner <christian.brauner@ubuntu.com>

Test that the new cgroup.kill feature works as intended.

Cc: Tejun Heo <tj@kernel.org>
Cc: cgroups@vger.kernel.org
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
- Christian Brauner <christian.brauner@ubuntu.com>:
  - Remove debug logging (It wasn't very helpful in the first place.).
---
 tools/testing/selftests/cgroup/.gitignore  |   3 +-
 tools/testing/selftests/cgroup/Makefile    |   2 +
 tools/testing/selftests/cgroup/test_kill.c | 291 +++++++++++++++++++++
 3 files changed, 295 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/cgroup/test_kill.c

diff --git a/tools/testing/selftests/cgroup/.gitignore b/tools/testing/selftests/cgroup/.gitignore
index 84cfcabea838..be9643ef6285 100644
--- a/tools/testing/selftests/cgroup/.gitignore
+++ b/tools/testing/selftests/cgroup/.gitignore
@@ -2,4 +2,5 @@
 test_memcontrol
 test_core
 test_freezer
-test_kmem
\ No newline at end of file
+test_kmem
+test_kill
diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
index f027d933595b..59e222460581 100644
--- a/tools/testing/selftests/cgroup/Makefile
+++ b/tools/testing/selftests/cgroup/Makefile
@@ -9,6 +9,7 @@ TEST_GEN_PROGS = test_memcontrol
 TEST_GEN_PROGS += test_kmem
 TEST_GEN_PROGS += test_core
 TEST_GEN_PROGS += test_freezer
+TEST_GEN_PROGS += test_kill
 
 include ../lib.mk
 
@@ -16,3 +17,4 @@ $(OUTPUT)/test_memcontrol: cgroup_util.c ../clone3/clone3_selftests.h
 $(OUTPUT)/test_kmem: cgroup_util.c ../clone3/clone3_selftests.h
 $(OUTPUT)/test_core: cgroup_util.c ../clone3/clone3_selftests.h
 $(OUTPUT)/test_freezer: cgroup_util.c ../clone3/clone3_selftests.h
+$(OUTPUT)/test_kill: cgroup_util.c ../clone3/clone3_selftests.h ../pidfd/pidfd.h
diff --git a/tools/testing/selftests/cgroup/test_kill.c b/tools/testing/selftests/cgroup/test_kill.c
new file mode 100644
index 000000000000..3e1132deed33
--- /dev/null
+++ b/tools/testing/selftests/cgroup/test_kill.c
@@ -0,0 +1,291 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <errno.h>
+#include <linux/limits.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+#include "../pidfd/pidfd.h"
+#include "cgroup_util.h"
+
+/*
+ * Kill the given cgroup and wait for the inotify signal.
+ * If there are no events in 10 seconds, treat this as an error.
+ * Then check that the cgroup is in the desired state.
+ */
+static int cg_kill_wait(const char *cgroup)
+{
+	int fd, ret = -1;
+
+	fd = cg_prepare_for_wait(cgroup);
+	if (fd < 0)
+		return fd;
+
+	ret = cg_write(cgroup, "cgroup.kill", "1");
+	if (ret)
+		goto out;
+
+	ret = cg_wait_for(fd);
+	if (ret)
+		goto out;
+
+	ret = cg_read_strcmp(cgroup, "cgroup.events", "populated 0\n");
+out:
+	close(fd);
+	return ret;
+}
+
+/*
+ * A simple process running in a sleep loop until being
+ * re-parented.
+ */
+static int child_fn(const char *cgroup, void *arg)
+{
+	int ppid = getppid();
+
+	while (getppid() == ppid)
+		usleep(1000);
+
+	return getppid() == ppid;
+}
+
+static int test_cgkill_simple(const char *root)
+{
+	pid_t pids[100];
+	int ret = KSFT_FAIL;
+	char *cgroup = NULL;
+	int i;
+
+	cgroup = cg_name(root, "cg_test_simple");
+	if (!cgroup)
+		goto cleanup;
+
+	if (cg_create(cgroup))
+		goto cleanup;
+
+	for (i = 0; i < 100; i++)
+		pids[i] = cg_run_nowait(cgroup, child_fn, NULL);
+
+	if (cg_wait_for_proc_count(cgroup, 100))
+		goto cleanup;
+
+        if (cg_write(cgroup, "cgroup.kill", "1"))
+		goto cleanup;
+
+	if (cg_read_strcmp(cgroup, "cgroup.events", "populated 1\n"))
+		goto cleanup;
+
+	if (cg_kill_wait(cgroup))
+		goto cleanup;
+
+	if (cg_read_strcmp(cgroup, "cgroup.events", "populated 0\n"))
+		goto cleanup;
+
+	ret = KSFT_PASS;
+
+cleanup:
+	for (i = 0; i < 100; i++)
+		wait_for_pid(pids[i]);
+
+	if (cgroup)
+		cg_destroy(cgroup);
+	free(cgroup);
+	return ret;
+}
+
+/*
+ * The test creates the following hierarchy:
+ *       A
+ *    / / \ \
+ *   B  E  I K
+ *  /\  |
+ * C  D F
+ *      |
+ *      G
+ *      |
+ *      H
+ *
+ * with a process in C, H and 3 processes in K.
+ * Then it tries to kill the whole tree.
+ */
+static int test_cgkill_tree(const char *root)
+{
+	pid_t pids[5];
+	char *cgroup[10] = {0};
+	int ret = KSFT_FAIL;
+	int i;
+
+	cgroup[0] = cg_name(root, "cg_test_tree_A");
+	if (!cgroup[0])
+		goto cleanup;
+
+	cgroup[1] = cg_name(cgroup[0], "B");
+	if (!cgroup[1])
+		goto cleanup;
+
+	cgroup[2] = cg_name(cgroup[1], "C");
+	if (!cgroup[2])
+		goto cleanup;
+
+	cgroup[3] = cg_name(cgroup[1], "D");
+	if (!cgroup[3])
+		goto cleanup;
+
+	cgroup[4] = cg_name(cgroup[0], "E");
+	if (!cgroup[4])
+		goto cleanup;
+
+	cgroup[5] = cg_name(cgroup[4], "F");
+	if (!cgroup[5])
+		goto cleanup;
+
+	cgroup[6] = cg_name(cgroup[5], "G");
+	if (!cgroup[6])
+		goto cleanup;
+
+	cgroup[7] = cg_name(cgroup[6], "H");
+	if (!cgroup[7])
+		goto cleanup;
+
+	cgroup[8] = cg_name(cgroup[0], "I");
+	if (!cgroup[8])
+		goto cleanup;
+
+	cgroup[9] = cg_name(cgroup[0], "K");
+	if (!cgroup[9])
+		goto cleanup;
+
+	for (i = 0; i < 10; i++)
+		if (cg_create(cgroup[i]))
+			goto cleanup;
+
+	pids[0] = cg_run_nowait(cgroup[2], child_fn, NULL);
+	pids[1] = cg_run_nowait(cgroup[7], child_fn, NULL);
+	pids[2] = cg_run_nowait(cgroup[9], child_fn, NULL);
+	pids[3] = cg_run_nowait(cgroup[9], child_fn, NULL);
+	pids[4] = cg_run_nowait(cgroup[9], child_fn, NULL);
+
+	/*
+	 * Wait until all child processes will enter
+	 * corresponding cgroups.
+	 */
+
+	if (cg_wait_for_proc_count(cgroup[2], 1) ||
+	    cg_wait_for_proc_count(cgroup[7], 1) ||
+	    cg_wait_for_proc_count(cgroup[9], 3))
+		goto cleanup;
+
+	/*
+	 * Kill A and check that we get an empty notification.
+	 */
+	if (cg_kill_wait(cgroup[0]))
+		goto cleanup;
+
+	ret = KSFT_PASS;
+
+cleanup:
+	for (i = 0; i < 5; i++)
+		wait_for_pid(pids[i]);
+
+	for (i = 9; i >= 0 && cgroup[i]; i--) {
+		cg_destroy(cgroup[i]);
+		free(cgroup[i]);
+	}
+
+	return ret;
+}
+
+static int forkbomb_fn(const char *cgroup, void *arg)
+{
+	int ppid;
+
+	fork();
+	fork();
+
+	ppid = getppid();
+
+	while (getppid() == ppid)
+		usleep(1000);
+
+	return getppid() == ppid;
+}
+
+/*
+ * The test runs a fork bomb in a cgroup and tries to kill it.
+ */
+static int test_cgkill_forkbomb(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *cgroup = NULL;
+	pid_t pid = -ESRCH;
+
+	cgroup = cg_name(root, "cg_forkbomb_test");
+	if (!cgroup)
+		goto cleanup;
+
+	if (cg_create(cgroup))
+		goto cleanup;
+
+	pid = cg_run_nowait(cgroup, forkbomb_fn, NULL);
+	if (pid < 0)
+		goto cleanup;
+
+	usleep(100000);
+
+	if (cg_kill_wait(cgroup))
+		goto cleanup;
+
+	if (cg_wait_for_proc_count(cgroup, 0))
+		goto cleanup;
+
+	ret = KSFT_PASS;
+
+cleanup:
+	if (cgroup)
+		cg_destroy(cgroup);
+	if (pid > 0)
+		wait_for_pid(pid);
+	free(cgroup);
+	return ret;
+}
+
+#define T(x) { x, #x }
+struct cgkill_test {
+	int (*fn)(const char *root);
+	const char *name;
+} tests[] = {
+	T(test_cgkill_simple),
+	T(test_cgkill_tree),
+	T(test_cgkill_forkbomb),
+};
+#undef T
+
+int main(int argc, char *argv[])
+{
+	char root[PATH_MAX];
+	int i, ret = EXIT_SUCCESS;
+
+	if (cg_find_unified_root(root, sizeof(root)))
+		ksft_exit_skip("cgroup v2 isn't mounted\n");
+	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		switch (tests[i].fn(root)) {
+		case KSFT_PASS:
+			ksft_test_result_pass("%s\n", tests[i].name);
+			break;
+		case KSFT_SKIP:
+			ksft_test_result_skip("%s\n", tests[i].name);
+			break;
+		default:
+			ret = EXIT_FAILURE;
+			ksft_test_result_fail("%s\n", tests[i].name);
+			break;
+		}
+	}
+
+	return ret;
+}
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/5] cgroup: introduce cgroup.kill
  2021-05-03 14:39 [PATCH v2 1/5] cgroup: introduce cgroup.kill Christian Brauner
                   ` (3 preceding siblings ...)
  2021-05-03 14:39 ` [PATCH v2 5/5] tests/cgroup: test cgroup.kill Christian Brauner
@ 2021-05-03 17:18 ` Shakeel Butt
  2021-05-04  1:47 ` Serge E. Hallyn
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Shakeel Butt @ 2021-05-03 17:18 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tejun Heo, Roman Gushchin, Zefan Li, Johannes Weiner, Cgroups,
	containers, Christian Brauner

On Mon, May 3, 2021 at 7:40 AM Christian Brauner <brauner@kernel.org> wrote:
>
> From: Christian Brauner <christian.brauner@ubuntu.com>
>
> Introduce the cgroup.kill file. It does what it says on the tin and
> allows a caller to kill a cgroup by writing "1" into cgroup.kill.
> The file is available in non-root cgroups.
>
> Killing cgroups is a process directed operation, i.e. the whole
> thread-group is affected. Consequently trying to write to cgroup.kill in
> threaded cgroups will be rejected and EOPNOTSUPP returned. This behavior
> aligns with cgroup.procs where reads in threaded-cgroups are rejected
> with EOPNOTSUPP.
>
> The cgroup.kill file is write-only since killing a cgroup is an event
> not which makes it different from e.g. freezer where a cgroup
> transitions between the two states.
>
> As with all new cgroup features cgroup.kill is recursive by default.
>
> Killing a cgroup is protected against concurrent migrations through the
> cgroup mutex. To protect against forkbombs and to mitigate the effect of
> racing forks a new CGRP_KILL css set lock protected flag is introduced
> that is set prior to killing a cgroup and unset after the cgroup has
> been killed. We can then check in cgroup_post_fork() where we hold the
> css set lock already whether the cgroup is currently being killed. If so
> we send the child a SIGKILL signal immediately taking it down as soon as
> it returns to userspace. To make the killing of the child semantically
> clean it is killed after all cgroup attachment operations have been
> finalized.
>
> There are various use-cases of this interface:
> - Containers usually have a conservative layout where each container
>   usually has a delegated cgroup. For such layouts there is a 1:1
>   mapping between container and cgroup. If the container in addition
>   uses a separate pid namespace then killing a container usually becomes
>   a simple kill -9 <container-init-pid> from an ancestor pid namespace.
>   However, there are quite a few scenarios where that isn't true. For
>   example, there are containers that share the cgroup with other
>   processes on purpose that are supposed to be bound to the lifetime of
>   the container but are not in the same pidns of the container.
>   Containers that are in a delegated cgroup but share the pid namespace
>   with the host or other containers.
> - Service managers such as systemd use cgroups to group and organize
>   processes belonging to a service. They usually rely on a recursive
>   algorithm now to kill a service. With cgroup.kill this becomes a
>   simple write to cgroup.kill.
> - Userspace OOM implementations can make good use of this feature to
>   efficiently take down whole cgroups quickly.
> - The kill program can gain a new
>   kill --cgroup /sys/fs/cgroup/delegated
>   flag to take down cgroups.
>
> A few observations about the semantics:
> - If parent and child are in the same cgroup and CLONE_INTO_CGROUP is
>   not specified we are not taking cgroup mutex meaning the cgroup can be
>   killed while a process in that cgroup is forking.
>   If the kill request happens right before cgroup_can_fork() and before
>   the parent grabs its siglock the parent is guaranteed to see the
>   pending SIGKILL. In addition we perform another check in
>   cgroup_post_fork() whether the cgroup is being killed and is so take
>   down the child (see above). This is robust enough and protects gainst
>   forkbombs. If userspace really really wants to have stricter
>   protection the simple solution would be to grab the write side of the
>   cgroup threadgroup rwsem which will force all ongoing forks to
>   complete before killing starts. We concluded that this is not
>   necessary as the semantics for concurrent forking should simply align
>   with freezer where a similar check as cgroup_post_fork() is performed.
>
>   For all other cases CLONE_INTO_CGROUP is required. In this case we
>   will grab the cgroup mutex so the cgroup can't be killed while we
>   fork. Once we're done with the fork and have dropped cgroup mutex we
>   are visible and will be found by any subsequent kill request.
> - We obviously don't kill kthreads. This means a cgroup that has a
>   kthread will not become empty after killing and consequently no
>   unpopulated event will be generated. The assumption is that kthreads
>   should be in the root cgroup only anyway so this is not an issue.
> - We skip killing tasks that already have pending fatal signals.
> - Freezer doesn't care about tasks in different pid namespaces, i.e. if
>   you have two tasks in different pid namespaces the cgroup would still
>   be frozen. The cgroup.kill mechanism consequently behaves the same
>   way, i.e. we kill all processes and ignore in which pid namespace they
>   exist.
> - If the caller is located in a cgroup that is killed the caller will
>   obviously be killed as well.
>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: cgroups@vger.kernel.org
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/5] cgroup: introduce cgroup.kill
  2021-05-03 14:39 [PATCH v2 1/5] cgroup: introduce cgroup.kill Christian Brauner
                   ` (4 preceding siblings ...)
  2021-05-03 17:18 ` [PATCH v2 1/5] cgroup: introduce cgroup.kill Shakeel Butt
@ 2021-05-04  1:47 ` Serge E. Hallyn
  2021-05-04 18:29 ` Shakeel Butt
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Serge E. Hallyn @ 2021-05-04  1:47 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tejun Heo, Roman Gushchin, Shakeel Butt, Zefan Li,
	Johannes Weiner, cgroups, containers, Christian Brauner

On Mon, May 03, 2021 at 04:39:19PM +0200, Christian Brauner wrote:
> From: Christian Brauner <christian.brauner@ubuntu.com>
> 
> Introduce the cgroup.kill file. It does what it says on the tin and
> allows a caller to kill a cgroup by writing "1" into cgroup.kill.
> The file is available in non-root cgroups.
> 
> Killing cgroups is a process directed operation, i.e. the whole
> thread-group is affected. Consequently trying to write to cgroup.kill in
> threaded cgroups will be rejected and EOPNOTSUPP returned. This behavior
> aligns with cgroup.procs where reads in threaded-cgroups are rejected
> with EOPNOTSUPP.
> 
> The cgroup.kill file is write-only since killing a cgroup is an event
> not which makes it different from e.g. freezer where a cgroup
> transitions between the two states.
> 
> As with all new cgroup features cgroup.kill is recursive by default.
> 
> Killing a cgroup is protected against concurrent migrations through the
> cgroup mutex. To protect against forkbombs and to mitigate the effect of
> racing forks a new CGRP_KILL css set lock protected flag is introduced
> that is set prior to killing a cgroup and unset after the cgroup has
> been killed. We can then check in cgroup_post_fork() where we hold the
> css set lock already whether the cgroup is currently being killed. If so
> we send the child a SIGKILL signal immediately taking it down as soon as
> it returns to userspace. To make the killing of the child semantically
> clean it is killed after all cgroup attachment operations have been
> finalized.
> 
> There are various use-cases of this interface:
> - Containers usually have a conservative layout where each container
>   usually has a delegated cgroup. For such layouts there is a 1:1
>   mapping between container and cgroup. If the container in addition
>   uses a separate pid namespace then killing a container usually becomes
>   a simple kill -9 <container-init-pid> from an ancestor pid namespace.
>   However, there are quite a few scenarios where that isn't true. For
>   example, there are containers that share the cgroup with other
>   processes on purpose that are supposed to be bound to the lifetime of
>   the container but are not in the same pidns of the container.
>   Containers that are in a delegated cgroup but share the pid namespace
>   with the host or other containers.
> - Service managers such as systemd use cgroups to group and organize
>   processes belonging to a service. They usually rely on a recursive
>   algorithm now to kill a service. With cgroup.kill this becomes a
>   simple write to cgroup.kill.
> - Userspace OOM implementations can make good use of this feature to
>   efficiently take down whole cgroups quickly.
> - The kill program can gain a new
>   kill --cgroup /sys/fs/cgroup/delegated
>   flag to take down cgroups.
> 
> A few observations about the semantics:
> - If parent and child are in the same cgroup and CLONE_INTO_CGROUP is
>   not specified we are not taking cgroup mutex meaning the cgroup can be
>   killed while a process in that cgroup is forking.
>   If the kill request happens right before cgroup_can_fork() and before
>   the parent grabs its siglock the parent is guaranteed to see the
>   pending SIGKILL. In addition we perform another check in
>   cgroup_post_fork() whether the cgroup is being killed and is so take
>   down the child (see above). This is robust enough and protects gainst
>   forkbombs. If userspace really really wants to have stricter
>   protection the simple solution would be to grab the write side of the
>   cgroup threadgroup rwsem which will force all ongoing forks to
>   complete before killing starts. We concluded that this is not
>   necessary as the semantics for concurrent forking should simply align
>   with freezer where a similar check as cgroup_post_fork() is performed.
> 
>   For all other cases CLONE_INTO_CGROUP is required. In this case we
>   will grab the cgroup mutex so the cgroup can't be killed while we
>   fork. Once we're done with the fork and have dropped cgroup mutex we
>   are visible and will be found by any subsequent kill request.
> - We obviously don't kill kthreads. This means a cgroup that has a
>   kthread will not become empty after killing and consequently no
>   unpopulated event will be generated. The assumption is that kthreads
>   should be in the root cgroup only anyway so this is not an issue.
> - We skip killing tasks that already have pending fatal signals.
> - Freezer doesn't care about tasks in different pid namespaces, i.e. if
>   you have two tasks in different pid namespaces the cgroup would still
>   be frozen. The cgroup.kill mechanism consequently behaves the same
>   way, i.e. we kill all processes and ignore in which pid namespace they
>   exist.
> - If the caller is located in a cgroup that is killed the caller will
>   obviously be killed as well.
> 
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: cgroups@vger.kernel.org
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Reviewed-by: Serge Hallyn <serge@hallyn.com>

> ---
> 
> The series can be pulled from
> 
> git@gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux tags/cgroup.kill.v5.14
> 
> /* v2 */
> - Roman Gushchin <guro@fb.com>:
>   - Retrieve cgrp->flags only once and check CGRP_* bits on it.
> ---
>  include/linux/cgroup-defs.h |   3 +
>  kernel/cgroup/cgroup.c      | 127 ++++++++++++++++++++++++++++++++----
>  2 files changed, 116 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 559ee05f86b2..43fef771009a 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -71,6 +71,9 @@ enum {
>  
>  	/* Cgroup is frozen. */
>  	CGRP_FROZEN,
> +
> +	/* Control group has to be killed. */
> +	CGRP_KILL,
>  };
>  
>  /* cgroup_root->flags */
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 9153b20e5cc6..aee84b99534a 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3654,6 +3654,80 @@ static ssize_t cgroup_freeze_write(struct kernfs_open_file *of,
>  	return nbytes;
>  }
>  
> +static void __cgroup_kill(struct cgroup *cgrp)
> +{
> +	struct css_task_iter it;
> +	struct task_struct *task;
> +
> +	lockdep_assert_held(&cgroup_mutex);
> +
> +	spin_lock_irq(&css_set_lock);
> +	set_bit(CGRP_KILL, &cgrp->flags);
> +	spin_unlock_irq(&css_set_lock);
> +
> +	css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED, &it);
> +	while ((task = css_task_iter_next(&it))) {
> +		/* Ignore kernel threads here. */
> +		if (task->flags & PF_KTHREAD)
> +			continue;
> +
> +		/* Skip tasks that are already dying. */
> +		if (__fatal_signal_pending(task))
> +			continue;
> +
> +		send_sig(SIGKILL, task, 0);
> +	}
> +	css_task_iter_end(&it);
> +
> +	spin_lock_irq(&css_set_lock);
> +	clear_bit(CGRP_KILL, &cgrp->flags);
> +	spin_unlock_irq(&css_set_lock);
> +}
> +
> +static void cgroup_kill(struct cgroup *cgrp)
> +{
> +	struct cgroup_subsys_state *css;
> +	struct cgroup *dsct;
> +
> +	lockdep_assert_held(&cgroup_mutex);
> +
> +	cgroup_for_each_live_descendant_pre(dsct, css, cgrp)
> +		__cgroup_kill(dsct);
> +}
> +
> +static ssize_t cgroup_kill_write(struct kernfs_open_file *of, char *buf,
> +				 size_t nbytes, loff_t off)
> +{
> +	ssize_t ret = 0;
> +	int kill;
> +	struct cgroup *cgrp;
> +
> +	ret = kstrtoint(strstrip(buf), 0, &kill);
> +	if (ret)
> +		return ret;
> +
> +	if (kill != 1)
> +		return -ERANGE;
> +
> +	cgrp = cgroup_kn_lock_live(of->kn, false);
> +	if (!cgrp)
> +		return -ENOENT;
> +
> +	/*
> +	 * Killing is a process directed operation, i.e. the whole thread-group
> +	 * is taken down so act like we do for cgroup.procs and only make this
> +	 * writable in non-threaded cgroups.
> +	 */
> +	if (cgroup_is_threaded(cgrp))
> +		ret = -EOPNOTSUPP;
> +	else
> +		cgroup_kill(cgrp);
> +
> +	cgroup_kn_unlock(of->kn);
> +
> +	return ret ?: nbytes;
> +}
> +
>  static int cgroup_file_open(struct kernfs_open_file *of)
>  {
>  	struct cftype *cft = of_cft(of);
> @@ -4846,6 +4920,11 @@ static struct cftype cgroup_base_files[] = {
>  		.seq_show = cgroup_freeze_show,
>  		.write = cgroup_freeze_write,
>  	},
> +	{
> +		.name = "cgroup.kill",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.write = cgroup_kill_write,
> +	},
>  	{
>  		.name = "cpu.stat",
>  		.seq_show = cpu_stat_show,
> @@ -6077,6 +6156,8 @@ void cgroup_post_fork(struct task_struct *child,
>  		      struct kernel_clone_args *kargs)
>  	__releases(&cgroup_threadgroup_rwsem) __releases(&cgroup_mutex)
>  {
> +	unsigned long cgrp_flags = 0;
> +	bool kill = false;
>  	struct cgroup_subsys *ss;
>  	struct css_set *cset;
>  	int i;
> @@ -6088,6 +6169,11 @@ void cgroup_post_fork(struct task_struct *child,
>  
>  	/* init tasks are special, only link regular threads */
>  	if (likely(child->pid)) {
> +		if (kargs->cgrp)
> +			cgrp_flags = kargs->cgrp->flags;
> +		else
> +			cgrp_flags = cset->dfl_cgrp->flags;
> +
>  		WARN_ON_ONCE(!list_empty(&child->cg_list));
>  		cset->nr_tasks++;
>  		css_set_move_task(child, NULL, cset, false);
> @@ -6096,23 +6182,32 @@ void cgroup_post_fork(struct task_struct *child,
>  		cset = NULL;
>  	}
>  
> -	/*
> -	 * If the cgroup has to be frozen, the new task has too.  Let's set
> -	 * the JOBCTL_TRAP_FREEZE jobctl bit to get the task into the
> -	 * frozen state.
> -	 */
> -	if (unlikely(cgroup_task_freeze(child))) {
> -		spin_lock(&child->sighand->siglock);
> -		WARN_ON_ONCE(child->frozen);
> -		child->jobctl |= JOBCTL_TRAP_FREEZE;
> -		spin_unlock(&child->sighand->siglock);
> +	if (!(child->flags & PF_KTHREAD)) {
> +		if (test_bit(CGRP_FREEZE, &cgrp_flags)) {
> +			/*
> +			 * If the cgroup has to be frozen, the new task has
> +			 * too. Let's set the JOBCTL_TRAP_FREEZE jobctl bit to
> +			 * get the task into the frozen state.
> +			 */
> +			spin_lock(&child->sighand->siglock);
> +			WARN_ON_ONCE(child->frozen);
> +			child->jobctl |= JOBCTL_TRAP_FREEZE;
> +			spin_unlock(&child->sighand->siglock);
> +
> +			/*
> +			 * Calling cgroup_update_frozen() isn't required here,
> +			 * because it will be called anyway a bit later from
> +			 * do_freezer_trap(). So we avoid cgroup's transient
> +			 * switch from the frozen state and back.
> +			 */
> +		}
>  
>  		/*
> -		 * Calling cgroup_update_frozen() isn't required here,
> -		 * because it will be called anyway a bit later from
> -		 * do_freezer_trap(). So we avoid cgroup's transient switch
> -		 * from the frozen state and back.
> +		 * If the cgroup is to be killed notice it now and take the
> +		 * child down right after we finished preparing it for
> +		 * userspace.
>  		 */
> +		kill = test_bit(CGRP_KILL, &cgrp_flags);
>  	}
>  
>  	spin_unlock_irq(&css_set_lock);
> @@ -6135,6 +6230,10 @@ void cgroup_post_fork(struct task_struct *child,
>  		put_css_set(rcset);
>  	}
>  
> +	/* Cgroup has to be killed so take down child immediately. */
> +	if (kill)
> +		send_sig(SIGKILL, child, 0);
> +
>  	cgroup_css_set_put_fork(kargs);
>  }
>  
> 
> base-commit: 9f4ad9e425a1d3b6a34617b8ea226d56a119a717
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/5] cgroup: introduce cgroup.kill
  2021-05-03 14:39 [PATCH v2 1/5] cgroup: introduce cgroup.kill Christian Brauner
                   ` (5 preceding siblings ...)
  2021-05-04  1:47 ` Serge E. Hallyn
@ 2021-05-04 18:29 ` Shakeel Butt
  2021-05-05 17:57 ` Roman Gushchin
  2021-05-05 18:31 ` Eric W. Biederman
  8 siblings, 0 replies; 18+ messages in thread
From: Shakeel Butt @ 2021-05-04 18:29 UTC (permalink / raw)
  To: Christian Brauner, Michal Hocko
  Cc: Tejun Heo, Roman Gushchin, Zefan Li, Johannes Weiner, Cgroups,
	containers, Christian Brauner, Linux MM

+Michal Hocko

On Mon, May 3, 2021 at 7:40 AM Christian Brauner <brauner@kernel.org> wrote:
>
> From: Christian Brauner <christian.brauner@ubuntu.com>
>
> Introduce the cgroup.kill file. It does what it says on the tin and
> allows a caller to kill a cgroup by writing "1" into cgroup.kill.
> The file is available in non-root cgroups.
>
> Killing cgroups is a process directed operation, i.e. the whole
> thread-group is affected. Consequently trying to write to cgroup.kill in
> threaded cgroups will be rejected and EOPNOTSUPP returned. This behavior
> aligns with cgroup.procs where reads in threaded-cgroups are rejected
> with EOPNOTSUPP.
>
> The cgroup.kill file is write-only since killing a cgroup is an event
> not which makes it different from e.g. freezer where a cgroup
> transitions between the two states.
>
> As with all new cgroup features cgroup.kill is recursive by default.
>
> Killing a cgroup is protected against concurrent migrations through the
> cgroup mutex. To protect against forkbombs and to mitigate the effect of
> racing forks a new CGRP_KILL css set lock protected flag is introduced
> that is set prior to killing a cgroup and unset after the cgroup has
> been killed. We can then check in cgroup_post_fork() where we hold the
> css set lock already whether the cgroup is currently being killed. If so
> we send the child a SIGKILL signal immediately taking it down as soon as
> it returns to userspace. To make the killing of the child semantically
> clean it is killed after all cgroup attachment operations have been
> finalized.
>
> There are various use-cases of this interface:
> - Containers usually have a conservative layout where each container
>   usually has a delegated cgroup. For such layouts there is a 1:1
>   mapping between container and cgroup. If the container in addition
>   uses a separate pid namespace then killing a container usually becomes
>   a simple kill -9 <container-init-pid> from an ancestor pid namespace.
>   However, there are quite a few scenarios where that isn't true. For
>   example, there are containers that share the cgroup with other
>   processes on purpose that are supposed to be bound to the lifetime of
>   the container but are not in the same pidns of the container.
>   Containers that are in a delegated cgroup but share the pid namespace
>   with the host or other containers.
> - Service managers such as systemd use cgroups to group and organize
>   processes belonging to a service. They usually rely on a recursive
>   algorithm now to kill a service. With cgroup.kill this becomes a
>   simple write to cgroup.kill.
> - Userspace OOM implementations can make good use of this feature to
>   efficiently take down whole cgroups quickly.

Just to further add the motivation for userspace oom-killers. Instead
of traversing the tree, opening cgroup.procs and manually killing the
processes under memory pressure, the userspace oom-killer can just
keep the list of cgroup.kill files open and just write '1' when it
decides to trigger the oom-kill.

Michal, what do you think of putting the processes killed through this
interface into the oom_reaper_list as well? Will there be any
concerns?

> - The kill program can gain a new
>   kill --cgroup /sys/fs/cgroup/delegated
>   flag to take down cgroups.
>
> A few observations about the semantics:
> - If parent and child are in the same cgroup and CLONE_INTO_CGROUP is
>   not specified we are not taking cgroup mutex meaning the cgroup can be
>   killed while a process in that cgroup is forking.
>   If the kill request happens right before cgroup_can_fork() and before
>   the parent grabs its siglock the parent is guaranteed to see the
>   pending SIGKILL. In addition we perform another check in
>   cgroup_post_fork() whether the cgroup is being killed and is so take
>   down the child (see above). This is robust enough and protects gainst
>   forkbombs. If userspace really really wants to have stricter
>   protection the simple solution would be to grab the write side of the
>   cgroup threadgroup rwsem which will force all ongoing forks to
>   complete before killing starts. We concluded that this is not
>   necessary as the semantics for concurrent forking should simply align
>   with freezer where a similar check as cgroup_post_fork() is performed.
>
>   For all other cases CLONE_INTO_CGROUP is required. In this case we
>   will grab the cgroup mutex so the cgroup can't be killed while we
>   fork. Once we're done with the fork and have dropped cgroup mutex we
>   are visible and will be found by any subsequent kill request.
> - We obviously don't kill kthreads. This means a cgroup that has a
>   kthread will not become empty after killing and consequently no
>   unpopulated event will be generated. The assumption is that kthreads
>   should be in the root cgroup only anyway so this is not an issue.
> - We skip killing tasks that already have pending fatal signals.
> - Freezer doesn't care about tasks in different pid namespaces, i.e. if
>   you have two tasks in different pid namespaces the cgroup would still
>   be frozen. The cgroup.kill mechanism consequently behaves the same
>   way, i.e. we kill all processes and ignore in which pid namespace they
>   exist.
> - If the caller is located in a cgroup that is killed the caller will
>   obviously be killed as well.
>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: cgroups@vger.kernel.org
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
[...]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/5] docs/cgroup: add entry for cgroup.kill
  2021-05-03 14:39 ` [PATCH v2 2/5] docs/cgroup: add entry for cgroup.kill Christian Brauner
@ 2021-05-05 16:29   ` Shakeel Butt
  0 siblings, 0 replies; 18+ messages in thread
From: Shakeel Butt @ 2021-05-05 16:29 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tejun Heo, Roman Gushchin, Zefan Li, Johannes Weiner, Cgroups,
	containers, Christian Brauner

On Mon, May 3, 2021 at 7:40 AM Christian Brauner <brauner@kernel.org> wrote:
>
> From: Christian Brauner <christian.brauner@ubuntu.com>
>
> Give a brief overview of the cgroup.kill functionality.
>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: cgroups@vger.kernel.org
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/5] tests/cgroup: use cgroup.kill in cg_killall()
  2021-05-03 14:39 ` [PATCH v2 3/5] tests/cgroup: use cgroup.kill in cg_killall() Christian Brauner
@ 2021-05-05 16:42   ` Shakeel Butt
  0 siblings, 0 replies; 18+ messages in thread
From: Shakeel Butt @ 2021-05-05 16:42 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tejun Heo, Roman Gushchin, Zefan Li, Johannes Weiner, Cgroups,
	containers, Christian Brauner

On Mon, May 3, 2021 at 7:40 AM Christian Brauner <brauner@kernel.org> wrote:
>
> From: Christian Brauner <christian.brauner@ubuntu.com>
>
> If cgroup.kill file is supported make use of it.
>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: cgroups@vger.kernel.org
> Acked-by: Roman Gushchin <guro@fb.com>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

> ---
> /* v2 */
> - Roman Gushchin <guro@fb.com>:
>   - Fix whitespace.
> ---
>  tools/testing/selftests/cgroup/cgroup_util.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c
> index 027014662fb2..f60f7d764690 100644
> --- a/tools/testing/selftests/cgroup/cgroup_util.c
> +++ b/tools/testing/selftests/cgroup/cgroup_util.c
> @@ -252,6 +252,10 @@ int cg_killall(const char *cgroup)
>         char buf[PAGE_SIZE];
>         char *ptr = buf;
>
> +       /* If cgroup.kill exists use it. */
> +       if (!cg_write(cgroup, "cgroup.kill", "1"))
> +               return 0;

Now cg_killall will kill all the processes in the tree rooted at
cgroup which I think is fine.

> +
>         if (cg_read(cgroup, "cgroup.procs", buf, sizeof(buf)))
>                 return -1;
>
> --
> 2.27.0
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/5] tests/cgroup: move cg_wait_for(), cg_prepare_for_wait()
  2021-05-03 14:39 ` [PATCH v2 4/5] tests/cgroup: move cg_wait_for(), cg_prepare_for_wait() Christian Brauner
@ 2021-05-05 16:43   ` Shakeel Butt
  0 siblings, 0 replies; 18+ messages in thread
From: Shakeel Butt @ 2021-05-05 16:43 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tejun Heo, Roman Gushchin, Zefan Li, Johannes Weiner, Cgroups,
	containers, Christian Brauner

On Mon, May 3, 2021 at 7:40 AM Christian Brauner <brauner@kernel.org> wrote:
>
> From: Christian Brauner <christian.brauner@ubuntu.com>
>
> as they will be used by the tests for cgroup killing.
>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: cgroups@vger.kernel.org
> Acked-by: Roman Gushchin <guro@fb.com>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/5] cgroup: introduce cgroup.kill
  2021-05-03 14:39 [PATCH v2 1/5] cgroup: introduce cgroup.kill Christian Brauner
                   ` (6 preceding siblings ...)
  2021-05-04 18:29 ` Shakeel Butt
@ 2021-05-05 17:57 ` Roman Gushchin
  2021-05-05 18:46   ` Christian Brauner
  2021-05-05 18:31 ` Eric W. Biederman
  8 siblings, 1 reply; 18+ messages in thread
From: Roman Gushchin @ 2021-05-05 17:57 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tejun Heo, Shakeel Butt, Zefan Li, Johannes Weiner, cgroups,
	containers, Christian Brauner

On Mon, May 03, 2021 at 04:39:19PM +0200, Christian Brauner wrote:
> From: Christian Brauner <christian.brauner@ubuntu.com>
> 
> Introduce the cgroup.kill file. It does what it says on the tin and
> allows a caller to kill a cgroup by writing "1" into cgroup.kill.
> The file is available in non-root cgroups.
> 
> Killing cgroups is a process directed operation, i.e. the whole
> thread-group is affected. Consequently trying to write to cgroup.kill in
> threaded cgroups will be rejected and EOPNOTSUPP returned. This behavior
> aligns with cgroup.procs where reads in threaded-cgroups are rejected
> with EOPNOTSUPP.
> 
> The cgroup.kill file is write-only since killing a cgroup is an event
> not which makes it different from e.g. freezer where a cgroup
> transitions between the two states.
> 
> As with all new cgroup features cgroup.kill is recursive by default.
> 
> Killing a cgroup is protected against concurrent migrations through the
> cgroup mutex. To protect against forkbombs and to mitigate the effect of
> racing forks a new CGRP_KILL css set lock protected flag is introduced
> that is set prior to killing a cgroup and unset after the cgroup has
> been killed. We can then check in cgroup_post_fork() where we hold the
> css set lock already whether the cgroup is currently being killed. If so
> we send the child a SIGKILL signal immediately taking it down as soon as
> it returns to userspace. To make the killing of the child semantically
> clean it is killed after all cgroup attachment operations have been
> finalized.
> 
> There are various use-cases of this interface:
> - Containers usually have a conservative layout where each container
>   usually has a delegated cgroup. For such layouts there is a 1:1
>   mapping between container and cgroup. If the container in addition
>   uses a separate pid namespace then killing a container usually becomes
>   a simple kill -9 <container-init-pid> from an ancestor pid namespace.
>   However, there are quite a few scenarios where that isn't true. For
>   example, there are containers that share the cgroup with other
>   processes on purpose that are supposed to be bound to the lifetime of
>   the container but are not in the same pidns of the container.
>   Containers that are in a delegated cgroup but share the pid namespace
>   with the host or other containers.
> - Service managers such as systemd use cgroups to group and organize
>   processes belonging to a service. They usually rely on a recursive
>   algorithm now to kill a service. With cgroup.kill this becomes a
>   simple write to cgroup.kill.
> - Userspace OOM implementations can make good use of this feature to
>   efficiently take down whole cgroups quickly.
> - The kill program can gain a new
>   kill --cgroup /sys/fs/cgroup/delegated
>   flag to take down cgroups.
> 
> A few observations about the semantics:
> - If parent and child are in the same cgroup and CLONE_INTO_CGROUP is
>   not specified we are not taking cgroup mutex meaning the cgroup can be
>   killed while a process in that cgroup is forking.
>   If the kill request happens right before cgroup_can_fork() and before
>   the parent grabs its siglock the parent is guaranteed to see the
>   pending SIGKILL. In addition we perform another check in
>   cgroup_post_fork() whether the cgroup is being killed and is so take
>   down the child (see above). This is robust enough and protects gainst
>   forkbombs. If userspace really really wants to have stricter
>   protection the simple solution would be to grab the write side of the
>   cgroup threadgroup rwsem which will force all ongoing forks to
>   complete before killing starts. We concluded that this is not
>   necessary as the semantics for concurrent forking should simply align
>   with freezer where a similar check as cgroup_post_fork() is performed.
> 
>   For all other cases CLONE_INTO_CGROUP is required. In this case we
>   will grab the cgroup mutex so the cgroup can't be killed while we
>   fork. Once we're done with the fork and have dropped cgroup mutex we
>   are visible and will be found by any subsequent kill request.
> - We obviously don't kill kthreads. This means a cgroup that has a
>   kthread will not become empty after killing and consequently no
>   unpopulated event will be generated. The assumption is that kthreads
>   should be in the root cgroup only anyway so this is not an issue.
> - We skip killing tasks that already have pending fatal signals.
> - Freezer doesn't care about tasks in different pid namespaces, i.e. if
>   you have two tasks in different pid namespaces the cgroup would still
>   be frozen. The cgroup.kill mechanism consequently behaves the same
>   way, i.e. we kill all processes and ignore in which pid namespace they
>   exist.
> - If the caller is located in a cgroup that is killed the caller will
>   obviously be killed as well.
> 
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: cgroups@vger.kernel.org
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> ---
> 
> The series can be pulled from
> 
> git@gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux tags/cgroup.kill.v5.14
> 
> /* v2 */
> - Roman Gushchin <guro@fb.com>:
>   - Retrieve cgrp->flags only once and check CGRP_* bits on it.
> ---
>  include/linux/cgroup-defs.h |   3 +
>  kernel/cgroup/cgroup.c      | 127 ++++++++++++++++++++++++++++++++----
>  2 files changed, 116 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 559ee05f86b2..43fef771009a 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -71,6 +71,9 @@ enum {
>  
>  	/* Cgroup is frozen. */
>  	CGRP_FROZEN,
> +
> +	/* Control group has to be killed. */
> +	CGRP_KILL,
>  };
>  
>  /* cgroup_root->flags */
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 9153b20e5cc6..aee84b99534a 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3654,6 +3654,80 @@ static ssize_t cgroup_freeze_write(struct kernfs_open_file *of,
>  	return nbytes;
>  }
>  
> +static void __cgroup_kill(struct cgroup *cgrp)
> +{
> +	struct css_task_iter it;
> +	struct task_struct *task;
> +
> +	lockdep_assert_held(&cgroup_mutex);
> +
> +	spin_lock_irq(&css_set_lock);
> +	set_bit(CGRP_KILL, &cgrp->flags);
> +	spin_unlock_irq(&css_set_lock);
> +
> +	css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED, &it);
> +	while ((task = css_task_iter_next(&it))) {
> +		/* Ignore kernel threads here. */
> +		if (task->flags & PF_KTHREAD)
> +			continue;
> +
> +		/* Skip tasks that are already dying. */
> +		if (__fatal_signal_pending(task))
> +			continue;
> +
> +		send_sig(SIGKILL, task, 0);
> +	}
> +	css_task_iter_end(&it);
> +
> +	spin_lock_irq(&css_set_lock);
> +	clear_bit(CGRP_KILL, &cgrp->flags);
> +	spin_unlock_irq(&css_set_lock);
> +}
> +
> +static void cgroup_kill(struct cgroup *cgrp)
> +{
> +	struct cgroup_subsys_state *css;
> +	struct cgroup *dsct;
> +
> +	lockdep_assert_held(&cgroup_mutex);
> +
> +	cgroup_for_each_live_descendant_pre(dsct, css, cgrp)
> +		__cgroup_kill(dsct);
> +}
> +
> +static ssize_t cgroup_kill_write(struct kernfs_open_file *of, char *buf,
> +				 size_t nbytes, loff_t off)
> +{
> +	ssize_t ret = 0;
> +	int kill;
> +	struct cgroup *cgrp;
> +
> +	ret = kstrtoint(strstrip(buf), 0, &kill);
> +	if (ret)
> +		return ret;
> +
> +	if (kill != 1)
> +		return -ERANGE;
> +
> +	cgrp = cgroup_kn_lock_live(of->kn, false);
> +	if (!cgrp)
> +		return -ENOENT;
> +
> +	/*
> +	 * Killing is a process directed operation, i.e. the whole thread-group
> +	 * is taken down so act like we do for cgroup.procs and only make this
> +	 * writable in non-threaded cgroups.
> +	 */
> +	if (cgroup_is_threaded(cgrp))
> +		ret = -EOPNOTSUPP;
> +	else
> +		cgroup_kill(cgrp);
> +
> +	cgroup_kn_unlock(of->kn);
> +
> +	return ret ?: nbytes;
> +}
> +
>  static int cgroup_file_open(struct kernfs_open_file *of)
>  {
>  	struct cftype *cft = of_cft(of);
> @@ -4846,6 +4920,11 @@ static struct cftype cgroup_base_files[] = {
>  		.seq_show = cgroup_freeze_show,
>  		.write = cgroup_freeze_write,
>  	},
> +	{
> +		.name = "cgroup.kill",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.write = cgroup_kill_write,
> +	},
>  	{
>  		.name = "cpu.stat",
>  		.seq_show = cpu_stat_show,
> @@ -6077,6 +6156,8 @@ void cgroup_post_fork(struct task_struct *child,
>  		      struct kernel_clone_args *kargs)
>  	__releases(&cgroup_threadgroup_rwsem) __releases(&cgroup_mutex)
>  {
> +	unsigned long cgrp_flags = 0;
> +	bool kill = false;
>  	struct cgroup_subsys *ss;
>  	struct css_set *cset;
>  	int i;
> @@ -6088,6 +6169,11 @@ void cgroup_post_fork(struct task_struct *child,
>  
>  	/* init tasks are special, only link regular threads */
>  	if (likely(child->pid)) {
> +		if (kargs->cgrp)
> +			cgrp_flags = kargs->cgrp->flags;
> +		else
> +			cgrp_flags = cset->dfl_cgrp->flags;
> +
>  		WARN_ON_ONCE(!list_empty(&child->cg_list));
>  		cset->nr_tasks++;
>  		css_set_move_task(child, NULL, cset, false);
> @@ -6096,23 +6182,32 @@ void cgroup_post_fork(struct task_struct *child,
>  		cset = NULL;
>  	}
>  
> -	/*
> -	 * If the cgroup has to be frozen, the new task has too.  Let's set
> -	 * the JOBCTL_TRAP_FREEZE jobctl bit to get the task into the
> -	 * frozen state.
> -	 */
> -	if (unlikely(cgroup_task_freeze(child))) {
> -		spin_lock(&child->sighand->siglock);
> -		WARN_ON_ONCE(child->frozen);
> -		child->jobctl |= JOBCTL_TRAP_FREEZE;
> -		spin_unlock(&child->sighand->siglock);
> +	if (!(child->flags & PF_KTHREAD)) {
> +		if (test_bit(CGRP_FREEZE, &cgrp_flags)) {
> +			/*
> +			 * If the cgroup has to be frozen, the new task has
> +			 * too. Let's set the JOBCTL_TRAP_FREEZE jobctl bit to
> +			 * get the task into the frozen state.
> +			 */
> +			spin_lock(&child->sighand->siglock);
> +			WARN_ON_ONCE(child->frozen);
> +			child->jobctl |= JOBCTL_TRAP_FREEZE;
> +			spin_unlock(&child->sighand->siglock);
> +
> +			/*
> +			 * Calling cgroup_update_frozen() isn't required here,
> +			 * because it will be called anyway a bit later from
> +			 * do_freezer_trap(). So we avoid cgroup's transient
> +			 * switch from the frozen state and back.
> +			 */
> +		}

I think this part can be optimized a bit further:
1) we don't need atomic test_bit() here
2) all PF_KTHREAD, CGRP_FREEZE and CGRP_KILL cases are very unlikely

So something like this could work (completely untested):

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 0965b44ff425..f567ca69119d 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6190,13 +6190,15 @@ void cgroup_post_fork(struct task_struct *child,
                cset = NULL;
        }
 
-       if (!(child->flags & PF_KTHREAD)) {
-               if (test_bit(CGRP_FREEZE, &cgrp_flags)) {
-                       /*
-                        * If the cgroup has to be frozen, the new task has
-                        * too. Let's set the JOBCTL_TRAP_FREEZE jobctl bit to
-                        * get the task into the frozen state.
-                        */
+
+       if (unlikely(!(child->flags & PF_KTHREAD) &&
+                    cgrp_flags & (CGRP_FREEZE | CGRP_KILL))) {
+               /*
+                * If the cgroup has to be frozen, the new task has
+                * too. Let's set the JOBCTL_TRAP_FREEZE jobctl bit to
+                * get the task into the frozen state.
+                */
+               if (cgrp_flags & CGRP_FREEZE) {
                        spin_lock(&child->sighand->siglock);
                        WARN_ON_ONCE(child->frozen);
                        child->jobctl |= JOBCTL_TRAP_FREEZE;
@@ -6215,7 +6217,8 @@ void cgroup_post_fork(struct task_struct *child,
                 * child down right after we finished preparing it for
                 * userspace.
                 */
-               kill = test_bit(CGRP_KILL, &cgrp_flags);
+               if (cgrp_flags & CGRP_KILL)
+                       kill = true;
        }
 
        spin_unlock_irq(&css_set_lock);


>  
>  		/*
> -		 * Calling cgroup_update_frozen() isn't required here,
> -		 * because it will be called anyway a bit later from
> -		 * do_freezer_trap(). So we avoid cgroup's transient switch
> -		 * from the frozen state and back.
> +		 * If the cgroup is to be killed notice it now and take the
> +		 * child down right after we finished preparing it for
> +		 * userspace.
>  		 */
> +		kill = test_bit(CGRP_KILL, &cgrp_flags);
>  	}
>  
>  	spin_unlock_irq(&css_set_lock);
> @@ -6135,6 +6230,10 @@ void cgroup_post_fork(struct task_struct *child,
>  		put_css_set(rcset);
>  	}
>  
> +	/* Cgroup has to be killed so take down child immediately. */
> +	if (kill)
> +		send_sig(SIGKILL, child, 0);

I think it's better to use do_send_sig_info() here, which skips the check
for the signal number, which is obviously valid.

Thanks!

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/5] cgroup: introduce cgroup.kill
  2021-05-03 14:39 [PATCH v2 1/5] cgroup: introduce cgroup.kill Christian Brauner
                   ` (7 preceding siblings ...)
  2021-05-05 17:57 ` Roman Gushchin
@ 2021-05-05 18:31 ` Eric W. Biederman
  2021-05-05 18:49   ` Christian Brauner
  8 siblings, 1 reply; 18+ messages in thread
From: Eric W. Biederman @ 2021-05-05 18:31 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tejun Heo, Roman Gushchin, Shakeel Butt, Zefan Li,
	Johannes Weiner, cgroups, containers, Christian Brauner


Please see below this patch uses the wrong function to send SIGKILL.

Eric


Christian Brauner <brauner@kernel.org> writes:

> From: Christian Brauner <christian.brauner@ubuntu.com>
>
> Introduce the cgroup.kill file. It does what it says on the tin and
> allows a caller to kill a cgroup by writing "1" into cgroup.kill.
> The file is available in non-root cgroups.
>
> Killing cgroups is a process directed operation, i.e. the whole
> thread-group is affected. Consequently trying to write to cgroup.kill in
> threaded cgroups will be rejected and EOPNOTSUPP returned. This behavior
> aligns with cgroup.procs where reads in threaded-cgroups are rejected
> with EOPNOTSUPP.
>
> The cgroup.kill file is write-only since killing a cgroup is an event
> not which makes it different from e.g. freezer where a cgroup
> transitions between the two states.
>
> As with all new cgroup features cgroup.kill is recursive by default.
>
> Killing a cgroup is protected against concurrent migrations through the
> cgroup mutex. To protect against forkbombs and to mitigate the effect of
> racing forks a new CGRP_KILL css set lock protected flag is introduced
> that is set prior to killing a cgroup and unset after the cgroup has
> been killed. We can then check in cgroup_post_fork() where we hold the
> css set lock already whether the cgroup is currently being killed. If so
> we send the child a SIGKILL signal immediately taking it down as soon as
> it returns to userspace. To make the killing of the child semantically
> clean it is killed after all cgroup attachment operations have been
> finalized.
>
> There are various use-cases of this interface:
> - Containers usually have a conservative layout where each container
>   usually has a delegated cgroup. For such layouts there is a 1:1
>   mapping between container and cgroup. If the container in addition
>   uses a separate pid namespace then killing a container usually becomes
>   a simple kill -9 <container-init-pid> from an ancestor pid namespace.
>   However, there are quite a few scenarios where that isn't true. For
>   example, there are containers that share the cgroup with other
>   processes on purpose that are supposed to be bound to the lifetime of
>   the container but are not in the same pidns of the container.
>   Containers that are in a delegated cgroup but share the pid namespace
>   with the host or other containers.
> - Service managers such as systemd use cgroups to group and organize
>   processes belonging to a service. They usually rely on a recursive
>   algorithm now to kill a service. With cgroup.kill this becomes a
>   simple write to cgroup.kill.
> - Userspace OOM implementations can make good use of this feature to
>   efficiently take down whole cgroups quickly.
> - The kill program can gain a new
>   kill --cgroup /sys/fs/cgroup/delegated
>   flag to take down cgroups.
>
> A few observations about the semantics:
> - If parent and child are in the same cgroup and CLONE_INTO_CGROUP is
>   not specified we are not taking cgroup mutex meaning the cgroup can be
>   killed while a process in that cgroup is forking.
>   If the kill request happens right before cgroup_can_fork() and before
>   the parent grabs its siglock the parent is guaranteed to see the
>   pending SIGKILL. In addition we perform another check in
>   cgroup_post_fork() whether the cgroup is being killed and is so take
>   down the child (see above). This is robust enough and protects gainst
>   forkbombs. If userspace really really wants to have stricter
>   protection the simple solution would be to grab the write side of the
>   cgroup threadgroup rwsem which will force all ongoing forks to
>   complete before killing starts. We concluded that this is not
>   necessary as the semantics for concurrent forking should simply align
>   with freezer where a similar check as cgroup_post_fork() is performed.
>
>   For all other cases CLONE_INTO_CGROUP is required. In this case we
>   will grab the cgroup mutex so the cgroup can't be killed while we
>   fork. Once we're done with the fork and have dropped cgroup mutex we
>   are visible and will be found by any subsequent kill request.
> - We obviously don't kill kthreads. This means a cgroup that has a
>   kthread will not become empty after killing and consequently no
>   unpopulated event will be generated. The assumption is that kthreads
>   should be in the root cgroup only anyway so this is not an issue.
> - We skip killing tasks that already have pending fatal signals.
> - Freezer doesn't care about tasks in different pid namespaces, i.e. if
>   you have two tasks in different pid namespaces the cgroup would still
>   be frozen. The cgroup.kill mechanism consequently behaves the same
>   way, i.e. we kill all processes and ignore in which pid namespace they
>   exist.
> - If the caller is located in a cgroup that is killed the caller will
>   obviously be killed as well.
>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: cgroups@vger.kernel.org
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> ---
>
> The series can be pulled from
>
> git@gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux tags/cgroup.kill.v5.14
>
> /* v2 */
> - Roman Gushchin <guro@fb.com>:
>   - Retrieve cgrp->flags only once and check CGRP_* bits on it.
> ---
>  include/linux/cgroup-defs.h |   3 +
>  kernel/cgroup/cgroup.c      | 127 ++++++++++++++++++++++++++++++++----
>  2 files changed, 116 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 559ee05f86b2..43fef771009a 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -71,6 +71,9 @@ enum {
>  
>  	/* Cgroup is frozen. */
>  	CGRP_FROZEN,
> +
> +	/* Control group has to be killed. */
> +	CGRP_KILL,
>  };
>  
>  /* cgroup_root->flags */
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 9153b20e5cc6..aee84b99534a 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3654,6 +3654,80 @@ static ssize_t cgroup_freeze_write(struct kernfs_open_file *of,
>  	return nbytes;
>  }
>  
> +static void __cgroup_kill(struct cgroup *cgrp)
> +{
> +	struct css_task_iter it;
> +	struct task_struct *task;
> +
> +	lockdep_assert_held(&cgroup_mutex);
> +
> +	spin_lock_irq(&css_set_lock);
> +	set_bit(CGRP_KILL, &cgrp->flags);
> +	spin_unlock_irq(&css_set_lock);
> +
> +	css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED, &it);
> +	while ((task = css_task_iter_next(&it))) {
> +		/* Ignore kernel threads here. */
> +		if (task->flags & PF_KTHREAD)
> +			continue;
> +
> +		/* Skip tasks that are already dying. */
> +		if (__fatal_signal_pending(task))
> +			continue;
> +
> +		send_sig(SIGKILL, task, 0);
                ^^^^^^^^
Using send_sig here is wrong.  The function send_sig
is the interface to send a signal to a single task/thread.

The signal SIGKILL can not be sent to a single task/thread.
So it is never makes sense to use send_sig with SIGKILL.

As this all happens in the context of the process writing
to the file this can either be:

	group_send_sig_info(SIGKILL, SEND_SIG_NOINFO, task, PIDTYPE_TGID);

Which will check that the caller actually has permissions to kill the
specified task.  Or:

	do_send_sig_info(SIGKILL, SEND_SIG_NOINFO, task, PIDTYPE_TGID);




> +	}
> +	css_task_iter_end(&it);
> +
> +	spin_lock_irq(&css_set_lock);
> +	clear_bit(CGRP_KILL, &cgrp->flags);
> +	spin_unlock_irq(&css_set_lock);
> +}
> +
> +static void cgroup_kill(struct cgroup *cgrp)
> +{
> +	struct cgroup_subsys_state *css;
> +	struct cgroup *dsct;
> +
> +	lockdep_assert_held(&cgroup_mutex);
> +
> +	cgroup_for_each_live_descendant_pre(dsct, css, cgrp)
> +		__cgroup_kill(dsct);
> +}
> +
> +static ssize_t cgroup_kill_write(struct kernfs_open_file *of, char *buf,
> +				 size_t nbytes, loff_t off)
> +{
> +	ssize_t ret = 0;
> +	int kill;
> +	struct cgroup *cgrp;
> +
> +	ret = kstrtoint(strstrip(buf), 0, &kill);
> +	if (ret)
> +		return ret;
> +
> +	if (kill != 1)
> +		return -ERANGE;
> +
> +	cgrp = cgroup_kn_lock_live(of->kn, false);
> +	if (!cgrp)
> +		return -ENOENT;
> +
> +	/*
> +	 * Killing is a process directed operation, i.e. the whole thread-group
> +	 * is taken down so act like we do for cgroup.procs and only make this
> +	 * writable in non-threaded cgroups.
> +	 */
> +	if (cgroup_is_threaded(cgrp))
> +		ret = -EOPNOTSUPP;
> +	else
> +		cgroup_kill(cgrp);
> +
> +	cgroup_kn_unlock(of->kn);
> +
> +	return ret ?: nbytes;
> +}
> +
>  static int cgroup_file_open(struct kernfs_open_file *of)
>  {
>  	struct cftype *cft = of_cft(of);
> @@ -4846,6 +4920,11 @@ static struct cftype cgroup_base_files[] = {
>  		.seq_show = cgroup_freeze_show,
>  		.write = cgroup_freeze_write,
>  	},
> +	{
> +		.name = "cgroup.kill",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.write = cgroup_kill_write,
> +	},
>  	{
>  		.name = "cpu.stat",
>  		.seq_show = cpu_stat_show,
> @@ -6077,6 +6156,8 @@ void cgroup_post_fork(struct task_struct *child,
>  		      struct kernel_clone_args *kargs)
>  	__releases(&cgroup_threadgroup_rwsem) __releases(&cgroup_mutex)
>  {
> +	unsigned long cgrp_flags = 0;
> +	bool kill = false;
>  	struct cgroup_subsys *ss;
>  	struct css_set *cset;
>  	int i;
> @@ -6088,6 +6169,11 @@ void cgroup_post_fork(struct task_struct *child,
>  
>  	/* init tasks are special, only link regular threads */
>  	if (likely(child->pid)) {
> +		if (kargs->cgrp)
> +			cgrp_flags = kargs->cgrp->flags;
> +		else
> +			cgrp_flags = cset->dfl_cgrp->flags;
> +
>  		WARN_ON_ONCE(!list_empty(&child->cg_list));
>  		cset->nr_tasks++;
>  		css_set_move_task(child, NULL, cset, false);
> @@ -6096,23 +6182,32 @@ void cgroup_post_fork(struct task_struct *child,
>  		cset = NULL;
>  	}
>  
> -	/*
> -	 * If the cgroup has to be frozen, the new task has too.  Let's set
> -	 * the JOBCTL_TRAP_FREEZE jobctl bit to get the task into the
> -	 * frozen state.
> -	 */
> -	if (unlikely(cgroup_task_freeze(child))) {
> -		spin_lock(&child->sighand->siglock);
> -		WARN_ON_ONCE(child->frozen);
> -		child->jobctl |= JOBCTL_TRAP_FREEZE;
> -		spin_unlock(&child->sighand->siglock);
> +	if (!(child->flags & PF_KTHREAD)) {
> +		if (test_bit(CGRP_FREEZE, &cgrp_flags)) {
> +			/*
> +			 * If the cgroup has to be frozen, the new task has
> +			 * too. Let's set the JOBCTL_TRAP_FREEZE jobctl bit to
> +			 * get the task into the frozen state.
> +			 */
> +			spin_lock(&child->sighand->siglock);
> +			WARN_ON_ONCE(child->frozen);
> +			child->jobctl |= JOBCTL_TRAP_FREEZE;
> +			spin_unlock(&child->sighand->siglock);
> +
> +			/*
> +			 * Calling cgroup_update_frozen() isn't required here,
> +			 * because it will be called anyway a bit later from
> +			 * do_freezer_trap(). So we avoid cgroup's transient
> +			 * switch from the frozen state and back.
> +			 */
> +		}
>  
>  		/*
> -		 * Calling cgroup_update_frozen() isn't required here,
> -		 * because it will be called anyway a bit later from
> -		 * do_freezer_trap(). So we avoid cgroup's transient switch
> -		 * from the frozen state and back.
> +		 * If the cgroup is to be killed notice it now and take the
> +		 * child down right after we finished preparing it for
> +		 * userspace.
>  		 */
> +		kill = test_bit(CGRP_KILL, &cgrp_flags);
>  	}
>  
>  	spin_unlock_irq(&css_set_lock);
> @@ -6135,6 +6230,10 @@ void cgroup_post_fork(struct task_struct *child,
>  		put_css_set(rcset);
>  	}
>  
> +	/* Cgroup has to be killed so take down child immediately. */
> +	if (kill)
> +		send_sig(SIGKILL, child, 0);
                ^^^^^^^^
Using send_sig is wrong here for the same reasons as above.

Is a change to cgroup_post_fork necessary?  Fork already
has protections against a signal being delivered effectively
during fork.  Which may be enough in this case.


> +
>  	cgroup_css_set_put_fork(kargs);
>  }
>  
>
> base-commit: 9f4ad9e425a1d3b6a34617b8ea226d56a119a717

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 5/5] tests/cgroup: test cgroup.kill
  2021-05-03 14:39 ` [PATCH v2 5/5] tests/cgroup: test cgroup.kill Christian Brauner
@ 2021-05-05 18:34   ` Shakeel Butt
  2021-05-05 18:52     ` Christian Brauner
  0 siblings, 1 reply; 18+ messages in thread
From: Shakeel Butt @ 2021-05-05 18:34 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tejun Heo, Roman Gushchin, Zefan Li, Johannes Weiner, Cgroups,
	containers, Christian Brauner

On Mon, May 3, 2021 at 7:40 AM Christian Brauner <brauner@kernel.org> wrote:
>
[...]
> +
> +static int test_cgkill_simple(const char *root)
> +{
> +       pid_t pids[100];
> +       int ret = KSFT_FAIL;
> +       char *cgroup = NULL;
> +       int i;
> +
> +       cgroup = cg_name(root, "cg_test_simple");
> +       if (!cgroup)
> +               goto cleanup;
> +
> +       if (cg_create(cgroup))
> +               goto cleanup;
> +
> +       for (i = 0; i < 100; i++)
> +               pids[i] = cg_run_nowait(cgroup, child_fn, NULL);
> +
> +       if (cg_wait_for_proc_count(cgroup, 100))
> +               goto cleanup;
> +
> +        if (cg_write(cgroup, "cgroup.kill", "1"))
> +               goto cleanup;

I don't think the above write to cgroup.kill is correct.

> +
> +       if (cg_read_strcmp(cgroup, "cgroup.events", "populated 1\n"))
> +               goto cleanup;
> +
> +       if (cg_kill_wait(cgroup))
> +               goto cleanup;
> +
> +       if (cg_read_strcmp(cgroup, "cgroup.events", "populated 0\n"))
> +               goto cleanup;
> +
> +       ret = KSFT_PASS;
> +
> +cleanup:
> +       for (i = 0; i < 100; i++)
> +               wait_for_pid(pids[i]);
> +
> +       if (cgroup)
> +               cg_destroy(cgroup);
> +       free(cgroup);
> +       return ret;
> +}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/5] cgroup: introduce cgroup.kill
  2021-05-05 17:57 ` Roman Gushchin
@ 2021-05-05 18:46   ` Christian Brauner
  2021-05-05 19:13     ` Roman Gushchin
  0 siblings, 1 reply; 18+ messages in thread
From: Christian Brauner @ 2021-05-05 18:46 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Christian Brauner, Tejun Heo, Shakeel Butt, Zefan Li,
	Johannes Weiner, cgroups, containers

On Wed, May 05, 2021 at 10:57:00AM -0700, Roman Gushchin wrote:
> On Mon, May 03, 2021 at 04:39:19PM +0200, Christian Brauner wrote:
> > From: Christian Brauner <christian.brauner@ubuntu.com>
> > 
> > Introduce the cgroup.kill file. It does what it says on the tin and
> > allows a caller to kill a cgroup by writing "1" into cgroup.kill.
> > The file is available in non-root cgroups.
> > 
> > Killing cgroups is a process directed operation, i.e. the whole
> > thread-group is affected. Consequently trying to write to cgroup.kill in
> > threaded cgroups will be rejected and EOPNOTSUPP returned. This behavior
> > aligns with cgroup.procs where reads in threaded-cgroups are rejected
> > with EOPNOTSUPP.
> > 
> > The cgroup.kill file is write-only since killing a cgroup is an event
> > not which makes it different from e.g. freezer where a cgroup
> > transitions between the two states.
> > 
> > As with all new cgroup features cgroup.kill is recursive by default.
> > 
> > Killing a cgroup is protected against concurrent migrations through the
> > cgroup mutex. To protect against forkbombs and to mitigate the effect of
> > racing forks a new CGRP_KILL css set lock protected flag is introduced
> > that is set prior to killing a cgroup and unset after the cgroup has
> > been killed. We can then check in cgroup_post_fork() where we hold the
> > css set lock already whether the cgroup is currently being killed. If so
> > we send the child a SIGKILL signal immediately taking it down as soon as
> > it returns to userspace. To make the killing of the child semantically
> > clean it is killed after all cgroup attachment operations have been
> > finalized.
> > 
> > There are various use-cases of this interface:
> > - Containers usually have a conservative layout where each container
> >   usually has a delegated cgroup. For such layouts there is a 1:1
> >   mapping between container and cgroup. If the container in addition
> >   uses a separate pid namespace then killing a container usually becomes
> >   a simple kill -9 <container-init-pid> from an ancestor pid namespace.
> >   However, there are quite a few scenarios where that isn't true. For
> >   example, there are containers that share the cgroup with other
> >   processes on purpose that are supposed to be bound to the lifetime of
> >   the container but are not in the same pidns of the container.
> >   Containers that are in a delegated cgroup but share the pid namespace
> >   with the host or other containers.
> > - Service managers such as systemd use cgroups to group and organize
> >   processes belonging to a service. They usually rely on a recursive
> >   algorithm now to kill a service. With cgroup.kill this becomes a
> >   simple write to cgroup.kill.
> > - Userspace OOM implementations can make good use of this feature to
> >   efficiently take down whole cgroups quickly.
> > - The kill program can gain a new
> >   kill --cgroup /sys/fs/cgroup/delegated
> >   flag to take down cgroups.
> > 
> > A few observations about the semantics:
> > - If parent and child are in the same cgroup and CLONE_INTO_CGROUP is
> >   not specified we are not taking cgroup mutex meaning the cgroup can be
> >   killed while a process in that cgroup is forking.
> >   If the kill request happens right before cgroup_can_fork() and before
> >   the parent grabs its siglock the parent is guaranteed to see the
> >   pending SIGKILL. In addition we perform another check in
> >   cgroup_post_fork() whether the cgroup is being killed and is so take
> >   down the child (see above). This is robust enough and protects gainst
> >   forkbombs. If userspace really really wants to have stricter
> >   protection the simple solution would be to grab the write side of the
> >   cgroup threadgroup rwsem which will force all ongoing forks to
> >   complete before killing starts. We concluded that this is not
> >   necessary as the semantics for concurrent forking should simply align
> >   with freezer where a similar check as cgroup_post_fork() is performed.
> > 
> >   For all other cases CLONE_INTO_CGROUP is required. In this case we
> >   will grab the cgroup mutex so the cgroup can't be killed while we
> >   fork. Once we're done with the fork and have dropped cgroup mutex we
> >   are visible and will be found by any subsequent kill request.
> > - We obviously don't kill kthreads. This means a cgroup that has a
> >   kthread will not become empty after killing and consequently no
> >   unpopulated event will be generated. The assumption is that kthreads
> >   should be in the root cgroup only anyway so this is not an issue.
> > - We skip killing tasks that already have pending fatal signals.
> > - Freezer doesn't care about tasks in different pid namespaces, i.e. if
> >   you have two tasks in different pid namespaces the cgroup would still
> >   be frozen. The cgroup.kill mechanism consequently behaves the same
> >   way, i.e. we kill all processes and ignore in which pid namespace they
> >   exist.
> > - If the caller is located in a cgroup that is killed the caller will
> >   obviously be killed as well.
> > 
> > Cc: Shakeel Butt <shakeelb@google.com>
> > Cc: Roman Gushchin <guro@fb.com>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: cgroups@vger.kernel.org
> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> > ---
> > 
> > The series can be pulled from
> > 
> > git@gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux tags/cgroup.kill.v5.14
> > 
> > /* v2 */
> > - Roman Gushchin <guro@fb.com>:
> >   - Retrieve cgrp->flags only once and check CGRP_* bits on it.
> > ---
> >  include/linux/cgroup-defs.h |   3 +
> >  kernel/cgroup/cgroup.c      | 127 ++++++++++++++++++++++++++++++++----
> >  2 files changed, 116 insertions(+), 14 deletions(-)
> > 
> > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> > index 559ee05f86b2..43fef771009a 100644
> > --- a/include/linux/cgroup-defs.h
> > +++ b/include/linux/cgroup-defs.h
> > @@ -71,6 +71,9 @@ enum {
> >  
> >  	/* Cgroup is frozen. */
> >  	CGRP_FROZEN,
> > +
> > +	/* Control group has to be killed. */
> > +	CGRP_KILL,
> >  };
> >  
> >  /* cgroup_root->flags */
> > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > index 9153b20e5cc6..aee84b99534a 100644
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -3654,6 +3654,80 @@ static ssize_t cgroup_freeze_write(struct kernfs_open_file *of,
> >  	return nbytes;
> >  }
> >  
> > +static void __cgroup_kill(struct cgroup *cgrp)
> > +{
> > +	struct css_task_iter it;
> > +	struct task_struct *task;
> > +
> > +	lockdep_assert_held(&cgroup_mutex);
> > +
> > +	spin_lock_irq(&css_set_lock);
> > +	set_bit(CGRP_KILL, &cgrp->flags);
> > +	spin_unlock_irq(&css_set_lock);
> > +
> > +	css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED, &it);
> > +	while ((task = css_task_iter_next(&it))) {
> > +		/* Ignore kernel threads here. */
> > +		if (task->flags & PF_KTHREAD)
> > +			continue;
> > +
> > +		/* Skip tasks that are already dying. */
> > +		if (__fatal_signal_pending(task))
> > +			continue;
> > +
> > +		send_sig(SIGKILL, task, 0);
> > +	}
> > +	css_task_iter_end(&it);
> > +
> > +	spin_lock_irq(&css_set_lock);
> > +	clear_bit(CGRP_KILL, &cgrp->flags);
> > +	spin_unlock_irq(&css_set_lock);
> > +}
> > +
> > +static void cgroup_kill(struct cgroup *cgrp)
> > +{
> > +	struct cgroup_subsys_state *css;
> > +	struct cgroup *dsct;
> > +
> > +	lockdep_assert_held(&cgroup_mutex);
> > +
> > +	cgroup_for_each_live_descendant_pre(dsct, css, cgrp)
> > +		__cgroup_kill(dsct);
> > +}
> > +
> > +static ssize_t cgroup_kill_write(struct kernfs_open_file *of, char *buf,
> > +				 size_t nbytes, loff_t off)
> > +{
> > +	ssize_t ret = 0;
> > +	int kill;
> > +	struct cgroup *cgrp;
> > +
> > +	ret = kstrtoint(strstrip(buf), 0, &kill);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (kill != 1)
> > +		return -ERANGE;
> > +
> > +	cgrp = cgroup_kn_lock_live(of->kn, false);
> > +	if (!cgrp)
> > +		return -ENOENT;
> > +
> > +	/*
> > +	 * Killing is a process directed operation, i.e. the whole thread-group
> > +	 * is taken down so act like we do for cgroup.procs and only make this
> > +	 * writable in non-threaded cgroups.
> > +	 */
> > +	if (cgroup_is_threaded(cgrp))
> > +		ret = -EOPNOTSUPP;
> > +	else
> > +		cgroup_kill(cgrp);
> > +
> > +	cgroup_kn_unlock(of->kn);
> > +
> > +	return ret ?: nbytes;
> > +}
> > +
> >  static int cgroup_file_open(struct kernfs_open_file *of)
> >  {
> >  	struct cftype *cft = of_cft(of);
> > @@ -4846,6 +4920,11 @@ static struct cftype cgroup_base_files[] = {
> >  		.seq_show = cgroup_freeze_show,
> >  		.write = cgroup_freeze_write,
> >  	},
> > +	{
> > +		.name = "cgroup.kill",
> > +		.flags = CFTYPE_NOT_ON_ROOT,
> > +		.write = cgroup_kill_write,
> > +	},
> >  	{
> >  		.name = "cpu.stat",
> >  		.seq_show = cpu_stat_show,
> > @@ -6077,6 +6156,8 @@ void cgroup_post_fork(struct task_struct *child,
> >  		      struct kernel_clone_args *kargs)
> >  	__releases(&cgroup_threadgroup_rwsem) __releases(&cgroup_mutex)
> >  {
> > +	unsigned long cgrp_flags = 0;
> > +	bool kill = false;
> >  	struct cgroup_subsys *ss;
> >  	struct css_set *cset;
> >  	int i;
> > @@ -6088,6 +6169,11 @@ void cgroup_post_fork(struct task_struct *child,
> >  
> >  	/* init tasks are special, only link regular threads */
> >  	if (likely(child->pid)) {
> > +		if (kargs->cgrp)
> > +			cgrp_flags = kargs->cgrp->flags;
> > +		else
> > +			cgrp_flags = cset->dfl_cgrp->flags;
> > +
> >  		WARN_ON_ONCE(!list_empty(&child->cg_list));
> >  		cset->nr_tasks++;
> >  		css_set_move_task(child, NULL, cset, false);
> > @@ -6096,23 +6182,32 @@ void cgroup_post_fork(struct task_struct *child,
> >  		cset = NULL;
> >  	}
> >  
> > -	/*
> > -	 * If the cgroup has to be frozen, the new task has too.  Let's set
> > -	 * the JOBCTL_TRAP_FREEZE jobctl bit to get the task into the
> > -	 * frozen state.
> > -	 */
> > -	if (unlikely(cgroup_task_freeze(child))) {
> > -		spin_lock(&child->sighand->siglock);
> > -		WARN_ON_ONCE(child->frozen);
> > -		child->jobctl |= JOBCTL_TRAP_FREEZE;
> > -		spin_unlock(&child->sighand->siglock);
> > +	if (!(child->flags & PF_KTHREAD)) {
> > +		if (test_bit(CGRP_FREEZE, &cgrp_flags)) {
> > +			/*
> > +			 * If the cgroup has to be frozen, the new task has
> > +			 * too. Let's set the JOBCTL_TRAP_FREEZE jobctl bit to
> > +			 * get the task into the frozen state.
> > +			 */
> > +			spin_lock(&child->sighand->siglock);
> > +			WARN_ON_ONCE(child->frozen);
> > +			child->jobctl |= JOBCTL_TRAP_FREEZE;
> > +			spin_unlock(&child->sighand->siglock);
> > +
> > +			/*
> > +			 * Calling cgroup_update_frozen() isn't required here,
> > +			 * because it will be called anyway a bit later from
> > +			 * do_freezer_trap(). So we avoid cgroup's transient
> > +			 * switch from the frozen state and back.
> > +			 */
> > +		}
> 
> I think this part can be optimized a bit further:
> 1) we don't need atomic test_bit() here
> 2) all PF_KTHREAD, CGRP_FREEZE and CGRP_KILL cases are very unlikely
> 
> So something like this could work (completely untested):
> 
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 0965b44ff425..f567ca69119d 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -6190,13 +6190,15 @@ void cgroup_post_fork(struct task_struct *child,
>                 cset = NULL;
>         }
>  
> -       if (!(child->flags & PF_KTHREAD)) {
> -               if (test_bit(CGRP_FREEZE, &cgrp_flags)) {
> -                       /*
> -                        * If the cgroup has to be frozen, the new task has
> -                        * too. Let's set the JOBCTL_TRAP_FREEZE jobctl bit to
> -                        * get the task into the frozen state.
> -                        */
> +
> +       if (unlikely(!(child->flags & PF_KTHREAD) &&
> +                    cgrp_flags & (CGRP_FREEZE | CGRP_KILL))) {

The unlikely might make sense.

But hm, I'm not a fan of the CGRP_FREEZE and CGRP_KILL check without
test_bit(). That seems a bit ugly. Especially since nowhere in
kernel/cgroup.c are these bits checked without test_bit().

Also, this wouldn't work afaict at least not for all values since
CGRP_FREEZE and CGRP_KILL aren't flags, they're bits defined in an enum.
(In contrast to cgroup_root->flags which are defined as flags _in an
enum_.) So it seems they should really be used with test_bit. Otherwise
this would probably have to be sm like

if (unlikely(!(child->flags & PF_KTHREAD) &&
	(cgrp_flags & (BIT_ULL(CGRP_FREEZE) | BIT_ULL(CGRP_KILL))) {
	.
	.
	.

which seems unreadable and makes the rest of cgroup.c for these values
inconsistent.
Note that before the check was the same for CGRP_FREEZE it was just
hidden in the helper.
I really think we should just leave the test_bit() checks.

> +               /*
> +                * If the cgroup has to be frozen, the new task has
> +                * too. Let's set the JOBCTL_TRAP_FREEZE jobctl bit to
> +                * get the task into the frozen state.
> +                */
> +               if (cgrp_flags & CGRP_FREEZE) {
>                         spin_lock(&child->sighand->siglock);
>                         WARN_ON_ONCE(child->frozen);
>                         child->jobctl |= JOBCTL_TRAP_FREEZE;
> @@ -6215,7 +6217,8 @@ void cgroup_post_fork(struct task_struct *child,
>                  * child down right after we finished preparing it for
>                  * userspace.
>                  */
> -               kill = test_bit(CGRP_KILL, &cgrp_flags);
> +               if (cgrp_flags & CGRP_KILL)
> +                       kill = true;
>         }
>  
>         spin_unlock_irq(&css_set_lock);
> 
> 
> >  
> >  		/*
> > -		 * Calling cgroup_update_frozen() isn't required here,
> > -		 * because it will be called anyway a bit later from
> > -		 * do_freezer_trap(). So we avoid cgroup's transient switch
> > -		 * from the frozen state and back.
> > +		 * If the cgroup is to be killed notice it now and take the
> > +		 * child down right after we finished preparing it for
> > +		 * userspace.
> >  		 */
> > +		kill = test_bit(CGRP_KILL, &cgrp_flags);
> >  	}
> >  
> >  	spin_unlock_irq(&css_set_lock);
> > @@ -6135,6 +6230,10 @@ void cgroup_post_fork(struct task_struct *child,
> >  		put_css_set(rcset);
> >  	}
> >  
> > +	/* Cgroup has to be killed so take down child immediately. */
> > +	if (kill)
> > +		send_sig(SIGKILL, child, 0);
> 
> I think it's better to use do_send_sig_info() here, which skips the check
> for the signal number, which is obviously valid.

sure/shrug

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/5] cgroup: introduce cgroup.kill
  2021-05-05 18:31 ` Eric W. Biederman
@ 2021-05-05 18:49   ` Christian Brauner
  0 siblings, 0 replies; 18+ messages in thread
From: Christian Brauner @ 2021-05-05 18:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Tejun Heo, Roman Gushchin, Shakeel Butt,
	Zefan Li, Johannes Weiner, cgroups, containers

On Wed, May 05, 2021 at 01:31:09PM -0500, Eric W. Biederman wrote:
> 
> Please see below this patch uses the wrong function to send SIGKILL.
> 
> Eric
> 
> 
> Christian Brauner <brauner@kernel.org> writes:
> 
> > From: Christian Brauner <christian.brauner@ubuntu.com>
> >
> > Introduce the cgroup.kill file. It does what it says on the tin and
> > allows a caller to kill a cgroup by writing "1" into cgroup.kill.
> > The file is available in non-root cgroups.
> >
> > Killing cgroups is a process directed operation, i.e. the whole
> > thread-group is affected. Consequently trying to write to cgroup.kill in
> > threaded cgroups will be rejected and EOPNOTSUPP returned. This behavior
> > aligns with cgroup.procs where reads in threaded-cgroups are rejected
> > with EOPNOTSUPP.
> >
> > The cgroup.kill file is write-only since killing a cgroup is an event
> > not which makes it different from e.g. freezer where a cgroup
> > transitions between the two states.
> >
> > As with all new cgroup features cgroup.kill is recursive by default.
> >
> > Killing a cgroup is protected against concurrent migrations through the
> > cgroup mutex. To protect against forkbombs and to mitigate the effect of
> > racing forks a new CGRP_KILL css set lock protected flag is introduced
> > that is set prior to killing a cgroup and unset after the cgroup has
> > been killed. We can then check in cgroup_post_fork() where we hold the
> > css set lock already whether the cgroup is currently being killed. If so
> > we send the child a SIGKILL signal immediately taking it down as soon as
> > it returns to userspace. To make the killing of the child semantically
> > clean it is killed after all cgroup attachment operations have been
> > finalized.
> >
> > There are various use-cases of this interface:
> > - Containers usually have a conservative layout where each container
> >   usually has a delegated cgroup. For such layouts there is a 1:1
> >   mapping between container and cgroup. If the container in addition
> >   uses a separate pid namespace then killing a container usually becomes
> >   a simple kill -9 <container-init-pid> from an ancestor pid namespace.
> >   However, there are quite a few scenarios where that isn't true. For
> >   example, there are containers that share the cgroup with other
> >   processes on purpose that are supposed to be bound to the lifetime of
> >   the container but are not in the same pidns of the container.
> >   Containers that are in a delegated cgroup but share the pid namespace
> >   with the host or other containers.
> > - Service managers such as systemd use cgroups to group and organize
> >   processes belonging to a service. They usually rely on a recursive
> >   algorithm now to kill a service. With cgroup.kill this becomes a
> >   simple write to cgroup.kill.
> > - Userspace OOM implementations can make good use of this feature to
> >   efficiently take down whole cgroups quickly.
> > - The kill program can gain a new
> >   kill --cgroup /sys/fs/cgroup/delegated
> >   flag to take down cgroups.
> >
> > A few observations about the semantics:
> > - If parent and child are in the same cgroup and CLONE_INTO_CGROUP is
> >   not specified we are not taking cgroup mutex meaning the cgroup can be
> >   killed while a process in that cgroup is forking.
> >   If the kill request happens right before cgroup_can_fork() and before
> >   the parent grabs its siglock the parent is guaranteed to see the
> >   pending SIGKILL. In addition we perform another check in
> >   cgroup_post_fork() whether the cgroup is being killed and is so take
> >   down the child (see above). This is robust enough and protects gainst
> >   forkbombs. If userspace really really wants to have stricter
> >   protection the simple solution would be to grab the write side of the
> >   cgroup threadgroup rwsem which will force all ongoing forks to
> >   complete before killing starts. We concluded that this is not
> >   necessary as the semantics for concurrent forking should simply align
> >   with freezer where a similar check as cgroup_post_fork() is performed.
> >
> >   For all other cases CLONE_INTO_CGROUP is required. In this case we
> >   will grab the cgroup mutex so the cgroup can't be killed while we
> >   fork. Once we're done with the fork and have dropped cgroup mutex we
> >   are visible and will be found by any subsequent kill request.
> > - We obviously don't kill kthreads. This means a cgroup that has a
> >   kthread will not become empty after killing and consequently no
> >   unpopulated event will be generated. The assumption is that kthreads
> >   should be in the root cgroup only anyway so this is not an issue.
> > - We skip killing tasks that already have pending fatal signals.
> > - Freezer doesn't care about tasks in different pid namespaces, i.e. if
> >   you have two tasks in different pid namespaces the cgroup would still
> >   be frozen. The cgroup.kill mechanism consequently behaves the same
> >   way, i.e. we kill all processes and ignore in which pid namespace they
> >   exist.
> > - If the caller is located in a cgroup that is killed the caller will
> >   obviously be killed as well.
> >
> > Cc: Shakeel Butt <shakeelb@google.com>
> > Cc: Roman Gushchin <guro@fb.com>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: cgroups@vger.kernel.org
> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> > ---
> >
> > The series can be pulled from
> >
> > git@gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux tags/cgroup.kill.v5.14
> >
> > /* v2 */
> > - Roman Gushchin <guro@fb.com>:
> >   - Retrieve cgrp->flags only once and check CGRP_* bits on it.
> > ---
> >  include/linux/cgroup-defs.h |   3 +
> >  kernel/cgroup/cgroup.c      | 127 ++++++++++++++++++++++++++++++++----
> >  2 files changed, 116 insertions(+), 14 deletions(-)
> >
> > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> > index 559ee05f86b2..43fef771009a 100644
> > --- a/include/linux/cgroup-defs.h
> > +++ b/include/linux/cgroup-defs.h
> > @@ -71,6 +71,9 @@ enum {
> >  
> >  	/* Cgroup is frozen. */
> >  	CGRP_FROZEN,
> > +
> > +	/* Control group has to be killed. */
> > +	CGRP_KILL,
> >  };
> >  
> >  /* cgroup_root->flags */
> > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > index 9153b20e5cc6..aee84b99534a 100644
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -3654,6 +3654,80 @@ static ssize_t cgroup_freeze_write(struct kernfs_open_file *of,
> >  	return nbytes;
> >  }
> >  
> > +static void __cgroup_kill(struct cgroup *cgrp)
> > +{
> > +	struct css_task_iter it;
> > +	struct task_struct *task;
> > +
> > +	lockdep_assert_held(&cgroup_mutex);
> > +
> > +	spin_lock_irq(&css_set_lock);
> > +	set_bit(CGRP_KILL, &cgrp->flags);
> > +	spin_unlock_irq(&css_set_lock);
> > +
> > +	css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED, &it);
> > +	while ((task = css_task_iter_next(&it))) {
> > +		/* Ignore kernel threads here. */
> > +		if (task->flags & PF_KTHREAD)
> > +			continue;
> > +
> > +		/* Skip tasks that are already dying. */
> > +		if (__fatal_signal_pending(task))
> > +			continue;
> > +
> > +		send_sig(SIGKILL, task, 0);
>                 ^^^^^^^^
> Using send_sig here is wrong.  The function send_sig
> is the interface to send a signal to a single task/thread.
> 
> The signal SIGKILL can not be sent to a single task/thread.
> So it is never makes sense to use send_sig with SIGKILL.
> 
> As this all happens in the context of the process writing
> to the file this can either be:
> 
> 	group_send_sig_info(SIGKILL, SEND_SIG_NOINFO, task, PIDTYPE_TGID);
> 
> Which will check that the caller actually has permissions to kill the
> specified task.  Or:
> 
> 	do_send_sig_info(SIGKILL, SEND_SIG_NOINFO, task, PIDTYPE_TGID);

The result should be the same but yes it's better to be explicit about
that. I'll switch to that.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 5/5] tests/cgroup: test cgroup.kill
  2021-05-05 18:34   ` Shakeel Butt
@ 2021-05-05 18:52     ` Christian Brauner
  0 siblings, 0 replies; 18+ messages in thread
From: Christian Brauner @ 2021-05-05 18:52 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Christian Brauner, Tejun Heo, Roman Gushchin, Zefan Li,
	Johannes Weiner, Cgroups, containers

On Wed, May 05, 2021 at 11:34:41AM -0700, Shakeel Butt wrote:
> On Mon, May 3, 2021 at 7:40 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> [...]
> > +
> > +static int test_cgkill_simple(const char *root)
> > +{
> > +       pid_t pids[100];
> > +       int ret = KSFT_FAIL;
> > +       char *cgroup = NULL;
> > +       int i;
> > +
> > +       cgroup = cg_name(root, "cg_test_simple");
> > +       if (!cgroup)
> > +               goto cleanup;
> > +
> > +       if (cg_create(cgroup))
> > +               goto cleanup;
> > +
> > +       for (i = 0; i < 100; i++)
> > +               pids[i] = cg_run_nowait(cgroup, child_fn, NULL);
> > +
> > +       if (cg_wait_for_proc_count(cgroup, 100))
> > +               goto cleanup;
> > +
> > +        if (cg_write(cgroup, "cgroup.kill", "1"))
> > +               goto cleanup;
> 
> I don't think the above write to cgroup.kill is correct.

Hm, that's a left-over from the port of the similar freezer test. Thanks
for spotting this. It never failed because of the number of procs being
created and then killed most likely.

Thanks, will remove.
Christian

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/5] cgroup: introduce cgroup.kill
  2021-05-05 18:46   ` Christian Brauner
@ 2021-05-05 19:13     ` Roman Gushchin
  0 siblings, 0 replies; 18+ messages in thread
From: Roman Gushchin @ 2021-05-05 19:13 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christian Brauner, Tejun Heo, Shakeel Butt, Zefan Li,
	Johannes Weiner, cgroups, containers

On Wed, May 05, 2021 at 08:46:32PM +0200, Christian Brauner wrote:
> On Wed, May 05, 2021 at 10:57:00AM -0700, Roman Gushchin wrote:
> > On Mon, May 03, 2021 at 04:39:19PM +0200, Christian Brauner wrote:
> > > From: Christian Brauner <christian.brauner@ubuntu.com>
> > > 
> > > Introduce the cgroup.kill file. It does what it says on the tin and
> > > allows a caller to kill a cgroup by writing "1" into cgroup.kill.
> > > The file is available in non-root cgroups.
> > > 
> > > Killing cgroups is a process directed operation, i.e. the whole
> > > thread-group is affected. Consequently trying to write to cgroup.kill in
> > > threaded cgroups will be rejected and EOPNOTSUPP returned. This behavior
> > > aligns with cgroup.procs where reads in threaded-cgroups are rejected
> > > with EOPNOTSUPP.
> > > 
> > > The cgroup.kill file is write-only since killing a cgroup is an event
> > > not which makes it different from e.g. freezer where a cgroup
> > > transitions between the two states.
> > > 
> > > As with all new cgroup features cgroup.kill is recursive by default.
> > > 
> > > Killing a cgroup is protected against concurrent migrations through the
> > > cgroup mutex. To protect against forkbombs and to mitigate the effect of
> > > racing forks a new CGRP_KILL css set lock protected flag is introduced
> > > that is set prior to killing a cgroup and unset after the cgroup has
> > > been killed. We can then check in cgroup_post_fork() where we hold the
> > > css set lock already whether the cgroup is currently being killed. If so
> > > we send the child a SIGKILL signal immediately taking it down as soon as
> > > it returns to userspace. To make the killing of the child semantically
> > > clean it is killed after all cgroup attachment operations have been
> > > finalized.
> > > 
> > > There are various use-cases of this interface:
> > > - Containers usually have a conservative layout where each container
> > >   usually has a delegated cgroup. For such layouts there is a 1:1
> > >   mapping between container and cgroup. If the container in addition
> > >   uses a separate pid namespace then killing a container usually becomes
> > >   a simple kill -9 <container-init-pid> from an ancestor pid namespace.
> > >   However, there are quite a few scenarios where that isn't true. For
> > >   example, there are containers that share the cgroup with other
> > >   processes on purpose that are supposed to be bound to the lifetime of
> > >   the container but are not in the same pidns of the container.
> > >   Containers that are in a delegated cgroup but share the pid namespace
> > >   with the host or other containers.
> > > - Service managers such as systemd use cgroups to group and organize
> > >   processes belonging to a service. They usually rely on a recursive
> > >   algorithm now to kill a service. With cgroup.kill this becomes a
> > >   simple write to cgroup.kill.
> > > - Userspace OOM implementations can make good use of this feature to
> > >   efficiently take down whole cgroups quickly.
> > > - The kill program can gain a new
> > >   kill --cgroup /sys/fs/cgroup/delegated
> > >   flag to take down cgroups.
> > > 
> > > A few observations about the semantics:
> > > - If parent and child are in the same cgroup and CLONE_INTO_CGROUP is
> > >   not specified we are not taking cgroup mutex meaning the cgroup can be
> > >   killed while a process in that cgroup is forking.
> > >   If the kill request happens right before cgroup_can_fork() and before
> > >   the parent grabs its siglock the parent is guaranteed to see the
> > >   pending SIGKILL. In addition we perform another check in
> > >   cgroup_post_fork() whether the cgroup is being killed and is so take
> > >   down the child (see above). This is robust enough and protects gainst
> > >   forkbombs. If userspace really really wants to have stricter
> > >   protection the simple solution would be to grab the write side of the
> > >   cgroup threadgroup rwsem which will force all ongoing forks to
> > >   complete before killing starts. We concluded that this is not
> > >   necessary as the semantics for concurrent forking should simply align
> > >   with freezer where a similar check as cgroup_post_fork() is performed.
> > > 
> > >   For all other cases CLONE_INTO_CGROUP is required. In this case we
> > >   will grab the cgroup mutex so the cgroup can't be killed while we
> > >   fork. Once we're done with the fork and have dropped cgroup mutex we
> > >   are visible and will be found by any subsequent kill request.
> > > - We obviously don't kill kthreads. This means a cgroup that has a
> > >   kthread will not become empty after killing and consequently no
> > >   unpopulated event will be generated. The assumption is that kthreads
> > >   should be in the root cgroup only anyway so this is not an issue.
> > > - We skip killing tasks that already have pending fatal signals.
> > > - Freezer doesn't care about tasks in different pid namespaces, i.e. if
> > >   you have two tasks in different pid namespaces the cgroup would still
> > >   be frozen. The cgroup.kill mechanism consequently behaves the same
> > >   way, i.e. we kill all processes and ignore in which pid namespace they
> > >   exist.
> > > - If the caller is located in a cgroup that is killed the caller will
> > >   obviously be killed as well.
> > > 
> > > Cc: Shakeel Butt <shakeelb@google.com>
> > > Cc: Roman Gushchin <guro@fb.com>
> > > Cc: Tejun Heo <tj@kernel.org>
> > > Cc: cgroups@vger.kernel.org
> > > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> > > ---
> > > 
> > > The series can be pulled from
> > > 
> > > git@gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux tags/cgroup.kill.v5.14
> > > 
> > > /* v2 */
> > > - Roman Gushchin <guro@fb.com>:
> > >   - Retrieve cgrp->flags only once and check CGRP_* bits on it.
> > > ---
> > >  include/linux/cgroup-defs.h |   3 +
> > >  kernel/cgroup/cgroup.c      | 127 ++++++++++++++++++++++++++++++++----
> > >  2 files changed, 116 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> > > index 559ee05f86b2..43fef771009a 100644
> > > --- a/include/linux/cgroup-defs.h
> > > +++ b/include/linux/cgroup-defs.h
> > > @@ -71,6 +71,9 @@ enum {
> > >  
> > >  	/* Cgroup is frozen. */
> > >  	CGRP_FROZEN,
> > > +
> > > +	/* Control group has to be killed. */
> > > +	CGRP_KILL,
> > >  };
> > >  
> > >  /* cgroup_root->flags */
> > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > > index 9153b20e5cc6..aee84b99534a 100644
> > > --- a/kernel/cgroup/cgroup.c
> > > +++ b/kernel/cgroup/cgroup.c
> > > @@ -3654,6 +3654,80 @@ static ssize_t cgroup_freeze_write(struct kernfs_open_file *of,
> > >  	return nbytes;
> > >  }
> > >  
> > > +static void __cgroup_kill(struct cgroup *cgrp)
> > > +{
> > > +	struct css_task_iter it;
> > > +	struct task_struct *task;
> > > +
> > > +	lockdep_assert_held(&cgroup_mutex);
> > > +
> > > +	spin_lock_irq(&css_set_lock);
> > > +	set_bit(CGRP_KILL, &cgrp->flags);
> > > +	spin_unlock_irq(&css_set_lock);
> > > +
> > > +	css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED, &it);
> > > +	while ((task = css_task_iter_next(&it))) {
> > > +		/* Ignore kernel threads here. */
> > > +		if (task->flags & PF_KTHREAD)
> > > +			continue;
> > > +
> > > +		/* Skip tasks that are already dying. */
> > > +		if (__fatal_signal_pending(task))
> > > +			continue;
> > > +
> > > +		send_sig(SIGKILL, task, 0);
> > > +	}
> > > +	css_task_iter_end(&it);
> > > +
> > > +	spin_lock_irq(&css_set_lock);
> > > +	clear_bit(CGRP_KILL, &cgrp->flags);
> > > +	spin_unlock_irq(&css_set_lock);
> > > +}
> > > +
> > > +static void cgroup_kill(struct cgroup *cgrp)
> > > +{
> > > +	struct cgroup_subsys_state *css;
> > > +	struct cgroup *dsct;
> > > +
> > > +	lockdep_assert_held(&cgroup_mutex);
> > > +
> > > +	cgroup_for_each_live_descendant_pre(dsct, css, cgrp)
> > > +		__cgroup_kill(dsct);
> > > +}
> > > +
> > > +static ssize_t cgroup_kill_write(struct kernfs_open_file *of, char *buf,
> > > +				 size_t nbytes, loff_t off)
> > > +{
> > > +	ssize_t ret = 0;
> > > +	int kill;
> > > +	struct cgroup *cgrp;
> > > +
> > > +	ret = kstrtoint(strstrip(buf), 0, &kill);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	if (kill != 1)
> > > +		return -ERANGE;
> > > +
> > > +	cgrp = cgroup_kn_lock_live(of->kn, false);
> > > +	if (!cgrp)
> > > +		return -ENOENT;
> > > +
> > > +	/*
> > > +	 * Killing is a process directed operation, i.e. the whole thread-group
> > > +	 * is taken down so act like we do for cgroup.procs and only make this
> > > +	 * writable in non-threaded cgroups.
> > > +	 */
> > > +	if (cgroup_is_threaded(cgrp))
> > > +		ret = -EOPNOTSUPP;
> > > +	else
> > > +		cgroup_kill(cgrp);
> > > +
> > > +	cgroup_kn_unlock(of->kn);
> > > +
> > > +	return ret ?: nbytes;
> > > +}
> > > +
> > >  static int cgroup_file_open(struct kernfs_open_file *of)
> > >  {
> > >  	struct cftype *cft = of_cft(of);
> > > @@ -4846,6 +4920,11 @@ static struct cftype cgroup_base_files[] = {
> > >  		.seq_show = cgroup_freeze_show,
> > >  		.write = cgroup_freeze_write,
> > >  	},
> > > +	{
> > > +		.name = "cgroup.kill",
> > > +		.flags = CFTYPE_NOT_ON_ROOT,
> > > +		.write = cgroup_kill_write,
> > > +	},
> > >  	{
> > >  		.name = "cpu.stat",
> > >  		.seq_show = cpu_stat_show,
> > > @@ -6077,6 +6156,8 @@ void cgroup_post_fork(struct task_struct *child,
> > >  		      struct kernel_clone_args *kargs)
> > >  	__releases(&cgroup_threadgroup_rwsem) __releases(&cgroup_mutex)
> > >  {
> > > +	unsigned long cgrp_flags = 0;
> > > +	bool kill = false;
> > >  	struct cgroup_subsys *ss;
> > >  	struct css_set *cset;
> > >  	int i;
> > > @@ -6088,6 +6169,11 @@ void cgroup_post_fork(struct task_struct *child,
> > >  
> > >  	/* init tasks are special, only link regular threads */
> > >  	if (likely(child->pid)) {
> > > +		if (kargs->cgrp)
> > > +			cgrp_flags = kargs->cgrp->flags;
> > > +		else
> > > +			cgrp_flags = cset->dfl_cgrp->flags;
> > > +
> > >  		WARN_ON_ONCE(!list_empty(&child->cg_list));
> > >  		cset->nr_tasks++;
> > >  		css_set_move_task(child, NULL, cset, false);
> > > @@ -6096,23 +6182,32 @@ void cgroup_post_fork(struct task_struct *child,
> > >  		cset = NULL;
> > >  	}
> > >  
> > > -	/*
> > > -	 * If the cgroup has to be frozen, the new task has too.  Let's set
> > > -	 * the JOBCTL_TRAP_FREEZE jobctl bit to get the task into the
> > > -	 * frozen state.
> > > -	 */
> > > -	if (unlikely(cgroup_task_freeze(child))) {
> > > -		spin_lock(&child->sighand->siglock);
> > > -		WARN_ON_ONCE(child->frozen);
> > > -		child->jobctl |= JOBCTL_TRAP_FREEZE;
> > > -		spin_unlock(&child->sighand->siglock);
> > > +	if (!(child->flags & PF_KTHREAD)) {
> > > +		if (test_bit(CGRP_FREEZE, &cgrp_flags)) {
> > > +			/*
> > > +			 * If the cgroup has to be frozen, the new task has
> > > +			 * too. Let's set the JOBCTL_TRAP_FREEZE jobctl bit to
> > > +			 * get the task into the frozen state.
> > > +			 */
> > > +			spin_lock(&child->sighand->siglock);
> > > +			WARN_ON_ONCE(child->frozen);
> > > +			child->jobctl |= JOBCTL_TRAP_FREEZE;
> > > +			spin_unlock(&child->sighand->siglock);
> > > +
> > > +			/*
> > > +			 * Calling cgroup_update_frozen() isn't required here,
> > > +			 * because it will be called anyway a bit later from
> > > +			 * do_freezer_trap(). So we avoid cgroup's transient
> > > +			 * switch from the frozen state and back.
> > > +			 */
> > > +		}
> > 
> > I think this part can be optimized a bit further:
> > 1) we don't need atomic test_bit() here
> > 2) all PF_KTHREAD, CGRP_FREEZE and CGRP_KILL cases are very unlikely
> > 
> > So something like this could work (completely untested):
> > 
> > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > index 0965b44ff425..f567ca69119d 100644
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -6190,13 +6190,15 @@ void cgroup_post_fork(struct task_struct *child,
> >                 cset = NULL;
> >         }
> >  
> > -       if (!(child->flags & PF_KTHREAD)) {
> > -               if (test_bit(CGRP_FREEZE, &cgrp_flags)) {
> > -                       /*
> > -                        * If the cgroup has to be frozen, the new task has
> > -                        * too. Let's set the JOBCTL_TRAP_FREEZE jobctl bit to
> > -                        * get the task into the frozen state.
> > -                        */
> > +
> > +       if (unlikely(!(child->flags & PF_KTHREAD) &&
> > +                    cgrp_flags & (CGRP_FREEZE | CGRP_KILL))) {
> 
> The unlikely might make sense.
> 
> But hm, I'm not a fan of the CGRP_FREEZE and CGRP_KILL check without
> test_bit(). That seems a bit ugly. Especially since nowhere in
> kernel/cgroup.c are these bits checked without test_bit().
> 
> Also, this wouldn't work afaict at least not for all values since
> CGRP_FREEZE and CGRP_KILL aren't flags, they're bits defined in an enum.
> (In contrast to cgroup_root->flags which are defined as flags _in an
> enum_.) So it seems they should really be used with test_bit. Otherwise
> this would probably have to be sm like
> 
> if (unlikely(!(child->flags & PF_KTHREAD) &&
> 	(cgrp_flags & (BIT_ULL(CGRP_FREEZE) | BIT_ULL(CGRP_KILL))) {
> 	.
> 	.
> 	.
> 
> which seems unreadable and makes the rest of cgroup.c for these values
> inconsistent.
> Note that before the check was the same for CGRP_FREEZE it was just
> hidden in the helper.
> I really think we should just leave the test_bit() checks.

Ok, you're right that it was (hidden) test_bit() here previously, so I
can't blame your patch and should blame my original code instead :)

Idk how badly we need to optimize this place and maybe having two checks
is a good price for a simpler code. After all, we can optimize it later.

With this said:
Acked-by: Roman Gushchin <guro@fb.com>

Thank you!

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-05-05 19:30 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-03 14:39 [PATCH v2 1/5] cgroup: introduce cgroup.kill Christian Brauner
2021-05-03 14:39 ` [PATCH v2 2/5] docs/cgroup: add entry for cgroup.kill Christian Brauner
2021-05-05 16:29   ` Shakeel Butt
2021-05-03 14:39 ` [PATCH v2 3/5] tests/cgroup: use cgroup.kill in cg_killall() Christian Brauner
2021-05-05 16:42   ` Shakeel Butt
2021-05-03 14:39 ` [PATCH v2 4/5] tests/cgroup: move cg_wait_for(), cg_prepare_for_wait() Christian Brauner
2021-05-05 16:43   ` Shakeel Butt
2021-05-03 14:39 ` [PATCH v2 5/5] tests/cgroup: test cgroup.kill Christian Brauner
2021-05-05 18:34   ` Shakeel Butt
2021-05-05 18:52     ` Christian Brauner
2021-05-03 17:18 ` [PATCH v2 1/5] cgroup: introduce cgroup.kill Shakeel Butt
2021-05-04  1:47 ` Serge E. Hallyn
2021-05-04 18:29 ` Shakeel Butt
2021-05-05 17:57 ` Roman Gushchin
2021-05-05 18:46   ` Christian Brauner
2021-05-05 19:13     ` Roman Gushchin
2021-05-05 18:31 ` Eric W. Biederman
2021-05-05 18:49   ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).