All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-23  3:08 ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-02-23  3:08 UTC (permalink / raw)
  To: tj, lizefan, mingo, peterz
  Cc: richard, fweisbec, linux-kernel, cgroups, Aleksa Sarai

The current state of resource limitation for the number of open
processes (as well as the number of open file descriptors) requires you
to use setrlimit(2), which means that you are limited to resource
limiting process trees rather than resource limiting cgroups (which is
the point of cgroups).

There was a patch to implement this in 2011[1], but that was rejected
because it implemented a general-purpose rlimit subsystem -- which meant
that you couldn't control distinct resource limits in different
heirarchies. This patch implements a resource controller *specifically*
for the number of processes in a cgroup, overcoming this issue.

There has been a similar attempt to implement a resource controller for
the number of open file descriptors[2], which has not been merged
becasue the reasons were dubious. Merely from a "sane interface"
perspective, it should be possible to utilise cgroups to do such
rudimentary resource management (which currently only exists for process
trees).

Aleksa Sarai (2):
  cgroups: allow a cgroup subsystem to reject a fork
  cgroups: add an nproc subsystem

 include/linux/cgroup.h        |   9 ++-
 include/linux/cgroup_subsys.h |   4 +
 init/Kconfig                  |  10 +++
 kernel/Makefile               |   1 +
 kernel/cgroup.c               |  13 ++-
 kernel/cgroup_freezer.c       |   6 +-
 kernel/cgroup_nproc.c         | 181 ++++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                 |   4 +-
 kernel/sched/core.c           |   3 +-
 9 files changed, 221 insertions(+), 10 deletions(-)
 create mode 100644 kernel/cgroup_nproc.c

[1]: https://lkml.org/lkml/2011/6/19/170
[2]: https://lkml.org/lkml/2014/7/2/640

-- 
2.3.0


^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-23  3:08 ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-02-23  3:08 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: richard-/L3Ra7n9ekc, fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Aleksa Sarai

The current state of resource limitation for the number of open
processes (as well as the number of open file descriptors) requires you
to use setrlimit(2), which means that you are limited to resource
limiting process trees rather than resource limiting cgroups (which is
the point of cgroups).

There was a patch to implement this in 2011[1], but that was rejected
because it implemented a general-purpose rlimit subsystem -- which meant
that you couldn't control distinct resource limits in different
heirarchies. This patch implements a resource controller *specifically*
for the number of processes in a cgroup, overcoming this issue.

There has been a similar attempt to implement a resource controller for
the number of open file descriptors[2], which has not been merged
becasue the reasons were dubious. Merely from a "sane interface"
perspective, it should be possible to utilise cgroups to do such
rudimentary resource management (which currently only exists for process
trees).

Aleksa Sarai (2):
  cgroups: allow a cgroup subsystem to reject a fork
  cgroups: add an nproc subsystem

 include/linux/cgroup.h        |   9 ++-
 include/linux/cgroup_subsys.h |   4 +
 init/Kconfig                  |  10 +++
 kernel/Makefile               |   1 +
 kernel/cgroup.c               |  13 ++-
 kernel/cgroup_freezer.c       |   6 +-
 kernel/cgroup_nproc.c         | 181 ++++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                 |   4 +-
 kernel/sched/core.c           |   3 +-
 9 files changed, 221 insertions(+), 10 deletions(-)
 create mode 100644 kernel/cgroup_nproc.c

[1]: https://lkml.org/lkml/2011/6/19/170
[2]: https://lkml.org/lkml/2014/7/2/640

-- 
2.3.0

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH RFC 1/2] cgroups: allow a cgroup subsystem to reject a fork
  2015-02-23  3:08 ` Aleksa Sarai
  (?)
@ 2015-02-23  3:08 ` Aleksa Sarai
  2015-02-23 14:49   ` Peter Zijlstra
  -1 siblings, 1 reply; 108+ messages in thread
From: Aleksa Sarai @ 2015-02-23  3:08 UTC (permalink / raw)
  To: tj, lizefan, mingo, peterz
  Cc: richard, fweisbec, linux-kernel, cgroups, Aleksa Sarai

NOTE: I'm not sure if I'm doing enough cleanup inside copy_process(),
because a bunch of stuff happens between the last valid goto to the
bad_fork_free_pid label and cgroup_post_fork().

What is the correct way of doing cleanup this late inside
copy_process()?

8<----------------------------------------------------------------------

Make the cgroup subsystem post fork callback return an error code so
that subsystems can accept or reject a fork from completing with a
custom error value.

This is in preparation for implementing the numtasks cgroup scope.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 include/linux/cgroup.h  |  9 ++++++---
 kernel/cgroup.c         | 13 ++++++++++---
 kernel/cgroup_freezer.c |  6 ++++--
 kernel/fork.c           |  4 +++-
 kernel/sched/core.c     |  3 ++-
 5 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index da0dae0..91718ff 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -32,7 +32,7 @@ struct cgroup;
 extern int cgroup_init_early(void);
 extern int cgroup_init(void);
 extern void cgroup_fork(struct task_struct *p);
-extern void cgroup_post_fork(struct task_struct *p);
+extern int cgroup_post_fork(struct task_struct *p);
 extern void cgroup_exit(struct task_struct *p);
 extern int cgroupstats_build(struct cgroupstats *stats,
 				struct dentry *dentry);
@@ -649,7 +649,7 @@ struct cgroup_subsys {
 			      struct cgroup_taskset *tset);
 	void (*attach)(struct cgroup_subsys_state *css,
 		       struct cgroup_taskset *tset);
-	void (*fork)(struct task_struct *task);
+	int (*fork)(struct task_struct *task);
 	void (*exit)(struct cgroup_subsys_state *css,
 		     struct cgroup_subsys_state *old_css,
 		     struct task_struct *task);
@@ -946,7 +946,10 @@ struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry,
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
 static inline void cgroup_fork(struct task_struct *p) {}
-static inline void cgroup_post_fork(struct task_struct *p) {}
+static inline int cgroup_post_fork(struct task_struct *p)
+{
+	return 0;
+}
 static inline void cgroup_exit(struct task_struct *p) {}
 
 static inline int cgroupstats_build(struct cgroupstats *stats,
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 04cfe8a..82ecb6f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5191,7 +5191,7 @@ void cgroup_fork(struct task_struct *child)
  * cgroup_task_iter_start() - to guarantee that the new task ends up on its
  * list.
  */
-void cgroup_post_fork(struct task_struct *child)
+int cgroup_post_fork(struct task_struct *child)
 {
 	struct cgroup_subsys *ss;
 	int i;
@@ -5236,10 +5236,17 @@ void cgroup_post_fork(struct task_struct *child)
 	 * and addition to css_set.
 	 */
 	if (need_forkexit_callback) {
+		int ret;
+
 		for_each_subsys(ss, i)
-			if (ss->fork)
-				ss->fork(child);
+			if (ss->fork) {
+				ret = ss->fork(child);
+				if (ret)
+					return ret;
+			}
 	}
+
+	return 0;
 }
 
 /**
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index 92b98cc..f5906b7 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -203,7 +203,7 @@ static void freezer_attach(struct cgroup_subsys_state *new_css,
  * to do anything as freezer_attach() will put @task into the appropriate
  * state.
  */
-static void freezer_fork(struct task_struct *task)
+static int freezer_fork(struct task_struct *task)
 {
 	struct freezer *freezer;
 
@@ -215,7 +215,7 @@ static void freezer_fork(struct task_struct *task)
 	 * right thing to do.
 	 */
 	if (task_css_is_root(task, freezer_cgrp_id))
-		return;
+		return 0;
 
 	mutex_lock(&freezer_mutex);
 	rcu_read_lock();
@@ -226,6 +226,8 @@ static void freezer_fork(struct task_struct *task)
 
 	rcu_read_unlock();
 	mutex_unlock(&freezer_mutex);
+
+	return 0;
 }
 
 /**
diff --git a/kernel/fork.c b/kernel/fork.c
index 4dc2dda..ff12e23 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1541,7 +1541,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	write_unlock_irq(&tasklist_lock);
 
 	proc_fork_connector(p);
-	cgroup_post_fork(p);
+	retval = cgroup_post_fork(p);
+	if (retval)
+		goto bad_fork_free_pid;
 	if (clone_flags & CLONE_THREAD)
 		threadgroup_change_end(current);
 	perf_event_fork(p);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5eab11d..9b9f970 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8010,9 +8010,10 @@ static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
 	sched_offline_group(tg);
 }
 
-static void cpu_cgroup_fork(struct task_struct *task)
+static int cpu_cgroup_fork(struct task_struct *task)
 {
 	sched_move_task(task);
+	return 0;
 }
 
 static int cpu_cgroup_can_attach(struct cgroup_subsys_state *css,
-- 
2.3.0


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC 2/2] cgroups: add an nproc subsystem
  2015-02-23  3:08 ` Aleksa Sarai
  (?)
  (?)
@ 2015-02-23  3:08 ` Aleksa Sarai
  -1 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-02-23  3:08 UTC (permalink / raw)
  To: tj, lizefan, mingo, peterz
  Cc: richard, fweisbec, linux-kernel, cgroups, Aleksa Sarai

Adds a new single-purpose nproc subsystem to limit the number of
tasks that can run inside a cgroup. Essentially this is an
implementation of RLIMIT_NPROC that will applies to a cgroup rather than
a process tree.

This is a step to being able to limit the global impact of a fork bomb
inside a cgroup, allowing for cgroups to perform fairly basic resource
limitation which it currently doesn't have the capability to do.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 include/linux/cgroup_subsys.h |   4 +
 init/Kconfig                  |  10 +++
 kernel/Makefile               |   1 +
 kernel/cgroup_nproc.c         | 181 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 196 insertions(+)
 create mode 100644 kernel/cgroup_nproc.c

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 98c4f9b..e83e0ac 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -47,6 +47,10 @@ SUBSYS(net_prio)
 SUBSYS(hugetlb)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_NPROC)
+SUBSYS(nproc)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/init/Kconfig b/init/Kconfig
index 9afb971..d6315fe 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1047,6 +1047,16 @@ config CGROUP_HUGETLB
 	  control group is tracked in the third page lru pointer. This means
 	  that we cannot use the controller with huge page less than 3 pages.
 
+config CGROUP_NPROC
+	bool "Process number limiting on cgroups"
+	depends on PAGE_COUNTER
+	help
+	  This options enables the setting of process number limits in the scope
+	  of a cgroup. Any attempt to fork more processes than is allowed in the
+	  cgroup will fail. This allows for more basic resource limitation that
+	  applies to a cgroup, similar to RLIMIT_NPROC (except that instead of
+	  applying to a process tree it applies to a cgroup).
+
 config CGROUP_PERF
 	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
 	depends on PERF_EVENTS && CGROUPS
diff --git a/kernel/Makefile b/kernel/Makefile
index a59481a..10c4b40 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -52,6 +52,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
+obj-$(CONFIG_CGROUP_NPROC) += cgroup_nproc.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
diff --git a/kernel/cgroup_nproc.c b/kernel/cgroup_nproc.c
new file mode 100644
index 0000000..414d8d5
--- /dev/null
+++ b/kernel/cgroup_nproc.c
@@ -0,0 +1,181 @@
+/*
+ * Process number limiting subsys for cgroups.
+ *
+ * Copyright (C) 2015 Aleksa Sarai <cyphar@cyphar.com>
+ *
+ * Thanks to Frederic Weisbecker for creating the seminal patches which lead to
+ * this being written.
+ *
+ */
+
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/page_counter.h>
+
+struct nproc {
+	struct page_counter		proc_counter;
+	struct cgroup_subsys_state	css;
+};
+
+static inline struct nproc *css_nproc(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct nproc, css) : NULL;
+}
+
+static inline struct nproc *task_nproc(struct task_struct *task)
+{
+	return css_nproc(task_css(task, nproc_cgrp_id));
+}
+
+static struct nproc *parent_nproc(struct nproc *nproc)
+{
+	return css_nproc(nproc->css.parent);
+}
+
+static struct cgroup_subsys_state *nproc_css_alloc(struct cgroup_subsys_state *parent)
+{
+	struct nproc *nproc;
+
+	nproc = kzalloc(sizeof(struct nproc), GFP_KERNEL);
+	if (!nproc)
+		return ERR_PTR(-ENOMEM);
+
+	return &nproc->css;
+}
+
+static int nproc_css_online(struct cgroup_subsys_state *css)
+{
+	struct nproc *nproc = css_nproc(css);
+	struct nproc *parent = parent_nproc(nproc);
+
+	if (!parent) {
+		page_counter_init(&nproc->proc_counter, NULL);
+		return 0;
+	}
+
+	page_counter_init(&nproc->proc_counter, &parent->proc_counter);
+	return page_counter_limit(&nproc->proc_counter, parent->proc_counter.limit);
+}
+
+static void nproc_css_free(struct cgroup_subsys_state *css)
+{
+	kfree(css_nproc(css));
+}
+
+static inline void nproc_remove_procs(struct nproc *nproc, int num_procs)
+{
+	page_counter_uncharge(&nproc->proc_counter, num_procs);
+}
+
+static inline int nproc_add_procs(struct nproc *nproc, int num_procs)
+{
+	struct page_counter *fail_at;
+	int errcode;
+
+	errcode = page_counter_try_charge(&nproc->proc_counter, num_procs, &fail_at);
+	if (errcode)
+		return -EAGAIN;
+
+	return 0;
+}
+
+static void nproc_cancel_attach(struct cgroup_subsys_state *css,
+				struct cgroup_taskset *tset)
+{
+	struct nproc *nproc = css_nproc(css);
+	unsigned long num_tasks = 0;
+	struct task_struct *task;
+
+	cgroup_taskset_for_each(task, tset)
+		num_tasks++;
+
+	nproc_remove_procs(nproc, num_tasks);
+}
+
+static int nproc_can_attach(struct cgroup_subsys_state *css,
+			    struct cgroup_taskset *tset)
+{
+	struct nproc *nproc = css_nproc(css);
+	unsigned long num_tasks = 0;
+	struct task_struct *task;
+
+	cgroup_taskset_for_each(task, tset)
+		num_tasks++;
+
+	return nproc_add_procs(nproc, num_tasks);
+}
+
+static int nproc_fork(struct task_struct *task)
+{
+	struct nproc *nproc = task_nproc(task);
+
+	return nproc_add_procs(nproc, 1);
+}
+
+static void nproc_exit(struct cgroup_subsys_state *css,
+		       struct cgroup_subsys_state *old_css,
+		       struct task_struct *task)
+{
+	struct nproc *nproc = css_nproc(old_css);
+
+	nproc_remove_procs(nproc, 1);
+}
+
+static int nproc_write_limit(struct cgroup_subsys_state *css,
+			     struct cftype *cft, u64 val)
+{
+	struct nproc *nproc = css_nproc(css);
+
+	return page_counter_limit(&nproc->proc_counter, val);
+}
+
+static u64 nproc_read_limit(struct cgroup_subsys_state *css,
+			    struct cftype *cft)
+{
+	struct nproc *nproc = css_nproc(css);
+
+	return nproc->proc_counter.limit;
+}
+
+static u64 nproc_read_max_limit(struct cgroup_subsys_state *css,
+				       struct cftype *cft)
+{
+	return PAGE_COUNTER_MAX;
+}
+
+static u64 nproc_read_usage(struct cgroup_subsys_state *css,
+			    struct cftype *cft)
+{
+	struct nproc *nproc = css_nproc(css);
+
+	return page_counter_read(&nproc->proc_counter);
+}
+
+static struct cftype files[] = {
+	{
+		.name = "limit",
+		.write_u64 = nproc_write_limit,
+		.read_u64 = nproc_read_limit,
+	},
+	{
+		.name = "max_limit",
+		.read_u64 = nproc_read_max_limit,
+	},
+	{
+		.name = "usage",
+		.read_u64 = nproc_read_usage,
+	},
+	{ }	/* terminate */
+};
+
+struct cgroup_subsys nproc_cgrp_subsys = {
+	.css_alloc	= nproc_css_alloc,
+	.css_online	= nproc_css_online,
+	.css_free	= nproc_css_free,
+	.can_attach	= nproc_can_attach,
+	.cancel_attach	= nproc_cancel_attach,
+	.fork		= nproc_fork,
+	.exit		= nproc_exit,
+	.legacy_cftypes	= files,
+	.early_init	= 0,
+};
-- 
2.3.0


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 1/2] cgroups: allow a cgroup subsystem to reject a fork
  2015-02-23  3:08 ` [PATCH RFC 1/2] cgroups: allow a cgroup subsystem to reject a fork Aleksa Sarai
@ 2015-02-23 14:49   ` Peter Zijlstra
  0 siblings, 0 replies; 108+ messages in thread
From: Peter Zijlstra @ 2015-02-23 14:49 UTC (permalink / raw)
  To: Aleksa Sarai; +Cc: tj, lizefan, mingo, richard, fweisbec, linux-kernel, cgroups

On Mon, Feb 23, 2015 at 02:08:10PM +1100, Aleksa Sarai wrote:
> NOTE: I'm not sure if I'm doing enough cleanup inside copy_process(),
> because a bunch of stuff happens between the last valid goto to the
> bad_fork_free_pid label and cgroup_post_fork().
> 
> What is the correct way of doing cleanup this late inside
> copy_process()?

Its not; you're past the point of fail. You've already exposed the new
process.

If you want to allow fail, you'll have to do it earlier.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [RFC PATCH v2 0/2] add nproc cgroup subsystem
@ 2015-02-27  4:17   ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-02-27  4:17 UTC (permalink / raw)
  To: tj, lizefan, mingo, peterz
  Cc: richard, fweisbec, linux-kernel, cgroups, Aleksa Sarai

This is an updated version of the nproc patchset[1], in which the forking
cleanup issue has been resolved by adding can_fork and cancel_fork
callbacks to cgroup subsystems. The can_fork callback is run early
enough that it doesn't get called after the "point of no return" where
the process is exposed (which is when fork) is called, and cancel_fork
is run during the cleanup of copy_process if the fork fails due to other
reasons.

[1]: https://lkml.org/lkml/2015/2/22/204

Aleksa Sarai (2):
  cgroups: allow a cgroup subsystem to reject a fork
  cgroups: add an nproc subsystem

 include/linux/cgroup.h        |   9 ++
 include/linux/cgroup_subsys.h |   4 +
 init/Kconfig                  |  10 +++
 kernel/Makefile               |   1 +
 kernel/cgroup.c               |  80 +++++++++++++----
 kernel/cgroup_nproc.c         | 198 ++++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                 |  12 ++-
 7 files changed, 296 insertions(+), 18 deletions(-)
 create mode 100644 kernel/cgroup_nproc.c

-- 
2.3.1


^ permalink raw reply	[flat|nested] 108+ messages in thread

* [RFC PATCH v2 0/2] add nproc cgroup subsystem
@ 2015-02-27  4:17   ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-02-27  4:17 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: richard-/L3Ra7n9ekc, fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Aleksa Sarai

This is an updated version of the nproc patchset[1], in which the forking
cleanup issue has been resolved by adding can_fork and cancel_fork
callbacks to cgroup subsystems. The can_fork callback is run early
enough that it doesn't get called after the "point of no return" where
the process is exposed (which is when fork) is called, and cancel_fork
is run during the cleanup of copy_process if the fork fails due to other
reasons.

[1]: https://lkml.org/lkml/2015/2/22/204

Aleksa Sarai (2):
  cgroups: allow a cgroup subsystem to reject a fork
  cgroups: add an nproc subsystem

 include/linux/cgroup.h        |   9 ++
 include/linux/cgroup_subsys.h |   4 +
 init/Kconfig                  |  10 +++
 kernel/Makefile               |   1 +
 kernel/cgroup.c               |  80 +++++++++++++----
 kernel/cgroup_nproc.c         | 198 ++++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                 |  12 ++-
 7 files changed, 296 insertions(+), 18 deletions(-)
 create mode 100644 kernel/cgroup_nproc.c

-- 
2.3.1

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH v2 1/2] cgroups: allow a cgroup subsystem to reject a fork
@ 2015-02-27  4:17     ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-02-27  4:17 UTC (permalink / raw)
  To: tj, lizefan, mingo, peterz
  Cc: richard, fweisbec, linux-kernel, cgroups, Aleksa Sarai

Add a new cgroup subsystem callback can_fork that conditionally
states whether or not the fork is accepted or rejected with a cgroup
policy.

Make the cgroup subsystem can_fork callback return an error code so
that subsystems can accept or reject a fork from completing with a
custom error value, before the process is exposed.

In addition, add a cancel_fork callback so that if an error occurs later
in the forking process, any state modified by can_fork can be reverted.

In order for can_fork to deal with a task that has an accurate css_set,
move the css_set updating to cgroup_fork (where it belongs).

This is in preparation for implementing the nproc cgroup subsystem.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 include/linux/cgroup.h |  9 ++++++
 kernel/cgroup.c        | 80 +++++++++++++++++++++++++++++++++++++++-----------
 kernel/fork.c          | 12 +++++++-
 3 files changed, 83 insertions(+), 18 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index da0dae0..9897533 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -32,6 +32,8 @@ struct cgroup;
 extern int cgroup_init_early(void);
 extern int cgroup_init(void);
 extern void cgroup_fork(struct task_struct *p);
+extern int cgroup_can_fork(struct task_struct *p);
+extern void cgroup_cancel_fork(struct task_struct *p);
 extern void cgroup_post_fork(struct task_struct *p);
 extern void cgroup_exit(struct task_struct *p);
 extern int cgroupstats_build(struct cgroupstats *stats,
@@ -649,6 +651,8 @@ struct cgroup_subsys {
 			      struct cgroup_taskset *tset);
 	void (*attach)(struct cgroup_subsys_state *css,
 		       struct cgroup_taskset *tset);
+	int (*can_fork)(struct task_struct *task);
+	void (*cancel_fork)(struct task_struct *task);
 	void (*fork)(struct task_struct *task);
 	void (*exit)(struct cgroup_subsys_state *css,
 		     struct cgroup_subsys_state *old_css,
@@ -946,6 +950,11 @@ struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry,
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
 static inline void cgroup_fork(struct task_struct *p) {}
+static inline int cgroup_can_fork(struct task_struct *p)
+{
+	return 0;
+}
+static inline void cgroup_cancel_fork(struct task_struct *p) {}
 static inline void cgroup_post_fork(struct task_struct *p) {}
 static inline void cgroup_exit(struct task_struct *p) {}
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 04cfe8a..f062350 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4928,7 +4928,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 	 * init_css_set is in the subsystem's root cgroup. */
 	init_css_set.subsys[ss->id] = css;
 
-	need_forkexit_callback |= ss->fork || ss->exit;
+	need_forkexit_callback |= ss->can_fork || ss->cancel_fork || ss->fork || ss->exit;
 
 	/* At system boot, before all subsystems have been
 	 * registered, no tasks have been forked, so we don't
@@ -5179,22 +5179,6 @@ void cgroup_fork(struct task_struct *child)
 {
 	RCU_INIT_POINTER(child->cgroups, &init_css_set);
 	INIT_LIST_HEAD(&child->cg_list);
-}
-
-/**
- * cgroup_post_fork - called on a new task after adding it to the task list
- * @child: the task in question
- *
- * Adds the task to the list running through its css_set if necessary and
- * call the subsystem fork() callbacks.  Has to be after the task is
- * visible on the task list in case we race with the first call to
- * cgroup_task_iter_start() - to guarantee that the new task ends up on its
- * list.
- */
-void cgroup_post_fork(struct task_struct *child)
-{
-	struct cgroup_subsys *ss;
-	int i;
 
 	/*
 	 * This may race against cgroup_enable_task_cg_lists().  As that
@@ -5229,6 +5213,68 @@ void cgroup_post_fork(struct task_struct *child)
 		}
 		up_write(&css_set_rwsem);
 	}
+}
+
+/**
+ * cgroup_can_fork - called on a new task before the process is exposed.
+ * @child: the task in question.
+ *
+ * This calls the subsystem can_fork() callbacks. If the can_fork() callback
+ * returns an error, the fork aborts with that error code. This allows for
+ * a cgroup subsystem to conditionally allow or deny new forks.
+ */
+int cgroup_can_fork(struct task_struct *child)
+{
+	struct cgroup_subsys *ss;
+	int i;
+
+	if (need_forkexit_callback) {
+		int retval;
+
+		for_each_subsys(ss, i)
+			if (ss->can_fork) {
+				retval = ss->can_fork(child);
+				if (retval)
+					return retval;
+			}
+	}
+
+	return 0;
+}
+
+/**
+ * cgroup_cancel_fork - called if a fork failed after cgroup_can_fork()
+ * @child: the task in question
+ *
+ * This calls the cancel_fork() callbacks if a fork failed *after*
+ * cgroup_can_fork() succeded.
+ */
+void cgroup_cancel_fork(struct task_struct *child)
+{
+	struct cgroup_subsys *ss;
+	int i;
+
+	if (need_forkexit_callback) {
+		for_each_subsys(ss, i)
+			if (ss->cancel_fork)
+				ss->cancel_fork(child);
+	}
+}
+
+/**
+ * cgroup_post_fork - called on a new task after adding it to the task list
+ * @child: the task in question
+ *
+ * Adds the task to the list running through its css_set if necessary and
+ * call the subsystem fork() callbacks.  Has to be after the task is
+ * visible on the task list in case we race with the first call to
+ * cgroup_task_iter_start() - to guarantee that the new task ends up on its
+ * list.
+ */
+void cgroup_post_fork(struct task_struct *child)
+{
+	struct cgroup_subsys *ss;
+	int i;
 
 	/*
 	 * Call ss->fork().  This must happen after @child is linked on
diff --git a/kernel/fork.c b/kernel/fork.c
index 4dc2dda..e84ce86 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1464,6 +1464,14 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p->task_works = NULL;
 
 	/*
+	 * Ensure that the cgroup subsystem policies allow the new process to be
+	 * forked.
+	 */
+	retval = cgroup_can_fork(p);
+	if (retval)
+		goto bad_fork_free_pid;
+
+	/*
 	 * Make it visible to the rest of the system, but dont wake it up yet.
 	 * Need tasklist lock for parent etc handling!
 	 */
@@ -1499,7 +1507,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		spin_unlock(&current->sighand->siglock);
 		write_unlock_irq(&tasklist_lock);
 		retval = -ERESTARTNOINTR;
-		goto bad_fork_free_pid;
+		goto bad_fork_cgroup_cancel;
 	}
 
 	if (likely(p->pid)) {
@@ -1551,6 +1559,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 
 	return p;
 
+bad_fork_cgroup_cancel:
+	cgroup_cancel_fork(p);
 bad_fork_free_pid:
 	if (pid != &init_struct_pid)
 		free_pid(pid);
-- 
2.3.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH v2 1/2] cgroups: allow a cgroup subsystem to reject a fork
@ 2015-02-27  4:17     ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-02-27  4:17 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: richard-/L3Ra7n9ekc, fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Aleksa Sarai

Add a new cgroup subsystem callback can_fork that conditionally
states whether or not the fork is accepted or rejected with a cgroup
policy.

Make the cgroup subsystem can_fork callback return an error code so
that subsystems can accept or reject a fork from completing with a
custom error value, before the process is exposed.

In addition, add a cancel_fork callback so that if an error occurs later
in the forking process, any state modified by can_fork can be reverted.

In order for can_fork to deal with a task that has an accurate css_set,
move the css_set updating to cgroup_fork (where it belongs).

This is in preparation for implementing the nproc cgroup subsystem.

Signed-off-by: Aleksa Sarai <cyphar-gVpy/LI/lHzQT0dZR+AlfA@public.gmane.org>
---
 include/linux/cgroup.h |  9 ++++++
 kernel/cgroup.c        | 80 +++++++++++++++++++++++++++++++++++++++-----------
 kernel/fork.c          | 12 +++++++-
 3 files changed, 83 insertions(+), 18 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index da0dae0..9897533 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -32,6 +32,8 @@ struct cgroup;
 extern int cgroup_init_early(void);
 extern int cgroup_init(void);
 extern void cgroup_fork(struct task_struct *p);
+extern int cgroup_can_fork(struct task_struct *p);
+extern void cgroup_cancel_fork(struct task_struct *p);
 extern void cgroup_post_fork(struct task_struct *p);
 extern void cgroup_exit(struct task_struct *p);
 extern int cgroupstats_build(struct cgroupstats *stats,
@@ -649,6 +651,8 @@ struct cgroup_subsys {
 			      struct cgroup_taskset *tset);
 	void (*attach)(struct cgroup_subsys_state *css,
 		       struct cgroup_taskset *tset);
+	int (*can_fork)(struct task_struct *task);
+	void (*cancel_fork)(struct task_struct *task);
 	void (*fork)(struct task_struct *task);
 	void (*exit)(struct cgroup_subsys_state *css,
 		     struct cgroup_subsys_state *old_css,
@@ -946,6 +950,11 @@ struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry,
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
 static inline void cgroup_fork(struct task_struct *p) {}
+static inline int cgroup_can_fork(struct task_struct *p)
+{
+	return 0;
+}
+static inline void cgroup_cancel_fork(struct task_struct *p) {}
 static inline void cgroup_post_fork(struct task_struct *p) {}
 static inline void cgroup_exit(struct task_struct *p) {}
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 04cfe8a..f062350 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4928,7 +4928,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 	 * init_css_set is in the subsystem's root cgroup. */
 	init_css_set.subsys[ss->id] = css;
 
-	need_forkexit_callback |= ss->fork || ss->exit;
+	need_forkexit_callback |= ss->can_fork || ss->cancel_fork || ss->fork || ss->exit;
 
 	/* At system boot, before all subsystems have been
 	 * registered, no tasks have been forked, so we don't
@@ -5179,22 +5179,6 @@ void cgroup_fork(struct task_struct *child)
 {
 	RCU_INIT_POINTER(child->cgroups, &init_css_set);
 	INIT_LIST_HEAD(&child->cg_list);
-}
-
-/**
- * cgroup_post_fork - called on a new task after adding it to the task list
- * @child: the task in question
- *
- * Adds the task to the list running through its css_set if necessary and
- * call the subsystem fork() callbacks.  Has to be after the task is
- * visible on the task list in case we race with the first call to
- * cgroup_task_iter_start() - to guarantee that the new task ends up on its
- * list.
- */
-void cgroup_post_fork(struct task_struct *child)
-{
-	struct cgroup_subsys *ss;
-	int i;
 
 	/*
 	 * This may race against cgroup_enable_task_cg_lists().  As that
@@ -5229,6 +5213,68 @@ void cgroup_post_fork(struct task_struct *child)
 		}
 		up_write(&css_set_rwsem);
 	}
+}
+
+/**
+ * cgroup_can_fork - called on a new task before the process is exposed.
+ * @child: the task in question.
+ *
+ * This calls the subsystem can_fork() callbacks. If the can_fork() callback
+ * returns an error, the fork aborts with that error code. This allows for
+ * a cgroup subsystem to conditionally allow or deny new forks.
+ */
+int cgroup_can_fork(struct task_struct *child)
+{
+	struct cgroup_subsys *ss;
+	int i;
+
+	if (need_forkexit_callback) {
+		int retval;
+
+		for_each_subsys(ss, i)
+			if (ss->can_fork) {
+				retval = ss->can_fork(child);
+				if (retval)
+					return retval;
+			}
+	}
+
+	return 0;
+}
+
+/**
+ * cgroup_cancel_fork - called if a fork failed after cgroup_can_fork()
+ * @child: the task in question
+ *
+ * This calls the cancel_fork() callbacks if a fork failed *after*
+ * cgroup_can_fork() succeded.
+ */
+void cgroup_cancel_fork(struct task_struct *child)
+{
+	struct cgroup_subsys *ss;
+	int i;
+
+	if (need_forkexit_callback) {
+		for_each_subsys(ss, i)
+			if (ss->cancel_fork)
+				ss->cancel_fork(child);
+	}
+}
+
+/**
+ * cgroup_post_fork - called on a new task after adding it to the task list
+ * @child: the task in question
+ *
+ * Adds the task to the list running through its css_set if necessary and
+ * call the subsystem fork() callbacks.  Has to be after the task is
+ * visible on the task list in case we race with the first call to
+ * cgroup_task_iter_start() - to guarantee that the new task ends up on its
+ * list.
+ */
+void cgroup_post_fork(struct task_struct *child)
+{
+	struct cgroup_subsys *ss;
+	int i;
 
 	/*
 	 * Call ss->fork().  This must happen after @child is linked on
diff --git a/kernel/fork.c b/kernel/fork.c
index 4dc2dda..e84ce86 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1464,6 +1464,14 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p->task_works = NULL;
 
 	/*
+	 * Ensure that the cgroup subsystem policies allow the new process to be
+	 * forked.
+	 */
+	retval = cgroup_can_fork(p);
+	if (retval)
+		goto bad_fork_free_pid;
+
+	/*
 	 * Make it visible to the rest of the system, but dont wake it up yet.
 	 * Need tasklist lock for parent etc handling!
 	 */
@@ -1499,7 +1507,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		spin_unlock(&current->sighand->siglock);
 		write_unlock_irq(&tasklist_lock);
 		retval = -ERESTARTNOINTR;
-		goto bad_fork_free_pid;
+		goto bad_fork_cgroup_cancel;
 	}
 
 	if (likely(p->pid)) {
@@ -1551,6 +1559,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 
 	return p;
 
+bad_fork_cgroup_cancel:
+	cgroup_cancel_fork(p);
 bad_fork_free_pid:
 	if (pid != &init_struct_pid)
 		free_pid(pid);
-- 
2.3.1

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH v2 2/2] cgroups: add an nproc subsystem
  2015-02-27  4:17   ` Aleksa Sarai
  (?)
  (?)
@ 2015-02-27  4:17   ` Aleksa Sarai
  2015-03-02 15:22       ` Tejun Heo
  -1 siblings, 1 reply; 108+ messages in thread
From: Aleksa Sarai @ 2015-02-27  4:17 UTC (permalink / raw)
  To: tj, lizefan, mingo, peterz
  Cc: richard, fweisbec, linux-kernel, cgroups, Aleksa Sarai

Adds a new single-purpose nproc subsystem to limit the number of
tasks that can run inside a cgroup. Essentially this is an
implementation of RLIMIT_NPROC that will applies to a cgroup rather than
a process tree.

This is a step to being able to limit the global impact of a fork bomb
inside a cgroup, allowing for cgroups to perform fairly basic resource
limitation which it currently doesn't have the capability to do.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 include/linux/cgroup_subsys.h |   4 +
 init/Kconfig                  |  10 +++
 kernel/Makefile               |   1 +
 kernel/cgroup_nproc.c         | 198 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 213 insertions(+)
 create mode 100644 kernel/cgroup_nproc.c

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 98c4f9b..e83e0ac 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -47,6 +47,10 @@ SUBSYS(net_prio)
 SUBSYS(hugetlb)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_NPROC)
+SUBSYS(nproc)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/init/Kconfig b/init/Kconfig
index 9afb971..d6315fe 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1047,6 +1047,16 @@ config CGROUP_HUGETLB
 	  control group is tracked in the third page lru pointer. This means
 	  that we cannot use the controller with huge page less than 3 pages.
 
+config CGROUP_NPROC
+	bool "Process number limiting on cgroups"
+	depends on PAGE_COUNTER
+	help
+	  This options enables the setting of process number limits in the scope
+	  of a cgroup. Any attempt to fork more processes than is allowed in the
+	  cgroup will fail. This allows for more basic resource limitation that
+	  applies to a cgroup, similar to RLIMIT_NPROC (except that instead of
+	  applying to a process tree it applies to a cgroup).
+
 config CGROUP_PERF
 	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
 	depends on PERF_EVENTS && CGROUPS
diff --git a/kernel/Makefile b/kernel/Makefile
index a59481a..10c4b40 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -52,6 +52,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
+obj-$(CONFIG_CGROUP_NPROC) += cgroup_nproc.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
diff --git a/kernel/cgroup_nproc.c b/kernel/cgroup_nproc.c
new file mode 100644
index 0000000..86de0fe
--- /dev/null
+++ b/kernel/cgroup_nproc.c
@@ -0,0 +1,198 @@
+/*
+ * Process number limiting subsys for cgroups.
+ *
+ * Copyright (C) 2015 Aleksa Sarai <cyphar@cyphar.com>
+ *
+ * Thanks to Frederic Weisbecker for creating the seminal patches which lead to
+ * this being written.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/page_counter.h>
+
+struct nproc {
+	struct page_counter		proc_counter;
+	struct cgroup_subsys_state	css;
+};
+
+static inline struct nproc *css_nproc(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct nproc, css) : NULL;
+}
+
+static inline struct nproc *task_nproc(struct task_struct *task)
+{
+	return css_nproc(task_css(task, nproc_cgrp_id));
+}
+
+static struct nproc *parent_nproc(struct nproc *nproc)
+{
+	return css_nproc(nproc->css.parent);
+}
+
+static struct cgroup_subsys_state *nproc_css_alloc(struct cgroup_subsys_state *parent)
+{
+	struct nproc *nproc;
+
+	nproc = kzalloc(sizeof(struct nproc), GFP_KERNEL);
+	if (!nproc)
+		return ERR_PTR(-ENOMEM);
+
+	return &nproc->css;
+}
+
+static int nproc_css_online(struct cgroup_subsys_state *css)
+{
+	struct nproc *nproc = css_nproc(css);
+	struct nproc *parent = parent_nproc(nproc);
+
+	if (!parent) {
+		page_counter_init(&nproc->proc_counter, NULL);
+		return 0;
+	}
+
+	page_counter_init(&nproc->proc_counter, &parent->proc_counter);
+	return page_counter_limit(&nproc->proc_counter, parent->proc_counter.limit);
+}
+
+static void nproc_css_free(struct cgroup_subsys_state *css)
+{
+	kfree(css_nproc(css));
+}
+
+static inline void nproc_remove_procs(struct nproc *nproc, int num_procs)
+{
+	page_counter_uncharge(&nproc->proc_counter, num_procs);
+}
+
+static inline int nproc_add_procs(struct nproc *nproc, int num_procs)
+{
+	struct page_counter *fail_at;
+	int errcode;
+
+	errcode = page_counter_try_charge(&nproc->proc_counter, num_procs, &fail_at);
+	if (errcode)
+		return -EAGAIN;
+
+	return 0;
+}
+
+static int nproc_can_attach(struct cgroup_subsys_state *css,
+			    struct cgroup_taskset *tset)
+{
+	struct nproc *nproc = css_nproc(css);
+	unsigned long num_tasks = 0;
+	struct task_struct *task;
+
+	cgroup_taskset_for_each(task, tset)
+		num_tasks++;
+
+	return nproc_add_procs(nproc, num_tasks);
+}
+
+static void nproc_cancel_attach(struct cgroup_subsys_state *css,
+				struct cgroup_taskset *tset)
+{
+	struct nproc *nproc = css_nproc(css);
+	unsigned long num_tasks = 0;
+	struct task_struct *task;
+
+	cgroup_taskset_for_each(task, tset)
+		num_tasks++;
+
+	nproc_remove_procs(nproc, num_tasks);
+}
+
+static int nproc_can_fork(struct task_struct *task)
+{
+	struct nproc *nproc = task_nproc(task);
+
+	return nproc_add_procs(nproc, 1);
+}
+
+static void nproc_cancel_fork(struct task_struct *task)
+{
+	struct nproc *nproc = task_nproc(task);
+
+	nproc_remove_procs(nproc, 1);
+}
+
+static void nproc_exit(struct cgroup_subsys_state *css,
+		       struct cgroup_subsys_state *old_css,
+		       struct task_struct *task)
+{
+	struct nproc *nproc = css_nproc(old_css);
+
+	/*
+	 * cgroup_exit() gets called as part of the cleanup code when copy_process()
+	 * fails. This should ignored, because the nproc_cancel_fork callback already
+	 * deals with the cgroup failed fork case.
+	 */
+	if (!(task->flags & PF_EXITING))
+		return;
+
+	nproc_remove_procs(nproc, 1);
+}
+
+static int nproc_write_limit(struct cgroup_subsys_state *css,
+			     struct cftype *cft, u64 val)
+{
+	struct nproc *nproc = css_nproc(css);
+
+	return page_counter_limit(&nproc->proc_counter, val);
+}
+
+static u64 nproc_read_limit(struct cgroup_subsys_state *css,
+			    struct cftype *cft)
+{
+	struct nproc *nproc = css_nproc(css);
+
+	return nproc->proc_counter.limit;
+}
+
+static u64 nproc_read_max_limit(struct cgroup_subsys_state *css,
+				       struct cftype *cft)
+{
+	return PAGE_COUNTER_MAX;
+}
+
+static u64 nproc_read_usage(struct cgroup_subsys_state *css,
+			    struct cftype *cft)
+{
+	struct nproc *nproc = css_nproc(css);
+
+	return page_counter_read(&nproc->proc_counter);
+}
+
+static struct cftype files[] = {
+	{
+		.name = "limit",
+		.write_u64 = nproc_write_limit,
+		.read_u64 = nproc_read_limit,
+	},
+	{
+		.name = "max_limit",
+		.read_u64 = nproc_read_max_limit,
+	},
+	{
+		.name = "usage",
+		.read_u64 = nproc_read_usage,
+	},
+	{ }	/* terminate */
+};
+
+struct cgroup_subsys nproc_cgrp_subsys = {
+	.css_alloc	= nproc_css_alloc,
+	.css_online	= nproc_css_online,
+	.css_free	= nproc_css_free,
+	.can_attach	= nproc_can_attach,
+	.cancel_attach	= nproc_cancel_attach,
+	.can_fork	= nproc_can_fork,
+	.cancel_fork	= nproc_cancel_fork,
+	.exit		= nproc_exit,
+	.legacy_cftypes	= files,
+	.early_init	= 0,
+};
-- 
2.3.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
  2015-02-23  3:08 ` Aleksa Sarai
                   ` (3 preceding siblings ...)
  (?)
@ 2015-02-27 11:49 ` Tejun Heo
  2015-02-27 13:46     ` Richard Weinberger
  2015-02-27 16:42     ` Austin S Hemmelgarn
  -1 siblings, 2 replies; 108+ messages in thread
From: Tejun Heo @ 2015-02-27 11:49 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan, mingo, peterz, richard, fweisbec, linux-kernel, cgroups

Hello,

On Mon, Feb 23, 2015 at 02:08:09PM +1100, Aleksa Sarai wrote:
> The current state of resource limitation for the number of open
> processes (as well as the number of open file descriptors) requires you
> to use setrlimit(2), which means that you are limited to resource
> limiting process trees rather than resource limiting cgroups (which is
> the point of cgroups).
> 
> There was a patch to implement this in 2011[1], but that was rejected
> because it implemented a general-purpose rlimit subsystem -- which meant
> that you couldn't control distinct resource limits in different
> heirarchies. This patch implements a resource controller *specifically*
> for the number of processes in a cgroup, overcoming this issue.
> 
> There has been a similar attempt to implement a resource controller for
> the number of open file descriptors[2], which has not been merged
> becasue the reasons were dubious. Merely from a "sane interface"
> perspective, it should be possible to utilise cgroups to do such
> rudimentary resource management (which currently only exists for process
> trees).

This isn't a proper resource to control.  kmemcg just grew proper
reclaim support and will be useable to control kernel side of memory
consumption.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-27 13:46     ` Richard Weinberger
  0 siblings, 0 replies; 108+ messages in thread
From: Richard Weinberger @ 2015-02-27 13:46 UTC (permalink / raw)
  To: Tejun Heo, Aleksa Sarai
  Cc: lizefan, mingo, peterz, fweisbec, linux-kernel, cgroups

Tejun,

Am 27.02.2015 um 12:49 schrieb Tejun Heo:
> This isn't a proper resource to control.  kmemcg just grew proper
> reclaim support and will be useable to control kernel side of memory
> consumption.

just to make sure that I understand the big picture.
The plan is to limit kernel memory per cgroup such that fork bombs and
stuff cannot harm other groups of processes?

Thanks,
//richard

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-27 13:46     ` Richard Weinberger
  0 siblings, 0 replies; 108+ messages in thread
From: Richard Weinberger @ 2015-02-27 13:46 UTC (permalink / raw)
  To: Tejun Heo, Aleksa Sarai
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Tejun,

Am 27.02.2015 um 12:49 schrieb Tejun Heo:
> This isn't a proper resource to control.  kmemcg just grew proper
> reclaim support and will be useable to control kernel side of memory
> consumption.

just to make sure that I understand the big picture.
The plan is to limit kernel memory per cgroup such that fork bombs and
stuff cannot harm other groups of processes?

Thanks,
//richard

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-27 13:52       ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-02-27 13:52 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Aleksa Sarai, lizefan, mingo, peterz, fweisbec, linux-kernel, cgroups

Hello,

On Fri, Feb 27, 2015 at 02:46:13PM +0100, Richard Weinberger wrote:
> just to make sure that I understand the big picture.
> The plan is to limit kernel memory per cgroup such that fork bombs and
> stuff cannot harm other groups of processes?

Yes, the kmem part of memcg hasn't really been functional because the
reclaim part was broken and (partially conseqently) kmem config being
siloed from the rest but we're very close to solving that at this
point.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-27 13:52       ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-02-27 13:52 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Aleksa Sarai, lizefan-hv44wF8Li93QT0dZR+AlfA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Fri, Feb 27, 2015 at 02:46:13PM +0100, Richard Weinberger wrote:
> just to make sure that I understand the big picture.
> The plan is to limit kernel memory per cgroup such that fork bombs and
> stuff cannot harm other groups of processes?

Yes, the kmem part of memcg hasn't really been functional because the
reclaim part was broken and (partially conseqently) kmem config being
siloed from the rest but we're very close to solving that at this
point.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-27 16:42     ` Austin S Hemmelgarn
  0 siblings, 0 replies; 108+ messages in thread
From: Austin S Hemmelgarn @ 2015-02-27 16:42 UTC (permalink / raw)
  To: Tejun Heo, Aleksa Sarai
  Cc: lizefan, mingo, peterz, richard, fweisbec, linux-kernel, cgroups

On 2015-02-27 06:49, Tejun Heo wrote:
> Hello,
>
> On Mon, Feb 23, 2015 at 02:08:09PM +1100, Aleksa Sarai wrote:
>> The current state of resource limitation for the number of open
>> processes (as well as the number of open file descriptors) requires you
>> to use setrlimit(2), which means that you are limited to resource
>> limiting process trees rather than resource limiting cgroups (which is
>> the point of cgroups).
>>
>> There was a patch to implement this in 2011[1], but that was rejected
>> because it implemented a general-purpose rlimit subsystem -- which meant
>> that you couldn't control distinct resource limits in different
>> heirarchies. This patch implements a resource controller *specifically*
>> for the number of processes in a cgroup, overcoming this issue.
>>
>> There has been a similar attempt to implement a resource controller for
>> the number of open file descriptors[2], which has not been merged
>> becasue the reasons were dubious. Merely from a "sane interface"
>> perspective, it should be possible to utilise cgroups to do such
>> rudimentary resource management (which currently only exists for process
>> trees).
>
> This isn't a proper resource to control.  kmemcg just grew proper
> reclaim support and will be useable to control kernel side of memory
> consumption.
>
> Thanks.
>
Kernel memory consumption isn't the only valid reason to want to limit 
the number of processes in a cgroup.  Limiting the number of processes 
is very useful to ensure that a program is working correctly (for 
example, the NTP daemon should (usually) have an _exact_ number of 
children if it is functioning correctly, and rpcbind shouldn't (AFAIK) 
ever have _any_ children), to prevent PID number exhaustion, to head off 
DoS attacks against forking network servers before they get to the point 
of causing kmem exhaustion, and to limit the number of processes in a 
cgroup that uses lots of kernel memory very infrequently.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-27 16:42     ` Austin S Hemmelgarn
  0 siblings, 0 replies; 108+ messages in thread
From: Austin S Hemmelgarn @ 2015-02-27 16:42 UTC (permalink / raw)
  To: Tejun Heo, Aleksa Sarai
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, richard-/L3Ra7n9ekc,
	fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On 2015-02-27 06:49, Tejun Heo wrote:
> Hello,
>
> On Mon, Feb 23, 2015 at 02:08:09PM +1100, Aleksa Sarai wrote:
>> The current state of resource limitation for the number of open
>> processes (as well as the number of open file descriptors) requires you
>> to use setrlimit(2), which means that you are limited to resource
>> limiting process trees rather than resource limiting cgroups (which is
>> the point of cgroups).
>>
>> There was a patch to implement this in 2011[1], but that was rejected
>> because it implemented a general-purpose rlimit subsystem -- which meant
>> that you couldn't control distinct resource limits in different
>> heirarchies. This patch implements a resource controller *specifically*
>> for the number of processes in a cgroup, overcoming this issue.
>>
>> There has been a similar attempt to implement a resource controller for
>> the number of open file descriptors[2], which has not been merged
>> becasue the reasons were dubious. Merely from a "sane interface"
>> perspective, it should be possible to utilise cgroups to do such
>> rudimentary resource management (which currently only exists for process
>> trees).
>
> This isn't a proper resource to control.  kmemcg just grew proper
> reclaim support and will be useable to control kernel side of memory
> consumption.
>
> Thanks.
>
Kernel memory consumption isn't the only valid reason to want to limit 
the number of processes in a cgroup.  Limiting the number of processes 
is very useful to ensure that a program is working correctly (for 
example, the NTP daemon should (usually) have an _exact_ number of 
children if it is functioning correctly, and rpcbind shouldn't (AFAIK) 
ever have _any_ children), to prevent PID number exhaustion, to head off 
DoS attacks against forking network servers before they get to the point 
of causing kmem exhaustion, and to limit the number of processes in a 
cgroup that uses lots of kernel memory very infrequently.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-27 17:06       ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-02-27 17:06 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Aleksa Sarai, lizefan, mingo, peterz, richard, fweisbec,
	linux-kernel, cgroups

Hello,

On Fri, Feb 27, 2015 at 11:42:10AM -0500, Austin S Hemmelgarn wrote:
> Kernel memory consumption isn't the only valid reason to want to limit the
> number of processes in a cgroup.  Limiting the number of processes is very
> useful to ensure that a program is working correctly (for example, the NTP
> daemon should (usually) have an _exact_ number of children if it is
> functioning correctly, and rpcbind shouldn't (AFAIK) ever have _any_
> children), to prevent PID number exhaustion, to head off DoS attacks against
> forking network servers before they get to the point of causing kmem
> exhaustion, and to limit the number of processes in a cgroup that uses lots
> of kernel memory very infrequently.

All the use cases you're listing are extremely niche and can be
trivially achieved without introducing another cgroup controller.  Not
only that, they're actually pretty silly.  Let's say NTP daemon is
misbehaving (or its code changed w/o you knowing or there are corner
cases which trigger extremely infrequently).  What do you exactly
achieve by rejecting its fork call?  It's just adding another
variation to the misbehavior.  It was misbehaving before and would now
be continuing to misbehave after a failed fork.

In general, I'm pretty strongly against adding controllers for things
which aren't fundamental resources in the system.  What's next?  Open
files?  Pipe buffer?  Number of flocks?  Number of session leaders or
program groups?

If you want to prevent a certain class of jobs from exhausting a given
resource, protecting that resource is the obvious thing to do.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-27 17:06       ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-02-27 17:06 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Aleksa Sarai, lizefan-hv44wF8Li93QT0dZR+AlfA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	richard-/L3Ra7n9ekc, fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Fri, Feb 27, 2015 at 11:42:10AM -0500, Austin S Hemmelgarn wrote:
> Kernel memory consumption isn't the only valid reason to want to limit the
> number of processes in a cgroup.  Limiting the number of processes is very
> useful to ensure that a program is working correctly (for example, the NTP
> daemon should (usually) have an _exact_ number of children if it is
> functioning correctly, and rpcbind shouldn't (AFAIK) ever have _any_
> children), to prevent PID number exhaustion, to head off DoS attacks against
> forking network servers before they get to the point of causing kmem
> exhaustion, and to limit the number of processes in a cgroup that uses lots
> of kernel memory very infrequently.

All the use cases you're listing are extremely niche and can be
trivially achieved without introducing another cgroup controller.  Not
only that, they're actually pretty silly.  Let's say NTP daemon is
misbehaving (or its code changed w/o you knowing or there are corner
cases which trigger extremely infrequently).  What do you exactly
achieve by rejecting its fork call?  It's just adding another
variation to the misbehavior.  It was misbehaving before and would now
be continuing to misbehave after a failed fork.

In general, I'm pretty strongly against adding controllers for things
which aren't fundamental resources in the system.  What's next?  Open
files?  Pipe buffer?  Number of flocks?  Number of session leaders or
program groups?

If you want to prevent a certain class of jobs from exhausting a given
resource, protecting that resource is the obvious thing to do.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
  2015-02-27 16:42     ` Austin S Hemmelgarn
  (?)
  (?)
@ 2015-02-27 17:12     ` Tim Hockin
  2015-02-27 17:15         ` Tejun Heo
  -1 siblings, 1 reply; 108+ messages in thread
From: Tim Hockin @ 2015-02-27 17:12 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Tejun Heo, Aleksa Sarai, lizefan, mingo, peterz, richard,
	Frédéric Weisbecker, Linux Kernel Mailing List,
	cgroups

On Fri, Feb 27, 2015 at 8:42 AM, Austin S Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2015-02-27 06:49, Tejun Heo wrote:
>>
>> Hello,
>>
>> On Mon, Feb 23, 2015 at 02:08:09PM +1100, Aleksa Sarai wrote:
>>>
>>> The current state of resource limitation for the number of open
>>> processes (as well as the number of open file descriptors) requires you
>>> to use setrlimit(2), which means that you are limited to resource
>>> limiting process trees rather than resource limiting cgroups (which is
>>> the point of cgroups).
>>>
>>> There was a patch to implement this in 2011[1], but that was rejected
>>> because it implemented a general-purpose rlimit subsystem -- which meant
>>> that you couldn't control distinct resource limits in different
>>> heirarchies. This patch implements a resource controller *specifically*
>>> for the number of processes in a cgroup, overcoming this issue.
>>>
>>> There has been a similar attempt to implement a resource controller for
>>> the number of open file descriptors[2], which has not been merged
>>> becasue the reasons were dubious. Merely from a "sane interface"
>>> perspective, it should be possible to utilise cgroups to do such
>>> rudimentary resource management (which currently only exists for process
>>> trees).
>>
>>
>> This isn't a proper resource to control.  kmemcg just grew proper
>> reclaim support and will be useable to control kernel side of memory
>> consumption.

I was told that the plan was to use kmemcg - but I was told that YEARS
AGO.  In the mean time we all either do our own thing or we do nothing
and suffer.

Something like this is long overdue, IMO, and is still more
appropriate and obvious than kmemcg anyway.


>> Thanks.
>>
> Kernel memory consumption isn't the only valid reason to want to limit the
> number of processes in a cgroup.  Limiting the number of processes is very
> useful to ensure that a program is working correctly (for example, the NTP
> daemon should (usually) have an _exact_ number of children if it is
> functioning correctly, and rpcbind shouldn't (AFAIK) ever have _any_
> children), to prevent PID number exhaustion, to head off DoS attacks against
> forking network servers before they get to the point of causing kmem
> exhaustion, and to limit the number of processes in a cgroup that uses lots
> of kernel memory very infrequently.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-27 17:15         ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-02-27 17:15 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Austin S Hemmelgarn, Aleksa Sarai, lizefan, mingo, peterz,
	richard, Frédéric Weisbecker,
	Linux Kernel Mailing List, cgroups

On Fri, Feb 27, 2015 at 09:12:45AM -0800, Tim Hockin wrote:
> I was told that the plan was to use kmemcg - but I was told that YEARS
> AGO.  In the mean time we all either do our own thing or we do nothing
> and suffer.

Wasn't it like a year ago?  Yeah, it's taking longer than everybody
hoped but seriously kmemcg reclaimer just got merged and also did the
new memcg interface which will tie kmemcg and memcg together.

> Something like this is long overdue, IMO, and is still more
> appropriate and obvious than kmemcg anyway.

Thanks for chiming in again but if you aren't bringing out anything
new to the table (I don't remember you doing that last time either),
I'm not sure why the decision would be different this time.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-27 17:15         ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-02-27 17:15 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Austin S Hemmelgarn, Aleksa Sarai,
	lizefan-hv44wF8Li93QT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, richard-/L3Ra7n9ekc,
	Frédéric Weisbecker, Linux Kernel Mailing List,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Feb 27, 2015 at 09:12:45AM -0800, Tim Hockin wrote:
> I was told that the plan was to use kmemcg - but I was told that YEARS
> AGO.  In the mean time we all either do our own thing or we do nothing
> and suffer.

Wasn't it like a year ago?  Yeah, it's taking longer than everybody
hoped but seriously kmemcg reclaimer just got merged and also did the
new memcg interface which will tie kmemcg and memcg together.

> Something like this is long overdue, IMO, and is still more
> appropriate and obvious than kmemcg anyway.

Thanks for chiming in again but if you aren't bringing out anything
new to the table (I don't remember you doing that last time either),
I'm not sure why the decision would be different this time.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-27 17:25         ` Tim Hockin
  0 siblings, 0 replies; 108+ messages in thread
From: Tim Hockin @ 2015-02-27 17:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Austin S Hemmelgarn, Aleksa Sarai, Li Zefan, mingo,
	Peter Zijlstra, richard, Frédéric Weisbecker,
	linux-kernel, Cgroups

On Fri, Feb 27, 2015 at 9:06 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Fri, Feb 27, 2015 at 11:42:10AM -0500, Austin S Hemmelgarn wrote:
>> Kernel memory consumption isn't the only valid reason to want to limit the
>> number of processes in a cgroup.  Limiting the number of processes is very
>> useful to ensure that a program is working correctly (for example, the NTP
>> daemon should (usually) have an _exact_ number of children if it is
>> functioning correctly, and rpcbind shouldn't (AFAIK) ever have _any_
>> children), to prevent PID number exhaustion, to head off DoS attacks against
>> forking network servers before they get to the point of causing kmem
>> exhaustion, and to limit the number of processes in a cgroup that uses lots
>> of kernel memory very infrequently.
>
> All the use cases you're listing are extremely niche and can be
> trivially achieved without introducing another cgroup controller.  Not
> only that, they're actually pretty silly.  Let's say NTP daemon is
> misbehaving (or its code changed w/o you knowing or there are corner
> cases which trigger extremely infrequently).  What do you exactly
> achieve by rejecting its fork call?  It's just adding another
> variation to the misbehavior.  It was misbehaving before and would now
> be continuing to misbehave after a failed fork.
>
> In general, I'm pretty strongly against adding controllers for things
> which aren't fundamental resources in the system.  What's next?  Open
> files?  Pipe buffer?  Number of flocks?  Number of session leaders or
> program groups?

Yes to some or all of those.  We do exactly this internally and it has
greatly added to the stability of our overall container management
system.  and while you have been telling everyone to wait for kmemcg,
we have had an extra 3+ years of stability.

> If you want to prevent a certain class of jobs from exhausting a given
> resource, protecting that resource is the obvious thing to do.

I don't follow your argument - isn't this exactly what this patch set
is doing - protecting resources?

> Wasn't it like a year ago?  Yeah, it's taking longer than everybody
> hoped but seriously kmemcg reclaimer just got merged and also did the
> new memcg interface which will tie kmemcg and memcg together.

By my email it was almost 2 years ago, and that was the second or
third incarnation of this patch.

>> Something like this is long overdue, IMO, and is still more
>> appropriate and obvious than kmemcg anyway.
>
> Thanks for chiming in again but if you aren't bringing out anything
> new to the table (I don't remember you doing that last time either),
> I'm not sure why the decision would be different this time.

I'm just vocalizing my support for this idea in defense of practical
solutions that work NOW instead of "engineering ideals" that never
actually arrive.

As containers take the server world by storm, stuff like this gets
more and more important.

Tim

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-27 17:25         ` Tim Hockin
  0 siblings, 0 replies; 108+ messages in thread
From: Tim Hockin @ 2015-02-27 17:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Austin S Hemmelgarn, Aleksa Sarai, Li Zefan,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, Peter Zijlstra,
	richard-/L3Ra7n9ekc, Frédéric Weisbecker,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Cgroups

On Fri, Feb 27, 2015 at 9:06 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello,
>
> On Fri, Feb 27, 2015 at 11:42:10AM -0500, Austin S Hemmelgarn wrote:
>> Kernel memory consumption isn't the only valid reason to want to limit the
>> number of processes in a cgroup.  Limiting the number of processes is very
>> useful to ensure that a program is working correctly (for example, the NTP
>> daemon should (usually) have an _exact_ number of children if it is
>> functioning correctly, and rpcbind shouldn't (AFAIK) ever have _any_
>> children), to prevent PID number exhaustion, to head off DoS attacks against
>> forking network servers before they get to the point of causing kmem
>> exhaustion, and to limit the number of processes in a cgroup that uses lots
>> of kernel memory very infrequently.
>
> All the use cases you're listing are extremely niche and can be
> trivially achieved without introducing another cgroup controller.  Not
> only that, they're actually pretty silly.  Let's say NTP daemon is
> misbehaving (or its code changed w/o you knowing or there are corner
> cases which trigger extremely infrequently).  What do you exactly
> achieve by rejecting its fork call?  It's just adding another
> variation to the misbehavior.  It was misbehaving before and would now
> be continuing to misbehave after a failed fork.
>
> In general, I'm pretty strongly against adding controllers for things
> which aren't fundamental resources in the system.  What's next?  Open
> files?  Pipe buffer?  Number of flocks?  Number of session leaders or
> program groups?

Yes to some or all of those.  We do exactly this internally and it has
greatly added to the stability of our overall container management
system.  and while you have been telling everyone to wait for kmemcg,
we have had an extra 3+ years of stability.

> If you want to prevent a certain class of jobs from exhausting a given
> resource, protecting that resource is the obvious thing to do.

I don't follow your argument - isn't this exactly what this patch set
is doing - protecting resources?

> Wasn't it like a year ago?  Yeah, it's taking longer than everybody
> hoped but seriously kmemcg reclaimer just got merged and also did the
> new memcg interface which will tie kmemcg and memcg together.

By my email it was almost 2 years ago, and that was the second or
third incarnation of this patch.

>> Something like this is long overdue, IMO, and is still more
>> appropriate and obvious than kmemcg anyway.
>
> Thanks for chiming in again but if you aren't bringing out anything
> new to the table (I don't remember you doing that last time either),
> I'm not sure why the decision would be different this time.

I'm just vocalizing my support for this idea in defense of practical
solutions that work NOW instead of "engineering ideals" that never
actually arrive.

As containers take the server world by storm, stuff like this gets
more and more important.

Tim

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
  2015-02-27 17:25         ` Tim Hockin
  (?)
@ 2015-02-27 17:45         ` Tejun Heo
  2015-02-27 17:56             ` Tejun Heo
  2015-02-27 21:45             ` Tim Hockin
  -1 siblings, 2 replies; 108+ messages in thread
From: Tejun Heo @ 2015-02-27 17:45 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Austin S Hemmelgarn, Aleksa Sarai, Li Zefan, mingo,
	Peter Zijlstra, richard, Frédéric Weisbecker,
	linux-kernel, Cgroups

On Fri, Feb 27, 2015 at 09:25:10AM -0800, Tim Hockin wrote:
> > In general, I'm pretty strongly against adding controllers for things
> > which aren't fundamental resources in the system.  What's next?  Open
> > files?  Pipe buffer?  Number of flocks?  Number of session leaders or
> > program groups?
> 
> Yes to some or all of those.  We do exactly this internally and it has
> greatly added to the stability of our overall container management
> system.  and while you have been telling everyone to wait for kmemcg,
> we have had an extra 3+ years of stability.

Yeah, good job.  I totally get why kernel part of memory consumption
needs protection.  I'm not arguing against that at all.

> > If you want to prevent a certain class of jobs from exhausting a given
> > resource, protecting that resource is the obvious thing to do.
> 
> I don't follow your argument - isn't this exactly what this patch set
> is doing - protecting resources?

If you have proper protection over kernel memory consumption, this is
completely covered because memory is the fundamental resource here.
Controlling distribution of those fundamental resources is what
cgroups are primarily about.

> > Wasn't it like a year ago?  Yeah, it's taking longer than everybody
> > hoped but seriously kmemcg reclaimer just got merged and also did the
> > new memcg interface which will tie kmemcg and memcg together.
> 
> By my email it was almost 2 years ago, and that was the second or
> third incarnation of this patch.

Again, I agree this is taking a while.  Memory people had to retool
the whole reclamation path to make this work, which is the pattern
being repeated across the different controllers - we're refactoring a
lot of infrastructure code so that resource control can integrate with
the regular operation of the kernel, which BTW is what we should have
been doing from the beginning.

If your complaint is that this is taking too long, I hear you, and
there's a certain amount of validity in arguing that upstreaming a
temporary measure is the better trade-off, but the rationale for nproc
(or nfds, or virtual memory, whatever) has been pretty weak otherwise.

And as for the different incarnations of this patchset.  Reposting the
same stuff repeatedly doesn't really change anything.  Why would it?

> >> Something like this is long overdue, IMO, and is still more
> >> appropriate and obvious than kmemcg anyway.
> >
> > Thanks for chiming in again but if you aren't bringing out anything
> > new to the table (I don't remember you doing that last time either),
> > I'm not sure why the decision would be different this time.
> 
> I'm just vocalizing my support for this idea in defense of practical
> solutions that work NOW instead of "engineering ideals" that never
> actually arrive.
> 
> As containers take the server world by storm, stuff like this gets
> more and more important.

Again, protection of kernel side memory consumption is important.
There's no question about that.  As for the never-arriving part, well,
it is arriving.  If you still can't believe, just take a look at the
code.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-27 17:56             ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-02-27 17:56 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Austin S Hemmelgarn, Aleksa Sarai, Li Zefan, mingo,
	Peter Zijlstra, richard, Frédéric Weisbecker,
	linux-kernel, Cgroups

On Fri, Feb 27, 2015 at 12:45:03PM -0500, Tejun Heo wrote:
> If your complaint is that this is taking too long, I hear you, and
> there's a certain amount of validity in arguing that upstreaming a
> temporary measure is the better trade-off, but the rationale for nproc
> (or nfds, or virtual memory, whatever) has been pretty weak otherwise.

Also, note that this is subset of a larger problem.  e.g. there's a
patchset trying to implement writeback IO control from the filesystem
layer.  cgroup control of writeback has been a thorny issue for over
three years now and the rationale for implementing this reversed
controlling scheme is about the same - doing it properly is too
difficult, let's bolt something on the top as a practical measure.

I think it'd be seriously short-sighted to give in and merge all
those.  These sorts of shortcuts are crippling in the long term.
Again, similarly, proper cgroup writeback support is literally right
around the corner.

The situation sure can be frustrating if you need something now but we
can't make decisions solely on that.  This is an a lot longer term
project and we better, for once, get things right.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-27 17:56             ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-02-27 17:56 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Austin S Hemmelgarn, Aleksa Sarai, Li Zefan,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, Peter Zijlstra,
	richard-/L3Ra7n9ekc, Frédéric Weisbecker,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Cgroups

On Fri, Feb 27, 2015 at 12:45:03PM -0500, Tejun Heo wrote:
> If your complaint is that this is taking too long, I hear you, and
> there's a certain amount of validity in arguing that upstreaming a
> temporary measure is the better trade-off, but the rationale for nproc
> (or nfds, or virtual memory, whatever) has been pretty weak otherwise.

Also, note that this is subset of a larger problem.  e.g. there's a
patchset trying to implement writeback IO control from the filesystem
layer.  cgroup control of writeback has been a thorny issue for over
three years now and the rationale for implementing this reversed
controlling scheme is about the same - doing it properly is too
difficult, let's bolt something on the top as a practical measure.

I think it'd be seriously short-sighted to give in and merge all
those.  These sorts of shortcuts are crippling in the long term.
Again, similarly, proper cgroup writeback support is literally right
around the corner.

The situation sure can be frustrating if you need something now but we
can't make decisions solely on that.  This is an a lot longer term
project and we better, for once, get things right.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-27 18:49         ` Austin S Hemmelgarn
  0 siblings, 0 replies; 108+ messages in thread
From: Austin S Hemmelgarn @ 2015-02-27 18:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Aleksa Sarai, lizefan, mingo, peterz, richard, fweisbec,
	linux-kernel, cgroups

On 2015-02-27 12:06, Tejun Heo wrote:
> Hello,
>
> On Fri, Feb 27, 2015 at 11:42:10AM -0500, Austin S Hemmelgarn wrote:
>> Kernel memory consumption isn't the only valid reason to want to limit the
>> number of processes in a cgroup.  Limiting the number of processes is very
>> useful to ensure that a program is working correctly (for example, the NTP
>> daemon should (usually) have an _exact_ number of children if it is
>> functioning correctly, and rpcbind shouldn't (AFAIK) ever have _any_
>> children), to prevent PID number exhaustion, to head off DoS attacks against
>> forking network servers before they get to the point of causing kmem
>> exhaustion, and to limit the number of processes in a cgroup that uses lots
>> of kernel memory very infrequently.
>
> All the use cases you're listing are extremely niche and can be
> trivially achieved without introducing another cgroup controller.  Not
> only that, they're actually pretty silly.  Let's say NTP daemon is
> misbehaving (or its code changed w/o you knowing or there are corner
> cases which trigger extremely infrequently).  What do you exactly
> achieve by rejecting its fork call?  It's just adding another
> variation to the misbehavior.  It was misbehaving before and would now
> be continuing to misbehave after a failed fork.
>
I wouldn't think that preventing PID exhaustion would be all that much 
of a niche case, it's fully possible for it to happen without using 
excessive amounts of kernel memory (think about BIG server systems with 
terabytes of memory running (arguably poorly written) forking servers 
that handle tens of thousands of client requests per second, each 
lasting multiple tens of seconds), and not necessarily as trivial as you 
might think to handle sanely (especially if you want callbacks when the 
limits get hit).
As far as being trivial to achieve, I'm assuming you are referring to 
rlimit and PAM's limits module, both of which have their own issues. 
Using pam_limits.so to limit processes isn't trivial because it requires 
calling through PAM to begin with, which almost no software that isn't 
login related does, and rlimits are tricky to set up properly with the 
granularity that having a cgroup would provide.
> In general, I'm pretty strongly against adding controllers for things
> which aren't fundamental resources in the system.  What's next?  Open
> files?  Pipe buffer?  Number of flocks?  Number of session leaders or
> program groups?
>
PID's are a fundamental resource, you run out and it's an only 
marginally better situation than OOM, namely, if you don't already have 
a shell open which has kill builtin (because you can't fork), or have 
some other reliable way to terminate processes without forking, you are 
stuck either waiting for the problem to resolve itself, or have to reset 
the system.
> If you want to prevent a certain class of jobs from exhausting a given
> resource, protecting that resource is the obvious thing to do.
>
Which is why I'm advocating something that provides a more robust method 
of preventing the system from exhausting PID numbers.
> Thanks.
>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-27 18:49         ` Austin S Hemmelgarn
  0 siblings, 0 replies; 108+ messages in thread
From: Austin S Hemmelgarn @ 2015-02-27 18:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Aleksa Sarai, lizefan-hv44wF8Li93QT0dZR+AlfA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	richard-/L3Ra7n9ekc, fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On 2015-02-27 12:06, Tejun Heo wrote:
> Hello,
>
> On Fri, Feb 27, 2015 at 11:42:10AM -0500, Austin S Hemmelgarn wrote:
>> Kernel memory consumption isn't the only valid reason to want to limit the
>> number of processes in a cgroup.  Limiting the number of processes is very
>> useful to ensure that a program is working correctly (for example, the NTP
>> daemon should (usually) have an _exact_ number of children if it is
>> functioning correctly, and rpcbind shouldn't (AFAIK) ever have _any_
>> children), to prevent PID number exhaustion, to head off DoS attacks against
>> forking network servers before they get to the point of causing kmem
>> exhaustion, and to limit the number of processes in a cgroup that uses lots
>> of kernel memory very infrequently.
>
> All the use cases you're listing are extremely niche and can be
> trivially achieved without introducing another cgroup controller.  Not
> only that, they're actually pretty silly.  Let's say NTP daemon is
> misbehaving (or its code changed w/o you knowing or there are corner
> cases which trigger extremely infrequently).  What do you exactly
> achieve by rejecting its fork call?  It's just adding another
> variation to the misbehavior.  It was misbehaving before and would now
> be continuing to misbehave after a failed fork.
>
I wouldn't think that preventing PID exhaustion would be all that much 
of a niche case, it's fully possible for it to happen without using 
excessive amounts of kernel memory (think about BIG server systems with 
terabytes of memory running (arguably poorly written) forking servers 
that handle tens of thousands of client requests per second, each 
lasting multiple tens of seconds), and not necessarily as trivial as you 
might think to handle sanely (especially if you want callbacks when the 
limits get hit).
As far as being trivial to achieve, I'm assuming you are referring to 
rlimit and PAM's limits module, both of which have their own issues. 
Using pam_limits.so to limit processes isn't trivial because it requires 
calling through PAM to begin with, which almost no software that isn't 
login related does, and rlimits are tricky to set up properly with the 
granularity that having a cgroup would provide.
> In general, I'm pretty strongly against adding controllers for things
> which aren't fundamental resources in the system.  What's next?  Open
> files?  Pipe buffer?  Number of flocks?  Number of session leaders or
> program groups?
>
PID's are a fundamental resource, you run out and it's an only 
marginally better situation than OOM, namely, if you don't already have 
a shell open which has kill builtin (because you can't fork), or have 
some other reliable way to terminate processes without forking, you are 
stuck either waiting for the problem to resolve itself, or have to reset 
the system.
> If you want to prevent a certain class of jobs from exhausting a given
> resource, protecting that resource is the obvious thing to do.
>
Which is why I'm advocating something that provides a more robust method 
of preventing the system from exhausting PID numbers.
> Thanks.
>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-27 19:35           ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-02-27 19:35 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Aleksa Sarai, lizefan, mingo, peterz, richard, fweisbec,
	linux-kernel, cgroups

Hello, Austin.

On Fri, Feb 27, 2015 at 01:49:53PM -0500, Austin S Hemmelgarn wrote:
> As far as being trivial to achieve, I'm assuming you are referring to rlimit
> and PAM's limits module, both of which have their own issues. Using
> pam_limits.so to limit processes isn't trivial because it requires calling
> through PAM to begin with, which almost no software that isn't login related
> does, and rlimits are tricky to set up properly with the granularity that
> having a cgroup would provide.
...
> PID's are a fundamental resource, you run out and it's an only marginally
> better situation than OOM, namely, if you don't already have a shell open
> which has kill builtin (because you can't fork), or have some other reliable
> way to terminate processes without forking, you are stuck either waiting for
> the problem to resolve itself, or have to reset the system.

Right, this is an a lot more valid argument.  Currently, we're capping
max pid at 4M which translates to some tens of gigs of memory which
isn't a crazy amount on modern machines.  The hard(er) barrier would
be around 2^30 (2^29 from futex side, apparently) which would also be
reacheable on configurations w/ terabytes of memory.

I'll think more about it and get back.

Thanks a lot.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-27 19:35           ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-02-27 19:35 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Aleksa Sarai, lizefan-hv44wF8Li93QT0dZR+AlfA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	richard-/L3Ra7n9ekc, fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello, Austin.

On Fri, Feb 27, 2015 at 01:49:53PM -0500, Austin S Hemmelgarn wrote:
> As far as being trivial to achieve, I'm assuming you are referring to rlimit
> and PAM's limits module, both of which have their own issues. Using
> pam_limits.so to limit processes isn't trivial because it requires calling
> through PAM to begin with, which almost no software that isn't login related
> does, and rlimits are tricky to set up properly with the granularity that
> having a cgroup would provide.
...
> PID's are a fundamental resource, you run out and it's an only marginally
> better situation than OOM, namely, if you don't already have a shell open
> which has kill builtin (because you can't fork), or have some other reliable
> way to terminate processes without forking, you are stuck either waiting for
> the problem to resolve itself, or have to reset the system.

Right, this is an a lot more valid argument.  Currently, we're capping
max pid at 4M which translates to some tens of gigs of memory which
isn't a crazy amount on modern machines.  The hard(er) barrier would
be around 2^30 (2^29 from futex side, apparently) which would also be
reacheable on configurations w/ terabytes of memory.

I'll think more about it and get back.

Thanks a lot.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-27 21:45             ` Tim Hockin
  0 siblings, 0 replies; 108+ messages in thread
From: Tim Hockin @ 2015-02-27 21:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Austin S Hemmelgarn, Aleksa Sarai, Li Zefan, mingo,
	Peter Zijlstra, richard, Frédéric Weisbecker,
	linux-kernel, Cgroups

On Fri, Feb 27, 2015 at 9:45 AM, Tejun Heo <tj@kernel.org> wrote:
> On Fri, Feb 27, 2015 at 09:25:10AM -0800, Tim Hockin wrote:
>> > In general, I'm pretty strongly against adding controllers for things
>> > which aren't fundamental resources in the system.  What's next?  Open
>> > files?  Pipe buffer?  Number of flocks?  Number of session leaders or
>> > program groups?
>>
>> Yes to some or all of those.  We do exactly this internally and it has
>> greatly added to the stability of our overall container management
>> system.  and while you have been telling everyone to wait for kmemcg,
>> we have had an extra 3+ years of stability.
>
> Yeah, good job.  I totally get why kernel part of memory consumption
> needs protection.  I'm not arguing against that at all.

You keep shifting the focus to be about memory, but that's not what
people are asking for.  You're letting the desire for a perfect
solution (which is years late) block good solutions that exist NOW.

>> > If you want to prevent a certain class of jobs from exhausting a given
>> > resource, protecting that resource is the obvious thing to do.
>>
>> I don't follow your argument - isn't this exactly what this patch set
>> is doing - protecting resources?
>
> If you have proper protection over kernel memory consumption, this is
> completely covered because memory is the fundamental resource here.
> Controlling distribution of those fundamental resources is what
> cgroups are primarily about.

You say that's what cgroups are about, but it's not at all obvious
that you are right.  What users, admins, systems people want is
building blocks that are usable and make sense.  Limiting kernel
memory is NOT the logical building block, here.  It's not something
people can reason about or quantify easily.  if you need to implement
the interfaces in terms of memory, go nuts, but making users think
liek that is just not right.

>> > Wasn't it like a year ago?  Yeah, it's taking longer than everybody
>> > hoped but seriously kmemcg reclaimer just got merged and also did the
>> > new memcg interface which will tie kmemcg and memcg together.
>>
>> By my email it was almost 2 years ago, and that was the second or
>> third incarnation of this patch.
>
> Again, I agree this is taking a while.  Memory people had to retool
> the whole reclamation path to make this work, which is the pattern
> being repeated across the different controllers - we're refactoring a
> lot of infrastructure code so that resource control can integrate with
> the regular operation of the kernel, which BTW is what we should have
> been doing from the beginning.
>
> If your complaint is that this is taking too long, I hear you, and
> there's a certain amount of validity in arguing that upstreaming a
> temporary measure is the better trade-off, but the rationale for nproc
> (or nfds, or virtual memory, whatever) has been pretty weak otherwise.

At least 3 or 4 people have INDEPENDENTLY decided this is what is
causing them pain and tried to fix it and invested the time to send a
patch says that it is actually a thing.  There exists a problem that
you are disallowing to be fixed.  Do you recognize that users are
experiencing pain?  Why do you hate your users? :)

> And as for the different incarnations of this patchset.  Reposting the
> same stuff repeatedly doesn't really change anything.  Why would it?

Because reasonable people might survey the ecosystem and say "humm,
things have changed over the years - isolation has become a pretty
serious topic".  or maybe they hope that you'll finally agree that
fixing the problem NOW is worthwhile, even if the solution is
imperfect, and that a more perfect solution will arrive.

>> >> Something like this is long overdue, IMO, and is still more
>> >> appropriate and obvious than kmemcg anyway.
>> >
>> > Thanks for chiming in again but if you aren't bringing out anything
>> > new to the table (I don't remember you doing that last time either),
>> > I'm not sure why the decision would be different this time.
>>
>> I'm just vocalizing my support for this idea in defense of practical
>> solutions that work NOW instead of "engineering ideals" that never
>> actually arrive.
>>
>> As containers take the server world by storm, stuff like this gets
>> more and more important.
>
> Again, protection of kernel side memory consumption is important.
> There's no question about that.  As for the never-arriving part, well,
> it is arriving.  If you still can't believe, just take a look at the
> code.

Are you willing to put a drop-dead date on it?  If we don't have
kmemcg working well enough to _actually_ bound PID usage and FD usage
by, say, June 1st, will you then accept a patch to this effect?  If
the answer is no, then I have zero faith that it's coming any time
soon - I heard this 2 years ago.  I believed you then.

I see further downthread that you said you'll think about it.  Thank
you.  Just because our use cases are not normal does not mean we're
not valid :)

Tim

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-27 21:45             ` Tim Hockin
  0 siblings, 0 replies; 108+ messages in thread
From: Tim Hockin @ 2015-02-27 21:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Austin S Hemmelgarn, Aleksa Sarai, Li Zefan, mingo,
	Peter Zijlstra, richard, Frédéric Weisbecker,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Cgroups

On Fri, Feb 27, 2015 at 9:45 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Fri, Feb 27, 2015 at 09:25:10AM -0800, Tim Hockin wrote:
>> > In general, I'm pretty strongly against adding controllers for things
>> > which aren't fundamental resources in the system.  What's next?  Open
>> > files?  Pipe buffer?  Number of flocks?  Number of session leaders or
>> > program groups?
>>
>> Yes to some or all of those.  We do exactly this internally and it has
>> greatly added to the stability of our overall container management
>> system.  and while you have been telling everyone to wait for kmemcg,
>> we have had an extra 3+ years of stability.
>
> Yeah, good job.  I totally get why kernel part of memory consumption
> needs protection.  I'm not arguing against that at all.

You keep shifting the focus to be about memory, but that's not what
people are asking for.  You're letting the desire for a perfect
solution (which is years late) block good solutions that exist NOW.

>> > If you want to prevent a certain class of jobs from exhausting a given
>> > resource, protecting that resource is the obvious thing to do.
>>
>> I don't follow your argument - isn't this exactly what this patch set
>> is doing - protecting resources?
>
> If you have proper protection over kernel memory consumption, this is
> completely covered because memory is the fundamental resource here.
> Controlling distribution of those fundamental resources is what
> cgroups are primarily about.

You say that's what cgroups are about, but it's not at all obvious
that you are right.  What users, admins, systems people want is
building blocks that are usable and make sense.  Limiting kernel
memory is NOT the logical building block, here.  It's not something
people can reason about or quantify easily.  if you need to implement
the interfaces in terms of memory, go nuts, but making users think
liek that is just not right.

>> > Wasn't it like a year ago?  Yeah, it's taking longer than everybody
>> > hoped but seriously kmemcg reclaimer just got merged and also did the
>> > new memcg interface which will tie kmemcg and memcg together.
>>
>> By my email it was almost 2 years ago, and that was the second or
>> third incarnation of this patch.
>
> Again, I agree this is taking a while.  Memory people had to retool
> the whole reclamation path to make this work, which is the pattern
> being repeated across the different controllers - we're refactoring a
> lot of infrastructure code so that resource control can integrate with
> the regular operation of the kernel, which BTW is what we should have
> been doing from the beginning.
>
> If your complaint is that this is taking too long, I hear you, and
> there's a certain amount of validity in arguing that upstreaming a
> temporary measure is the better trade-off, but the rationale for nproc
> (or nfds, or virtual memory, whatever) has been pretty weak otherwise.

At least 3 or 4 people have INDEPENDENTLY decided this is what is
causing them pain and tried to fix it and invested the time to send a
patch says that it is actually a thing.  There exists a problem that
you are disallowing to be fixed.  Do you recognize that users are
experiencing pain?  Why do you hate your users? :)

> And as for the different incarnations of this patchset.  Reposting the
> same stuff repeatedly doesn't really change anything.  Why would it?

Because reasonable people might survey the ecosystem and say "humm,
things have changed over the years - isolation has become a pretty
serious topic".  or maybe they hope that you'll finally agree that
fixing the problem NOW is worthwhile, even if the solution is
imperfect, and that a more perfect solution will arrive.

>> >> Something like this is long overdue, IMO, and is still more
>> >> appropriate and obvious than kmemcg anyway.
>> >
>> > Thanks for chiming in again but if you aren't bringing out anything
>> > new to the table (I don't remember you doing that last time either),
>> > I'm not sure why the decision would be different this time.
>>
>> I'm just vocalizing my support for this idea in defense of practical
>> solutions that work NOW instead of "engineering ideals" that never
>> actually arrive.
>>
>> As containers take the server world by storm, stuff like this gets
>> more and more important.
>
> Again, protection of kernel side memory consumption is important.
> There's no question about that.  As for the never-arriving part, well,
> it is arriving.  If you still can't believe, just take a look at the
> code.

Are you willing to put a drop-dead date on it?  If we don't have
kmemcg working well enough to _actually_ bound PID usage and FD usage
by, say, June 1st, will you then accept a patch to this effect?  If
the answer is no, then I have zero faith that it's coming any time
soon - I heard this 2 years ago.  I believed you then.

I see further downthread that you said you'll think about it.  Thank
you.  Just because our use cases are not normal does not mean we're
not valid :)

Tim

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
  2015-02-27 21:45             ` Tim Hockin
  (?)
@ 2015-02-27 21:49             ` Tejun Heo
       [not found]               ` <CAAAKZwsCc8BtFx58KMFpRTohU81oCBeGVOPGMJrjJt9q5upKfQ@mail.gmail.com>
  -1 siblings, 1 reply; 108+ messages in thread
From: Tejun Heo @ 2015-02-27 21:49 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Austin S Hemmelgarn, Aleksa Sarai, Li Zefan, mingo,
	Peter Zijlstra, richard, Frédéric Weisbecker,
	linux-kernel, Cgroups

On Fri, Feb 27, 2015 at 01:45:09PM -0800, Tim Hockin wrote:
> Are you willing to put a drop-dead date on it?  If we don't have
> kmemcg working well enough to _actually_ bound PID usage and FD usage
> by, say, June 1st, will you then accept a patch to this effect?  If
> the answer is no, then I have zero faith that it's coming any time
> soon - I heard this 2 years ago.  I believed you then.

Tim, cut this bullshit.  That's not how kernel development works.
Contribute to techincal discussion or shut it.  I'm really getting
tired of your whining without any useful substance.

> I see further downthread that you said you'll think about it.  Thank
> you.  Just because our use cases are not normal does not mean we're
> not valid :)

And can you even see why that made progress?

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-28  9:26           ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-02-28  9:26 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Tejun Heo, lizefan, mingo, peterz, richard,
	Frédéric Weisbecker, linux-kernel, cgroups

> I wouldn't think that preventing PID exhaustion would be all that much of a
> niche case, it's fully possible for it to happen without using excessive
> amounts of kernel memory (think about BIG server systems with terabytes of
> memory running (arguably poorly written) forking servers that handle tens of
> thousands of client requests per second, each lasting multiple tens of
> seconds), and not necessarily as trivial as you might think to handle sanely
> (especially if you want callbacks when the limits get hit).
> As far as being trivial to achieve, I'm assuming you are referring to rlimit
> and PAM's limits module, both of which have their own issues. Using
> pam_limits.so to limit processes isn't trivial because it requires calling
> through PAM to begin with, which almost no software that isn't login related
> does, and rlimits are tricky to set up properly with the granularity that
> having a cgroup would provide.

I just want to quickly echo my support for this statement. Process IDs
aren't limited by kernel memory, they're a hard-set limit. Thus they are
a resource like other global resources (open files, etc). Now, while you
can argue that it is possible to limit the amount of *effective*
processes you can use in a cgroup through kmemcg (by limiting the amount
of memory spent in storing task_struct data) -- that isn't limiting the
usage of the *actual* resource (the fact you're limiting the number of
PIDs is little more than a by-product).

Also, If it wasn't an actual resource then why is RLIMIT_NPROC a thing?
To me, that indicates that PID limiting not an esoteric usecase and it
should be possible to use the Linux kernel's home-grown accounting
system to limit the number of PIDs in a cgroup. Otherwise you're stuck
in a weird world where you *can* limit the number of processes in a
process tree but *not* the number of processes in a cgroup.

>> In general, I'm pretty strongly against adding controllers for things
>> which aren't fundamental resources in the system.  What's next?  Open
>> files?  Pipe buffer?  Number of flocks?  Number of session leaders or
>> program groups?
>>
> PID's are a fundamental resource, you run out and it's an only marginally
> better situation than OOM, namely, if you don't already have a shell open
> which has kill builtin (because you can't fork), or have some other reliable
> way to terminate processes without forking, you are stuck either waiting for
> the problem to resolve itself, or have to reset the system.

I couldn't agree more. PIDs are a fundamental resource because there is
a hard limit on the amount of PIDs you can have in any one system. Once
you've exhausted that limit, there's not much you can do apart from
doing the SYSRQ dance.

-- 
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-28  9:26           ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-02-28  9:26 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Tejun Heo, lizefan-hv44wF8Li93QT0dZR+AlfA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	richard-/L3Ra7n9ekc, Frédéric Weisbecker,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

> I wouldn't think that preventing PID exhaustion would be all that much of a
> niche case, it's fully possible for it to happen without using excessive
> amounts of kernel memory (think about BIG server systems with terabytes of
> memory running (arguably poorly written) forking servers that handle tens of
> thousands of client requests per second, each lasting multiple tens of
> seconds), and not necessarily as trivial as you might think to handle sanely
> (especially if you want callbacks when the limits get hit).
> As far as being trivial to achieve, I'm assuming you are referring to rlimit
> and PAM's limits module, both of which have their own issues. Using
> pam_limits.so to limit processes isn't trivial because it requires calling
> through PAM to begin with, which almost no software that isn't login related
> does, and rlimits are tricky to set up properly with the granularity that
> having a cgroup would provide.

I just want to quickly echo my support for this statement. Process IDs
aren't limited by kernel memory, they're a hard-set limit. Thus they are
a resource like other global resources (open files, etc). Now, while you
can argue that it is possible to limit the amount of *effective*
processes you can use in a cgroup through kmemcg (by limiting the amount
of memory spent in storing task_struct data) -- that isn't limiting the
usage of the *actual* resource (the fact you're limiting the number of
PIDs is little more than a by-product).

Also, If it wasn't an actual resource then why is RLIMIT_NPROC a thing?
To me, that indicates that PID limiting not an esoteric usecase and it
should be possible to use the Linux kernel's home-grown accounting
system to limit the number of PIDs in a cgroup. Otherwise you're stuck
in a weird world where you *can* limit the number of processes in a
process tree but *not* the number of processes in a cgroup.

>> In general, I'm pretty strongly against adding controllers for things
>> which aren't fundamental resources in the system.  What's next?  Open
>> files?  Pipe buffer?  Number of flocks?  Number of session leaders or
>> program groups?
>>
> PID's are a fundamental resource, you run out and it's an only marginally
> better situation than OOM, namely, if you don't already have a shell open
> which has kill builtin (because you can't fork), or have some other reliable
> way to terminate processes without forking, you are stuck either waiting for
> the problem to resolve itself, or have to reset the system.

I couldn't agree more. PIDs are a fundamental resource because there is
a hard limit on the amount of PIDs you can have in any one system. Once
you've exhausted that limit, there's not much you can do apart from
doing the SYSRQ dance.

-- 
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-28 11:59             ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-02-28 11:59 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Austin S Hemmelgarn, lizefan, mingo, peterz, richard,
	Frédéric Weisbecker, linux-kernel, cgroups

Hello, Aleksa.

On Sat, Feb 28, 2015 at 08:26:34PM +1100, Aleksa Sarai wrote:
> I just want to quickly echo my support for this statement. Process IDs
> aren't limited by kernel memory, they're a hard-set limit. Thus they are

Process IDs become a hard global resource because we didn't switch to
long during 64bit transition and put an artifical global limit on it,
which allows it to affect system-wide operation while its memory
consumption is staying within practical range.

> a resource like other global resources (open files, etc). Now, while you

Unlike open files.

> can argue that it is possible to limit the amount of *effective*
> processes you can use in a cgroup through kmemcg (by limiting the amount
> of memory spent in storing task_struct data) -- that isn't limiting the
> usage of the *actual* resource (the fact you're limiting the number of
> PIDs is little more than a by-product).

No, the problem is not that.  The problem is that pid_t is, as a
resource, is decoupled from its backing resource - memory - by the
extra artificial and difficult-to-overcome limit put on it.  You are
saying something which is completely different from what Austin was
arguing.

> Also, If it wasn't an actual resource then why is RLIMIT_NPROC a thing?

One strong reason would be because we didn't have a way to account for
and limit the fundamental resources.  If you can fully contain and
control the consumption via rationing the underlying resource, there
isn't much point in controlling the upper layer constructs.

> To me, that indicates that PID limiting not an esoteric usecase and it
> should be possible to use the Linux kernel's home-grown accounting
> system to limit the number of PIDs in a cgroup. Otherwise you're stuck

Again, I think it's a lot more indicative of the fact that we didn't
have any way to control kernel side memory consumption and pids and
open files were one of the things which are relatively easy to
implement policy-wise.

> in a weird world where you *can* limit the number of processes in a
> process tree but *not* the number of processes in a cgroup.

I'm not sold on the idea of replicating the features of ulimit in
cgroups.  ulimit is a mixed bag of relatively easily implementable
resource limits and their behaviors are a combination of resource
limits, per-user usage policies, and per-process behavior safetynets.
The only part translatable to cgroups is actual resource related part
and even among those we should identify what are actual resources
which can't be mapped to consumption of other fundamental resources.

> >> In general, I'm pretty strongly against adding controllers for things
> >> which aren't fundamental resources in the system.  What's next?  Open
> >> files?  Pipe buffer?  Number of flocks?  Number of session leaders or
> >> program groups?
> >>
> > PID's are a fundamental resource, you run out and it's an only marginally
> > better situation than OOM, namely, if you don't already have a shell open
> > which has kill builtin (because you can't fork), or have some other reliable
> > way to terminate processes without forking, you are stuck either waiting for
> > the problem to resolve itself, or have to reset the system.
> 
> I couldn't agree more. PIDs are a fundamental resource because there is
> a hard limit on the amount of PIDs you can have in any one system. Once
> you've exhausted that limit, there's not much you can do apart from
> doing the SYSRQ dance.

The reason why this holds is because we can hit the global limit way
earlier than a practically sized kmem consumption limits can kick in.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-28 11:59             ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-02-28 11:59 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Austin S Hemmelgarn, lizefan-hv44wF8Li93QT0dZR+AlfA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	richard-/L3Ra7n9ekc, Frédéric Weisbecker,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello, Aleksa.

On Sat, Feb 28, 2015 at 08:26:34PM +1100, Aleksa Sarai wrote:
> I just want to quickly echo my support for this statement. Process IDs
> aren't limited by kernel memory, they're a hard-set limit. Thus they are

Process IDs become a hard global resource because we didn't switch to
long during 64bit transition and put an artifical global limit on it,
which allows it to affect system-wide operation while its memory
consumption is staying within practical range.

> a resource like other global resources (open files, etc). Now, while you

Unlike open files.

> can argue that it is possible to limit the amount of *effective*
> processes you can use in a cgroup through kmemcg (by limiting the amount
> of memory spent in storing task_struct data) -- that isn't limiting the
> usage of the *actual* resource (the fact you're limiting the number of
> PIDs is little more than a by-product).

No, the problem is not that.  The problem is that pid_t is, as a
resource, is decoupled from its backing resource - memory - by the
extra artificial and difficult-to-overcome limit put on it.  You are
saying something which is completely different from what Austin was
arguing.

> Also, If it wasn't an actual resource then why is RLIMIT_NPROC a thing?

One strong reason would be because we didn't have a way to account for
and limit the fundamental resources.  If you can fully contain and
control the consumption via rationing the underlying resource, there
isn't much point in controlling the upper layer constructs.

> To me, that indicates that PID limiting not an esoteric usecase and it
> should be possible to use the Linux kernel's home-grown accounting
> system to limit the number of PIDs in a cgroup. Otherwise you're stuck

Again, I think it's a lot more indicative of the fact that we didn't
have any way to control kernel side memory consumption and pids and
open files were one of the things which are relatively easy to
implement policy-wise.

> in a weird world where you *can* limit the number of processes in a
> process tree but *not* the number of processes in a cgroup.

I'm not sold on the idea of replicating the features of ulimit in
cgroups.  ulimit is a mixed bag of relatively easily implementable
resource limits and their behaviors are a combination of resource
limits, per-user usage policies, and per-process behavior safetynets.
The only part translatable to cgroups is actual resource related part
and even among those we should identify what are actual resources
which can't be mapped to consumption of other fundamental resources.

> >> In general, I'm pretty strongly against adding controllers for things
> >> which aren't fundamental resources in the system.  What's next?  Open
> >> files?  Pipe buffer?  Number of flocks?  Number of session leaders or
> >> program groups?
> >>
> > PID's are a fundamental resource, you run out and it's an only marginally
> > better situation than OOM, namely, if you don't already have a shell open
> > which has kill builtin (because you can't fork), or have some other reliable
> > way to terminate processes without forking, you are stuck either waiting for
> > the problem to resolve itself, or have to reset the system.
> 
> I couldn't agree more. PIDs are a fundamental resource because there is
> a hard limit on the amount of PIDs you can have in any one system. Once
> you've exhausted that limit, there's not much you can do apart from
> doing the SYSRQ dance.

The reason why this holds is because we can hit the global limit way
earlier than a practically sized kmem consumption limits can kick in.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-28 16:43                 ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-02-28 16:43 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Frederic Weisbecker, Austin S Hemmelgarn, lizefan, richard,
	mingo, Aleksa Sarai, cgroups, peterz, linux-kernel

Hello, Tim.

On Sat, Feb 28, 2015 at 08:38:07AM -0800, Tim Hockin wrote:
> I know there is not much concern for legacy-system problems, but it is
> worth adding this case - there are systems that limit PIDs for other
> reasons, eg broken infrastructure that assumes PIDs fit in a short int,
> hypothetically.  Given such a system, PIDs become precious and limiting
> them per job is important.
>
> My main point being that there are less obvious considerations in play than
> just memory usage.

Sure, there are those cases but it'd be unwise to hinge long term
decisions on them.  It's hard to even argue 16bit pid in legacy code
as a significant contributing factor at this point.  At any rate, it
seems that pid is a global resource which needs to be provisioned for
reasonable isolation which is a good reason to consider controlling it
via cgroups.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-28 16:43                 ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-02-28 16:43 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Frederic Weisbecker, Austin S Hemmelgarn,
	lizefan-hv44wF8Li93QT0dZR+AlfA, richard-/L3Ra7n9ekc,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, Aleksa Sarai,
	cgroups-u79uwXL29TY76Z2rM5mHXA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hello, Tim.

On Sat, Feb 28, 2015 at 08:38:07AM -0800, Tim Hockin wrote:
> I know there is not much concern for legacy-system problems, but it is
> worth adding this case - there are systems that limit PIDs for other
> reasons, eg broken infrastructure that assumes PIDs fit in a short int,
> hypothetically.  Given such a system, PIDs become precious and limiting
> them per job is important.
>
> My main point being that there are less obvious considerations in play than
> just memory usage.

Sure, there are those cases but it'd be unwise to hinge long term
decisions on them.  It's hard to even argue 16bit pid in legacy code
as a significant contributing factor at this point.  At any rate, it
seems that pid is a global resource which needs to be provisioned for
reasonable isolation which is a good reason to consider controlling it
via cgroups.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
       [not found]               ` <CAAAKZwsCc8BtFx58KMFpRTohU81oCBeGVOPGMJrjJt9q5upKfQ@mail.gmail.com>
@ 2015-02-28 16:57                 ` Tejun Heo
  2015-02-28 22:26                     ` Tim Hockin
  0 siblings, 1 reply; 108+ messages in thread
From: Tejun Heo @ 2015-02-28 16:57 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Frederic Weisbecker, Austin S Hemmelgarn, Li Zefan,
	Peter Zijlstra, linux-kernel, Cgroups, Aleksa Sarai, richard,
	mingo

On Sat, Feb 28, 2015 at 08:48:12AM -0800, Tim Hockin wrote:
> I am sorry that real-user problems are not perceived as substantial.  This
> was/is a real issue for us.  Being in limbo for years on end might not be a
> technical point, but I do think it matters, and that was my point.

It's a problem which is localized to you and caused by the specific
problems of your setup.  This isn't a wide-spread problem at all and
the world doesn't revolve around you.  If your setup is so messed up
as to require sticking to 16bit pids, handle that locally.  If
something at larger scale eases that handling, you get lucky.  If not,
it's *your* predicament to deal with.  The rest of the world doesn't
exist to wipe your ass.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-28 22:26                     ` Tim Hockin
  0 siblings, 0 replies; 108+ messages in thread
From: Tim Hockin @ 2015-02-28 22:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Frederic Weisbecker, Austin S Hemmelgarn, Li Zefan,
	Peter Zijlstra, linux-kernel, Cgroups, Aleksa Sarai, richard,
	mingo

On Sat, Feb 28, 2015 at 8:57 AM, Tejun Heo <tj@kernel.org> wrote:
>
> On Sat, Feb 28, 2015 at 08:48:12AM -0800, Tim Hockin wrote:
> > I am sorry that real-user problems are not perceived as substantial.  This
> > was/is a real issue for us.  Being in limbo for years on end might not be a
> > technical point, but I do think it matters, and that was my point.
>
> It's a problem which is localized to you and caused by the specific
> problems of your setup.  This isn't a wide-spread problem at all and
> the world doesn't revolve around you.  If your setup is so messed up
> as to require sticking to 16bit pids, handle that locally.  If
> something at larger scale eases that handling, you get lucky.  If not,
> it's *your* predicament to deal with.  The rest of the world doesn't
> exist to wipe your ass.

Wow, so much anger.  I'm not even sure how to respond, so I'll just
say this and sign off.  All I want is a better, friendlier, more
useful system overall.  We clearly have different ways of looking at
the problem.

No antagonism intended

Tim

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-28 22:26                     ` Tim Hockin
  0 siblings, 0 replies; 108+ messages in thread
From: Tim Hockin @ 2015-02-28 22:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Frederic Weisbecker, Austin S Hemmelgarn, Li Zefan,
	Peter Zijlstra, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Cgroups,
	Aleksa Sarai, richard, mingo

On Sat, Feb 28, 2015 at 8:57 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> On Sat, Feb 28, 2015 at 08:48:12AM -0800, Tim Hockin wrote:
> > I am sorry that real-user problems are not perceived as substantial.  This
> > was/is a real issue for us.  Being in limbo for years on end might not be a
> > technical point, but I do think it matters, and that was my point.
>
> It's a problem which is localized to you and caused by the specific
> problems of your setup.  This isn't a wide-spread problem at all and
> the world doesn't revolve around you.  If your setup is so messed up
> as to require sticking to 16bit pids, handle that locally.  If
> something at larger scale eases that handling, you get lucky.  If not,
> it's *your* predicament to deal with.  The rest of the world doesn't
> exist to wipe your ass.

Wow, so much anger.  I'm not even sure how to respond, so I'll just
say this and sign off.  All I want is a better, friendlier, more
useful system overall.  We clearly have different ways of looking at
the problem.

No antagonism intended

Tim

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-28 22:50                       ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-02-28 22:50 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Frederic Weisbecker, Austin S Hemmelgarn, Li Zefan,
	Peter Zijlstra, linux-kernel, Cgroups, Aleksa Sarai, richard,
	mingo

On Sat, Feb 28, 2015 at 02:26:58PM -0800, Tim Hockin wrote:
> Wow, so much anger.  I'm not even sure how to respond, so I'll just
> say this and sign off.  All I want is a better, friendlier, more
> useful system overall.  We clearly have different ways of looking at
> the problem.

Can you communicate anything w/o passive aggression?  If you have a
technical point, just state that.  Can you at least agree that we
shouldn't be making design decisions based on 16bit pid_t?

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-28 22:50                       ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-02-28 22:50 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Frederic Weisbecker, Austin S Hemmelgarn, Li Zefan,
	Peter Zijlstra, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Cgroups,
	Aleksa Sarai, richard, mingo

On Sat, Feb 28, 2015 at 02:26:58PM -0800, Tim Hockin wrote:
> Wow, so much anger.  I'm not even sure how to respond, so I'll just
> say this and sign off.  All I want is a better, friendlier, more
> useful system overall.  We clearly have different ways of looking at
> the problem.

Can you communicate anything w/o passive aggression?  If you have a
technical point, just state that.  Can you at least agree that we
shouldn't be making design decisions based on 16bit pid_t?

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-28 23:11                       ` Johannes Weiner
  0 siblings, 0 replies; 108+ messages in thread
From: Johannes Weiner @ 2015-02-28 23:11 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Tejun Heo, Frederic Weisbecker, Austin S Hemmelgarn, Li Zefan,
	Peter Zijlstra, linux-kernel, Cgroups, Aleksa Sarai, richard,
	mingo

On Sat, Feb 28, 2015 at 02:26:58PM -0800, Tim Hockin wrote:
> On Sat, Feb 28, 2015 at 8:57 AM, Tejun Heo <tj@kernel.org> wrote:
> >
> > On Sat, Feb 28, 2015 at 08:48:12AM -0800, Tim Hockin wrote:
> > > I am sorry that real-user problems are not perceived as substantial.  This
> > > was/is a real issue for us.  Being in limbo for years on end might not be a
> > > technical point, but I do think it matters, and that was my point.
> >
> > It's a problem which is localized to you and caused by the specific
> > problems of your setup.  This isn't a wide-spread problem at all and
> > the world doesn't revolve around you.  If your setup is so messed up
> > as to require sticking to 16bit pids, handle that locally.  If
> > something at larger scale eases that handling, you get lucky.  If not,
> > it's *your* predicament to deal with.  The rest of the world doesn't
> > exist to wipe your ass.
> 
> Wow, so much anger.

Yeah, quite surprising after such an intellectually honest discussion:

: On Fri, Feb 27, 2015 at 01:45:09PM -0800, Tim Hockin wrote:
: > At least 3 or 4 people have INDEPENDENTLY decided this is what is
: > causing them pain and tried to fix it and invested the time to send a
: > patch says that it is actually a thing.  There exists a problem that
: > you are disallowing to be fixed.  Do you recognize that users are
: > experiencing pain?  Why do you hate your users? :)

[...]

: > Are you willing to put a drop-dead date on it?  If we don't have
: > kmemcg working well enough to _actually_ bound PID usage and FD usage
: > by, say, June 1st, will you then accept a patch to this effect?  If
: > the answer is no, then I have zero faith that it's coming any time
: > soon - I heard this 2 years ago.  I believed you then.

> I'm not even sure how to respond, so I'll just say this and sign
> off.  All I want is a better, friendlier, more useful system
> overall.  We clearly have different ways of looking at the problem.

Overlapping features and inconsistent userspace interfaces are only
better for the people that pick the hacks.  They are the opposite of
friendly and useful.  They are also horrible to maintain, which could
be a reason why you constantly disagree with the people that cleaned
up this unholy mess and are now trying to keep a balance between your
short term interests and the long-term health of the Linux kernel.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-02-28 23:11                       ` Johannes Weiner
  0 siblings, 0 replies; 108+ messages in thread
From: Johannes Weiner @ 2015-02-28 23:11 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Tejun Heo, Frederic Weisbecker, Austin S Hemmelgarn, Li Zefan,
	Peter Zijlstra, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Cgroups,
	Aleksa Sarai, richard, mingo

On Sat, Feb 28, 2015 at 02:26:58PM -0800, Tim Hockin wrote:
> On Sat, Feb 28, 2015 at 8:57 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> >
> > On Sat, Feb 28, 2015 at 08:48:12AM -0800, Tim Hockin wrote:
> > > I am sorry that real-user problems are not perceived as substantial.  This
> > > was/is a real issue for us.  Being in limbo for years on end might not be a
> > > technical point, but I do think it matters, and that was my point.
> >
> > It's a problem which is localized to you and caused by the specific
> > problems of your setup.  This isn't a wide-spread problem at all and
> > the world doesn't revolve around you.  If your setup is so messed up
> > as to require sticking to 16bit pids, handle that locally.  If
> > something at larger scale eases that handling, you get lucky.  If not,
> > it's *your* predicament to deal with.  The rest of the world doesn't
> > exist to wipe your ass.
> 
> Wow, so much anger.

Yeah, quite surprising after such an intellectually honest discussion:

: On Fri, Feb 27, 2015 at 01:45:09PM -0800, Tim Hockin wrote:
: > At least 3 or 4 people have INDEPENDENTLY decided this is what is
: > causing them pain and tried to fix it and invested the time to send a
: > patch says that it is actually a thing.  There exists a problem that
: > you are disallowing to be fixed.  Do you recognize that users are
: > experiencing pain?  Why do you hate your users? :)

[...]

: > Are you willing to put a drop-dead date on it?  If we don't have
: > kmemcg working well enough to _actually_ bound PID usage and FD usage
: > by, say, June 1st, will you then accept a patch to this effect?  If
: > the answer is no, then I have zero faith that it's coming any time
: > soon - I heard this 2 years ago.  I believed you then.

> I'm not even sure how to respond, so I'll just say this and sign
> off.  All I want is a better, friendlier, more useful system
> overall.  We clearly have different ways of looking at the problem.

Overlapping features and inconsistent userspace interfaces are only
better for the people that pick the hacks.  They are the opposite of
friendly and useful.  They are also horrible to maintain, which could
be a reason why you constantly disagree with the people that cleaned
up this unholy mess and are now trying to keep a balance between your
short term interests and the long-term health of the Linux kernel.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-03-01  4:46                         ` Tim Hockin
  0 siblings, 0 replies; 108+ messages in thread
From: Tim Hockin @ 2015-03-01  4:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Austin S Hemmelgarn, Li Zefan, Peter Zijlstra,
	Frederic Weisbecker, linux-kernel, Aleksa Sarai, Cgroups, mingo,
	richard

On Feb 28, 2015 2:50 PM, "Tejun Heo" <tj@kernel.org> wrote:
>
> On Sat, Feb 28, 2015 at 02:26:58PM -0800, Tim Hockin wrote:
> > Wow, so much anger.  I'm not even sure how to respond, so I'll just
> > say this and sign off.  All I want is a better, friendlier, more
> > useful system overall.  We clearly have different ways of looking at
> > the problem.
>
> Can you communicate anything w/o passive aggression?  If you have a
> technical point, just state that.  Can you at least agree that we
> shouldn't be making design decisions based on 16bit pid_t?

Hmm, I have screwed this thread up, I think.  I've made some remarks
that did not come through with the proper tongue-in-cheek slant.  I'm
not being passive aggressive - we DO look at this problem differently.
OF COURSE we should not make decisions based on ancient artifacts of
history.  My point was that there are secondary considerations here -
PIDs are more than just the memory that backs them.  They _ARE_ a
constrained resource, and you shouldn't assume the constraint is just
physical memory.  It is a piece of policy that is outside the control
of the kernel proper - we handed those keys to userspace along time
ago.

Given that, I believe and have believed that the solution should model
the problem as the user perceives it - limiting PIDs - rather than
attaching to a solution-by-proxy.

Yes a solution here partially overlaps with kmemcg, but I don't think
that is a significant problem.  They are different policies governing
behavior that may result in the same condition, but for very different
reasons.  I do not think that is particularly bad for overall
comprehension, and I think the fact that this popped up yet again
indicates the existence of some nugget of user experience that is
worth paying consideration to.

I appreciate your promised consideration through a slightly refocused
lens.  I will go back to my cave and do something I hope is more
productive and less antagonistic.  I did not mean to bring out so much
vitriol.

Tim

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-03-01  4:46                         ` Tim Hockin
  0 siblings, 0 replies; 108+ messages in thread
From: Tim Hockin @ 2015-03-01  4:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Austin S Hemmelgarn, Li Zefan, Peter Zijlstra,
	Frederic Weisbecker, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Aleksa Sarai, Cgroups, mingo, richard

On Feb 28, 2015 2:50 PM, "Tejun Heo" <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> On Sat, Feb 28, 2015 at 02:26:58PM -0800, Tim Hockin wrote:
> > Wow, so much anger.  I'm not even sure how to respond, so I'll just
> > say this and sign off.  All I want is a better, friendlier, more
> > useful system overall.  We clearly have different ways of looking at
> > the problem.
>
> Can you communicate anything w/o passive aggression?  If you have a
> technical point, just state that.  Can you at least agree that we
> shouldn't be making design decisions based on 16bit pid_t?

Hmm, I have screwed this thread up, I think.  I've made some remarks
that did not come through with the proper tongue-in-cheek slant.  I'm
not being passive aggressive - we DO look at this problem differently.
OF COURSE we should not make decisions based on ancient artifacts of
history.  My point was that there are secondary considerations here -
PIDs are more than just the memory that backs them.  They _ARE_ a
constrained resource, and you shouldn't assume the constraint is just
physical memory.  It is a piece of policy that is outside the control
of the kernel proper - we handed those keys to userspace along time
ago.

Given that, I believe and have believed that the solution should model
the problem as the user perceives it - limiting PIDs - rather than
attaching to a solution-by-proxy.

Yes a solution here partially overlaps with kmemcg, but I don't think
that is a significant problem.  They are different policies governing
behavior that may result in the same condition, but for very different
reasons.  I do not think that is particularly bad for overall
comprehension, and I think the fact that this popped up yet again
indicates the existence of some nugget of user experience that is
worth paying consideration to.

I appreciate your promised consideration through a slightly refocused
lens.  I will go back to my cave and do something I hope is more
productive and less antagonistic.  I did not mean to bring out so much
vitriol.

Tim

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
  2015-02-28 16:43                 ` Tejun Heo
  (?)
@ 2015-03-02 13:13                 ` Austin S Hemmelgarn
  2015-03-02 13:31                     ` Aleksa Sarai
  2015-03-02 13:49                   ` Tejun Heo
  -1 siblings, 2 replies; 108+ messages in thread
From: Austin S Hemmelgarn @ 2015-03-02 13:13 UTC (permalink / raw)
  To: Tejun Heo, Tim Hockin
  Cc: Frederic Weisbecker, lizefan, richard, mingo, Aleksa Sarai,
	cgroups, peterz, linux-kernel

On 2015-02-28 11:43, Tejun Heo wrote:
> Hello, Tim.
>
> On Sat, Feb 28, 2015 at 08:38:07AM -0800, Tim Hockin wrote:
>> I know there is not much concern for legacy-system problems, but it is
>> worth adding this case - there are systems that limit PIDs for other
>> reasons, eg broken infrastructure that assumes PIDs fit in a short int,
>> hypothetically.  Given such a system, PIDs become precious and limiting
>> them per job is important.
>>
>> My main point being that there are less obvious considerations in play than
>> just memory usage.
>
> Sure, there are those cases but it'd be unwise to hinge long term
> decisions on them.  It's hard to even argue 16bit pid in legacy code
> as a significant contributing factor at this point.  At any rate, it
> seems that pid is a global resource which needs to be provisioned for
> reasonable isolation which is a good reason to consider controlling it
> via cgroups.
If 16-bit PID's aren't a concern anymore, then why do we still default 
to treating it like a 16-bit signed int (the default for 
/proc/sys/kernel/pid_max is 32768)?


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-03-02 13:31                     ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-02 13:31 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Tejun Heo, Tim Hockin, Frederic Weisbecker, lizefan, richard,
	mingo, cgroups, peterz, linux-kernel

> If 16-bit PID's aren't a concern anymore, then why do we still default to
> treating it like a 16-bit signed int (the default for
> /proc/sys/kernel/pid_max is 32768)?

I just want to emphasise that *even if* we changed to another default
limit, the mere existence of a system-wide pid_max makes PIDs a
resource.

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-03-02 13:31                     ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-02 13:31 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Tejun Heo, Tim Hockin, Frederic Weisbecker,
	lizefan-hv44wF8Li93QT0dZR+AlfA, richard-/L3Ra7n9ekc,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

> If 16-bit PID's aren't a concern anymore, then why do we still default to
> treating it like a 16-bit signed int (the default for
> /proc/sys/kernel/pid_max is 32768)?

I just want to emphasise that *even if* we changed to another default
limit, the mere existence of a system-wide pid_max makes PIDs a
resource.

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
  2015-03-02 13:13                 ` Austin S Hemmelgarn
  2015-03-02 13:31                     ` Aleksa Sarai
@ 2015-03-02 13:49                   ` Tejun Heo
  1 sibling, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-02 13:49 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Tim Hockin, Frederic Weisbecker, lizefan, richard, mingo,
	Aleksa Sarai, cgroups, peterz, linux-kernel

On Mon, Mar 02, 2015 at 08:13:23AM -0500, Austin S Hemmelgarn wrote:
> If 16-bit PID's aren't a concern anymore, then why do we still default to
> treating it like a 16-bit signed int (the default for
> /proc/sys/kernel/pid_max is 32768)?

Inertia.  It has to start there for backward compatibility.  Now it's
trivial to adjust dynamically and majority of the users don't need to
worry about it, so there's no pressing reason to bump it up by
default.

16bit pid_t was already a dying breed on 32bit config and it never was
an option on 64bit.  Any remotely modern distros in the past decade,
whether 32 or 64bit, wouldn't have any problem with it.  The only
possibly problematic case would be legacy code which for some reason
explicitly used 16bit integer types instead of pid_t, but at this
point, we shouldn't be basing any design decisions on that.  If
anybody is still depending on that, there are different ways ton deal
with the issue on their end including namespacing its pid space.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-03-02 13:54                       ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-02 13:54 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Austin S Hemmelgarn, Tim Hockin, Frederic Weisbecker, lizefan,
	richard, mingo, cgroups, peterz, linux-kernel

On Tue, Mar 03, 2015 at 12:31:19AM +1100, Aleksa Sarai wrote:
> > If 16-bit PID's aren't a concern anymore, then why do we still default to
> > treating it like a 16-bit signed int (the default for
> > /proc/sys/kernel/pid_max is 32768)?
> 
> I just want to emphasise that *even if* we changed to another default
> limit, the mere existence of a system-wide pid_max makes PIDs a
> resource.

We seem to fail to communicate.  The primary reason why pid promotes
itself to a global resource status is because it's globally capped way
below its backing resource's (kernel memory) limit and it is very
difficult to make it not so due to direct userland dependencies on it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC 0/2] add nproc cgroup subsystem
@ 2015-03-02 13:54                       ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-02 13:54 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Austin S Hemmelgarn, Tim Hockin, Frederic Weisbecker,
	lizefan-hv44wF8Li93QT0dZR+AlfA, richard-/L3Ra7n9ekc,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Mar 03, 2015 at 12:31:19AM +1100, Aleksa Sarai wrote:
> > If 16-bit PID's aren't a concern anymore, then why do we still default to
> > treating it like a 16-bit signed int (the default for
> > /proc/sys/kernel/pid_max is 32768)?
> 
> I just want to emphasise that *even if* we changed to another default
> limit, the mere existence of a system-wide pid_max makes PIDs a
> resource.

We seem to fail to communicate.  The primary reason why pid promotes
itself to a global resource status is because it's globally capped way
below its backing resource's (kernel memory) limit and it is very
difficult to make it not so due to direct userland dependencies on it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 2/2] cgroups: add an nproc subsystem
@ 2015-03-02 15:22       ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-02 15:22 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan, mingo, peterz, richard, fweisbec, linux-kernel, cgroups

Hello,

On Fri, Feb 27, 2015 at 03:17:19PM +1100, Aleksa Sarai wrote:
> +config CGROUP_NPROC
> +	bool "Process number limiting on cgroups"
> +	depends on PAGE_COUNTER
> +	help
> +	  This options enables the setting of process number limits in the scope
> +	  of a cgroup. Any attempt to fork more processes than is allowed in the
> +	  cgroup will fail. This allows for more basic resource limitation that
> +	  applies to a cgroup, similar to RLIMIT_NPROC (except that instead of
> +	  applying to a process tree it applies to a cgroup).

Please reflect the rationale from this discussion thread in the commit
message and help text.  Also, I'd much prefer to name it pids
controller after the resource it's controlling.

> +struct nproc {
> +	struct page_counter		proc_counter;

I don't think it's a good idea to use page_counter outside memcg.
This is pretty much an implementation detail of memcg.  The only
reason that file is out there is because of the wacky tcp controller
which is somewhat part of memcg (and to be replaced by proper kmemcg).
Either use plain atomic_t or percpu_counter with controlled batch
value (e.g. upto 10% deviation allowed from the target or sth).  Given
that fork/exit is pretty heavy path, just plain atomic_t is prolly
enough.

> +static int nproc_can_attach(struct cgroup_subsys_state *css,
> +			    struct cgroup_taskset *tset)
> +{
> +	struct nproc *nproc = css_nproc(css);
> +	unsigned long num_tasks = 0;
> +	struct task_struct *task;
> +
> +	cgroup_taskset_for_each(task, tset)
> +		num_tasks++;
> +
> +	return nproc_add_procs(nproc, num_tasks);
> +}

can_attach() can't fail in the unified hierarchy.  Circumvention of
configuration by moving processes to children is prevented through
hierarchical limit enforcement.

> +static int nproc_write_limit(struct cgroup_subsys_state *css,
> +			     struct cftype *cft, u64 val)
> +{
> +	struct nproc *nproc = css_nproc(css);
> +
> +	return page_counter_limit(&nproc->proc_counter, val);
> +}

Please make it handle "max".

> +static u64 nproc_read_limit(struct cgroup_subsys_state *css,
> +			    struct cftype *cft)
> +{
> +	struct nproc *nproc = css_nproc(css);
> +
> +	return nproc->proc_counter.limit;
> +}

Ditto when reading back.

> +static u64 nproc_read_max_limit(struct cgroup_subsys_state *css,
> +				       struct cftype *cft)
> +{
> +	return PAGE_COUNTER_MAX;
> +}

And drop this file.

> +static u64 nproc_read_usage(struct cgroup_subsys_state *css,
> +			    struct cftype *cft)
> +{
> +	struct nproc *nproc = css_nproc(css);
> +
> +	return page_counter_read(&nproc->proc_counter);
> +}
> +
> +static struct cftype files[] = {
> +	{
> +		.name = "limit",
> +		.write_u64 = nproc_write_limit,
> +		.read_u64 = nproc_read_limit,
> +	},

pids.max

> +	{
> +		.name = "max_limit",
> +		.read_u64 = nproc_read_max_limit,
> +	},
> +	{
> +		.name = "usage",
> +		.read_u64 = nproc_read_usage,
> +	},

pids.current

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 2/2] cgroups: add an nproc subsystem
@ 2015-03-02 15:22       ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-02 15:22 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, richard-/L3Ra7n9ekc,
	fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Fri, Feb 27, 2015 at 03:17:19PM +1100, Aleksa Sarai wrote:
> +config CGROUP_NPROC
> +	bool "Process number limiting on cgroups"
> +	depends on PAGE_COUNTER
> +	help
> +	  This options enables the setting of process number limits in the scope
> +	  of a cgroup. Any attempt to fork more processes than is allowed in the
> +	  cgroup will fail. This allows for more basic resource limitation that
> +	  applies to a cgroup, similar to RLIMIT_NPROC (except that instead of
> +	  applying to a process tree it applies to a cgroup).

Please reflect the rationale from this discussion thread in the commit
message and help text.  Also, I'd much prefer to name it pids
controller after the resource it's controlling.

> +struct nproc {
> +	struct page_counter		proc_counter;

I don't think it's a good idea to use page_counter outside memcg.
This is pretty much an implementation detail of memcg.  The only
reason that file is out there is because of the wacky tcp controller
which is somewhat part of memcg (and to be replaced by proper kmemcg).
Either use plain atomic_t or percpu_counter with controlled batch
value (e.g. upto 10% deviation allowed from the target or sth).  Given
that fork/exit is pretty heavy path, just plain atomic_t is prolly
enough.

> +static int nproc_can_attach(struct cgroup_subsys_state *css,
> +			    struct cgroup_taskset *tset)
> +{
> +	struct nproc *nproc = css_nproc(css);
> +	unsigned long num_tasks = 0;
> +	struct task_struct *task;
> +
> +	cgroup_taskset_for_each(task, tset)
> +		num_tasks++;
> +
> +	return nproc_add_procs(nproc, num_tasks);
> +}

can_attach() can't fail in the unified hierarchy.  Circumvention of
configuration by moving processes to children is prevented through
hierarchical limit enforcement.

> +static int nproc_write_limit(struct cgroup_subsys_state *css,
> +			     struct cftype *cft, u64 val)
> +{
> +	struct nproc *nproc = css_nproc(css);
> +
> +	return page_counter_limit(&nproc->proc_counter, val);
> +}

Please make it handle "max".

> +static u64 nproc_read_limit(struct cgroup_subsys_state *css,
> +			    struct cftype *cft)
> +{
> +	struct nproc *nproc = css_nproc(css);
> +
> +	return nproc->proc_counter.limit;
> +}

Ditto when reading back.

> +static u64 nproc_read_max_limit(struct cgroup_subsys_state *css,
> +				       struct cftype *cft)
> +{
> +	return PAGE_COUNTER_MAX;
> +}

And drop this file.

> +static u64 nproc_read_usage(struct cgroup_subsys_state *css,
> +			    struct cftype *cft)
> +{
> +	struct nproc *nproc = css_nproc(css);
> +
> +	return page_counter_read(&nproc->proc_counter);
> +}
> +
> +static struct cftype files[] = {
> +	{
> +		.name = "limit",
> +		.write_u64 = nproc_write_limit,
> +		.read_u64 = nproc_read_limit,
> +	},

pids.max

> +	{
> +		.name = "max_limit",
> +		.read_u64 = nproc_read_max_limit,
> +	},
> +	{
> +		.name = "usage",
> +		.read_u64 = nproc_read_usage,
> +	},

pids.current

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH v3 0/2] cgroup: add pids subsystem
  2015-02-23  3:08 ` Aleksa Sarai
                   ` (4 preceding siblings ...)
  (?)
@ 2015-03-04 20:23 ` Aleksa Sarai
  2015-03-04 20:23   ` [PATCH v3 1/2] cgroups: allow a cgroup subsystem to reject a fork Aleksa Sarai
  2015-03-04 20:23   ` [PATCH v3 2/2] cgroups: add a pids subsystem Aleksa Sarai
  -1 siblings, 2 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-04 20:23 UTC (permalink / raw)
  To: tj, lizefan, mingo, peterz
  Cc: richard, fweisbec, linux-kernel, cgroups, Aleksa Sarai

This is a further updated version of the nproc v2 patchset[1] from
advice given by Tejun Heo[2]. The main changes include:

* Switching from mm/page_counter (which is a memcg implementation
  feature) to a pid-controller-specific hierarchical charge/uncharge
  counter with limits implementation using atomic_long_t (which is also
  lockless as it is based on page_counter).

* Updates to the user-space interface to allow for the setting of no
  limit to the number of pids in a cgroup (-1 == unlimited) as well as
  renaming of the files and the removal of nproc.max_limit.

* The controller was renamed to `pids`.

[1]: https://lkml.org/lkml/2015/2/26/787
[2]: https://lkml.org/lkml/2015/3/2/437

Aleksa Sarai (2):
  cgroups: allow a cgroup subsystem to reject a fork
  cgroups: add a pids subsystem

 include/linux/cgroup.h        |   9 ++
 include/linux/cgroup_subsys.h |   4 +
 init/Kconfig                  |  12 ++
 kernel/Makefile               |   1 +
 kernel/cgroup.c               |  80 +++++++++---
 kernel/cgroup_pids.c          | 281 ++++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                 |  12 +-
 7 files changed, 381 insertions(+), 18 deletions(-)
 create mode 100644 kernel/cgroup_pids.c

-- 
2.3.1


^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH v3 1/2] cgroups: allow a cgroup subsystem to reject a fork
  2015-03-04 20:23 ` [PATCH v3 0/2] cgroup: add pids subsystem Aleksa Sarai
@ 2015-03-04 20:23   ` Aleksa Sarai
  2015-03-04 20:23   ` [PATCH v3 2/2] cgroups: add a pids subsystem Aleksa Sarai
  1 sibling, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-04 20:23 UTC (permalink / raw)
  To: tj, lizefan, mingo, peterz
  Cc: richard, fweisbec, linux-kernel, cgroups, Aleksa Sarai

Add a new cgroup subsystem callback can_fork that conditionally
states whether or not the fork is accepted or rejected with a cgroup
policy.

Make the cgroup subsystem can_fork callback return an error code so
that subsystems can accept or reject a fork from completing with a
custom error value, before the process is exposed.

In addition, add a cancel_fork callback so that if an error occurs later
in the forking process, any state modified by can_fork can be reverted.

In order for can_fork to deal with a task that has an accurate css_set,
move the css_set updating to cgroup_fork (where it belongs).

This is in preparation for implementing the pids cgroup subsystem.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 include/linux/cgroup.h |  9 ++++++
 kernel/cgroup.c        | 80 +++++++++++++++++++++++++++++++++++++++-----------
 kernel/fork.c          | 12 +++++++-
 3 files changed, 83 insertions(+), 18 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b9cb94c..43ed1ee 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -32,6 +32,8 @@ struct cgroup;
 extern int cgroup_init_early(void);
 extern int cgroup_init(void);
 extern void cgroup_fork(struct task_struct *p);
+extern int cgroup_can_fork(struct task_struct *p);
+extern void cgroup_cancel_fork(struct task_struct *p);
 extern void cgroup_post_fork(struct task_struct *p);
 extern void cgroup_exit(struct task_struct *p);
 extern int cgroupstats_build(struct cgroupstats *stats,
@@ -649,6 +651,8 @@ struct cgroup_subsys {
 			      struct cgroup_taskset *tset);
 	void (*attach)(struct cgroup_subsys_state *css,
 		       struct cgroup_taskset *tset);
+	int (*can_fork)(struct task_struct *task);
+	void (*cancel_fork)(struct task_struct *task);
 	void (*fork)(struct task_struct *task);
 	void (*exit)(struct cgroup_subsys_state *css,
 		     struct cgroup_subsys_state *old_css,
@@ -948,6 +952,11 @@ struct cgroup_subsys_state;
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
 static inline void cgroup_fork(struct task_struct *p) {}
+static inline int cgroup_can_fork(struct task_struct *p)
+{
+	return 0;
+}
+static inline void cgroup_cancel_fork(struct task_struct *p) {}
 static inline void cgroup_post_fork(struct task_struct *p) {}
 static inline void cgroup_exit(struct task_struct *p) {}
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 29a7b2c..3e09284 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4932,7 +4932,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 	 * init_css_set is in the subsystem's root cgroup. */
 	init_css_set.subsys[ss->id] = css;
 
-	need_forkexit_callback |= ss->fork || ss->exit;
+	need_forkexit_callback |= ss->can_fork || ss->cancel_fork || ss->fork || ss->exit;
 
 	/* At system boot, before all subsystems have been
 	 * registered, no tasks have been forked, so we don't
@@ -5183,22 +5183,6 @@ void cgroup_fork(struct task_struct *child)
 {
 	RCU_INIT_POINTER(child->cgroups, &init_css_set);
 	INIT_LIST_HEAD(&child->cg_list);
-}
-
-/**
- * cgroup_post_fork - called on a new task after adding it to the task list
- * @child: the task in question
- *
- * Adds the task to the list running through its css_set if necessary and
- * call the subsystem fork() callbacks.  Has to be after the task is
- * visible on the task list in case we race with the first call to
- * cgroup_task_iter_start() - to guarantee that the new task ends up on its
- * list.
- */
-void cgroup_post_fork(struct task_struct *child)
-{
-	struct cgroup_subsys *ss;
-	int i;
 
 	/*
 	 * This may race against cgroup_enable_task_cg_lists().  As that
@@ -5233,6 +5217,68 @@ void cgroup_post_fork(struct task_struct *child)
 		}
 		up_write(&css_set_rwsem);
 	}
+}
+
+/**
+ * cgroup_can_fork - called on a new task before the process is exposed.
+ * @child: the task in question.
+ *
+ * This calls the subsystem can_fork() callbacks. If the can_fork() callback
+ * returns an error, the fork aborts with that error code. This allows for
+ * a cgroup subsystem to conditionally allow or deny new forks.
+ */
+int cgroup_can_fork(struct task_struct *child)
+{
+	struct cgroup_subsys *ss;
+	int i;
+
+	if (need_forkexit_callback) {
+		int retval;
+
+		for_each_subsys(ss, i)
+			if (ss->can_fork) {
+				retval = ss->can_fork(child);
+				if (retval)
+					return retval;
+			}
+	}
+
+	return 0;
+}
+
+/**
+ * cgroup_cancel_fork - called if a fork failed after cgroup_can_fork()
+ * @child: the task in question
+ *
+ * This calls the cancel_fork() callbacks if a fork failed *after*
+ * cgroup_can_fork() succeded.
+ */
+void cgroup_cancel_fork(struct task_struct *child)
+{
+	struct cgroup_subsys *ss;
+	int i;
+
+	if (need_forkexit_callback) {
+		for_each_subsys(ss, i)
+			if (ss->cancel_fork)
+				ss->cancel_fork(child);
+	}
+}
+
+/**
+ * cgroup_post_fork - called on a new task after adding it to the task list
+ * @child: the task in question
+ *
+ * Adds the task to the list running through its css_set if necessary and
+ * call the subsystem fork() callbacks.  Has to be after the task is
+ * visible on the task list in case we race with the first call to
+ * cgroup_task_iter_start() - to guarantee that the new task ends up on its
+ * list.
+ */
+void cgroup_post_fork(struct task_struct *child)
+{
+	struct cgroup_subsys *ss;
+	int i;
 
 	/*
 	 * Call ss->fork().  This must happen after @child is linked on
diff --git a/kernel/fork.c b/kernel/fork.c
index cf65139..35850a9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1469,6 +1469,14 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p->task_works = NULL;
 
 	/*
+	 * Ensure that the cgroup subsystem policies allow the new process to be
+	 * forked.
+	 */
+	retval = cgroup_can_fork(p);
+	if (retval)
+		goto bad_fork_free_pid;
+
+	/*
 	 * Make it visible to the rest of the system, but dont wake it up yet.
 	 * Need tasklist lock for parent etc handling!
 	 */
@@ -1504,7 +1512,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		spin_unlock(&current->sighand->siglock);
 		write_unlock_irq(&tasklist_lock);
 		retval = -ERESTARTNOINTR;
-		goto bad_fork_free_pid;
+		goto bad_fork_cgroup_cancel;
 	}
 
 	if (likely(p->pid)) {
@@ -1556,6 +1564,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 
 	return p;
 
+bad_fork_cgroup_cancel:
+	cgroup_cancel_fork(p);
 bad_fork_free_pid:
 	if (pid != &init_struct_pid)
 		free_pid(pid);
-- 
2.3.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH v3 2/2] cgroups: add a pids subsystem
  2015-03-04 20:23 ` [PATCH v3 0/2] cgroup: add pids subsystem Aleksa Sarai
  2015-03-04 20:23   ` [PATCH v3 1/2] cgroups: allow a cgroup subsystem to reject a fork Aleksa Sarai
@ 2015-03-04 20:23   ` Aleksa Sarai
  2015-03-05  8:39     ` Aleksa Sarai
  2015-03-05 14:37     ` Marian Marinov
  1 sibling, 2 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-04 20:23 UTC (permalink / raw)
  To: tj, lizefan, mingo, peterz
  Cc: richard, fweisbec, linux-kernel, cgroups, Aleksa Sarai

Adds a new single-purpose pids subsystem to limit the number of
tasks that can run inside a cgroup. Essentially this is an
implementation of RLIMIT_NPROC that will applies to a cgroup rather than
a process tree.

PIDs are fundamentally a global resource, and it is possible to reach
PID exhaustion inside a cgroup without hitting any reasonable kmemcg
policy. Once you've hit PID exhaustion, you're only in a marginally
better state than OOM. This subsystem allows PID exhaustion inside a
cgroup to be prevented.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 include/linux/cgroup_subsys.h |   4 +
 init/Kconfig                  |  12 ++
 kernel/Makefile               |   1 +
 kernel/cgroup_pids.c          | 281 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 298 insertions(+)
 create mode 100644 kernel/cgroup_pids.c

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index e4a96fb..a198822 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -47,6 +47,10 @@ SUBSYS(net_prio)
 SUBSYS(hugetlb)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_PIDS)
+SUBSYS(pids)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d..58f104a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1054,6 +1054,18 @@ config CGROUP_HUGETLB
 	  control group is tracked in the third page lru pointer. This means
 	  that we cannot use the controller with huge page less than 3 pages.
 
+config CGROUP_PIDS
+	bool "Process number limiting on cgroups"
+	depends on PAGE_COUNTER
+	help
+	  This options enables the setting of process number limits in the scope
+	  of a cgroup. Any attempt to fork more processes than is allowed in the
+	  cgroup will fail. PIDs are fundamentally a global resource because it
+	  is fairly trivial to reach PID exhaustion before you reach even a
+	  conservative kmemcg limit. As a result, it is possible to grind a
+	  system to halt without being limited by other cgroup policies. The pids
+	  cgroup subsystem is designed to stop this from happening.
+
 config CGROUP_PERF
 	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
 	depends on PERF_EVENTS && CGROUPS
diff --git a/kernel/Makefile b/kernel/Makefile
index 1408b33..e823592 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -53,6 +53,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
+obj-$(CONFIG_CGROUP_PIDS) += cgroup_pids.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
diff --git a/kernel/cgroup_pids.c b/kernel/cgroup_pids.c
new file mode 100644
index 0000000..65cbab3
--- /dev/null
+++ b/kernel/cgroup_pids.c
@@ -0,0 +1,281 @@
+/*
+ * Process number limiting subsys for cgroups.
+ *
+ * Copyright (C) 2015 Aleksa Sarai <cyphar@cyphar.com>
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/atomic.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+
+#define PIDS_UNLIMITED -1
+
+struct pids {
+	struct pids *parent;
+	struct cgroup_subsys_state css;
+
+	atomic_long_t counter;
+	long limit;
+};
+
+static inline struct pids *css_pids(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct pids, css) : NULL;
+}
+
+static inline struct pids *task_pids(struct task_struct *task)
+{
+	return css_pids(task_css(task, pids_cgrp_id));
+}
+
+static struct pids *parent_pids(struct pids *pids)
+{
+	return css_pids(pids->css.parent);
+}
+
+static struct cgroup_subsys_state *
+pids_css_alloc(struct cgroup_subsys_state *parent)
+{
+	struct pids *pids;
+
+	pids = kzalloc(sizeof(struct pids), GFP_KERNEL);
+	if (!pids)
+		return ERR_PTR(-ENOMEM);
+
+	return &pids->css;
+}
+
+static int pids_css_online(struct cgroup_subsys_state *css)
+{
+	struct pids *pids = css_pids(css);
+	long limit = -1;
+
+	pids->parent = parent_pids(pids);
+	if (pids->parent)
+		limit = pids->parent->limit;
+
+	pids->limit = limit;
+	atomic_long_set(&pids->counter, 0);
+	return 0;
+}
+
+static void pids_css_free(struct cgroup_subsys_state *css)
+{
+	kfree(css_pids(css));
+}
+
+/**
+ * pids_cancel - uncharge the local pid count
+ * @pids: the pid cgroup state
+ * @num: the number of pids to cancel
+ *
+ * This function will WARN if the pid count goes under 0,
+ * but will not prevent it.
+ */
+static void pids_cancel(struct pids *pids, int num)
+{
+	long new;
+
+	new = atomic_long_sub_return(num, &pids->counter);
+
+	/*
+	 * A negative count is invalid, but pids_cancel() can't fail.
+	 * So just emit a WARN.
+	 */
+	WARN_ON(new < 0);
+}
+
+/**
+ * pids_charge - hierarchically uncharge the pid count
+ * @pids: the pid cgroup state
+ * @num: the number of pids to uncharge
+ *
+ * This function will not allow the pid count to go under 0,
+ * and will WARN if a caller attempts to do so.
+ */
+static void pids_uncharge(struct pids *pids, int num)
+{
+	struct pids *p;
+
+	for(p = pids; p; p = p->parent)
+		pids_cancel(p, num);
+}
+
+/**
+ * pids_charge - hierarchically charge the pid count
+ * @pids: the pid cgroup state
+ * @num: the number of pids to charge
+ *
+ * This function does *not* follow the pid limit set. It will not
+ * fail and the new pid count may exceed the limit.
+ */
+static void pids_charge(struct pids *pids, int num)
+{
+	struct pids *p;
+
+	for(p = pids; p; p = p->parent)
+		atomic_long_add(num, &p->counter);
+}
+
+/**
+ * pids_try_charge - hierarchically try to charge the pid count
+ * @pids: the pid cgroup state
+ * @num: the number of pids to charge
+ *
+ * This function follows the set limit. It can fail if the charge
+ * would cause the new value to exceed the limit.
+ * Returns 0 if the charge succeded, otherwise -EAGAIN.
+ */
+static int pids_try_charge(struct pids *pids, int num)
+{
+	struct pids *p, *fail;
+
+	for(p = pids; p; p = p->parent) {
+		long new;
+
+		new = atomic_long_add_return(num, &p->counter);
+
+		if (p->limit == PIDS_UNLIMITED)
+			continue;
+
+		if (new > p->limit) {
+			atomic_long_sub(num, &p->counter);
+			fail = p;
+			goto revert;
+		}
+	}
+
+	return 0;
+
+revert:
+	for(p = pids; p != fail; p = p->parent)
+		pids_cancel(pids, num);
+
+	return -EAGAIN;
+}
+
+static int pids_can_attach(struct cgroup_subsys_state *css,
+			   struct cgroup_taskset *tset)
+{
+	struct pids *pids = css_pids(css);
+	unsigned long num_tasks = 0;
+	struct task_struct *task;
+
+	cgroup_taskset_for_each(task, tset)
+		num_tasks++;
+
+	/*
+	 * Attaching to a cgroup is allowed to overcome the
+	 * the PID limit, so that organisation operations aren't
+	 * blocked by the `pids` cgroup controller.
+	 */
+	pids_charge(pids, num_tasks);
+	return 0;
+}
+
+static void pids_cancel_attach(struct cgroup_subsys_state *css,
+			       struct cgroup_taskset *tset)
+{
+	struct pids *pids = css_pids(css);
+	unsigned long num_tasks = 0;
+	struct task_struct *task;
+
+	cgroup_taskset_for_each(task, tset)
+		num_tasks++;
+
+	pids_uncharge(pids, num_tasks);
+}
+
+static int pids_can_fork(struct task_struct *task)
+{
+	struct pids *pids = task_pids(task);
+
+	return pids_try_charge(pids, 1);
+}
+
+static void pids_cancel_fork(struct task_struct *task)
+{
+	struct pids *pids = task_pids(task);
+
+	pids_uncharge(pids, 1);
+}
+
+static void pids_exit(struct cgroup_subsys_state *css,
+		      struct cgroup_subsys_state *old_css,
+		      struct task_struct *task)
+{
+	struct pids *pids = css_pids(old_css);
+
+	/*
+	 * cgroup_exit() gets called as part of the cleanup code when copy_process()
+	 * fails. This should ignored, because the pids_cancel_fork callback already
+	 * deals with the cgroup failed fork case.
+	 */
+	if (!(task->flags & PF_EXITING))
+		return;
+
+	pids_uncharge(pids, 1);
+}
+
+static int pids_write_max(struct cgroup_subsys_state *css,
+			  struct cftype *cft, s64 val)
+{
+	struct pids *pids = css_pids(css);
+
+	/* PIDS_UNLIMITED is the only legal negative value. */
+	if (val < 0 && val != PIDS_UNLIMITED)
+		return -EINVAL;
+
+	/*
+	 * Limit updates don't need to be mutex'd, since they
+	 * are more of a "soft" limit in the sense that you can
+	 * set a limit which is smaller than the current count
+	 * to stop any *new* processes from spawning.
+	 */
+	pids->limit = val;
+	return 0;
+}
+
+static s64 pids_read_max(struct cgroup_subsys_state *css,
+			 struct cftype *cft)
+{
+	struct pids *pids = css_pids(css);
+
+	return pids->limit;
+}
+
+static s64 pids_read_current(struct cgroup_subsys_state *css,
+			     struct cftype *cft)
+{
+	struct pids *pids = css_pids(css);
+
+	return atomic_long_read(&pids->counter);
+}
+
+static struct cftype files[] = {
+	{
+		.name = "max",
+		.write_s64 = pids_write_max,
+		.read_s64 = pids_read_max,
+	},
+	{
+		.name = "current",
+		.read_s64 = pids_read_current,
+	},
+	{ }	/* terminate */
+};
+
+struct cgroup_subsys pids_cgrp_subsys = {
+	.css_alloc	= pids_css_alloc,
+	.css_online	= pids_css_online,
+	.css_free	= pids_css_free,
+	.can_attach	= pids_can_attach,
+	.cancel_attach	= pids_cancel_attach,
+	.can_fork	= pids_can_fork,
+	.cancel_fork	= pids_cancel_fork,
+	.exit		= pids_exit,
+	.legacy_cftypes	= files,
+	.early_init	= 0,
+};
-- 
2.3.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH v3 2/2] cgroups: add a pids subsystem
  2015-03-04 20:23   ` [PATCH v3 2/2] cgroups: add a pids subsystem Aleksa Sarai
@ 2015-03-05  8:39     ` Aleksa Sarai
  2015-03-05 14:37     ` Marian Marinov
  1 sibling, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-05  8:39 UTC (permalink / raw)
  To: Tejun Heo, lizefan, mingo, peterz
  Cc: richard, Frédéric Weisbecker, linux-kernel, cgroups,
	Aleksa Sarai

> +       depends on PAGE_COUNTER
Whoops. I forgot to remove this line. Should I submit a revised
patchset or can you just change the patch?

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v3 2/2] cgroups: add a pids subsystem
  2015-03-04 20:23   ` [PATCH v3 2/2] cgroups: add a pids subsystem Aleksa Sarai
  2015-03-05  8:39     ` Aleksa Sarai
@ 2015-03-05 14:37     ` Marian Marinov
  1 sibling, 0 replies; 108+ messages in thread
From: Marian Marinov @ 2015-03-05 14:37 UTC (permalink / raw)
  To: Aleksa Sarai, tj, lizefan, mingo, peterz
  Cc: richard, fweisbec, linux-kernel, cgroups

Hi Aleksa,
would you be willing to put your patches online in a repo like what Dwight Engen did 3 years ago.
  https://github.com/dwengen/linux/tree/cpuacct-task-limit-3.14

I'm using his patchset for more then a year now. However I would be happy to experiment with your patches as well.

And heaving a public repo from where I can pull is extremely good for people like me. It is easier for testing and long
term support.

Thank you for your work on this matter,
Marian

On 03/04/2015 10:23 PM, Aleksa Sarai wrote:
> Adds a new single-purpose pids subsystem to limit the number of
> tasks that can run inside a cgroup. Essentially this is an
> implementation of RLIMIT_NPROC that will applies to a cgroup rather than
> a process tree.
>
> PIDs are fundamentally a global resource, and it is possible to reach
> PID exhaustion inside a cgroup without hitting any reasonable kmemcg
> policy. Once you've hit PID exhaustion, you're only in a marginally
> better state than OOM. This subsystem allows PID exhaustion inside a
> cgroup to be prevented.
>
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  include/linux/cgroup_subsys.h |   4 +
>  init/Kconfig                  |  12 ++
>  kernel/Makefile               |   1 +
>  kernel/cgroup_pids.c          | 281 ++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 298 insertions(+)
>  create mode 100644 kernel/cgroup_pids.c
>
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index e4a96fb..a198822 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -47,6 +47,10 @@ SUBSYS(net_prio)
>  SUBSYS(hugetlb)
>  #endif
>  
> +#if IS_ENABLED(CONFIG_CGROUP_PIDS)
> +SUBSYS(pids)
> +#endif
> +
>  /*
>   * The following subsystems are not supported on the default hierarchy.
>   */
> diff --git a/init/Kconfig b/init/Kconfig
> index f5dbc6d..58f104a 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1054,6 +1054,18 @@ config CGROUP_HUGETLB
>  	  control group is tracked in the third page lru pointer. This means
>  	  that we cannot use the controller with huge page less than 3 pages.
>  
> +config CGROUP_PIDS
> +	bool "Process number limiting on cgroups"
> +	depends on PAGE_COUNTER
> +	help
> +	  This options enables the setting of process number limits in the scope
> +	  of a cgroup. Any attempt to fork more processes than is allowed in the
> +	  cgroup will fail. PIDs are fundamentally a global resource because it
> +	  is fairly trivial to reach PID exhaustion before you reach even a
> +	  conservative kmemcg limit. As a result, it is possible to grind a
> +	  system to halt without being limited by other cgroup policies. The pids
> +	  cgroup subsystem is designed to stop this from happening.
> +
>  config CGROUP_PERF
>  	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
>  	depends on PERF_EVENTS && CGROUPS
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 1408b33..e823592 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -53,6 +53,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
>  obj-$(CONFIG_COMPAT) += compat.o
>  obj-$(CONFIG_CGROUPS) += cgroup.o
>  obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
> +obj-$(CONFIG_CGROUP_PIDS) += cgroup_pids.o
>  obj-$(CONFIG_CPUSETS) += cpuset.o
>  obj-$(CONFIG_UTS_NS) += utsname.o
>  obj-$(CONFIG_USER_NS) += user_namespace.o
> diff --git a/kernel/cgroup_pids.c b/kernel/cgroup_pids.c
> new file mode 100644
> index 0000000..65cbab3
> --- /dev/null
> +++ b/kernel/cgroup_pids.c
> @@ -0,0 +1,281 @@
> +/*
> + * Process number limiting subsys for cgroups.
> + *
> + * Copyright (C) 2015 Aleksa Sarai <cyphar@cyphar.com>
> + *
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/atomic.h>
> +#include <linux/cgroup.h>
> +#include <linux/slab.h>
> +
> +#define PIDS_UNLIMITED -1
> +
> +struct pids {
> +	struct pids *parent;
> +	struct cgroup_subsys_state css;
> +
> +	atomic_long_t counter;
> +	long limit;
> +};
> +
> +static inline struct pids *css_pids(struct cgroup_subsys_state *css)
> +{
> +	return css ? container_of(css, struct pids, css) : NULL;
> +}
> +
> +static inline struct pids *task_pids(struct task_struct *task)
> +{
> +	return css_pids(task_css(task, pids_cgrp_id));
> +}
> +
> +static struct pids *parent_pids(struct pids *pids)
> +{
> +	return css_pids(pids->css.parent);
> +}
> +
> +static struct cgroup_subsys_state *
> +pids_css_alloc(struct cgroup_subsys_state *parent)
> +{
> +	struct pids *pids;
> +
> +	pids = kzalloc(sizeof(struct pids), GFP_KERNEL);
> +	if (!pids)
> +		return ERR_PTR(-ENOMEM);
> +
> +	return &pids->css;
> +}
> +
> +static int pids_css_online(struct cgroup_subsys_state *css)
> +{
> +	struct pids *pids = css_pids(css);
> +	long limit = -1;
> +
> +	pids->parent = parent_pids(pids);
> +	if (pids->parent)
> +		limit = pids->parent->limit;
> +
> +	pids->limit = limit;
> +	atomic_long_set(&pids->counter, 0);
> +	return 0;
> +}
> +
> +static void pids_css_free(struct cgroup_subsys_state *css)
> +{
> +	kfree(css_pids(css));
> +}
> +
> +/**
> + * pids_cancel - uncharge the local pid count
> + * @pids: the pid cgroup state
> + * @num: the number of pids to cancel
> + *
> + * This function will WARN if the pid count goes under 0,
> + * but will not prevent it.
> + */
> +static void pids_cancel(struct pids *pids, int num)
> +{
> +	long new;
> +
> +	new = atomic_long_sub_return(num, &pids->counter);
> +
> +	/*
> +	 * A negative count is invalid, but pids_cancel() can't fail.
> +	 * So just emit a WARN.
> +	 */
> +	WARN_ON(new < 0);
> +}
> +
> +/**
> + * pids_charge - hierarchically uncharge the pid count
> + * @pids: the pid cgroup state
> + * @num: the number of pids to uncharge
> + *
> + * This function will not allow the pid count to go under 0,
> + * and will WARN if a caller attempts to do so.
> + */
> +static void pids_uncharge(struct pids *pids, int num)
> +{
> +	struct pids *p;
> +
> +	for(p = pids; p; p = p->parent)
> +		pids_cancel(p, num);
> +}
> +
> +/**
> + * pids_charge - hierarchically charge the pid count
> + * @pids: the pid cgroup state
> + * @num: the number of pids to charge
> + *
> + * This function does *not* follow the pid limit set. It will not
> + * fail and the new pid count may exceed the limit.
> + */
> +static void pids_charge(struct pids *pids, int num)
> +{
> +	struct pids *p;
> +
> +	for(p = pids; p; p = p->parent)
> +		atomic_long_add(num, &p->counter);
> +}
> +
> +/**
> + * pids_try_charge - hierarchically try to charge the pid count
> + * @pids: the pid cgroup state
> + * @num: the number of pids to charge
> + *
> + * This function follows the set limit. It can fail if the charge
> + * would cause the new value to exceed the limit.
> + * Returns 0 if the charge succeded, otherwise -EAGAIN.
> + */
> +static int pids_try_charge(struct pids *pids, int num)
> +{
> +	struct pids *p, *fail;
> +
> +	for(p = pids; p; p = p->parent) {
> +		long new;
> +
> +		new = atomic_long_add_return(num, &p->counter);
> +
> +		if (p->limit == PIDS_UNLIMITED)
> +			continue;
> +
> +		if (new > p->limit) {
> +			atomic_long_sub(num, &p->counter);
> +			fail = p;
> +			goto revert;
> +		}
> +	}
> +
> +	return 0;
> +
> +revert:
> +	for(p = pids; p != fail; p = p->parent)
> +		pids_cancel(pids, num);
> +
> +	return -EAGAIN;
> +}
> +
> +static int pids_can_attach(struct cgroup_subsys_state *css,
> +			   struct cgroup_taskset *tset)
> +{
> +	struct pids *pids = css_pids(css);
> +	unsigned long num_tasks = 0;
> +	struct task_struct *task;
> +
> +	cgroup_taskset_for_each(task, tset)
> +		num_tasks++;
> +
> +	/*
> +	 * Attaching to a cgroup is allowed to overcome the
> +	 * the PID limit, so that organisation operations aren't
> +	 * blocked by the `pids` cgroup controller.
> +	 */
> +	pids_charge(pids, num_tasks);
> +	return 0;
> +}
> +
> +static void pids_cancel_attach(struct cgroup_subsys_state *css,
> +			       struct cgroup_taskset *tset)
> +{
> +	struct pids *pids = css_pids(css);
> +	unsigned long num_tasks = 0;
> +	struct task_struct *task;
> +
> +	cgroup_taskset_for_each(task, tset)
> +		num_tasks++;
> +
> +	pids_uncharge(pids, num_tasks);
> +}
> +
> +static int pids_can_fork(struct task_struct *task)
> +{
> +	struct pids *pids = task_pids(task);
> +
> +	return pids_try_charge(pids, 1);
> +}
> +
> +static void pids_cancel_fork(struct task_struct *task)
> +{
> +	struct pids *pids = task_pids(task);
> +
> +	pids_uncharge(pids, 1);
> +}
> +
> +static void pids_exit(struct cgroup_subsys_state *css,
> +		      struct cgroup_subsys_state *old_css,
> +		      struct task_struct *task)
> +{
> +	struct pids *pids = css_pids(old_css);
> +
> +	/*
> +	 * cgroup_exit() gets called as part of the cleanup code when copy_process()
> +	 * fails. This should ignored, because the pids_cancel_fork callback already
> +	 * deals with the cgroup failed fork case.
> +	 */
> +	if (!(task->flags & PF_EXITING))
> +		return;
> +
> +	pids_uncharge(pids, 1);
> +}
> +
> +static int pids_write_max(struct cgroup_subsys_state *css,
> +			  struct cftype *cft, s64 val)
> +{
> +	struct pids *pids = css_pids(css);
> +
> +	/* PIDS_UNLIMITED is the only legal negative value. */
> +	if (val < 0 && val != PIDS_UNLIMITED)
> +		return -EINVAL;
> +
> +	/*
> +	 * Limit updates don't need to be mutex'd, since they
> +	 * are more of a "soft" limit in the sense that you can
> +	 * set a limit which is smaller than the current count
> +	 * to stop any *new* processes from spawning.
> +	 */
> +	pids->limit = val;
> +	return 0;
> +}
> +
> +static s64 pids_read_max(struct cgroup_subsys_state *css,
> +			 struct cftype *cft)
> +{
> +	struct pids *pids = css_pids(css);
> +
> +	return pids->limit;
> +}
> +
> +static s64 pids_read_current(struct cgroup_subsys_state *css,
> +			     struct cftype *cft)
> +{
> +	struct pids *pids = css_pids(css);
> +
> +	return atomic_long_read(&pids->counter);
> +}
> +
> +static struct cftype files[] = {
> +	{
> +		.name = "max",
> +		.write_s64 = pids_write_max,
> +		.read_s64 = pids_read_max,
> +	},
> +	{
> +		.name = "current",
> +		.read_s64 = pids_read_current,
> +	},
> +	{ }	/* terminate */
> +};
> +
> +struct cgroup_subsys pids_cgrp_subsys = {
> +	.css_alloc	= pids_css_alloc,
> +	.css_online	= pids_css_online,
> +	.css_free	= pids_css_free,
> +	.can_attach	= pids_can_attach,
> +	.cancel_attach	= pids_cancel_attach,
> +	.can_fork	= pids_can_fork,
> +	.cancel_fork	= pids_cancel_fork,
> +	.exit		= pids_exit,
> +	.legacy_cftypes	= files,
> +	.early_init	= 0,
> +};


^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH v4 0/2] cgroup: add pids subsystem
@ 2015-03-06  1:45   ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-06  1:45 UTC (permalink / raw)
  To: tj, lizefan, mingo, peterz
  Cc: richard, fweisbec, linux-kernel, cgroups, Aleksa Sarai

This is a checkpatch'd version of the pids patchset[1]. It fixes some
style problems, as well as switch to using need_canfork_callback inside
kernel/cgroup.c. Also remove the dependency on PAGE_COUNTER (because
pids now uses an internal hierarchical counter) in Kconfig.

[1]: https://lkml.org/lkml/2015/3/4/1198

Aleksa Sarai (2):
  cgroups: allow a cgroup subsystem to reject a fork
  cgroups: add a pids subsystem

 include/linux/cgroup.h        |   9 ++
 include/linux/cgroup_subsys.h |   4 +
 init/Kconfig                  |  11 ++
 kernel/Makefile               |   1 +
 kernel/cgroup.c               |  82 +++++++++---
 kernel/cgroup_pids.c          | 282 ++++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                 |  12 +-
 7 files changed, 384 insertions(+), 17 deletions(-)
 create mode 100644 kernel/cgroup_pids.c

-- 
2.3.1


^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH v4 0/2] cgroup: add pids subsystem
@ 2015-03-06  1:45   ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-06  1:45 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: richard-/L3Ra7n9ekc, fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Aleksa Sarai

This is a checkpatch'd version of the pids patchset[1]. It fixes some
style problems, as well as switch to using need_canfork_callback inside
kernel/cgroup.c. Also remove the dependency on PAGE_COUNTER (because
pids now uses an internal hierarchical counter) in Kconfig.

[1]: https://lkml.org/lkml/2015/3/4/1198

Aleksa Sarai (2):
  cgroups: allow a cgroup subsystem to reject a fork
  cgroups: add a pids subsystem

 include/linux/cgroup.h        |   9 ++
 include/linux/cgroup_subsys.h |   4 +
 init/Kconfig                  |  11 ++
 kernel/Makefile               |   1 +
 kernel/cgroup.c               |  82 +++++++++---
 kernel/cgroup_pids.c          | 282 ++++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                 |  12 +-
 7 files changed, 384 insertions(+), 17 deletions(-)
 create mode 100644 kernel/cgroup_pids.c

-- 
2.3.1

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH v4 1/2] cgroups: allow a cgroup subsystem to reject a fork
  2015-03-06  1:45   ` Aleksa Sarai
  (?)
@ 2015-03-06  1:45   ` Aleksa Sarai
  -1 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-06  1:45 UTC (permalink / raw)
  To: tj, lizefan, mingo, peterz
  Cc: richard, fweisbec, linux-kernel, cgroups, Aleksa Sarai

Add a new cgroup subsystem callback can_fork that conditionally
states whether or not the fork is accepted or rejected with a cgroup
policy.

Make the cgroup subsystem can_fork callback return an error code so
that subsystems can accept or reject a fork from completing with a
custom error value, before the process is exposed.

In addition, add a cancel_fork callback so that if an error occurs later
in the forking process, any state modified by can_fork can be reverted.

In order for can_fork to deal with a task that has an accurate css_set,
move the css_set updating to cgroup_fork (where it belongs).

This is in preparation for implementing the pids cgroup subsystem.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 include/linux/cgroup.h |  9 ++++++
 kernel/cgroup.c        | 82 ++++++++++++++++++++++++++++++++++++++++----------
 kernel/fork.c          | 12 +++++++-
 3 files changed, 86 insertions(+), 17 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b9cb94c..43ed1ee 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -32,6 +32,8 @@ struct cgroup;
 extern int cgroup_init_early(void);
 extern int cgroup_init(void);
 extern void cgroup_fork(struct task_struct *p);
+extern int cgroup_can_fork(struct task_struct *p);
+extern void cgroup_cancel_fork(struct task_struct *p);
 extern void cgroup_post_fork(struct task_struct *p);
 extern void cgroup_exit(struct task_struct *p);
 extern int cgroupstats_build(struct cgroupstats *stats,
@@ -649,6 +651,8 @@ struct cgroup_subsys {
 			      struct cgroup_taskset *tset);
 	void (*attach)(struct cgroup_subsys_state *css,
 		       struct cgroup_taskset *tset);
+	int (*can_fork)(struct task_struct *task);
+	void (*cancel_fork)(struct task_struct *task);
 	void (*fork)(struct task_struct *task);
 	void (*exit)(struct cgroup_subsys_state *css,
 		     struct cgroup_subsys_state *old_css,
@@ -948,6 +952,11 @@ struct cgroup_subsys_state;
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
 static inline void cgroup_fork(struct task_struct *p) {}
+static inline int cgroup_can_fork(struct task_struct *p)
+{
+	return 0;
+}
+static inline void cgroup_cancel_fork(struct task_struct *p) {}
 static inline void cgroup_post_fork(struct task_struct *p) {}
 static inline void cgroup_exit(struct task_struct *p) {}
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 29a7b2c..378badb 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -182,6 +182,9 @@ static u64 css_serial_nr_next = 1;
  */
 static int need_forkexit_callback __read_mostly;
 
+/* Ditto for the can_fork/cancel_fork callbacks. */
+static int need_canfork_callback __read_mostly;
+
 static struct cftype cgroup_dfl_base_files[];
 static struct cftype cgroup_legacy_base_files[];
 
@@ -4933,6 +4936,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 	init_css_set.subsys[ss->id] = css;
 
 	need_forkexit_callback |= ss->fork || ss->exit;
+	need_canfork_callback |= ss->can_fork || ss->cancel_fork;
 
 	/* At system boot, before all subsystems have been
 	 * registered, no tasks have been forked, so we don't
@@ -5183,22 +5187,6 @@ void cgroup_fork(struct task_struct *child)
 {
 	RCU_INIT_POINTER(child->cgroups, &init_css_set);
 	INIT_LIST_HEAD(&child->cg_list);
-}
-
-/**
- * cgroup_post_fork - called on a new task after adding it to the task list
- * @child: the task in question
- *
- * Adds the task to the list running through its css_set if necessary and
- * call the subsystem fork() callbacks.  Has to be after the task is
- * visible on the task list in case we race with the first call to
- * cgroup_task_iter_start() - to guarantee that the new task ends up on its
- * list.
- */
-void cgroup_post_fork(struct task_struct *child)
-{
-	struct cgroup_subsys *ss;
-	int i;
 
 	/*
 	 * This may race against cgroup_enable_task_cg_lists().  As that
@@ -5233,6 +5221,68 @@ void cgroup_post_fork(struct task_struct *child)
 		}
 		up_write(&css_set_rwsem);
 	}
+}
+
+/**
+ * cgroup_can_fork - called on a new task before the process is exposed.
+ * @child: the task in question.
+ *
+ * This calls the subsystem can_fork() callbacks. If the can_fork() callback
+ * returns an error, the fork aborts with that error code. This allows for
+ * a cgroup subsystem to conditionally allow or deny new forks.
+ */
+int cgroup_can_fork(struct task_struct *child)
+{
+	struct cgroup_subsys *ss;
+	int i;
+
+	if (need_canfork_callback) {
+		int retval;
+
+		for_each_subsys(ss, i)
+			if (ss->can_fork) {
+				retval = ss->can_fork(child);
+				if (retval)
+					return retval;
+			}
+	}
+
+	return 0;
+}
+
+/**
+ * cgroup_cancel_fork - called if a fork failed after cgroup_can_fork()
+ * @child: the task in question
+ *
+ * This calls the cancel_fork() callbacks if a fork failed *after*
+ * cgroup_can_fork() succeded.
+ */
+void cgroup_cancel_fork(struct task_struct *child)
+{
+	struct cgroup_subsys *ss;
+	int i;
+
+	if (need_canfork_callback) {
+		for_each_subsys(ss, i)
+			if (ss->cancel_fork)
+				ss->cancel_fork(child);
+	}
+}
+
+/**
+ * cgroup_post_fork - called on a new task after adding it to the task list
+ * @child: the task in question
+ *
+ * Adds the task to the list running through its css_set if necessary and
+ * call the subsystem fork() callbacks.  Has to be after the task is
+ * visible on the task list in case we race with the first call to
+ * cgroup_task_iter_start() - to guarantee that the new task ends up on its
+ * list.
+ */
+void cgroup_post_fork(struct task_struct *child)
+{
+	struct cgroup_subsys *ss;
+	int i;
 
 	/*
 	 * Call ss->fork().  This must happen after @child is linked on
diff --git a/kernel/fork.c b/kernel/fork.c
index cf65139..35850a9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1469,6 +1469,14 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p->task_works = NULL;
 
 	/*
+	 * Ensure that the cgroup subsystem policies allow the new process to be
+	 * forked.
+	 */
+	retval = cgroup_can_fork(p);
+	if (retval)
+		goto bad_fork_free_pid;
+
+	/*
 	 * Make it visible to the rest of the system, but dont wake it up yet.
 	 * Need tasklist lock for parent etc handling!
 	 */
@@ -1504,7 +1512,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		spin_unlock(&current->sighand->siglock);
 		write_unlock_irq(&tasklist_lock);
 		retval = -ERESTARTNOINTR;
-		goto bad_fork_free_pid;
+		goto bad_fork_cgroup_cancel;
 	}
 
 	if (likely(p->pid)) {
@@ -1556,6 +1564,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 
 	return p;
 
+bad_fork_cgroup_cancel:
+	cgroup_cancel_fork(p);
 bad_fork_free_pid:
 	if (pid != &init_struct_pid)
 		free_pid(pid);
-- 
2.3.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH v4 2/2] cgroups: add a pids subsystem
@ 2015-03-06  1:45     ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-06  1:45 UTC (permalink / raw)
  To: tj, lizefan, mingo, peterz
  Cc: richard, fweisbec, linux-kernel, cgroups, Aleksa Sarai

Adds a new single-purpose pids subsystem to limit the number of
tasks that can run inside a cgroup. Essentially this is an
implementation of RLIMIT_NPROC that will applies to a cgroup rather than
a process tree.

PIDs are fundamentally a global resource, and it is possible to reach
PID exhaustion inside a cgroup without hitting any reasonable kmemcg
policy. Once you've hit PID exhaustion, you're only in a marginally
better state than OOM. This subsystem allows PID exhaustion inside a
cgroup to be prevented.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 include/linux/cgroup_subsys.h |   4 +
 init/Kconfig                  |  11 ++
 kernel/Makefile               |   1 +
 kernel/cgroup_pids.c          | 282 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 298 insertions(+)
 create mode 100644 kernel/cgroup_pids.c

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index e4a96fb..a198822 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -47,6 +47,10 @@ SUBSYS(net_prio)
 SUBSYS(hugetlb)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_PIDS)
+SUBSYS(pids)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d..88364c9 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1054,6 +1054,17 @@ config CGROUP_HUGETLB
 	  control group is tracked in the third page lru pointer. This means
 	  that we cannot use the controller with huge page less than 3 pages.
 
+config CGROUP_PIDS
+	bool "Process number limiting on cgroups"
+	help
+	  This options enables the setting of process number limits in the scope
+	  of a cgroup. Any attempt to fork more processes than is allowed in the
+	  cgroup will fail. PIDs are fundamentally a global resource because it
+	  is fairly trivial to reach PID exhaustion before you reach even a
+	  conservative kmemcg limit. As a result, it is possible to grind a
+	  system to halt without being limited by other cgroup policies. The pids
+	  cgroup subsystem is designed to stop this from happening.
+
 config CGROUP_PERF
 	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
 	depends on PERF_EVENTS && CGROUPS
diff --git a/kernel/Makefile b/kernel/Makefile
index 1408b33..e823592 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -53,6 +53,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
+obj-$(CONFIG_CGROUP_PIDS) += cgroup_pids.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
diff --git a/kernel/cgroup_pids.c b/kernel/cgroup_pids.c
new file mode 100644
index 0000000..a97fd0e
--- /dev/null
+++ b/kernel/cgroup_pids.c
@@ -0,0 +1,282 @@
+/*
+ * Process number limiting subsys for cgroups.
+ *
+ * Copyright (C) 2015 Aleksa Sarai <cyphar@cyphar.com>
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/atomic.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+
+#define PIDS_UNLIMITED -1
+
+struct pids {
+	struct pids *parent;
+	struct cgroup_subsys_state css;
+
+	atomic_long_t counter;
+	long limit;
+};
+
+static inline struct pids *css_pids(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct pids, css) : NULL;
+}
+
+static inline struct pids *task_pids(struct task_struct *task)
+{
+	return css_pids(task_css(task, pids_cgrp_id));
+}
+
+static struct pids *parent_pids(struct pids *pids)
+{
+	return css_pids(pids->css.parent);
+}
+
+static struct cgroup_subsys_state *
+pids_css_alloc(struct cgroup_subsys_state *parent)
+{
+	struct pids *pids;
+
+	pids = kzalloc(sizeof(struct pids), GFP_KERNEL);
+	if (!pids)
+		return ERR_PTR(-ENOMEM);
+
+	return &pids->css;
+}
+
+static int pids_css_online(struct cgroup_subsys_state *css)
+{
+	struct pids *pids = css_pids(css);
+	long limit = -1;
+
+	pids->parent = parent_pids(pids);
+	if (pids->parent)
+		limit = pids->parent->limit;
+
+	pids->limit = limit;
+	atomic_long_set(&pids->counter, 0);
+	return 0;
+}
+
+static void pids_css_free(struct cgroup_subsys_state *css)
+{
+	kfree(css_pids(css));
+}
+
+/**
+ * pids_cancel - uncharge the local pid count
+ * @pids: the pid cgroup state
+ * @num: the number of pids to cancel
+ *
+ * This function will WARN if the pid count goes under 0,
+ * but will not prevent it.
+ */
+static void pids_cancel(struct pids *pids, int num)
+{
+	long new;
+
+	new = atomic_long_sub_return(num, &pids->counter);
+
+	/*
+	 * A negative count is invalid, but pids_cancel() can't fail.
+	 * So just emit a WARN.
+	 */
+	WARN_ON(new < 0);
+}
+
+/**
+ * pids_charge - hierarchically uncharge the pid count
+ * @pids: the pid cgroup state
+ * @num: the number of pids to uncharge
+ *
+ * This function will not allow the pid count to go under 0,
+ * and will WARN if a caller attempts to do so.
+ */
+static void pids_uncharge(struct pids *pids, int num)
+{
+	struct pids *p;
+
+	for (p = pids; p; p = p->parent)
+		pids_cancel(p, num);
+}
+
+/**
+ * pids_charge - hierarchically charge the pid count
+ * @pids: the pid cgroup state
+ * @num: the number of pids to charge
+ *
+ * This function does *not* follow the pid limit set. It will not
+ * fail and the new pid count may exceed the limit.
+ */
+static void pids_charge(struct pids *pids, int num)
+{
+	struct pids *p;
+
+	for (p = pids; p; p = p->parent)
+		atomic_long_add(num, &p->counter);
+}
+
+/**
+ * pids_try_charge - hierarchically try to charge the pid count
+ * @pids: the pid cgroup state
+ * @num: the number of pids to charge
+ *
+ * This function follows the set limit. It can fail if the charge
+ * would cause the new value to exceed the limit.
+ * Returns 0 if the charge succeded, otherwise -EAGAIN.
+ */
+static int pids_try_charge(struct pids *pids, int num)
+{
+	struct pids *p, *fail;
+
+	for (p = pids; p; p = p->parent) {
+		long new;
+
+		new = atomic_long_add_return(num, &p->counter);
+
+		if (p->limit == PIDS_UNLIMITED)
+			continue;
+
+		if (new > p->limit) {
+			atomic_long_sub(num, &p->counter);
+			fail = p;
+			goto revert;
+		}
+	}
+
+	return 0;
+
+revert:
+	for (p = pids; p != fail; p = p->parent)
+		pids_cancel(pids, num);
+
+	return -EAGAIN;
+}
+
+static int pids_can_attach(struct cgroup_subsys_state *css,
+			   struct cgroup_taskset *tset)
+{
+	struct pids *pids = css_pids(css);
+	unsigned long num_tasks = 0;
+	struct task_struct *task;
+
+	cgroup_taskset_for_each(task, tset)
+		num_tasks++;
+
+	/*
+	 * Attaching to a cgroup is allowed to overcome the
+	 * the PID limit, so that organisation operations aren't
+	 * blocked by the `pids` cgroup controller.
+	 */
+	pids_charge(pids, num_tasks);
+	return 0;
+}
+
+static void pids_cancel_attach(struct cgroup_subsys_state *css,
+			       struct cgroup_taskset *tset)
+{
+	struct pids *pids = css_pids(css);
+	unsigned long num_tasks = 0;
+	struct task_struct *task;
+
+	cgroup_taskset_for_each(task, tset)
+		num_tasks++;
+
+	pids_uncharge(pids, num_tasks);
+}
+
+static int pids_can_fork(struct task_struct *task)
+{
+	struct pids *pids = task_pids(task);
+
+	return pids_try_charge(pids, 1);
+}
+
+static void pids_cancel_fork(struct task_struct *task)
+{
+	struct pids *pids = task_pids(task);
+
+	pids_uncharge(pids, 1);
+}
+
+static void pids_exit(struct cgroup_subsys_state *css,
+		      struct cgroup_subsys_state *old_css,
+		      struct task_struct *task)
+{
+	struct pids *pids = css_pids(old_css);
+
+	/*
+	 * cgroup_exit() gets called as part of the cleanup code when
+	 * copy_process() fails. This should ignored, because the
+	 * pids_cancel_fork callback already deals with the cgroup failed fork
+	 * case.
+	 */
+	if (!(task->flags & PF_EXITING))
+		return;
+
+	pids_uncharge(pids, 1);
+}
+
+static int pids_write_max(struct cgroup_subsys_state *css,
+			  struct cftype *cft, s64 val)
+{
+	struct pids *pids = css_pids(css);
+
+	/* PIDS_UNLIMITED is the only legal negative value. */
+	if (val < 0 && val != PIDS_UNLIMITED)
+		return -EINVAL;
+
+	/*
+	 * Limit updates don't need to be mutex'd, since they
+	 * are more of a "soft" limit in the sense that you can
+	 * set a limit which is smaller than the current count
+	 * to stop any *new* processes from spawning.
+	 */
+	pids->limit = val;
+	return 0;
+}
+
+static s64 pids_read_max(struct cgroup_subsys_state *css,
+			 struct cftype *cft)
+{
+	struct pids *pids = css_pids(css);
+
+	return pids->limit;
+}
+
+static s64 pids_read_current(struct cgroup_subsys_state *css,
+			     struct cftype *cft)
+{
+	struct pids *pids = css_pids(css);
+
+	return atomic_long_read(&pids->counter);
+}
+
+static struct cftype files[] = {
+	{
+		.name = "max",
+		.write_s64 = pids_write_max,
+		.read_s64 = pids_read_max,
+	},
+	{
+		.name = "current",
+		.read_s64 = pids_read_current,
+	},
+	{ }	/* terminate */
+};
+
+struct cgroup_subsys pids_cgrp_subsys = {
+	.css_alloc	= pids_css_alloc,
+	.css_online	= pids_css_online,
+	.css_free	= pids_css_free,
+	.can_attach	= pids_can_attach,
+	.cancel_attach	= pids_cancel_attach,
+	.can_fork	= pids_can_fork,
+	.cancel_fork	= pids_cancel_fork,
+	.exit		= pids_exit,
+	.legacy_cftypes	= files,
+	.early_init	= 0,
+};
-- 
2.3.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH v4 2/2] cgroups: add a pids subsystem
@ 2015-03-06  1:45     ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-06  1:45 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: richard-/L3Ra7n9ekc, fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Aleksa Sarai

Adds a new single-purpose pids subsystem to limit the number of
tasks that can run inside a cgroup. Essentially this is an
implementation of RLIMIT_NPROC that will applies to a cgroup rather than
a process tree.

PIDs are fundamentally a global resource, and it is possible to reach
PID exhaustion inside a cgroup without hitting any reasonable kmemcg
policy. Once you've hit PID exhaustion, you're only in a marginally
better state than OOM. This subsystem allows PID exhaustion inside a
cgroup to be prevented.

Signed-off-by: Aleksa Sarai <cyphar-gVpy/LI/lHzQT0dZR+AlfA@public.gmane.org>
---
 include/linux/cgroup_subsys.h |   4 +
 init/Kconfig                  |  11 ++
 kernel/Makefile               |   1 +
 kernel/cgroup_pids.c          | 282 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 298 insertions(+)
 create mode 100644 kernel/cgroup_pids.c

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index e4a96fb..a198822 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -47,6 +47,10 @@ SUBSYS(net_prio)
 SUBSYS(hugetlb)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_PIDS)
+SUBSYS(pids)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d..88364c9 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1054,6 +1054,17 @@ config CGROUP_HUGETLB
 	  control group is tracked in the third page lru pointer. This means
 	  that we cannot use the controller with huge page less than 3 pages.
 
+config CGROUP_PIDS
+	bool "Process number limiting on cgroups"
+	help
+	  This options enables the setting of process number limits in the scope
+	  of a cgroup. Any attempt to fork more processes than is allowed in the
+	  cgroup will fail. PIDs are fundamentally a global resource because it
+	  is fairly trivial to reach PID exhaustion before you reach even a
+	  conservative kmemcg limit. As a result, it is possible to grind a
+	  system to halt without being limited by other cgroup policies. The pids
+	  cgroup subsystem is designed to stop this from happening.
+
 config CGROUP_PERF
 	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
 	depends on PERF_EVENTS && CGROUPS
diff --git a/kernel/Makefile b/kernel/Makefile
index 1408b33..e823592 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -53,6 +53,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
+obj-$(CONFIG_CGROUP_PIDS) += cgroup_pids.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
diff --git a/kernel/cgroup_pids.c b/kernel/cgroup_pids.c
new file mode 100644
index 0000000..a97fd0e
--- /dev/null
+++ b/kernel/cgroup_pids.c
@@ -0,0 +1,282 @@
+/*
+ * Process number limiting subsys for cgroups.
+ *
+ * Copyright (C) 2015 Aleksa Sarai <cyphar-gVpy/LI/lHzQT0dZR+AlfA@public.gmane.org>
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/atomic.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+
+#define PIDS_UNLIMITED -1
+
+struct pids {
+	struct pids *parent;
+	struct cgroup_subsys_state css;
+
+	atomic_long_t counter;
+	long limit;
+};
+
+static inline struct pids *css_pids(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct pids, css) : NULL;
+}
+
+static inline struct pids *task_pids(struct task_struct *task)
+{
+	return css_pids(task_css(task, pids_cgrp_id));
+}
+
+static struct pids *parent_pids(struct pids *pids)
+{
+	return css_pids(pids->css.parent);
+}
+
+static struct cgroup_subsys_state *
+pids_css_alloc(struct cgroup_subsys_state *parent)
+{
+	struct pids *pids;
+
+	pids = kzalloc(sizeof(struct pids), GFP_KERNEL);
+	if (!pids)
+		return ERR_PTR(-ENOMEM);
+
+	return &pids->css;
+}
+
+static int pids_css_online(struct cgroup_subsys_state *css)
+{
+	struct pids *pids = css_pids(css);
+	long limit = -1;
+
+	pids->parent = parent_pids(pids);
+	if (pids->parent)
+		limit = pids->parent->limit;
+
+	pids->limit = limit;
+	atomic_long_set(&pids->counter, 0);
+	return 0;
+}
+
+static void pids_css_free(struct cgroup_subsys_state *css)
+{
+	kfree(css_pids(css));
+}
+
+/**
+ * pids_cancel - uncharge the local pid count
+ * @pids: the pid cgroup state
+ * @num: the number of pids to cancel
+ *
+ * This function will WARN if the pid count goes under 0,
+ * but will not prevent it.
+ */
+static void pids_cancel(struct pids *pids, int num)
+{
+	long new;
+
+	new = atomic_long_sub_return(num, &pids->counter);
+
+	/*
+	 * A negative count is invalid, but pids_cancel() can't fail.
+	 * So just emit a WARN.
+	 */
+	WARN_ON(new < 0);
+}
+
+/**
+ * pids_charge - hierarchically uncharge the pid count
+ * @pids: the pid cgroup state
+ * @num: the number of pids to uncharge
+ *
+ * This function will not allow the pid count to go under 0,
+ * and will WARN if a caller attempts to do so.
+ */
+static void pids_uncharge(struct pids *pids, int num)
+{
+	struct pids *p;
+
+	for (p = pids; p; p = p->parent)
+		pids_cancel(p, num);
+}
+
+/**
+ * pids_charge - hierarchically charge the pid count
+ * @pids: the pid cgroup state
+ * @num: the number of pids to charge
+ *
+ * This function does *not* follow the pid limit set. It will not
+ * fail and the new pid count may exceed the limit.
+ */
+static void pids_charge(struct pids *pids, int num)
+{
+	struct pids *p;
+
+	for (p = pids; p; p = p->parent)
+		atomic_long_add(num, &p->counter);
+}
+
+/**
+ * pids_try_charge - hierarchically try to charge the pid count
+ * @pids: the pid cgroup state
+ * @num: the number of pids to charge
+ *
+ * This function follows the set limit. It can fail if the charge
+ * would cause the new value to exceed the limit.
+ * Returns 0 if the charge succeded, otherwise -EAGAIN.
+ */
+static int pids_try_charge(struct pids *pids, int num)
+{
+	struct pids *p, *fail;
+
+	for (p = pids; p; p = p->parent) {
+		long new;
+
+		new = atomic_long_add_return(num, &p->counter);
+
+		if (p->limit == PIDS_UNLIMITED)
+			continue;
+
+		if (new > p->limit) {
+			atomic_long_sub(num, &p->counter);
+			fail = p;
+			goto revert;
+		}
+	}
+
+	return 0;
+
+revert:
+	for (p = pids; p != fail; p = p->parent)
+		pids_cancel(pids, num);
+
+	return -EAGAIN;
+}
+
+static int pids_can_attach(struct cgroup_subsys_state *css,
+			   struct cgroup_taskset *tset)
+{
+	struct pids *pids = css_pids(css);
+	unsigned long num_tasks = 0;
+	struct task_struct *task;
+
+	cgroup_taskset_for_each(task, tset)
+		num_tasks++;
+
+	/*
+	 * Attaching to a cgroup is allowed to overcome the
+	 * the PID limit, so that organisation operations aren't
+	 * blocked by the `pids` cgroup controller.
+	 */
+	pids_charge(pids, num_tasks);
+	return 0;
+}
+
+static void pids_cancel_attach(struct cgroup_subsys_state *css,
+			       struct cgroup_taskset *tset)
+{
+	struct pids *pids = css_pids(css);
+	unsigned long num_tasks = 0;
+	struct task_struct *task;
+
+	cgroup_taskset_for_each(task, tset)
+		num_tasks++;
+
+	pids_uncharge(pids, num_tasks);
+}
+
+static int pids_can_fork(struct task_struct *task)
+{
+	struct pids *pids = task_pids(task);
+
+	return pids_try_charge(pids, 1);
+}
+
+static void pids_cancel_fork(struct task_struct *task)
+{
+	struct pids *pids = task_pids(task);
+
+	pids_uncharge(pids, 1);
+}
+
+static void pids_exit(struct cgroup_subsys_state *css,
+		      struct cgroup_subsys_state *old_css,
+		      struct task_struct *task)
+{
+	struct pids *pids = css_pids(old_css);
+
+	/*
+	 * cgroup_exit() gets called as part of the cleanup code when
+	 * copy_process() fails. This should ignored, because the
+	 * pids_cancel_fork callback already deals with the cgroup failed fork
+	 * case.
+	 */
+	if (!(task->flags & PF_EXITING))
+		return;
+
+	pids_uncharge(pids, 1);
+}
+
+static int pids_write_max(struct cgroup_subsys_state *css,
+			  struct cftype *cft, s64 val)
+{
+	struct pids *pids = css_pids(css);
+
+	/* PIDS_UNLIMITED is the only legal negative value. */
+	if (val < 0 && val != PIDS_UNLIMITED)
+		return -EINVAL;
+
+	/*
+	 * Limit updates don't need to be mutex'd, since they
+	 * are more of a "soft" limit in the sense that you can
+	 * set a limit which is smaller than the current count
+	 * to stop any *new* processes from spawning.
+	 */
+	pids->limit = val;
+	return 0;
+}
+
+static s64 pids_read_max(struct cgroup_subsys_state *css,
+			 struct cftype *cft)
+{
+	struct pids *pids = css_pids(css);
+
+	return pids->limit;
+}
+
+static s64 pids_read_current(struct cgroup_subsys_state *css,
+			     struct cftype *cft)
+{
+	struct pids *pids = css_pids(css);
+
+	return atomic_long_read(&pids->counter);
+}
+
+static struct cftype files[] = {
+	{
+		.name = "max",
+		.write_s64 = pids_write_max,
+		.read_s64 = pids_read_max,
+	},
+	{
+		.name = "current",
+		.read_s64 = pids_read_current,
+	},
+	{ }	/* terminate */
+};
+
+struct cgroup_subsys pids_cgrp_subsys = {
+	.css_alloc	= pids_css_alloc,
+	.css_online	= pids_css_online,
+	.css_free	= pids_css_free,
+	.can_attach	= pids_can_attach,
+	.cancel_attach	= pids_cancel_attach,
+	.can_fork	= pids_can_fork,
+	.cancel_fork	= pids_cancel_fork,
+	.exit		= pids_exit,
+	.legacy_cftypes	= files,
+	.early_init	= 0,
+};
-- 
2.3.1

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 2/2] cgroups: add an nproc subsystem
@ 2015-03-09  1:49         ` Zefan Li
  0 siblings, 0 replies; 108+ messages in thread
From: Zefan Li @ 2015-03-09  1:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Aleksa Sarai, mingo, peterz, richard, fweisbec, linux-kernel, cgroups

Hi Tejun,

On 2015/3/2 23:22, Tejun Heo wrote:
> Hello,
> 
> On Fri, Feb 27, 2015 at 03:17:19PM +1100, Aleksa Sarai wrote:
>> +config CGROUP_NPROC
>> +	bool "Process number limiting on cgroups"
>> +	depends on PAGE_COUNTER
>> +	help
>> +	  This options enables the setting of process number limits in the scope
>> +	  of a cgroup. Any attempt to fork more processes than is allowed in the
>> +	  cgroup will fail. This allows for more basic resource limitation that
>> +	  applies to a cgroup, similar to RLIMIT_NPROC (except that instead of
>> +	  applying to a process tree it applies to a cgroup).
> 
> Please reflect the rationale from this discussion thread in the commit
> message and help text.  Also, I'd much prefer to name it pids
> controller after the resource it's controlling.
> 

Seems we are going to accept this feature. Is it because kmemcg won't be able
to fullfill this requirement? And that's because kmemcg can only and will only
be able to control global kernel memory usage? 

I thought there will be some control file like kmem.pids.max, which will
translate the number of processes to the kernel memory needed, and fail memory
allocation if we reach the limit, for example make task_struct slab return
NULL.


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 2/2] cgroups: add an nproc subsystem
@ 2015-03-09  1:49         ` Zefan Li
  0 siblings, 0 replies; 108+ messages in thread
From: Zefan Li @ 2015-03-09  1:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Aleksa Sarai, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, richard-/L3Ra7n9ekc,
	fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hi Tejun,

On 2015/3/2 23:22, Tejun Heo wrote:
> Hello,
> 
> On Fri, Feb 27, 2015 at 03:17:19PM +1100, Aleksa Sarai wrote:
>> +config CGROUP_NPROC
>> +	bool "Process number limiting on cgroups"
>> +	depends on PAGE_COUNTER
>> +	help
>> +	  This options enables the setting of process number limits in the scope
>> +	  of a cgroup. Any attempt to fork more processes than is allowed in the
>> +	  cgroup will fail. This allows for more basic resource limitation that
>> +	  applies to a cgroup, similar to RLIMIT_NPROC (except that instead of
>> +	  applying to a process tree it applies to a cgroup).
> 
> Please reflect the rationale from this discussion thread in the commit
> message and help text.  Also, I'd much prefer to name it pids
> controller after the resource it's controlling.
> 

Seems we are going to accept this feature. Is it because kmemcg won't be able
to fullfill this requirement? And that's because kmemcg can only and will only
be able to control global kernel memory usage? 

I thought there will be some control file like kmem.pids.max, which will
translate the number of processes to the kernel memory needed, and fail memory
allocation if we reach the limit, for example make task_struct slab return
NULL.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 2/2] cgroups: add an nproc subsystem
@ 2015-03-09  2:34           ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-09  2:34 UTC (permalink / raw)
  To: Zefan Li
  Cc: Aleksa Sarai, mingo, peterz, richard, fweisbec, linux-kernel, cgroups

Hello, Li.

On Mon, Mar 09, 2015 at 09:49:45AM +0800, Zefan Li wrote:
> Seems we are going to accept this feature. Is it because kmemcg won't be able
> to fullfill this requirement? And that's because kmemcg can only and will only

Yeah, pretty much.

> be able to control global kernel memory usage? 

I'm not sure what you mean by global kernel memory usage.

> I thought there will be some control file like kmem.pids.max, which will
> translate the number of processes to the kernel memory needed, and fail memory
> allocation if we reach the limit, for example make task_struct slab return
> NULL.

Hmm... I don't think so.  memcg will only be concerned with actual
memory in bytes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 2/2] cgroups: add an nproc subsystem
@ 2015-03-09  2:34           ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-09  2:34 UTC (permalink / raw)
  To: Zefan Li
  Cc: Aleksa Sarai, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, richard-/L3Ra7n9ekc,
	fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello, Li.

On Mon, Mar 09, 2015 at 09:49:45AM +0800, Zefan Li wrote:
> Seems we are going to accept this feature. Is it because kmemcg won't be able
> to fullfill this requirement? And that's because kmemcg can only and will only

Yeah, pretty much.

> be able to control global kernel memory usage? 

I'm not sure what you mean by global kernel memory usage.

> I thought there will be some control file like kmem.pids.max, which will
> translate the number of processes to the kernel memory needed, and fail memory
> allocation if we reach the limit, for example make task_struct slab return
> NULL.

Hmm... I don't think so.  memcg will only be concerned with actual
memory in bytes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/2] cgroups: allow a cgroup subsystem to reject a fork
@ 2015-03-09  3:06       ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-09  3:06 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan, mingo, peterz, richard, fweisbec, linux-kernel, cgroups

On Fri, Feb 27, 2015 at 03:17:18PM +1100, Aleksa Sarai wrote:
...
> In order for can_fork to deal with a task that has an accurate css_set,
> move the css_set updating to cgroup_fork (where it belongs).

Hmmm?  So, now the task is visible on cgroup side before the point of
no return?  What happens if fork fails afterwards?  Also, why is this
non-trivial change happening in tandem in this patch?

> @@ -946,6 +950,11 @@ struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry,
>  static inline int cgroup_init_early(void) { return 0; }
>  static inline int cgroup_init(void) { return 0; }
>  static inline void cgroup_fork(struct task_struct *p) {}
> +static inline int cgroup_can_fork(struct task_struct *p)
> +{
> +	return 0;
> +}

Please follow the surrounding style.

> @@ -4928,7 +4928,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
>  	 * init_css_set is in the subsystem's root cgroup. */
>  	init_css_set.subsys[ss->id] = css;
>  
> -	need_forkexit_callback |= ss->fork || ss->exit;
> +	need_forkexit_callback |= ss->can_fork || ss->cancel_fork || ss->fork || ss->exit;

Your patch isn't the culprit but this is silly given that this flag is
set pretty much whenever cgroups are enabled.  Per-callback subsys
mask would make far more sense.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/2] cgroups: allow a cgroup subsystem to reject a fork
@ 2015-03-09  3:06       ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-09  3:06 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, richard-/L3Ra7n9ekc,
	fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Feb 27, 2015 at 03:17:18PM +1100, Aleksa Sarai wrote:
...
> In order for can_fork to deal with a task that has an accurate css_set,
> move the css_set updating to cgroup_fork (where it belongs).

Hmmm?  So, now the task is visible on cgroup side before the point of
no return?  What happens if fork fails afterwards?  Also, why is this
non-trivial change happening in tandem in this patch?

> @@ -946,6 +950,11 @@ struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry,
>  static inline int cgroup_init_early(void) { return 0; }
>  static inline int cgroup_init(void) { return 0; }
>  static inline void cgroup_fork(struct task_struct *p) {}
> +static inline int cgroup_can_fork(struct task_struct *p)
> +{
> +	return 0;
> +}

Please follow the surrounding style.

> @@ -4928,7 +4928,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
>  	 * init_css_set is in the subsystem's root cgroup. */
>  	init_css_set.subsys[ss->id] = css;
>  
> -	need_forkexit_callback |= ss->fork || ss->exit;
> +	need_forkexit_callback |= ss->can_fork || ss->cancel_fork || ss->fork || ss->exit;

Your patch isn't the culprit but this is silly given that this flag is
set pretty much whenever cgroups are enabled.  Per-callback subsys
mask would make far more sense.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/2] cgroup: add pids subsystem
@ 2015-03-09  3:08     ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-09  3:08 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan, mingo, peterz, richard, fweisbec, linux-kernel, cgroups

On Fri, Mar 06, 2015 at 12:45:55PM +1100, Aleksa Sarai wrote:
> This is a checkpatch'd version of the pids patchset[1]. It fixes some
> style problems, as well as switch to using need_canfork_callback inside
> kernel/cgroup.c. Also remove the dependency on PAGE_COUNTER (because
> pids now uses an internal hierarchical counter) in Kconfig.

Oops... reviewed the wrong patches.  Can you please not chain
different versions of the patchset?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 0/2] cgroup: add pids subsystem
@ 2015-03-09  3:08     ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-09  3:08 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, richard-/L3Ra7n9ekc,
	fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Mar 06, 2015 at 12:45:55PM +1100, Aleksa Sarai wrote:
> This is a checkpatch'd version of the pids patchset[1]. It fixes some
> style problems, as well as switch to using need_canfork_callback inside
> kernel/cgroup.c. Also remove the dependency on PAGE_COUNTER (because
> pids now uses an internal hierarchical counter) in Kconfig.

Oops... reviewed the wrong patches.  Can you please not chain
different versions of the patchset?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 2/2] cgroups: add a pids subsystem
@ 2015-03-09  3:34       ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-09  3:34 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan, mingo, peterz, richard, fweisbec, linux-kernel, cgroups

On Fri, Mar 06, 2015 at 12:45:57PM +1100, Aleksa Sarai wrote:
> +struct pids {

This name is way too generic.  Please make it clear it's part of a
cgroup controller.

> +	struct pids *parent;
> +	struct cgroup_subsys_state css;

Please make css the first element.  The above prevents css <-> pids
pointer conversions from being noop.

> +
> +	atomic_long_t counter;
> +	long limit;

Why are these long?

> +};
> +
> +static inline struct pids *css_pids(struct cgroup_subsys_state *css)

No need for explicit inlines.

> +{
> +	return css ? container_of(css, struct pids, css) : NULL;
> +}
> +
> +static inline struct pids *task_pids(struct task_struct *task)
> +{
> +	return css_pids(task_css(task, pids_cgrp_id));
> +}
> +
> +static struct pids *parent_pids(struct pids *pids)
> +{
> +	return css_pids(pids->css.parent);
> +}

For all the above functions.

> +static int pids_css_online(struct cgroup_subsys_state *css)
> +{
> +	struct pids *pids = css_pids(css);
> +	long limit = -1;
> +
> +	pids->parent = parent_pids(pids);
> +	if (pids->parent)
> +		limit = pids->parent->limit;
> +
> +	pids->limit = limit;

Why would a child inherit the setting of the parent?  It's already
hierarchically limited by the parent.  There's no point in inheriting
the setting itself.

> +	atomic_long_set(&pids->counter, 0);
> +	return 0;
> +}
> +
> +static void pids_css_free(struct cgroup_subsys_state *css)
> +{
> +	kfree(css_pids(css));
> +}
> +
> +/**
> + * pids_cancel - uncharge the local pid count
> + * @pids: the pid cgroup state
> + * @num: the number of pids to cancel
> + *
> + * This function will WARN if the pid count goes under 0,
> + * but will not prevent it.
> + */
> +static void pids_cancel(struct pids *pids, int num)
> +{
> +	long new;
> +
> +	new = atomic_long_sub_return(num, &pids->counter);
> +
> +	/*
> +	 * A negative count is invalid, but pids_cancel() can't fail.
> +	 * So just emit a WARN.
> +	 */
> +	WARN_ON(new < 0);

WARN_ON_ONCE() would be better.  Also, if you're gonna warn against
underflow, why not warn about overflow?  Just use
WARN_ON_ONCE(atomic_add_negative()).

> +}
> +
> +/**
> + * pids_charge - hierarchically uncharge the pid count
> + * @pids: the pid cgroup state
> + * @num: the number of pids to uncharge
> + *
> + * This function will not allow the pid count to go under 0,
> + * and will WARN if a caller attempts to do so.
> + */
> +static void pids_uncharge(struct pids *pids, int num)
> +{
> +	struct pids *p;
> +
> +	for (p = pids; p; p = p->parent)
> +		pids_cancel(p, num);
> +}

Does pids limit make sense in the root cgroup?

> +static int pids_try_charge(struct pids *pids, int num)
> +{
> +	struct pids *p, *fail;
> +
> +	for (p = pids; p; p = p->parent) {
> +		long new;
> +
> +		new = atomic_long_add_return(num, &p->counter);
> +
> +		if (p->limit == PIDS_UNLIMITED)
> +			continue;

Huh?  So, the counter stays out of sync if unlimited?  What happens
when it gets set to something else later?

> +
> +		if (new > p->limit) {
> +			atomic_long_sub(num, &p->counter);
> +			fail = p;
> +			goto revert;
> +		}
> +	}
> +
> +	return 0;
> +
> +revert:
> +	for (p = pids; p != fail; p = p->parent)
> +		pids_cancel(pids, num);
> +
> +	return -EAGAIN;
> +}
...
> +static void pids_exit(struct cgroup_subsys_state *css,
> +		      struct cgroup_subsys_state *old_css,
> +		      struct task_struct *task)
> +{
> +	struct pids *pids = css_pids(old_css);
> +
> +	/*
> +	 * cgroup_exit() gets called as part of the cleanup code when
> +	 * copy_process() fails. This should ignored, because the
> +	 * pids_cancel_fork callback already deals with the cgroup failed fork
> +	 * case.
> +	 */

Do we even need cancel call then?

> +	if (!(task->flags & PF_EXITING))
> +		return;
> +
> +	pids_uncharge(pids, 1);
> +}
> +
> +static int pids_write_max(struct cgroup_subsys_state *css,
> +			  struct cftype *cft, s64 val)
> +{
> +	struct pids *pids = css_pids(css);
> +
> +	/* PIDS_UNLIMITED is the only legal negative value. */
> +	if (val < 0 && val != PIDS_UNLIMITED)
> +		return -EINVAL;

Ugh... let's please not do negatives.  Please input and output "max"
for no limit conditions.

> +	/*
> +	 * Limit updates don't need to be mutex'd, since they
> +	 * are more of a "soft" limit in the sense that you can
> +	 * set a limit which is smaller than the current count
> +	 * to stop any *new* processes from spawning.
> +	 */
> +	pids->limit = val;

So, on 32bit machines, we're assigning 64bit inte to 32bit after
ensuring the 64bit is a positive number?

Overall, I'm not too confident this is going well.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 2/2] cgroups: add a pids subsystem
@ 2015-03-09  3:34       ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-09  3:34 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, richard-/L3Ra7n9ekc,
	fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Mar 06, 2015 at 12:45:57PM +1100, Aleksa Sarai wrote:
> +struct pids {

This name is way too generic.  Please make it clear it's part of a
cgroup controller.

> +	struct pids *parent;
> +	struct cgroup_subsys_state css;

Please make css the first element.  The above prevents css <-> pids
pointer conversions from being noop.

> +
> +	atomic_long_t counter;
> +	long limit;

Why are these long?

> +};
> +
> +static inline struct pids *css_pids(struct cgroup_subsys_state *css)

No need for explicit inlines.

> +{
> +	return css ? container_of(css, struct pids, css) : NULL;
> +}
> +
> +static inline struct pids *task_pids(struct task_struct *task)
> +{
> +	return css_pids(task_css(task, pids_cgrp_id));
> +}
> +
> +static struct pids *parent_pids(struct pids *pids)
> +{
> +	return css_pids(pids->css.parent);
> +}

For all the above functions.

> +static int pids_css_online(struct cgroup_subsys_state *css)
> +{
> +	struct pids *pids = css_pids(css);
> +	long limit = -1;
> +
> +	pids->parent = parent_pids(pids);
> +	if (pids->parent)
> +		limit = pids->parent->limit;
> +
> +	pids->limit = limit;

Why would a child inherit the setting of the parent?  It's already
hierarchically limited by the parent.  There's no point in inheriting
the setting itself.

> +	atomic_long_set(&pids->counter, 0);
> +	return 0;
> +}
> +
> +static void pids_css_free(struct cgroup_subsys_state *css)
> +{
> +	kfree(css_pids(css));
> +}
> +
> +/**
> + * pids_cancel - uncharge the local pid count
> + * @pids: the pid cgroup state
> + * @num: the number of pids to cancel
> + *
> + * This function will WARN if the pid count goes under 0,
> + * but will not prevent it.
> + */
> +static void pids_cancel(struct pids *pids, int num)
> +{
> +	long new;
> +
> +	new = atomic_long_sub_return(num, &pids->counter);
> +
> +	/*
> +	 * A negative count is invalid, but pids_cancel() can't fail.
> +	 * So just emit a WARN.
> +	 */
> +	WARN_ON(new < 0);

WARN_ON_ONCE() would be better.  Also, if you're gonna warn against
underflow, why not warn about overflow?  Just use
WARN_ON_ONCE(atomic_add_negative()).

> +}
> +
> +/**
> + * pids_charge - hierarchically uncharge the pid count
> + * @pids: the pid cgroup state
> + * @num: the number of pids to uncharge
> + *
> + * This function will not allow the pid count to go under 0,
> + * and will WARN if a caller attempts to do so.
> + */
> +static void pids_uncharge(struct pids *pids, int num)
> +{
> +	struct pids *p;
> +
> +	for (p = pids; p; p = p->parent)
> +		pids_cancel(p, num);
> +}

Does pids limit make sense in the root cgroup?

> +static int pids_try_charge(struct pids *pids, int num)
> +{
> +	struct pids *p, *fail;
> +
> +	for (p = pids; p; p = p->parent) {
> +		long new;
> +
> +		new = atomic_long_add_return(num, &p->counter);
> +
> +		if (p->limit == PIDS_UNLIMITED)
> +			continue;

Huh?  So, the counter stays out of sync if unlimited?  What happens
when it gets set to something else later?

> +
> +		if (new > p->limit) {
> +			atomic_long_sub(num, &p->counter);
> +			fail = p;
> +			goto revert;
> +		}
> +	}
> +
> +	return 0;
> +
> +revert:
> +	for (p = pids; p != fail; p = p->parent)
> +		pids_cancel(pids, num);
> +
> +	return -EAGAIN;
> +}
...
> +static void pids_exit(struct cgroup_subsys_state *css,
> +		      struct cgroup_subsys_state *old_css,
> +		      struct task_struct *task)
> +{
> +	struct pids *pids = css_pids(old_css);
> +
> +	/*
> +	 * cgroup_exit() gets called as part of the cleanup code when
> +	 * copy_process() fails. This should ignored, because the
> +	 * pids_cancel_fork callback already deals with the cgroup failed fork
> +	 * case.
> +	 */

Do we even need cancel call then?

> +	if (!(task->flags & PF_EXITING))
> +		return;
> +
> +	pids_uncharge(pids, 1);
> +}
> +
> +static int pids_write_max(struct cgroup_subsys_state *css,
> +			  struct cftype *cft, s64 val)
> +{
> +	struct pids *pids = css_pids(css);
> +
> +	/* PIDS_UNLIMITED is the only legal negative value. */
> +	if (val < 0 && val != PIDS_UNLIMITED)
> +		return -EINVAL;

Ugh... let's please not do negatives.  Please input and output "max"
for no limit conditions.

> +	/*
> +	 * Limit updates don't need to be mutex'd, since they
> +	 * are more of a "soft" limit in the sense that you can
> +	 * set a limit which is smaller than the current count
> +	 * to stop any *new* processes from spawning.
> +	 */
> +	pids->limit = val;

So, on 32bit machines, we're assigning 64bit inte to 32bit after
ensuring the 64bit is a positive number?

Overall, I'm not too confident this is going well.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 2/2] cgroups: add a pids subsystem
@ 2015-03-09  3:39         ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-09  3:39 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan, mingo, peterz, richard, fweisbec, linux-kernel, cgroups

On Sun, Mar 08, 2015 at 11:34:05PM -0400, Tejun Heo wrote:
> > +	for (p = pids; p; p = p->parent) {
> > +		long new;
> > +
> > +		new = atomic_long_add_return(num, &p->counter);
> > +
> > +		if (p->limit == PIDS_UNLIMITED)
> > +			continue;
> 
> Huh?  So, the counter stays out of sync if unlimited?  What happens
> when it gets set to something else later?

Oops, I misread the code, but why is PIDS_UNLIMITED a special case?
Just make it a number which always makes the condition true?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 2/2] cgroups: add a pids subsystem
@ 2015-03-09  3:39         ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-09  3:39 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, richard-/L3Ra7n9ekc,
	fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Sun, Mar 08, 2015 at 11:34:05PM -0400, Tejun Heo wrote:
> > +	for (p = pids; p; p = p->parent) {
> > +		long new;
> > +
> > +		new = atomic_long_add_return(num, &p->counter);
> > +
> > +		if (p->limit == PIDS_UNLIMITED)
> > +			continue;
> 
> Huh?  So, the counter stays out of sync if unlimited?  What happens
> when it gets set to something else later?

Oops, I misread the code, but why is PIDS_UNLIMITED a special case?
Just make it a number which always makes the condition true?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 2/2] cgroups: add a pids subsystem
@ 2015-03-09 18:58         ` Austin S Hemmelgarn
  0 siblings, 0 replies; 108+ messages in thread
From: Austin S Hemmelgarn @ 2015-03-09 18:58 UTC (permalink / raw)
  To: Tejun Heo, Aleksa Sarai
  Cc: lizefan, mingo, peterz, richard, fweisbec, linux-kernel, cgroups

On 2015-03-08 23:34, Tejun Heo wrote:
>
> Does pids limit make sense in the root cgroup?
>
I would say it kind of does, although I would just expect it to track 
/proc/sys/kernel/pid_max (either as a read-only value, or as an 
alternative way to set it).

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 2/2] cgroups: add a pids subsystem
@ 2015-03-09 18:58         ` Austin S Hemmelgarn
  0 siblings, 0 replies; 108+ messages in thread
From: Austin S Hemmelgarn @ 2015-03-09 18:58 UTC (permalink / raw)
  To: Tejun Heo, Aleksa Sarai
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, richard-/L3Ra7n9ekc,
	fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On 2015-03-08 23:34, Tejun Heo wrote:
>
> Does pids limit make sense in the root cgroup?
>
I would say it kind of does, although I would just expect it to track 
/proc/sys/kernel/pid_max (either as a read-only value, or as an 
alternative way to set it).

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 2/2] cgroups: add a pids subsystem
@ 2015-03-09 19:51           ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-09 19:51 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Aleksa Sarai, lizefan, mingo, peterz, richard, fweisbec,
	linux-kernel, cgroups

Hello, Austin.

On Mon, Mar 09, 2015 at 02:58:11PM -0400, Austin S Hemmelgarn wrote:
> On 2015-03-08 23:34, Tejun Heo wrote:
> >
> >Does pids limit make sense in the root cgroup?
> >
> I would say it kind of does, although I would just expect it to track
> /proc/sys/kernel/pid_max (either as a read-only value, or as an alternative
> way to set it).

I don't think that's a good idea.  It doesn't add anything while
putting pids controller in conflict with how other controllers handle
the root cgroup.  Furthermore, I don't think it's generally a good
idea to add things because it may help convenience in some cases,
which is exactly the case here.  Why add non-orthogonal component when
the only reason is "yeah, it may be a bit more convenient in some
imaginary cases"?  We'd be restricting the design space we can move
inside for no good reason.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 2/2] cgroups: add a pids subsystem
@ 2015-03-09 19:51           ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-09 19:51 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Aleksa Sarai, lizefan-hv44wF8Li93QT0dZR+AlfA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	richard-/L3Ra7n9ekc, fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello, Austin.

On Mon, Mar 09, 2015 at 02:58:11PM -0400, Austin S Hemmelgarn wrote:
> On 2015-03-08 23:34, Tejun Heo wrote:
> >
> >Does pids limit make sense in the root cgroup?
> >
> I would say it kind of does, although I would just expect it to track
> /proc/sys/kernel/pid_max (either as a read-only value, or as an alternative
> way to set it).

I don't think that's a good idea.  It doesn't add anything while
putting pids controller in conflict with how other controllers handle
the root cgroup.  Furthermore, I don't think it's generally a good
idea to add things because it may help convenience in some cases,
which is exactly the case here.  Why add non-orthogonal component when
the only reason is "yeah, it may be a bit more convenient in some
imaginary cases"?  We'd be restricting the design space we can move
inside for no good reason.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 2/2] cgroups: add a pids subsystem
@ 2015-03-10  8:10           ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-10  8:10 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Tejun Heo, lizefan, mingo, peterz, richard,
	Frédéric Weisbecker, linux-kernel, cgroups

Hi Austin,

>> Does pids limit make sense in the root cgroup?
>>
> I would say it kind of does, although I would just expect it to track
> /proc/sys/kernel/pid_max (either as a read-only value, or as an alternative
> way to set it).

Personally, that seems unintuitive. /proc/sys/kernel/pid_max and the pids
cgroup controller are orthogonal features, why should they be able to affect
each other (or even be aware of each other)?

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 2/2] cgroups: add a pids subsystem
@ 2015-03-10  8:10           ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-10  8:10 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Tejun Heo, lizefan-hv44wF8Li93QT0dZR+AlfA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	richard-/L3Ra7n9ekc, Frédéric Weisbecker,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hi Austin,

>> Does pids limit make sense in the root cgroup?
>>
> I would say it kind of does, although I would just expect it to track
> /proc/sys/kernel/pid_max (either as a read-only value, or as an alternative
> way to set it).

Personally, that seems unintuitive. /proc/sys/kernel/pid_max and the pids
cgroup controller are orthogonal features, why should they be able to affect
each other (or even be aware of each other)?

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/2] cgroups: allow a cgroup subsystem to reject a fork
@ 2015-03-10  8:19             ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-10  8:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, mingo, peterz, richard, Frédéric Weisbecker,
	linux-kernel, cgroups, Aleksa Sarai

Hi Tejun,

>> The reason is that when cgroup_can_fork() is called, the css_set doesn't
>> contain the pids cgroup it's forking to. You can verify this by moving that
>> segment of code back to it's original position and compiling/rebooting/testing
>> the pids cgroup. You will get a WARN each time you try to attach to a new pids
>> cgroup, because when your shell forks the counter for the original pids cgroup
>> hierarchy gets incremented but when the command exits the counter for the *new*
>> pids cgroup hierarchy gets decremented. Essentially it's because we need to
>> reference the css_set of the newly forked process (as it exists when
>> cgroup_fork() runs) when incrementing the hierarchy -- otherwise we will
>> increment/decrement a different hierarchy between fork() and exit().
>>
>> If there's a correct way of doing this, I'm all ears. I had a bad feeling that
>> moving that section would break the whole visibility issue in the same fashion
>> as I did in v1 of this patchset. The thing is, I'm not sure there's a way to
>> access the new css_set of the task as though it is attached without making it
>> visible prematurely (because we need the css_set to refer to the right
>> hierarchy in order to conditionally decide the fork).
>
> You can charge the parent's at can_attach(), remember which one you
> charged, and at post_fork() if the parent's has changed inbetween, fix
> it up.  Again, you may end up breaking the hard limit of the new pids
> cgroup at this point but that's fine.  The cgroup membership changing
> inbetween pretty much implies that organization operation happened
> inbetween.

I'm not sure how to check for equality between two `css_set`s (or just two
`css`s). Is there a function to do so? Also, I'm not sure if there's a nice way
of doing a charge that stops if you hit a certain `css` (unless we start
passing `css_set`s to the fork/exit callbacks -- and then we can uncharge the
old css_set and charge the new one).

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/2] cgroups: allow a cgroup subsystem to reject a fork
@ 2015-03-10  8:19             ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-10  8:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, richard-/L3Ra7n9ekc,
	Frédéric Weisbecker,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Aleksa Sarai

Hi Tejun,

>> The reason is that when cgroup_can_fork() is called, the css_set doesn't
>> contain the pids cgroup it's forking to. You can verify this by moving that
>> segment of code back to it's original position and compiling/rebooting/testing
>> the pids cgroup. You will get a WARN each time you try to attach to a new pids
>> cgroup, because when your shell forks the counter for the original pids cgroup
>> hierarchy gets incremented but when the command exits the counter for the *new*
>> pids cgroup hierarchy gets decremented. Essentially it's because we need to
>> reference the css_set of the newly forked process (as it exists when
>> cgroup_fork() runs) when incrementing the hierarchy -- otherwise we will
>> increment/decrement a different hierarchy between fork() and exit().
>>
>> If there's a correct way of doing this, I'm all ears. I had a bad feeling that
>> moving that section would break the whole visibility issue in the same fashion
>> as I did in v1 of this patchset. The thing is, I'm not sure there's a way to
>> access the new css_set of the task as though it is attached without making it
>> visible prematurely (because we need the css_set to refer to the right
>> hierarchy in order to conditionally decide the fork).
>
> You can charge the parent's at can_attach(), remember which one you
> charged, and at post_fork() if the parent's has changed inbetween, fix
> it up.  Again, you may end up breaking the hard limit of the new pids
> cgroup at this point but that's fine.  The cgroup membership changing
> inbetween pretty much implies that organization operation happened
> inbetween.

I'm not sure how to check for equality between two `css_set`s (or just two
`css`s). Is there a function to do so? Also, I'm not sure if there's a nice way
of doing a charge that stops if you hit a certain `css` (unless we start
passing `css_set`s to the fork/exit callbacks -- and then we can uncharge the
old css_set and charge the new one).

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 2/2] cgroups: add a pids subsystem
  2015-03-10  8:10           ` Aleksa Sarai
  (?)
@ 2015-03-10 11:32           ` Austin S Hemmelgarn
  2015-03-10 12:31               ` Aleksa Sarai
  -1 siblings, 1 reply; 108+ messages in thread
From: Austin S Hemmelgarn @ 2015-03-10 11:32 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Tejun Heo, lizefan, mingo, peterz, richard,
	Frédéric Weisbecker, linux-kernel, cgroups

On 2015-03-10 04:10, Aleksa Sarai wrote:
> Hi Austin,
>
>>> Does pids limit make sense in the root cgroup?
>>>
>> I would say it kind of does, although I would just expect it to track
>> /proc/sys/kernel/pid_max (either as a read-only value, or as an alternative
>> way to set it).
>
> Personally, that seems unintuitive. /proc/sys/kernel/pid_max and the pids
> cgroup controller are orthogonal features, why should they be able to affect
> each other (or even be aware of each other)?
I wouldn't consider them entirely orthogonal, the sysctl value is the 
limiting factor for the maximal value that can be set in a given pids 
cgroup.  Setting an unlimited value in the cgroup is functionally 
identical to setting it to be equal to /proc/sys/kernel/pid_max, and the 
root cgroup is functionally equivalent to /proc/sys/kernel/pid_max, 
because all tasks that aren't in another cgroup get put in the root.

My only thought is that having the file that would set the limit there 
might make things much simpler for software that expects the entire 
cgroup structure to be hierarchical.


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 2/2] cgroups: add a pids subsystem
@ 2015-03-10 12:31               ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-10 12:31 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Tejun Heo, lizefan, mingo, peterz, richard,
	Frédéric Weisbecker, linux-kernel, cgroups

Hi Austin,

>>>> Does pids limit make sense in the root cgroup?
>>>
>>> I would say it kind of does, although I would just expect it to track
>>> /proc/sys/kernel/pid_max (either as a read-only value, or as an
>>> alternative way to set it).
>>
>> Personally, that seems unintuitive. /proc/sys/kernel/pid_max and the pids
>> cgroup controller are orthogonal features, why should they be able to
>> affect each other (or even be aware of each other)?
>
> I wouldn't consider them entirely orthogonal, the sysctl value is the
> limiting factor for the maximal value that can be set in a given pids
> cgroup.  Setting an unlimited value in the cgroup is functionally identical
> to setting it to be equal to /proc/sys/kernel/pid_max, and the root cgroup
> is functionally equivalent to /proc/sys/kernel/pid_max, because all tasks
> that aren't in another cgroup get put in the root.

While it is true that /proc/sys/kernel/pid_max would be functionally equivalent
to setting pids.max to the value of /proc/sys/kernel/pid_max (and thus the pids
root cgroup is functionally equivalent to the parent), it is untrue that the
sysctl value is the limiting factor on what "max" is defined as. "max" is
defined as the maximum possible pid_t value (it's really the only sane maximum
value, because trying to use /proc/sys/kernel/pid_max would be problematic due
to the fact that the maximum limit would keep changing and the line between
"max" and some arbitrary value would be blurred). In addition, the sysctl value
limits the number of pids in the system in a separate part of the kernel -- it
has nothing to do with cgroups and cgroups have nothing to do with it.

> My only thought is that having the file that would set the limit there might
> make things much simpler for software that expects the entire cgroup
> structure to be hierarchical.

The only valid value for pids.max in the root cgroup would be "max". And "max"
is defined as (PID_MAX_LIMIT + 1), not as the current setting of
/proc/sys/kernel/pid_max, because the only *real* maximum value of pid_t is
PID_MAX_LIMIT so the only reasonable way to represent "max" is a number greater
than that.

There is an issue with both of the behaviours you describe. The root-level
pids.max could either:

a) be read-only (which breaks the idea of it being "simpler" because now you
   have a special case where you can't write to the limit); or (even worse)
b) modify some other aspect of the kernel in a way that is unique compared to
   children of the root hierarchy (which IMO sounds like trouble).

In either of those two cases, the idea of it being "simpler" for software that
makes the (wrong) assumption that you can limit the global maximum number of
pids through the root cgroup is broken because it has either weird side effects
(b) or is just an odd feature (a).

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 2/2] cgroups: add a pids subsystem
@ 2015-03-10 12:31               ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-10 12:31 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Tejun Heo, lizefan-hv44wF8Li93QT0dZR+AlfA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	richard-/L3Ra7n9ekc, Frédéric Weisbecker,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hi Austin,

>>>> Does pids limit make sense in the root cgroup?
>>>
>>> I would say it kind of does, although I would just expect it to track
>>> /proc/sys/kernel/pid_max (either as a read-only value, or as an
>>> alternative way to set it).
>>
>> Personally, that seems unintuitive. /proc/sys/kernel/pid_max and the pids
>> cgroup controller are orthogonal features, why should they be able to
>> affect each other (or even be aware of each other)?
>
> I wouldn't consider them entirely orthogonal, the sysctl value is the
> limiting factor for the maximal value that can be set in a given pids
> cgroup.  Setting an unlimited value in the cgroup is functionally identical
> to setting it to be equal to /proc/sys/kernel/pid_max, and the root cgroup
> is functionally equivalent to /proc/sys/kernel/pid_max, because all tasks
> that aren't in another cgroup get put in the root.

While it is true that /proc/sys/kernel/pid_max would be functionally equivalent
to setting pids.max to the value of /proc/sys/kernel/pid_max (and thus the pids
root cgroup is functionally equivalent to the parent), it is untrue that the
sysctl value is the limiting factor on what "max" is defined as. "max" is
defined as the maximum possible pid_t value (it's really the only sane maximum
value, because trying to use /proc/sys/kernel/pid_max would be problematic due
to the fact that the maximum limit would keep changing and the line between
"max" and some arbitrary value would be blurred). In addition, the sysctl value
limits the number of pids in the system in a separate part of the kernel -- it
has nothing to do with cgroups and cgroups have nothing to do with it.

> My only thought is that having the file that would set the limit there might
> make things much simpler for software that expects the entire cgroup
> structure to be hierarchical.

The only valid value for pids.max in the root cgroup would be "max". And "max"
is defined as (PID_MAX_LIMIT + 1), not as the current setting of
/proc/sys/kernel/pid_max, because the only *real* maximum value of pid_t is
PID_MAX_LIMIT so the only reasonable way to represent "max" is a number greater
than that.

There is an issue with both of the behaviours you describe. The root-level
pids.max could either:

a) be read-only (which breaks the idea of it being "simpler" because now you
   have a special case where you can't write to the limit); or (even worse)
b) modify some other aspect of the kernel in a way that is unique compared to
   children of the root hierarchy (which IMO sounds like trouble).

In either of those two cases, the idea of it being "simpler" for software that
makes the (wrong) assumption that you can limit the global maximum number of
pids through the root cgroup is broken because it has either weird side effects
(b) or is just an odd feature (a).

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/2] cgroups: allow a cgroup subsystem to reject a fork
@ 2015-03-10 12:47               ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-10 12:47 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan, mingo, peterz, richard, Frédéric Weisbecker,
	linux-kernel, cgroups

Hello, Aleksa.

On Tue, Mar 10, 2015 at 07:19:06PM +1100, Aleksa Sarai wrote:
> I'm not sure how to check for equality between two `css_set`s (or just two
> `css`s). Is there a function to do so? Also, I'm not sure if there's a nice way

You can compare the css pointers for equality.

> of doing a charge that stops if you hit a certain `css` (unless we start
> passing `css_set`s to the fork/exit callbacks -- and then we can uncharge the
> old css_set and charge the new one).

We'll have to pass the pointer from can_fork side to post_fork.  I'm
not sure how that should be done either.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/2] cgroups: allow a cgroup subsystem to reject a fork
@ 2015-03-10 12:47               ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-10 12:47 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, richard-/L3Ra7n9ekc,
	Frédéric Weisbecker,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello, Aleksa.

On Tue, Mar 10, 2015 at 07:19:06PM +1100, Aleksa Sarai wrote:
> I'm not sure how to check for equality between two `css_set`s (or just two
> `css`s). Is there a function to do so? Also, I'm not sure if there's a nice way

You can compare the css pointers for equality.

> of doing a charge that stops if you hit a certain `css` (unless we start
> passing `css_set`s to the fork/exit callbacks -- and then we can uncharge the
> old css_set and charge the new one).

We'll have to pass the pointer from can_fork side to post_fork.  I'm
not sure how that should be done either.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/2] cgroups: allow a cgroup subsystem to reject a fork
@ 2015-03-10 14:51                 ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-10 14:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, mingo, peterz, richard, Frédéric Weisbecker,
	linux-kernel, cgroups

Hello Tejun,

On Tue, Mar 10, 2015 at 11:47 PM, Tejun Heo <tj@kernel.org> wrote:
>> of doing a charge that stops if you hit a certain `css` (unless we start
>> passing `css_set`s to the fork/exit callbacks -- and then we can uncharge the
>> old css_set and charge the new one).
>
> We'll have to pass the pointer from can_fork side to post_fork.  I'm
> not sure how that should be done either.

Actually, I'm fairly sure we can do it all inside cgroup_post_fork() because
inside cgroup_post_fork() we have access to both the old css_set and the new
one. Then it's just a matter of reverting and re-applying the charge to the
hierarchies.

I'll send a patch in a few days, I need to make sure that this method
*actually* works. :P

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/2] cgroups: allow a cgroup subsystem to reject a fork
@ 2015-03-10 14:51                 ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-10 14:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, richard-/L3Ra7n9ekc,
	Frédéric Weisbecker,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello Tejun,

On Tue, Mar 10, 2015 at 11:47 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>> of doing a charge that stops if you hit a certain `css` (unless we start
>> passing `css_set`s to the fork/exit callbacks -- and then we can uncharge the
>> old css_set and charge the new one).
>
> We'll have to pass the pointer from can_fork side to post_fork.  I'm
> not sure how that should be done either.

Actually, I'm fairly sure we can do it all inside cgroup_post_fork() because
inside cgroup_post_fork() we have access to both the old css_set and the new
one. Then it's just a matter of reverting and re-applying the charge to the
hierarchies.

I'll send a patch in a few days, I need to make sure that this method
*actually* works. :P

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/2] cgroups: allow a cgroup subsystem to reject a fork
@ 2015-03-10 15:17                   ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-10 15:17 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan, mingo, peterz, richard, Frédéric Weisbecker,
	linux-kernel, cgroups

Hello,

On Wed, Mar 11, 2015 at 01:51:06AM +1100, Aleksa Sarai wrote:
> Actually, I'm fairly sure we can do it all inside cgroup_post_fork() because
> inside cgroup_post_fork() we have access to both the old css_set and the new
> one. Then it's just a matter of reverting and re-applying the charge to the
> hierarchies.

But the problem isn't whether we know both the old and new ones.  The
problem is that we can only abort before the fork commit point and the
"old" one may change between the abort point and post-commit point so
we need to trycharge the old one at the possible abort point, remember
to which css it got charged and then check whether the association has
changed inbetween at the post commit point and readjust if so.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/2] cgroups: allow a cgroup subsystem to reject a fork
@ 2015-03-10 15:17                   ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-10 15:17 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, richard-/L3Ra7n9ekc,
	Frédéric Weisbecker,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Wed, Mar 11, 2015 at 01:51:06AM +1100, Aleksa Sarai wrote:
> Actually, I'm fairly sure we can do it all inside cgroup_post_fork() because
> inside cgroup_post_fork() we have access to both the old css_set and the new
> one. Then it's just a matter of reverting and re-applying the charge to the
> hierarchies.

But the problem isn't whether we know both the old and new ones.  The
problem is that we can only abort before the fork commit point and the
"old" one may change between the abort point and post-commit point so
we need to trycharge the old one at the possible abort point, remember
to which css it got charged and then check whether the association has
changed inbetween at the post commit point and readjust if so.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/2] cgroups: allow a cgroup subsystem to reject a fork
  2015-03-10 15:17                   ` Tejun Heo
  (?)
@ 2015-03-11  5:16                   ` Aleksa Sarai
  2015-03-11 11:46                     ` Tejun Heo
  -1 siblings, 1 reply; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-11  5:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, mingo, peterz, richard, Frédéric Weisbecker,
	linux-kernel, cgroups

Hello Tejun,

On Wed, Mar 11, 2015 at 2:17 AM, Tejun Heo <tj@kernel.org> wrote:
> On Wed, Mar 11, 2015 at 01:51:06AM +1100, Aleksa Sarai wrote:
>> Actually, I'm fairly sure we can do it all inside cgroup_post_fork() because
>> inside cgroup_post_fork() we have access to both the old css_set and the new
>> one. Then it's just a matter of reverting and re-applying the charge to the
>> hierarchies.
>
> But the problem isn't whether we know both the old and new ones.  The
> problem is that we can only abort before the fork commit point and the
> "old" one may change between the abort point and post-commit point so
> we need to trycharge the old one at the possible abort point, remember
> to which css it got charged and then check whether the association has
> changed inbetween at the post commit point and readjust if so.

Actually, it appears I was wrong. Until we hit cgroup_post_fork()'s setting up
of the task's css_set, cgroup_can_fork() ends up charging init_css_set *every
time*. Which means a check to see if it changed will always show that it had
changed. The issue is that we need to access the css_set which is going to be
saved as the task's css_set in order to decide if the task should fork.

We know that the task will have its css_set set to task_css_set(current), and
we could just use that in cgroup_can_fork(). The only question is, can
task_css_set(current) change between cgroup_can_fork() and cgroup_post_fork()?

If it can change between the two calls, then we're in trouble -- there'd be no
reliable way of checking that the future css_set allows for the fork without
going through the registration of the css_set *proper* in cgroup_post_fork()
unless we hold css_set_rwsem for the entirety of the can_fork() to post_fork()
segment (which I can't imagine is a good idea).

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/2] cgroups: allow a cgroup subsystem to reject a fork
  2015-03-11  5:16                   ` Aleksa Sarai
@ 2015-03-11 11:46                     ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-11 11:46 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan, mingo, peterz, richard, Frédéric Weisbecker,
	linux-kernel, cgroups

On Wed, Mar 11, 2015 at 04:16:30PM +1100, Aleksa Sarai wrote:
> We know that the task will have its css_set set to task_css_set(current), and
> we could just use that in cgroup_can_fork(). The only question is, can
> task_css_set(current) change between cgroup_can_fork() and cgroup_post_fork()?

Yes, that's what I've been writing in the previous messages.  It can change.

> If it can change between the two calls, then we're in trouble -- there'd be no
> reliable way of checking that the future css_set allows for the fork without
> going through the registration of the css_set *proper* in cgroup_post_fork()
> unless we hold css_set_rwsem for the entirety of the can_fork() to post_fork()
> segment (which I can't imagine is a good idea).

Please re-read my previous messages.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 2/2] cgroups: add a pids subsystem
@ 2015-03-11 15:13                 ` Austin S Hemmelgarn
  0 siblings, 0 replies; 108+ messages in thread
From: Austin S Hemmelgarn @ 2015-03-11 15:13 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Tejun Heo, lizefan, mingo, peterz, richard,
	Frédéric Weisbecker, linux-kernel, cgroups

On 2015-03-10 08:31, Aleksa Sarai wrote:
> Hi Austin,
>
>>>>> Does pids limit make sense in the root cgroup?
>>>>
>>>> I would say it kind of does, although I would just expect it to track
>>>> /proc/sys/kernel/pid_max (either as a read-only value, or as an
>>>> alternative way to set it).
>>>
>>> Personally, that seems unintuitive. /proc/sys/kernel/pid_max and the pids
>>> cgroup controller are orthogonal features, why should they be able to
>>> affect each other (or even be aware of each other)?
>>
>> I wouldn't consider them entirely orthogonal, the sysctl value is the
>> limiting factor for the maximal value that can be set in a given pids
>> cgroup.  Setting an unlimited value in the cgroup is functionally identical
>> to setting it to be equal to /proc/sys/kernel/pid_max, and the root cgroup
>> is functionally equivalent to /proc/sys/kernel/pid_max, because all tasks
>> that aren't in another cgroup get put in the root.
>
> While it is true that /proc/sys/kernel/pid_max would be functionally equivalent
> to setting pids.max to the value of /proc/sys/kernel/pid_max (and thus the pids
> root cgroup is functionally equivalent to the parent), it is untrue that the
> sysctl value is the limiting factor on what "max" is defined as. "max" is
> defined as the maximum possible pid_t value (it's really the only sane maximum
> value, because trying to use /proc/sys/kernel/pid_max would be problematic due
> to the fact that the maximum limit would keep changing and the line between
> "max" and some arbitrary value would be blurred). In addition, the sysctl value
> limits the number of pids in the system in a separate part of the kernel -- it
> has nothing to do with cgroups and cgroups have nothing to do with it.
>
I did not necessarily word this very clearly.  What I meant is that 
/proc/sys/kernel/pid_max is essentially an external limiting factor that 
caps the total number of pids that can be under the root cgroup and it's 
children, not that the cgroup in any way payed attention to it.  It 
might be useful to be able to just disable the sysctl option and set the 
value through the root cgroup, solely or consistency, although such 
usage isn't something I would consider essential in any way.


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 2/2] cgroups: add a pids subsystem
@ 2015-03-11 15:13                 ` Austin S Hemmelgarn
  0 siblings, 0 replies; 108+ messages in thread
From: Austin S Hemmelgarn @ 2015-03-11 15:13 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Tejun Heo, lizefan-hv44wF8Li93QT0dZR+AlfA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	richard-/L3Ra7n9ekc, Frédéric Weisbecker,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On 2015-03-10 08:31, Aleksa Sarai wrote:
> Hi Austin,
>
>>>>> Does pids limit make sense in the root cgroup?
>>>>
>>>> I would say it kind of does, although I would just expect it to track
>>>> /proc/sys/kernel/pid_max (either as a read-only value, or as an
>>>> alternative way to set it).
>>>
>>> Personally, that seems unintuitive. /proc/sys/kernel/pid_max and the pids
>>> cgroup controller are orthogonal features, why should they be able to
>>> affect each other (or even be aware of each other)?
>>
>> I wouldn't consider them entirely orthogonal, the sysctl value is the
>> limiting factor for the maximal value that can be set in a given pids
>> cgroup.  Setting an unlimited value in the cgroup is functionally identical
>> to setting it to be equal to /proc/sys/kernel/pid_max, and the root cgroup
>> is functionally equivalent to /proc/sys/kernel/pid_max, because all tasks
>> that aren't in another cgroup get put in the root.
>
> While it is true that /proc/sys/kernel/pid_max would be functionally equivalent
> to setting pids.max to the value of /proc/sys/kernel/pid_max (and thus the pids
> root cgroup is functionally equivalent to the parent), it is untrue that the
> sysctl value is the limiting factor on what "max" is defined as. "max" is
> defined as the maximum possible pid_t value (it's really the only sane maximum
> value, because trying to use /proc/sys/kernel/pid_max would be problematic due
> to the fact that the maximum limit would keep changing and the line between
> "max" and some arbitrary value would be blurred). In addition, the sysctl value
> limits the number of pids in the system in a separate part of the kernel -- it
> has nothing to do with cgroups and cgroups have nothing to do with it.
>
I did not necessarily word this very clearly.  What I meant is that 
/proc/sys/kernel/pid_max is essentially an external limiting factor that 
caps the total number of pids that can be under the root cgroup and it's 
children, not that the cgroup in any way payed attention to it.  It 
might be useful to be able to just disable the sysctl option and set the 
value through the root cgroup, solely or consistency, although such 
usage isn't something I would consider essential in any way.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/2] cgroups: allow a cgroup subsystem to reject a fork
@ 2015-03-11 23:47             ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-11 23:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, mingo, peterz, richard, Frédéric Weisbecker,
	linux-kernel, cgroups, Aleksa Sarai

Hi Tejun,

> You can charge the parent's at can_attach(), remember which one you
> charged, and at post_fork() if the parent's has changed inbetween, fix
> it up. [...]

Did you mean can_fork() instead of can_attach() here?

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/2] cgroups: allow a cgroup subsystem to reject a fork
@ 2015-03-11 23:47             ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-11 23:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, richard-/L3Ra7n9ekc,
	Frédéric Weisbecker,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Aleksa Sarai

Hi Tejun,

> You can charge the parent's at can_attach(), remember which one you
> charged, and at post_fork() if the parent's has changed inbetween, fix
> it up. [...]

Did you mean can_fork() instead of can_attach() here?

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/2] cgroups: allow a cgroup subsystem to reject a fork
@ 2015-03-12  1:25               ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-12  1:25 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan, mingo, peterz, richard, Frédéric Weisbecker,
	linux-kernel, cgroups

On Thu, Mar 12, 2015 at 10:47:11AM +1100, Aleksa Sarai wrote:
> Hi Tejun,
> 
> > You can charge the parent's at can_attach(), remember which one you
> > charged, and at post_fork() if the parent's has changed inbetween, fix
> > it up. [...]
> 
> Did you mean can_fork() instead of can_attach() here?

Ah, yeah, can_fork().  Sorry about that.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/2] cgroups: allow a cgroup subsystem to reject a fork
@ 2015-03-12  1:25               ` Tejun Heo
  0 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-12  1:25 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, richard-/L3Ra7n9ekc,
	Frédéric Weisbecker,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu, Mar 12, 2015 at 10:47:11AM +1100, Aleksa Sarai wrote:
> Hi Tejun,
> 
> > You can charge the parent's at can_attach(), remember which one you
> > charged, and at post_fork() if the parent's has changed inbetween, fix
> > it up. [...]
> 
> Did you mean can_fork() instead of can_attach() here?

Ah, yeah, can_fork().  Sorry about that.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 2/2] cgroups: add a pids subsystem
@ 2015-03-12  2:28                   ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-12  2:28 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Tejun Heo, lizefan, mingo, peterz, richard,
	Frédéric Weisbecker, linux-kernel, cgroups

> I did not necessarily word this very clearly.  What I meant is that
> /proc/sys/kernel/pid_max is essentially an external limiting factor that
> caps the total number of pids that can be under the root cgroup and it's
> children, not that the cgroup in any way payed attention to it.  It might be
> useful to be able to just disable the sysctl option and set the value
> through the root cgroup, solely or consistency, although such usage isn't
> something I would consider essential in any way.

Maybe this is something that can be reviewed as a separate patchset to this
one? I'd much prefer that we actually get per-cgroup process limiting merged
first, then deal with such features separately.

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 2/2] cgroups: add a pids subsystem
@ 2015-03-12  2:28                   ` Aleksa Sarai
  0 siblings, 0 replies; 108+ messages in thread
From: Aleksa Sarai @ 2015-03-12  2:28 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Tejun Heo, lizefan-hv44wF8Li93QT0dZR+AlfA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	richard-/L3Ra7n9ekc, Frédéric Weisbecker,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

> I did not necessarily word this very clearly.  What I meant is that
> /proc/sys/kernel/pid_max is essentially an external limiting factor that
> caps the total number of pids that can be under the root cgroup and it's
> children, not that the cgroup in any way payed attention to it.  It might be
> useful to be able to just disable the sysctl option and set the value
> through the root cgroup, solely or consistency, although such usage isn't
> something I would consider essential in any way.

Maybe this is something that can be reviewed as a separate patchset to this
one? I'd much prefer that we actually get per-cgroup process limiting merged
first, then deal with such features separately.

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 2/2] cgroups: add a pids subsystem
  2015-03-11 15:13                 ` Austin S Hemmelgarn
  (?)
  (?)
@ 2015-03-12  3:47                 ` Tejun Heo
  -1 siblings, 0 replies; 108+ messages in thread
From: Tejun Heo @ 2015-03-12  3:47 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Aleksa Sarai, lizefan, mingo, peterz, richard,
	Frédéric Weisbecker, linux-kernel, cgroups

On Wed, Mar 11, 2015 at 11:13:48AM -0400, Austin S Hemmelgarn wrote:
> I did not necessarily word this very clearly.  What I meant is that
> /proc/sys/kernel/pid_max is essentially an external limiting factor that
> caps the total number of pids that can be under the root cgroup and it's
> children, not that the cgroup in any way payed attention to it.  It might be
> useful to be able to just disable the sysctl option and set the value
> through the root cgroup, solely or consistency, although such usage isn't
> something I would consider essential in any way.

Unless there's a compelling reason to implement it, I don't think it's
a good idea to add it.  The reasons against it have been mentioned a
couple times in the thread and AFAICS none is being refuted.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v4 2/2] cgroups: add a pids subsystem
  2015-03-12  2:28                   ` Aleksa Sarai
  (?)
@ 2015-03-12 15:35                   ` Austin S Hemmelgarn
  -1 siblings, 0 replies; 108+ messages in thread
From: Austin S Hemmelgarn @ 2015-03-12 15:35 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Tejun Heo, lizefan, mingo, peterz, richard,
	Frédéric Weisbecker, linux-kernel, cgroups

On 2015-03-11 22:28, Aleksa Sarai wrote:
>> I did not necessarily word this very clearly.  What I meant is that
>> /proc/sys/kernel/pid_max is essentially an external limiting factor that
>> caps the total number of pids that can be under the root cgroup and it's
>> children, not that the cgroup in any way payed attention to it.  It might be
>> useful to be able to just disable the sysctl option and set the value
>> through the root cgroup, solely or consistency, although such usage isn't
>> something I would consider essential in any way.
>
> Maybe this is something that can be reviewed as a separate patchset to this
> one? I'd much prefer that we actually get per-cgroup process limiting merged
> first, then deal with such features separately.
My thought exactly.


^ permalink raw reply	[flat|nested] 108+ messages in thread

end of thread, other threads:[~2015-03-12 15:36 UTC | newest]

Thread overview: 108+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-23  3:08 [PATCH RFC 0/2] add nproc cgroup subsystem Aleksa Sarai
2015-02-23  3:08 ` Aleksa Sarai
2015-02-23  3:08 ` [PATCH RFC 1/2] cgroups: allow a cgroup subsystem to reject a fork Aleksa Sarai
2015-02-23 14:49   ` Peter Zijlstra
2015-02-23  3:08 ` [PATCH RFC 2/2] cgroups: add an nproc subsystem Aleksa Sarai
2015-02-27  4:17 ` [RFC PATCH v2 0/2] add nproc cgroup subsystem Aleksa Sarai
2015-02-27  4:17   ` Aleksa Sarai
2015-02-27  4:17   ` [PATCH v2 1/2] cgroups: allow a cgroup subsystem to reject a fork Aleksa Sarai
2015-02-27  4:17     ` Aleksa Sarai
2015-03-09  3:06     ` Tejun Heo
2015-03-09  3:06       ` Tejun Heo
     [not found]       ` <CAOviyaip7Faz98YWzGoTaXGYVb72sfD+ZL4Xa89reU9+=43jFA@mail.gmail.com>
     [not found]         ` <20150309065902.GP13283@htj.duckdns.org>
2015-03-10  8:19           ` Aleksa Sarai
2015-03-10  8:19             ` Aleksa Sarai
2015-03-10 12:47             ` Tejun Heo
2015-03-10 12:47               ` Tejun Heo
2015-03-10 14:51               ` Aleksa Sarai
2015-03-10 14:51                 ` Aleksa Sarai
2015-03-10 15:17                 ` Tejun Heo
2015-03-10 15:17                   ` Tejun Heo
2015-03-11  5:16                   ` Aleksa Sarai
2015-03-11 11:46                     ` Tejun Heo
2015-03-11 23:47           ` Aleksa Sarai
2015-03-11 23:47             ` Aleksa Sarai
2015-03-12  1:25             ` Tejun Heo
2015-03-12  1:25               ` Tejun Heo
2015-02-27  4:17   ` [PATCH v2 2/2] cgroups: add an nproc subsystem Aleksa Sarai
2015-03-02 15:22     ` Tejun Heo
2015-03-02 15:22       ` Tejun Heo
2015-03-09  1:49       ` Zefan Li
2015-03-09  1:49         ` Zefan Li
2015-03-09  2:34         ` Tejun Heo
2015-03-09  2:34           ` Tejun Heo
2015-02-27 11:49 ` [PATCH RFC 0/2] add nproc cgroup subsystem Tejun Heo
2015-02-27 13:46   ` Richard Weinberger
2015-02-27 13:46     ` Richard Weinberger
2015-02-27 13:52     ` Tejun Heo
2015-02-27 13:52       ` Tejun Heo
2015-02-27 16:42   ` Austin S Hemmelgarn
2015-02-27 16:42     ` Austin S Hemmelgarn
2015-02-27 17:06     ` Tejun Heo
2015-02-27 17:06       ` Tejun Heo
2015-02-27 17:25       ` Tim Hockin
2015-02-27 17:25         ` Tim Hockin
2015-02-27 17:45         ` Tejun Heo
2015-02-27 17:56           ` Tejun Heo
2015-02-27 17:56             ` Tejun Heo
2015-02-27 21:45           ` Tim Hockin
2015-02-27 21:45             ` Tim Hockin
2015-02-27 21:49             ` Tejun Heo
     [not found]               ` <CAAAKZwsCc8BtFx58KMFpRTohU81oCBeGVOPGMJrjJt9q5upKfQ@mail.gmail.com>
2015-02-28 16:57                 ` Tejun Heo
2015-02-28 22:26                   ` Tim Hockin
2015-02-28 22:26                     ` Tim Hockin
2015-02-28 22:50                     ` Tejun Heo
2015-02-28 22:50                       ` Tejun Heo
2015-03-01  4:46                       ` Tim Hockin
2015-03-01  4:46                         ` Tim Hockin
2015-02-28 23:11                     ` Johannes Weiner
2015-02-28 23:11                       ` Johannes Weiner
2015-02-27 18:49       ` Austin S Hemmelgarn
2015-02-27 18:49         ` Austin S Hemmelgarn
2015-02-27 19:35         ` Tejun Heo
2015-02-27 19:35           ` Tejun Heo
2015-02-28  9:26         ` Aleksa Sarai
2015-02-28  9:26           ` Aleksa Sarai
2015-02-28 11:59           ` Tejun Heo
2015-02-28 11:59             ` Tejun Heo
     [not found]             ` <CAAAKZws45c3PhFQMGrm_K+OZV+KOyGV9sXTakHcTfNP1kHxzOQ@mail.gmail.com>
2015-02-28 16:43               ` Tejun Heo
2015-02-28 16:43                 ` Tejun Heo
2015-03-02 13:13                 ` Austin S Hemmelgarn
2015-03-02 13:31                   ` Aleksa Sarai
2015-03-02 13:31                     ` Aleksa Sarai
2015-03-02 13:54                     ` Tejun Heo
2015-03-02 13:54                       ` Tejun Heo
2015-03-02 13:49                   ` Tejun Heo
2015-02-27 17:12     ` Tim Hockin
2015-02-27 17:15       ` Tejun Heo
2015-02-27 17:15         ` Tejun Heo
2015-03-04 20:23 ` [PATCH v3 0/2] cgroup: add pids subsystem Aleksa Sarai
2015-03-04 20:23   ` [PATCH v3 1/2] cgroups: allow a cgroup subsystem to reject a fork Aleksa Sarai
2015-03-04 20:23   ` [PATCH v3 2/2] cgroups: add a pids subsystem Aleksa Sarai
2015-03-05  8:39     ` Aleksa Sarai
2015-03-05 14:37     ` Marian Marinov
2015-03-06  1:45 ` [PATCH v4 0/2] cgroup: add " Aleksa Sarai
2015-03-06  1:45   ` Aleksa Sarai
2015-03-06  1:45   ` [PATCH v4 1/2] cgroups: allow a cgroup subsystem to reject a fork Aleksa Sarai
2015-03-06  1:45   ` [PATCH v4 2/2] cgroups: add a pids subsystem Aleksa Sarai
2015-03-06  1:45     ` Aleksa Sarai
2015-03-09  3:34     ` Tejun Heo
2015-03-09  3:34       ` Tejun Heo
2015-03-09  3:39       ` Tejun Heo
2015-03-09  3:39         ` Tejun Heo
2015-03-09 18:58       ` Austin S Hemmelgarn
2015-03-09 18:58         ` Austin S Hemmelgarn
2015-03-09 19:51         ` Tejun Heo
2015-03-09 19:51           ` Tejun Heo
2015-03-10  8:10         ` Aleksa Sarai
2015-03-10  8:10           ` Aleksa Sarai
2015-03-10 11:32           ` Austin S Hemmelgarn
2015-03-10 12:31             ` Aleksa Sarai
2015-03-10 12:31               ` Aleksa Sarai
2015-03-11 15:13               ` Austin S Hemmelgarn
2015-03-11 15:13                 ` Austin S Hemmelgarn
2015-03-12  2:28                 ` Aleksa Sarai
2015-03-12  2:28                   ` Aleksa Sarai
2015-03-12 15:35                   ` Austin S Hemmelgarn
2015-03-12  3:47                 ` Tejun Heo
2015-03-09  3:08   ` [PATCH v4 0/2] cgroup: add " Tejun Heo
2015-03-09  3:08     ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.