[PATCH 0/8 v3] cgroups: Task counter subsystem (was: New max number of tasks subsystem)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/8 v3] cgroups: Task counter subsystem (was: New max number of tasks subsystem)
@ 2011-07-29 16:13 Frederic Weisbecker
  2011-07-29 16:13 ` [PATCH 1/8] cgroups: Add res_counter_write_u64() API Frederic Weisbecker
                   ` (8 more replies)
  0 siblings, 9 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2011-07-29 16:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Andrew Morton

Reminder:

This patchset is aimed at reducing the impact of a forkbomb to a
cgroup boundaries, thus minimizing the consequences of such an attack
against the rest of the system.

This can be useful when cgroups are used to stage some processes or run
untrustees.


changes in v3:

- acks added
- simplify resource counter inheritance (patch 2/8)
- rename "max_tasks" into "tasks" subsystem (patch 7/8)
- general renaming of "max tasks" into "task counter" for filename and config (patch 7/8)
- simplification in accessors (patch 7/8)
- drop root task counter object, only keep a root task counter css (patch 7/8)
- rename cgroup files into "usage" and "limit" (patch 7/8)
- drop need for that limit_fail_at pointer on charge (patch 8/8)

This can be pulled from:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git
	cgroup/nr_proc-v2

Frederic Weisbecker (8):
  cgroups: Add res_counter_write_u64() API
  cgroups: New resource counter inheritance API
  cgroups: Add previous cgroup in can_attach_task/attach_task callbacks
  cgroups: New cancel_attach_task subsystem callback
  cgroups: Ability to stop res charge propagation on bounded ancestor
  cgroups: Add res counter common ancestor searching
  cgroups: Add a task counter subsystem
  res_counter: Allow charge failure pointer to be null

 block/blk-cgroup.c            |   10 ++-
 include/linux/cgroup.h        |   15 +++-
 include/linux/cgroup_subsys.h |    8 ++
 include/linux/res_counter.h   |   12 +++
 init/Kconfig                  |    7 ++
 kernel/Makefile               |    1 +
 kernel/cgroup.c               |   25 ++++--
 kernel/cgroup_freezer.c       |    3 +-
 kernel/cgroup_task_counter.c  |  176 +++++++++++++++++++++++++++++++++++++++++
 kernel/cpuset.c               |    6 +-
 kernel/events/core.c          |    5 +-
 kernel/fork.c                 |    4 +
 kernel/res_counter.c          |   81 ++++++++++++++++---
 kernel/sched.c                |    6 +-
 14 files changed, 326 insertions(+), 33 deletions(-)
 create mode 100644 kernel/cgroup_task_counter.c

-- 
1.7.5.4


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/8] cgroups: Add res_counter_write_u64() API
  2011-07-29 16:13 [PATCH 0/8 v3] cgroups: Task counter subsystem (was: New max number of tasks subsystem) Frederic Weisbecker
@ 2011-07-29 16:13 ` Frederic Weisbecker
  2011-08-09 15:17   ` Oleg Nesterov
  2011-07-29 16:13 ` [PATCH 2/8] cgroups: New resource counter inheritance API Frederic Weisbecker
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2011-07-29 16:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov

Extend the resource counter API with a mirror of
res_counter_read_u64() to make it handy to update a resource
counter value from a cgroup subsystem u64 value file.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/res_counter.h |    2 ++
 kernel/res_counter.c        |   21 +++++++++++++++------
 2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index c9d625c..1b3fe05 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -82,6 +82,8 @@ int res_counter_memparse_write_strategy(const char *buf,
 int res_counter_write(struct res_counter *counter, int member,
 		      const char *buffer, write_strategy_fn write_strategy);
 
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
+
 /*
  * the field descriptors. one for each member of res_counter
  */
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 34683ef..806d041 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -168,12 +168,22 @@ int res_counter_memparse_write_strategy(const char *buf,
 	return 0;
 }
 
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
+{
+	unsigned long long *target;
+	unsigned long flags;
+
+	spin_lock_irqsave(&counter->lock, flags);
+	target = res_counter_member(counter, member);
+	*target = val;
+	spin_unlock_irqrestore(&counter->lock, flags);
+}
+
 int res_counter_write(struct res_counter *counter, int member,
 		      const char *buf, write_strategy_fn write_strategy)
 {
 	char *end;
-	unsigned long flags;
-	unsigned long long tmp, *val;
+	unsigned long long tmp;
 
 	if (write_strategy) {
 		if (write_strategy(buf, &tmp))
@@ -183,9 +193,8 @@ int res_counter_write(struct res_counter *counter, int member,
 		if (*end != '\0')
 			return -EINVAL;
 	}
-	spin_lock_irqsave(&counter->lock, flags);
-	val = res_counter_member(counter, member);
-	*val = tmp;
-	spin_unlock_irqrestore(&counter->lock, flags);
+
+	res_counter_write_u64(counter, member, tmp);
+
 	return 0;
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/8] cgroups: New resource counter inheritance API
  2011-07-29 16:13 [PATCH 0/8 v3] cgroups: Task counter subsystem (was: New max number of tasks subsystem) Frederic Weisbecker
  2011-07-29 16:13 ` [PATCH 1/8] cgroups: Add res_counter_write_u64() API Frederic Weisbecker
@ 2011-07-29 16:13 ` Frederic Weisbecker
  2011-07-29 16:13 ` [PATCH 3/8] cgroups: Add previous cgroup in can_attach_task/attach_task callbacks Frederic Weisbecker
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2011-07-29 16:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov

Provide an API to inherit a counter value from a parent.
This can be useful to implement cgroup.clone_children on
a resource counter.

Still the resources of the children are limited by those
of the parent, so this is only to provide a default setting
behaviour when clone_children is set.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/res_counter.h |    2 ++
 kernel/res_counter.c        |   12 ++++++++++++
 2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 1b3fe05..109d118 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -84,6 +84,8 @@ int res_counter_write(struct res_counter *counter, int member,
 
 void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
 
+void res_counter_inherit(struct res_counter *counter, int member);
+
 /*
  * the field descriptors. one for each member of res_counter
  */
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 806d041..10a901b 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -198,3 +198,15 @@ int res_counter_write(struct res_counter *counter, int member,
 
 	return 0;
 }
+
+void res_counter_inherit(struct res_counter *counter, int member)
+{
+	struct res_counter *parent;
+	unsigned long long val;
+
+	parent = counter->parent;
+	if (parent) {
+		val = res_counter_read_u64(parent, member);
+		res_counter_write_u64(counter, member, val);
+	}
+}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/8] cgroups: Add previous cgroup in can_attach_task/attach_task callbacks
  2011-07-29 16:13 [PATCH 0/8 v3] cgroups: Task counter subsystem (was: New max number of tasks subsystem) Frederic Weisbecker
  2011-07-29 16:13 ` [PATCH 1/8] cgroups: Add res_counter_write_u64() API Frederic Weisbecker
  2011-07-29 16:13 ` [PATCH 2/8] cgroups: New resource counter inheritance API Frederic Weisbecker
@ 2011-07-29 16:13 ` Frederic Weisbecker
  2011-08-17  2:40   ` Li Zefan
  2011-07-29 16:13 ` [PATCH 4/8] cgroups: New cancel_attach_task subsystem callback Frederic Weisbecker
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2011-07-29 16:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov

This is to prepare the integration of a new max number of proc
cgroup subsystem. We'll need to release some resources from the
previous cgroup.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 block/blk-cgroup.c      |   10 ++++++----
 include/linux/cgroup.h  |    5 +++--
 kernel/cgroup.c         |   10 ++++++----
 kernel/cgroup_freezer.c |    3 ++-
 kernel/cpuset.c         |    6 ++++--
 kernel/events/core.c    |    5 +++--
 kernel/sched.c          |    6 ++++--
 7 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index bcaf16e..d1bfe88 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -30,8 +30,8 @@ EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
 static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
 						  struct cgroup *);
-static int blkiocg_can_attach_task(struct cgroup *, struct task_struct *);
-static void blkiocg_attach_task(struct cgroup *, struct task_struct *);
+static int blkiocg_can_attach_task(struct cgroup *, struct cgroup *, struct task_struct *);
+static void blkiocg_attach_task(struct cgroup *, struct cgroup *, struct task_struct *);
 static void blkiocg_destroy(struct cgroup_subsys *, struct cgroup *);
 static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 
@@ -1614,7 +1614,8 @@ done:
  * of the main cic data structures.  For now we allow a task to change
  * its cgroup only if it's the only owner of its ioc.
  */
-static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static int blkiocg_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+				   struct task_struct *tsk)
 {
 	struct io_context *ioc;
 	int ret = 0;
@@ -1629,7 +1630,8 @@ static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	return ret;
 }
 
-static void blkiocg_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static void blkiocg_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+				struct task_struct *tsk)
 {
 	struct io_context *ioc;
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ab4ac0c..e8288a0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -468,11 +468,12 @@ struct cgroup_subsys {
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			  struct task_struct *tsk);
-	int (*can_attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
+	int (*can_attach_task)(struct cgroup *cgrp, struct cgroup *old_cgrp,
+			       struct task_struct *tsk);
 	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			      struct task_struct *tsk);
 	void (*pre_attach)(struct cgroup *cgrp);
-	void (*attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
+	void (*attach_task)(struct cgroup *cgrp, struct cgroup *old_cgrp, struct task_struct *tsk);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 		       struct cgroup *old_cgrp, struct task_struct *tsk);
 	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2731d11..c3ee4cf 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1841,7 +1841,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 			}
 		}
 		if (ss->can_attach_task) {
-			retval = ss->can_attach_task(cgrp, tsk);
+			retval = ss->can_attach_task(cgrp, oldcgrp, tsk);
 			if (retval) {
 				failed_ss = ss;
 				goto out;
@@ -1857,7 +1857,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		if (ss->pre_attach)
 			ss->pre_attach(cgrp);
 		if (ss->attach_task)
-			ss->attach_task(cgrp, tsk);
+			ss->attach_task(cgrp, oldcgrp, tsk);
 		if (ss->attach)
 			ss->attach(ss, cgrp, oldcgrp, tsk);
 	}
@@ -2072,7 +2072,9 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 			/* run on each task in the threadgroup. */
 			for (i = 0; i < group_size; i++) {
 				tsk = flex_array_get_ptr(group, i);
-				retval = ss->can_attach_task(cgrp, tsk);
+				oldcgrp = task_cgroup_from_root(tsk, root);
+
+				retval = ss->can_attach_task(cgrp, oldcgrp, tsk);
 				if (retval) {
 					failed_ss = ss;
 					cancel_failed_ss = true;
@@ -2135,7 +2137,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 		/* attach each task to each subsystem */
 		for_each_subsys(root, ss) {
 			if (ss->attach_task)
-				ss->attach_task(cgrp, tsk);
+				ss->attach_task(cgrp, oldcgrp, tsk);
 		}
 		/* if the thread is PF_EXITING, it can just get skipped. */
 		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index e691818..c1421a1 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -175,7 +175,8 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	return 0;
 }
 
-static int freezer_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static int freezer_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+				   struct task_struct *tsk)
 {
 	rcu_read_lock();
 	if (__cgroup_freezing_or_frozen(tsk)) {
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 9c9b754..f66c9b4 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1390,7 +1390,8 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 	return 0;
 }
 
-static int cpuset_can_attach_task(struct cgroup *cgrp, struct task_struct *task)
+static int cpuset_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+				  struct task_struct *task)
 {
 	return security_task_setscheduler(task);
 }
@@ -1418,7 +1419,8 @@ static void cpuset_pre_attach(struct cgroup *cont)
 }
 
 /* Per-thread attachment work. */
-static void cpuset_attach_task(struct cgroup *cont, struct task_struct *tsk)
+static void cpuset_attach_task(struct cgroup *cont, struct cgroup *old,
+			       struct task_struct *tsk)
 {
 	int err;
 	struct cpuset *cs = cgroup_cs(cont);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9efe710..3daf0eb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7403,7 +7403,8 @@ static int __perf_cgroup_move(void *info)
 }
 
 static void
-perf_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *task)
+perf_cgroup_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+			struct task_struct *task)
 {
 	task_function_call(task, __perf_cgroup_move, task);
 }
@@ -7419,7 +7420,7 @@ static void perf_cgroup_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	if (!(task->flags & PF_EXITING))
 		return;
 
-	perf_cgroup_attach_task(cgrp, task);
+	perf_cgroup_attach_task(cgrp, old_cgrp, task);
 }
 
 struct cgroup_subsys perf_subsys = {
diff --git a/kernel/sched.c b/kernel/sched.c
index fde6ff9..cbe0556 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8943,7 +8943,8 @@ cpu_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 }
 
 static int
-cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+			   struct task_struct *tsk)
 {
 #ifdef CONFIG_RT_GROUP_SCHED
 	if (!sched_rt_can_attach(cgroup_tg(cgrp), tsk))
@@ -8957,7 +8958,8 @@ cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 }
 
 static void
-cpu_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+cpu_cgroup_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+		       struct task_struct *tsk)
 {
 	sched_move_task(tsk);
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 4/8] cgroups: New cancel_attach_task subsystem callback
  2011-07-29 16:13 [PATCH 0/8 v3] cgroups: Task counter subsystem (was: New max number of tasks subsystem) Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2011-07-29 16:13 ` [PATCH 3/8] cgroups: Add previous cgroup in can_attach_task/attach_task callbacks Frederic Weisbecker
@ 2011-07-29 16:13 ` Frederic Weisbecker
  2011-08-17  2:40   ` Li Zefan
  2011-07-29 16:13 ` [PATCH 5/8] cgroups: Ability to stop res charge propagation on bounded ancestor Frederic Weisbecker
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2011-07-29 16:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov

To cancel a process attachment on a subsystem, we only call the
cancel_attach() callback once on the leader but we have no
way to cancel the attachment individually for each member of
the process group.

This is going to be needed for the max number of tasks susbystem
that is coming.

To prepare for this integration, call a new cancel_attach_task()
callback on each task of the group until we reach the member that
failed to attach.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/cgroup.h |    1 +
 kernel/cgroup.c        |   15 ++++++++++++---
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index e8288a0..94454143 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -472,6 +472,7 @@ struct cgroup_subsys {
 			       struct task_struct *tsk);
 	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			      struct task_struct *tsk);
+	void (*cancel_attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
 	void (*pre_attach)(struct cgroup *cgrp);
 	void (*attach_task)(struct cgroup *cgrp, struct cgroup *old_cgrp, struct task_struct *tsk);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index c3ee4cf..210dc05 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1989,7 +1989,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 {
 	int retval, i, group_size;
 	struct cgroup_subsys *ss, *failed_ss = NULL;
-	bool cancel_failed_ss = false;
+	struct task_struct *failed_task = NULL;
 	/* guaranteed to be initialized later, but the compiler needs this */
 	struct cgroup *oldcgrp = NULL;
 	struct css_set *oldcg;
@@ -2077,7 +2077,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 				retval = ss->can_attach_task(cgrp, oldcgrp, tsk);
 				if (retval) {
 					failed_ss = ss;
-					cancel_failed_ss = true;
+					failed_task = tsk;
 					goto out_cancel_attach;
 				}
 			}
@@ -2172,8 +2172,17 @@ out_cancel_attach:
 	/* same deal as in cgroup_attach_task */
 	if (retval) {
 		for_each_subsys(root, ss) {
+			if (ss->cancel_attach_task && (ss != failed_ss || failed_task)) {
+				for (i = 0; i < group_size; i++) {
+					tsk = flex_array_get_ptr(group, i);
+					if (tsk == failed_task)
+						break;
+					ss->cancel_attach_task(cgrp, tsk);
+				}
+			}
+
 			if (ss == failed_ss) {
-				if (cancel_failed_ss && ss->cancel_attach)
+				if (failed_task && ss->cancel_attach)
 					ss->cancel_attach(ss, cgrp, leader);
 				break;
 			}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 5/8] cgroups: Ability to stop res charge propagation on bounded ancestor
  2011-07-29 16:13 [PATCH 0/8 v3] cgroups: Task counter subsystem (was: New max number of tasks subsystem) Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2011-07-29 16:13 ` [PATCH 4/8] cgroups: New cancel_attach_task subsystem callback Frederic Weisbecker
@ 2011-07-29 16:13 ` Frederic Weisbecker
  2011-08-17  2:41   ` Li Zefan
  2011-07-29 16:13 ` [PATCH 6/8] cgroups: Add res counter common ancestor searching Frederic Weisbecker
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2011-07-29 16:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov

Moving a task from a cgroup to another may require to substract
its resource charge from the old cgroup and add it to the new one.

For this to happen, the uncharge/charge propagation can just stop
when we reach the common ancestor for the two cgroups. Further
the performance reasons, we also want to avoid to temporarily
overload the common ancestors with a non-accurate resource
counter usage if we charge first the new cgroup and uncharge the
old one thereafter. This is going to be a requirement for the coming
max number of task subsystem.

To solve this, provide a pair of new API that can charge/uncharge
a resource counter until we reach a given ancestor.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/res_counter.h |    6 ++++++
 kernel/res_counter.c        |   23 ++++++++++++++++++-----
 2 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 109d118..8c421ac 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -117,6 +117,10 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
 
 int __must_check res_counter_charge_locked(struct res_counter *counter,
 		unsigned long val);
+int __must_check res_counter_charge_until(struct res_counter *counter,
+					  struct res_counter *limit,
+					  unsigned long val,
+					  struct res_counter **limit_fail_at);
 int __must_check res_counter_charge(struct res_counter *counter,
 		unsigned long val, struct res_counter **limit_fail_at);
 
@@ -131,6 +135,8 @@ int __must_check res_counter_charge(struct res_counter *counter,
  */
 
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
+void res_counter_uncharge_until(struct res_counter *counter, struct res_counter *limit,
+				unsigned long val);
 void res_counter_uncharge(struct res_counter *counter, unsigned long val);
 
 /**
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 10a901b..3b48e64 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -35,8 +35,9 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 	return 0;
 }
 
-int res_counter_charge(struct res_counter *counter, unsigned long val,
-			struct res_counter **limit_fail_at)
+int res_counter_charge_until(struct res_counter *counter,
+			     struct res_counter *limit, unsigned long val,
+			     struct res_counter **limit_fail_at)
 {
 	int ret;
 	unsigned long flags;
@@ -44,7 +45,7 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
 
 	*limit_fail_at = NULL;
 	local_irq_save(flags);
-	for (c = counter; c != NULL; c = c->parent) {
+	for (c = counter; c != limit; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
 		spin_unlock(&c->lock);
@@ -66,6 +67,12 @@ done:
 	return ret;
 }
 
+int res_counter_charge(struct res_counter *counter, unsigned long val,
+			struct res_counter **limit_fail_at)
+{
+	return res_counter_charge_until(counter, NULL, val, limit_fail_at);
+}
+
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
 {
 	if (WARN_ON(counter->usage < val))
@@ -74,13 +81,15 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
 	counter->usage -= val;
 }
 
-void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+void res_counter_uncharge_until(struct res_counter *counter,
+				struct res_counter *limit,
+				unsigned long val)
 {
 	unsigned long flags;
 	struct res_counter *c;
 
 	local_irq_save(flags);
-	for (c = counter; c != NULL; c = c->parent) {
+	for (c = counter; c != limit; c = c->parent) {
 		spin_lock(&c->lock);
 		res_counter_uncharge_locked(c, val);
 		spin_unlock(&c->lock);
@@ -88,6 +97,10 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
 	local_irq_restore(flags);
 }
 
+void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+{
+	res_counter_uncharge_until(counter, NULL, val);
+}
 
 static inline unsigned long long *
 res_counter_member(struct res_counter *counter, int member)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 6/8] cgroups: Add res counter common ancestor searching
  2011-07-29 16:13 [PATCH 0/8 v3] cgroups: Task counter subsystem (was: New max number of tasks subsystem) Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2011-07-29 16:13 ` [PATCH 5/8] cgroups: Ability to stop res charge propagation on bounded ancestor Frederic Weisbecker
@ 2011-07-29 16:13 ` Frederic Weisbecker
  2011-07-29 16:13 ` [PATCH 7/8] cgroups: Add a task counter subsystem Frederic Weisbecker
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2011-07-29 16:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov

Add a new API to find the common ancestor between two resource
counters. This includes the passed resource counter themselves.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/res_counter.h |    2 ++
 kernel/res_counter.c        |   19 +++++++++++++++++++
 2 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 8c421ac..354ed30 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -139,6 +139,8 @@ void res_counter_uncharge_until(struct res_counter *counter, struct res_counter
 				unsigned long val);
 void res_counter_uncharge(struct res_counter *counter, unsigned long val);
 
+struct res_counter *res_counter_common_ancestor(struct res_counter *l, struct res_counter *r);
+
 /**
  * res_counter_margin - calculate chargeable space of a counter
  * @cnt: the counter
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 3b48e64..725dfa6 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -102,6 +102,25 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
 	res_counter_uncharge_until(counter, NULL, val);
 }
 
+struct res_counter *
+res_counter_common_ancestor(struct res_counter *r1, struct res_counter *r2)
+{
+	struct res_counter *iter;
+
+	while (r1) {
+		iter = r2;
+		while (iter) {
+			if (iter == r1)
+				return iter;
+			iter = iter->parent;
+		}
+
+		r1 = r1->parent;
+	}
+
+	return NULL;
+}
+
 static inline unsigned long long *
 res_counter_member(struct res_counter *counter, int member)
 {
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 7/8] cgroups: Add a task counter subsystem
  2011-07-29 16:13 [PATCH 0/8 v3] cgroups: Task counter subsystem (was: New max number of tasks subsystem) Frederic Weisbecker
                   ` (5 preceding siblings ...)
  2011-07-29 16:13 ` [PATCH 6/8] cgroups: Add res counter common ancestor searching Frederic Weisbecker
@ 2011-07-29 16:13 ` Frederic Weisbecker
  2011-08-01 23:13   ` Andrew Morton
                     ` (2 more replies)
  2011-07-29 16:13 ` [PATCH 8/8] res_counter: Allow charge failure pointer to be null Frederic Weisbecker
  2011-08-01 23:19 ` [PATCH 0/8 v3] cgroups: Task counter subsystem (was: New max number of tasks subsystem) Andrew Morton
  8 siblings, 3 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2011-07-29 16:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov

Add a new subsystem to limit the number of running tasks,
similar to the NR_PROC rlimit but in the scope of a cgroup.

This is a step to be able to isolate a bit more a cgroup against
the rest of the system and limit the global impact of a fork bomb
inside a given cgroup.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/cgroup.h        |    9 ++
 include/linux/cgroup_subsys.h |    8 ++
 init/Kconfig                  |    7 ++
 kernel/Makefile               |    1 +
 kernel/cgroup_task_counter.c  |  178 +++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                 |    4 +
 6 files changed, 207 insertions(+), 0 deletions(-)
 create mode 100644 kernel/cgroup_task_counter.c

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 94454143..1012169 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -660,4 +660,13 @@ static inline int cgroup_attach_task_current_cg(struct task_struct *t)
 
 #endif /* !CONFIG_CGROUPS */
 
+#ifdef CONFIG_CGROUP_TASK_COUNTER
+int cgroup_task_counter_fork(struct task_struct *child);
+#else
+static inline int cgroup_task_counter_fork(struct task_struct *child)
+{
+	return 0;
+}
+#endif /* CONFIG_CGROUP_TASK_COUNTER */
+
 #endif /* _LINUX_CGROUP_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index ac663c1..5425822 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -59,8 +59,16 @@ SUBSYS(net_cls)
 SUBSYS(blkio)
 #endif
 
+/* */
+
 #ifdef CONFIG_CGROUP_PERF
 SUBSYS(perf)
 #endif
 
 /* */
+
+#ifdef CONFIG_CGROUP_TASK_COUNTER
+SUBSYS(tasks)
+#endif
+
+/* */
diff --git a/init/Kconfig b/init/Kconfig
index 412c21b..bea98d2d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -690,6 +690,13 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
 	  select this option (if, for some reason, they need to disable it
 	  then noswapaccount does the trick).
 
+config CGROUP_TASK_COUNTER
+        bool "Control number of tasks in a cgroup"
+	depends on RESOURCE_COUNTERS
+	help
+	  This option let the user to set up an upper bound allowed number
+	  of tasks inside a cgroup.
+
 config CGROUP_PERF
 	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
 	depends on PERF_EVENTS && CGROUPS
diff --git a/kernel/Makefile b/kernel/Makefile
index 2d64cfc..983440ae 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -60,6 +60,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
+obj-$(CONFIG_CGROUP_TASK_COUNTER) += cgroup_task_counter.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
diff --git a/kernel/cgroup_task_counter.c b/kernel/cgroup_task_counter.c
new file mode 100644
index 0000000..7f12ac9
--- /dev/null
+++ b/kernel/cgroup_task_counter.c
@@ -0,0 +1,178 @@
+/*
+ * Limits on number of tasks subsystem for cgroups
+ *
+ * Copyright (C) 2011 Red Hat, Inc., Frederic Weisbecker <fweisbec@redhat.com>
+ *
+ * Thanks to Johannes Weiner, Li Zefan, Paul Menage for their suggestions.
+ *
+ */
+
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/res_counter.h>
+
+
+struct task_counter {
+	struct res_counter		res;
+	struct cgroup_subsys_state	css;
+};
+
+/*
+ * The root task counter doesn't exist as it's not part of the
+ * whole task counting in order to optimize the trivial case
+ * of only one root cgroup living.
+ */
+static struct cgroup_subsys_state root_css;
+
+
+static inline struct task_counter *cgroup_task_counter(struct cgroup *cgrp)
+{
+	if (!cgrp->parent)
+		return NULL;
+
+	return container_of(cgroup_subsys_state(cgrp, tasks_subsys_id),
+			    struct task_counter, css);
+}
+
+static inline struct res_counter *cgroup_task_counter_res(struct cgroup *cgrp)
+{
+	struct task_counter *cnt;
+
+	cnt = cgroup_task_counter(cgrp);
+	if (!cnt)
+		return NULL;
+
+	return &cnt->res;
+}
+
+static struct cgroup_subsys_state *
+task_counter_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct task_counter *cnt;
+	struct res_counter *parent_res;
+
+	if (!cgrp->parent)
+		return &root_css;
+
+	cnt = kzalloc(sizeof(*cnt), GFP_KERNEL);
+	if (!cnt)
+		return ERR_PTR(-ENOMEM);
+
+	parent_res = cgroup_task_counter_res(cgrp->parent);
+
+	res_counter_init(&cnt->res, parent_res);
+
+	return &cnt->css;
+}
+
+static void task_counter_post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	res_counter_inherit(cgroup_task_counter_res(cgrp), RES_LIMIT);
+}
+
+static void task_counter_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct task_counter *cnt = cgroup_task_counter(cgrp);
+
+	kfree(cnt);
+}
+
+static void task_counter_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
+			      struct cgroup *old_cgrp, struct task_struct *task)
+{
+	/* Optimize for the root cgroup case */
+	if (old_cgrp->parent)
+		res_counter_uncharge(cgroup_task_counter_res(old_cgrp), 1);
+}
+
+/* Protected amongst can_attach_task/attach_task/cancel_attach_task by cgroup mutex */
+static struct res_counter *common_ancestor;
+
+static int task_counter_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+					struct task_struct *tsk)
+{
+	struct res_counter *res = cgroup_task_counter_res(cgrp);
+	struct res_counter *old_res = cgroup_task_counter_res(old_cgrp);
+	struct res_counter *limit_fail_at;
+
+	common_ancestor = res_counter_common_ancestor(res, old_res);
+
+	return res_counter_charge_until(res, common_ancestor, 1, &limit_fail_at);
+}
+
+static void task_counter_cancel_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+{
+	res_counter_uncharge_until(cgroup_task_counter_res(cgrp), common_ancestor, 1);
+}
+
+static void task_counter_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+				     struct task_struct *tsk)
+{
+	res_counter_uncharge_until(cgroup_task_counter_res(old_cgrp), common_ancestor, 1);
+}
+
+static u64 task_counter_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+	int type = cft->private;
+
+	return res_counter_read_u64(cgroup_task_counter_res(cgrp), type);
+}
+
+static int task_counter_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	int type = cft->private;
+
+	res_counter_write_u64(cgroup_task_counter_res(cgrp), type, val);
+
+	return 0;
+}
+
+static struct cftype files[] = {
+	{
+		.name		= "limit",
+		.read_u64	= task_counter_read_u64,
+		.write_u64	= task_counter_write_u64,
+		.private	= RES_LIMIT,
+	},
+
+	{
+		.name		= "usage",
+		.read_u64	= task_counter_read_u64,
+		.private	= RES_USAGE,
+	},
+};
+
+static int task_counter_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	if (!cgrp->parent)
+		return 0;
+
+	return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
+}
+
+int cgroup_task_counter_fork(struct task_struct *child)
+{
+	struct cgroup_subsys_state *css = child->cgroups->subsys[tasks_subsys_id];
+	struct cgroup *cgrp = css->cgroup;
+	struct res_counter *limit_fail_at;
+
+	/* Optimize for the root cgroup case, which doesn't have a limit */
+	if (!cgrp->parent)
+		return 0;
+
+	return res_counter_charge(cgroup_task_counter_res(cgrp), 1, &limit_fail_at);
+}
+
+struct cgroup_subsys tasks_subsys = {
+	.name			= "tasks",
+	.subsys_id		= tasks_subsys_id,
+	.create			= task_counter_create,
+	.post_clone		= task_counter_post_clone,
+	.destroy		= task_counter_destroy,
+	.exit			= task_counter_exit,
+	.can_attach_task	= task_counter_can_attach_task,
+	.cancel_attach_task	= task_counter_cancel_attach_task,
+	.attach_task		= task_counter_attach_task,
+	.populate		= task_counter_populate,
+	.early_init		= 1,
+};
diff --git a/kernel/fork.c b/kernel/fork.c
index 0276c30..293eb42 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1295,6 +1295,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p->group_leader = p;
 	INIT_LIST_HEAD(&p->thread_group);
 
+	retval = cgroup_task_counter_fork(p);
+	if (retval)
+		goto bad_fork_free_pid;
+
 	/* Now that the task is set up, run cgroup callbacks if
 	 * necessary. We need to run them before the task is visible
 	 * on the tasklist. */
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 8/8] res_counter: Allow charge failure pointer to be null
  2011-07-29 16:13 [PATCH 0/8 v3] cgroups: Task counter subsystem (was: New max number of tasks subsystem) Frederic Weisbecker
                   ` (6 preceding siblings ...)
  2011-07-29 16:13 ` [PATCH 7/8] cgroups: Add a task counter subsystem Frederic Weisbecker
@ 2011-07-29 16:13 ` Frederic Weisbecker
  2011-08-17  2:44   ` Li Zefan
  2011-08-01 23:19 ` [PATCH 0/8 v3] cgroups: Task counter subsystem (was: New max number of tasks subsystem) Andrew Morton
  8 siblings, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2011-07-29 16:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov

So that callers of res_counter_charge() don't have to create and
pass this pointer even if they aren't interested in it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 kernel/cgroup_task_counter.c |    6 ++----
 kernel/res_counter.c         |    6 ++++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/cgroup_task_counter.c b/kernel/cgroup_task_counter.c
index 7f12ac9..dacc4a9 100644
--- a/kernel/cgroup_task_counter.c
+++ b/kernel/cgroup_task_counter.c
@@ -93,11 +93,10 @@ static int task_counter_can_attach_task(struct cgroup *cgrp, struct cgroup *old_
 {
 	struct res_counter *res = cgroup_task_counter_res(cgrp);
 	struct res_counter *old_res = cgroup_task_counter_res(old_cgrp);
-	struct res_counter *limit_fail_at;
 
 	common_ancestor = res_counter_common_ancestor(res, old_res);
 
-	return res_counter_charge_until(res, common_ancestor, 1, &limit_fail_at);
+	return res_counter_charge_until(res, common_ancestor, 1, NULL);
 }
 
 static void task_counter_cancel_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
@@ -154,13 +153,12 @@ int cgroup_task_counter_fork(struct task_struct *child)
 {
 	struct cgroup_subsys_state *css = child->cgroups->subsys[tasks_subsys_id];
 	struct cgroup *cgrp = css->cgroup;
-	struct res_counter *limit_fail_at;
 
 	/* Optimize for the root cgroup case, which doesn't have a limit */
 	if (!cgrp->parent)
 		return 0;
 
-	return res_counter_charge(cgroup_task_counter_res(cgrp), 1, &limit_fail_at);
+	return res_counter_charge(cgroup_task_counter_res(cgrp), 1, NULL);
 }
 
 struct cgroup_subsys tasks_subsys = {
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 725dfa6..7eac803 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -43,14 +43,16 @@ int res_counter_charge_until(struct res_counter *counter,
 	unsigned long flags;
 	struct res_counter *c, *u;
 
-	*limit_fail_at = NULL;
+	if (limit_fail_at)
+		*limit_fail_at = NULL;
 	local_irq_save(flags);
 	for (c = counter; c != limit; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
 		spin_unlock(&c->lock);
 		if (ret < 0) {
-			*limit_fail_at = c;
+			if (limit_fail_at)
+				*limit_fail_at = c;
 			goto undo;
 		}
 	}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 7/8] cgroups: Add a task counter subsystem
  2011-07-29 16:13 ` [PATCH 7/8] cgroups: Add a task counter subsystem Frederic Weisbecker
@ 2011-08-01 23:13   ` Andrew Morton
  2011-08-04 14:05     ` Frederic Weisbecker
  2011-08-09 15:11   ` Oleg Nesterov
  2011-08-17  3:18   ` Li Zefan
  2 siblings, 1 reply; 44+ messages in thread
From: Andrew Morton @ 2011-08-01 23:13 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali, Oleg Nesterov

On Fri, 29 Jul 2011 18:13:29 +0200
Frederic Weisbecker <fweisbec@gmail.com> wrote:

> Add a new subsystem to limit the number of running tasks,
> similar to the NR_PROC rlimit but in the scope of a cgroup.
> 
> This is a step to be able to isolate a bit more a cgroup against
> the rest of the system and limit the global impact of a fork bomb
> inside a given cgroup.
> 
> ...
>
> +config CGROUP_TASK_COUNTER
> +        bool "Control number of tasks in a cgroup"
> +	depends on RESOURCE_COUNTERS
> +	help
> +	  This option let the user to set up an upper bound allowed number
> +	  of tasks inside a cgroup.

whitespace went weird.

> 
> ...
>
 +
> +static void task_counter_post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> +	res_counter_inherit(cgroup_task_counter_res(cgrp), RES_LIMIT);

cgroup_task_counter_res() has code in it to carefully return NULL in
one situation, but if it does this, res_counter_inherit() will then
cheerily oops.  This makes no sense.

> +}
> +
> 
> ...
>
> +/* Protected amongst can_attach_task/attach_task/cancel_attach_task by cgroup mutex */
> +static struct res_counter *common_ancestor;
> +
> +static int task_counter_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
> +					struct task_struct *tsk)
> +{
> +	struct res_counter *res = cgroup_task_counter_res(cgrp);
> +	struct res_counter *old_res = cgroup_task_counter_res(old_cgrp);
> +	struct res_counter *limit_fail_at;
> +
> +	common_ancestor = res_counter_common_ancestor(res, old_res);

This might oops too?

> +	return res_counter_charge_until(res, common_ancestor, 1, &limit_fail_at);
> +}
> +
> 
> ...
>
> +int cgroup_task_counter_fork(struct task_struct *child)
> +{
> +	struct cgroup_subsys_state *css = child->cgroups->subsys[tasks_subsys_id];
> +	struct cgroup *cgrp = css->cgroup;
> +	struct res_counter *limit_fail_at;
> +
> +	/* Optimize for the root cgroup case, which doesn't have a limit */
> +	if (!cgrp->parent)
> +		return 0;
> +
> +	return res_counter_charge(cgroup_task_counter_res(cgrp), 1, &limit_fail_at);
> +}

It took a while for me to work out the meaning of the return value from
this function.  Some documentation would be nice?



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/8 v3] cgroups: Task counter subsystem (was: New max number of tasks subsystem)
  2011-07-29 16:13 [PATCH 0/8 v3] cgroups: Task counter subsystem (was: New max number of tasks subsystem) Frederic Weisbecker
                   ` (7 preceding siblings ...)
  2011-07-29 16:13 ` [PATCH 8/8] res_counter: Allow charge failure pointer to be null Frederic Weisbecker
@ 2011-08-01 23:19 ` Andrew Morton
  2011-08-03 14:29   ` Frederic Weisbecker
  2011-08-12 21:11   ` Tim Hockin
  8 siblings, 2 replies; 44+ messages in thread
From: Andrew Morton @ 2011-08-01 23:19 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali, Oleg Nesterov

On Fri, 29 Jul 2011 18:13:22 +0200
Frederic Weisbecker <fweisbec@gmail.com> wrote:

> Reminder:
> 
> This patchset is aimed at reducing the impact of a forkbomb to a
> cgroup boundaries, thus minimizing the consequences of such an attack
> against the rest of the system.
> 
> This can be useful when cgroups are used to stage some processes or run
> untrustees.

Really?  How useful?  Why is it useful enough to justify adding code
such as this to the kernel?

Is forkbomb-prevention the only use?  Others have proposed different
ways of preventing forkbombs which were independent of cgroups - is
this way better and if so, why?

>  block/blk-cgroup.c            |   10 ++-
>  include/linux/cgroup.h        |   15 +++-
>  include/linux/cgroup_subsys.h |    8 ++
>  include/linux/res_counter.h   |   12 +++
>  init/Kconfig                  |    7 ++
>  kernel/Makefile               |    1 +
>  kernel/cgroup.c               |   25 ++++--
>  kernel/cgroup_freezer.c       |    3 +-
>  kernel/cgroup_task_counter.c  |  176 +++++++++++++++++++++++++++++++++++++++++
>  kernel/cpuset.c               |    6 +-
>  kernel/events/core.c          |    5 +-
>  kernel/fork.c                 |    4 +
>  kernel/res_counter.c          |   81 ++++++++++++++++---
>  kernel/sched.c                |    6 +-

The patch forgot to document the feature: how it works, what it's
useful for, what behaviour users can expect to see, when they should
consider using it, what the userspace control interface is and how to
configure it, etc.  Documentation/cgroups/ is the place for that.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/8 v3] cgroups: Task counter subsystem (was: New max number of tasks subsystem)
  2011-08-01 23:19 ` [PATCH 0/8 v3] cgroups: Task counter subsystem (was: New max number of tasks subsystem) Andrew Morton
@ 2011-08-03 14:29   ` Frederic Weisbecker
  2011-08-12 21:11   ` Tim Hockin
  1 sibling, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2011-08-03 14:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali, Oleg Nesterov

On Mon, Aug 01, 2011 at 04:19:00PM -0700, Andrew Morton wrote:
> On Fri, 29 Jul 2011 18:13:22 +0200
> Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > Reminder:
> > 
> > This patchset is aimed at reducing the impact of a forkbomb to a
> > cgroup boundaries, thus minimizing the consequences of such an attack
> > against the rest of the system.
> > 
> > This can be useful when cgroups are used to stage some processes or run
> > untrustees.
> 
> Really?  How useful?  Why is it useful enough to justify adding code
> such as this to the kernel?
> 
> Is forkbomb-prevention the only use?  Others have proposed different
> ways of preventing forkbombs which were independent of cgroups - is
> this way better and if so, why?

I should have given more details.

So this is not intended to replace exisiting solution to protect against
forkbombs on the whole machine or user scope, like rlmit NR_PROC.

But rlimit NR_PROC is sometimes not adapted like in the case of containers
implemented using cgroups. If we service many containers for sandboxing
applications or so, the traditional nr_proc rlimit doesn't work anymore
because if all the containers run under the same user, which should be
typically the case, then one container can starve all the others if it
spawns too much processes and the limit is per user and not per cgroup.

> 
> >  block/blk-cgroup.c            |   10 ++-
> >  include/linux/cgroup.h        |   15 +++-
> >  include/linux/cgroup_subsys.h |    8 ++
> >  include/linux/res_counter.h   |   12 +++
> >  init/Kconfig                  |    7 ++
> >  kernel/Makefile               |    1 +
> >  kernel/cgroup.c               |   25 ++++--
> >  kernel/cgroup_freezer.c       |    3 +-
> >  kernel/cgroup_task_counter.c  |  176 +++++++++++++++++++++++++++++++++++++++++
> >  kernel/cpuset.c               |    6 +-
> >  kernel/events/core.c          |    5 +-
> >  kernel/fork.c                 |    4 +
> >  kernel/res_counter.c          |   81 ++++++++++++++++---
> >  kernel/sched.c                |    6 +-
> 
> The patch forgot to document the feature: how it works, what it's
> useful for, what behaviour users can expect to see, when they should
> consider using it, what the userspace control interface is and how to
> configure it, etc.  Documentation/cgroups/ is the place for that.

Right, I'll that in the next take. I did not until now because the ABI was
still staging.

Thanks.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 7/8] cgroups: Add a task counter subsystem
  2011-08-01 23:13   ` Andrew Morton
@ 2011-08-04 14:05     ` Frederic Weisbecker
  0 siblings, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2011-08-04 14:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali, Oleg Nesterov

On Mon, Aug 01, 2011 at 04:13:47PM -0700, Andrew Morton wrote:
> On Fri, 29 Jul 2011 18:13:29 +0200
> Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > Add a new subsystem to limit the number of running tasks,
> > similar to the NR_PROC rlimit but in the scope of a cgroup.
> > 
> > This is a step to be able to isolate a bit more a cgroup against
> > the rest of the system and limit the global impact of a fork bomb
> > inside a given cgroup.
> > 
> > ...
> >
> > +config CGROUP_TASK_COUNTER
> > +        bool "Control number of tasks in a cgroup"
> > +	depends on RESOURCE_COUNTERS
> > +	help
> > +	  This option let the user to set up an upper bound allowed number
> > +	  of tasks inside a cgroup.
> 
> whitespace went weird.

Yep, will fix.
 
> > 
> > ...
> >
>  +
> > +static void task_counter_post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp)
> > +{
> > +	res_counter_inherit(cgroup_task_counter_res(cgrp), RES_LIMIT);
> 
> cgroup_task_counter_res() has code in it to carefully return NULL in
> one situation, but if it does this, res_counter_inherit() will then
> cheerily oops.  This makes no sense.

Right but the only cgroup for which it returns NULL is the root cgroup.
But we don't post clone the root cgroup itself since it has no parent.

So this can't happen, but I can still add a warn_on condition that escapes.

> > +}
> > +
> > 
> > ...
> >
> > +/* Protected amongst can_attach_task/attach_task/cancel_attach_task by cgroup mutex */
> > +static struct res_counter *common_ancestor;
> > +
> > +static int task_counter_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
> > +					struct task_struct *tsk)
> > +{
> > +	struct res_counter *res = cgroup_task_counter_res(cgrp);
> > +	struct res_counter *old_res = cgroup_task_counter_res(old_cgrp);
> > +	struct res_counter *limit_fail_at;
> > +
> > +	common_ancestor = res_counter_common_ancestor(res, old_res);
> 
> This might oops too?

Nope, if either res or old_res is NULL, then the common ancestor returned
is NULL. Afterward the charge_until() below will simply charge res over
all the hierarchy if old_res is NULL, or it will do nothing is res itself
is NULL.

I should probably comment on that behaviour.

> 
> > +	return res_counter_charge_until(res, common_ancestor, 1, &limit_fail_at);
> > +}
> > +
> > 
> > ...
> >
> > +int cgroup_task_counter_fork(struct task_struct *child)
> > +{
> > +	struct cgroup_subsys_state *css = child->cgroups->subsys[tasks_subsys_id];
> > +	struct cgroup *cgrp = css->cgroup;
> > +	struct res_counter *limit_fail_at;
> > +
> > +	/* Optimize for the root cgroup case, which doesn't have a limit */
> > +	if (!cgrp->parent)
> > +		return 0;
> > +
> > +	return res_counter_charge(cgroup_task_counter_res(cgrp), 1, &limit_fail_at);
> > +}
> 
> It took a while for me to work out the meaning of the return value from
> this function.  Some documentation would be nice?

Yes and moreover I'm not at all sure about the default return value in
case of failure. -ENOMEM probably matches the need for memory limit
subsystem but for that task counter subsystem.

Probably the res_counter API should return -1 in case of limit reached
and let the caller subsystem deal with the error to return. -ENOMEM
is already too partial.

I guess we should return -EINVAL in case of task counter limit reached?

Once we agree on this I'll document it.

> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 7/8] cgroups: Add a task counter subsystem
  2011-07-29 16:13 ` [PATCH 7/8] cgroups: Add a task counter subsystem Frederic Weisbecker
  2011-08-01 23:13   ` Andrew Morton
@ 2011-08-09 15:11   ` Oleg Nesterov
  2011-08-09 17:27     ` Frederic Weisbecker
  2011-08-17  3:18   ` Li Zefan
  2 siblings, 1 reply; 44+ messages in thread
From: Oleg Nesterov @ 2011-08-09 15:11 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali

On 07/29, Frederic Weisbecker wrote:
>
> +static int task_counter_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
> +					struct task_struct *tsk)
> +{
> +	struct res_counter *res = cgroup_task_counter_res(cgrp);
> +	struct res_counter *old_res = cgroup_task_counter_res(old_cgrp);
> +	struct res_counter *limit_fail_at;
> +
> +	common_ancestor = res_counter_common_ancestor(res, old_res);
> +
> +	return res_counter_charge_until(res, common_ancestor, 1, &limit_fail_at);
> +}
>
> ...
>
> +static void task_counter_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
> +				     struct task_struct *tsk)
> +{
> +	res_counter_uncharge_until(cgroup_task_counter_res(old_cgrp), common_ancestor, 1);
> +}

This doesn't look right or I missed something.

What if tsk exits in between? Afaics this can happen with both
cgroup_attach_task() and cgroup_attach_proc().

Let's look at cgroup_attach_task(). Suppose that
task_counter_can_attach_task() succeeds and charges the new cgrp,
Then cgroup_task_migrate() returns -ESRCH. Who will uncharge the
new cgrp?


cgroup_attach_proc() is different, it calls cgroup_task_migrate()
after ->attach_task(). Cough.

In this case old_cgrp can be uncharged twice, no? And again, nobody
will uncharge the new cgrp?



->attach_task() can be skipped if cgrp == oldcgrp... Probably this
is fine, in this case can_attach_task() shouldn't actually charge.


> @@ -1295,6 +1295,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	p->group_leader = p;
>  	INIT_LIST_HEAD(&p->thread_group);
>
> +	retval = cgroup_task_counter_fork(p);
> +	if (retval)
> +		goto bad_fork_free_pid;
> +

Well, imho this is not good. You are adding yet another cgroup hook.
Why you can not reuse cgroup_fork_callbacks() ?

Yes, it returns void. Can't we chane ->fork() to return the error and
make it boolean?

Better yet,

	-	cgroup_fork_callbacks(p);
	-	cgroup_callbacks_done = 1;
	+	failed_ss = cgroup_fork_callbacks(p);
	+	if (failed_ss)
	+		goto bad_fork_free_pid;

	...

	-	cgroup_exit(p, cgroup_callbacks_done);
	+	cgroup_exit(p, failed_ss);

What do you think?

Oleg.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/8] cgroups: Add res_counter_write_u64() API
  2011-07-29 16:13 ` [PATCH 1/8] cgroups: Add res_counter_write_u64() API Frederic Weisbecker
@ 2011-08-09 15:17   ` Oleg Nesterov
  2011-08-09 17:31     ` Frederic Weisbecker
  0 siblings, 1 reply; 44+ messages in thread
From: Oleg Nesterov @ 2011-08-09 15:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Li Zefan, Johannes Weiner, Aditya Kali

On 07/29, Frederic Weisbecker wrote:
>
> +void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
> +{
> +	unsigned long long *target;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&counter->lock, flags);
> +	target = res_counter_member(counter, member);
> +	*target = val;
> +	spin_unlock_irqrestore(&counter->lock, flags);
> +}

Hmm. Why do we need counter->lock?

OK, probably it is for BITS_PER_LONG < 64. May be it makes sense
to ifdef, just to make the code more understandable.

Oleg.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 7/8] cgroups: Add a task counter subsystem
  2011-08-09 15:11   ` Oleg Nesterov
@ 2011-08-09 17:27     ` Frederic Weisbecker
  2011-08-09 17:57       ` Oleg Nesterov
  0 siblings, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2011-08-09 17:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Andrew Morton, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali

On Tue, Aug 09, 2011 at 05:11:55PM +0200, Oleg Nesterov wrote:
> On 07/29, Frederic Weisbecker wrote:
> >
> > +static int task_counter_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
> > +					struct task_struct *tsk)
> > +{
> > +	struct res_counter *res = cgroup_task_counter_res(cgrp);
> > +	struct res_counter *old_res = cgroup_task_counter_res(old_cgrp);
> > +	struct res_counter *limit_fail_at;
> > +
> > +	common_ancestor = res_counter_common_ancestor(res, old_res);
> > +
> > +	return res_counter_charge_until(res, common_ancestor, 1, &limit_fail_at);
> > +}
> >
> > ...
> >
> > +static void task_counter_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
> > +				     struct task_struct *tsk)
> > +{
> > +	res_counter_uncharge_until(cgroup_task_counter_res(old_cgrp), common_ancestor, 1);
> > +}
> 
> This doesn't look right or I missed something.
> 
> What if tsk exits in between? Afaics this can happen with both
> cgroup_attach_task() and cgroup_attach_proc().
> 
> Let's look at cgroup_attach_task(). Suppose that
> task_counter_can_attach_task() succeeds and charges the new cgrp,
> Then cgroup_task_migrate() returns -ESRCH. Who will uncharge the
> new cgrp?
> 

I may totally be missing something but that looks safe to me.
If the task has exited then cgroup_task_migrate() fails then we
rollback with ->cancel_attach_task().

Let me enumerate the possible scenario (may not be exhaustive):

* The task exits (called cgroup_exit()) before we cgroup_task_migrate()
switch the cgroup. In this case we rollback the charge we pushed
on the new cgoup and we return an error.

* The task exits after cgroup_task_migrate(), in which case cgroup
called ->exit() on the task with the new cgroup, uncharging that
task from it. At the same time we call ->attach_task() to uncharge the
old cgroup, which is still what we want as we confirmed the cgroup
migration.

> cgroup_attach_proc() is different, it calls cgroup_task_migrate()
> after ->attach_task(). Cough.

That's bad. I need to fix that.

So if it returns -ESRCH, I shall not call attach_task() on it
but cancel_attach_task().

Other than that it should be safe as in the single task case.

> In this case old_cgrp can be uncharged twice, no? And again, nobody
> will uncharge the new cgrp?

(see above)

> ->attach_task() can be skipped if cgrp == oldcgrp... Probably this
> is fine, in this case can_attach_task() shouldn't actually charge.

In fact in this case it simply doesn't charge. res_counter_common_ancestor()
returns the res_counter for cgrp as a limit and thus charging stops as soon
as it starts.

> 
> > @@ -1295,6 +1295,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> >  	p->group_leader = p;
> >  	INIT_LIST_HEAD(&p->thread_group);
> >
> > +	retval = cgroup_task_counter_fork(p);
> > +	if (retval)
> > +		goto bad_fork_free_pid;
> > +
> 
> Well, imho this is not good. You are adding yet another cgroup hook.
> Why you can not reuse cgroup_fork_callbacks() ?
> 
> Yes, it returns void. Can't we chane ->fork() to return the error and
> make it boolean?

That was my first proposition (minus the rollback with exit() that I forgot)
but Paul Menage said that added unnecessary complexity in the fork callbacks.

> 
> Better yet,
> 
> 	-	cgroup_fork_callbacks(p);
> 	-	cgroup_callbacks_done = 1;
> 	+	failed_ss = cgroup_fork_callbacks(p);
> 	+	if (failed_ss)
> 	+		goto bad_fork_free_pid;
> 
> 	...
> 
> 	-	cgroup_exit(p, cgroup_callbacks_done);
> 	+	cgroup_exit(p, failed_ss);
> 
> What do you think?

I would personally prefer that.

> Oleg.
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/8] cgroups: Add res_counter_write_u64() API
  2011-08-09 15:17   ` Oleg Nesterov
@ 2011-08-09 17:31     ` Frederic Weisbecker
  0 siblings, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2011-08-09 17:31 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: LKML, Andrew Morton, Li Zefan, Johannes Weiner, Aditya Kali

On Tue, Aug 09, 2011 at 05:17:39PM +0200, Oleg Nesterov wrote:
> On 07/29, Frederic Weisbecker wrote:
> >
> > +void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
> > +{
> > +	unsigned long long *target;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&counter->lock, flags);
> > +	target = res_counter_member(counter, member);
> > +	*target = val;
> > +	spin_unlock_irqrestore(&counter->lock, flags);
> > +}
> 
> Hmm. Why do we need counter->lock?
> 
> OK, probably it is for BITS_PER_LONG < 64. May be it makes sense
> to ifdef, just to make the code more understandable.

Ok, will add that.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 7/8] cgroups: Add a task counter subsystem
  2011-08-09 17:27     ` Frederic Weisbecker
@ 2011-08-09 17:57       ` Oleg Nesterov
  2011-08-09 18:09         ` Frederic Weisbecker
  0 siblings, 1 reply; 44+ messages in thread
From: Oleg Nesterov @ 2011-08-09 17:57 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali

On 08/09, Frederic Weisbecker wrote:
>
> On Tue, Aug 09, 2011 at 05:11:55PM +0200, Oleg Nesterov wrote:
> >
> > This doesn't look right or I missed something.
> >
> > What if tsk exits in between? Afaics this can happen with both
> > cgroup_attach_task() and cgroup_attach_proc().
> >
> > Let's look at cgroup_attach_task(). Suppose that
> > task_counter_can_attach_task() succeeds and charges the new cgrp,
> > Then cgroup_task_migrate() returns -ESRCH. Who will uncharge the
> > new cgrp?
> >
>
> I may totally be missing something but that looks safe to me.
> If the task has exited then cgroup_task_migrate() fails then we
> rollback with ->cancel_attach_task().

cgroup_attach_task() doesn't call ->cancel_attach_task() ;)

> > cgroup_attach_proc() is different, it calls cgroup_task_migrate()
> > after ->attach_task(). Cough.
>
> That's bad. I need to fix that.
>
> So if it returns -ESRCH, I shall not call attach_task() on it
> but cancel_attach_task().

Afaics this can't help, or I misunderstood. probably attach_task()
can check PF_EXITING...

> > ->attach_task() can be skipped if cgrp == oldcgrp... Probably this
> > is fine, in this case can_attach_task() shouldn't actually charge.
>
> In fact in this case it simply doesn't charge. res_counter_common_ancestor()
> returns the res_counter for cgrp as a limit and thus charging stops as soon
> as it starts.

Yes, this is what I meant. Just it wasn't immediately obvious for me,
initially I thought this is buggy.

> @@ -1295,6 +1295,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> > >  	p->group_leader = p;
> > >  	INIT_LIST_HEAD(&p->thread_group);
> > >
> > > +	retval = cgroup_task_counter_fork(p);
> > > +	if (retval)
> > > +		goto bad_fork_free_pid;
> > > +
> >
> > Well, imho this is not good. You are adding yet another cgroup hook.
> > Why you can not reuse cgroup_fork_callbacks() ?
> >
> > Yes, it returns void. Can't we chane ->fork() to return the error and
> > make it boolean?
>
> That was my first proposition (minus the rollback with exit() that I forgot)
> but Paul Menage said that added unnecessary complexity in the fork callbacks.

Hmm. Yes, this adds some complexity in the fork callbacks.

But yet another cgroup_task_counter_fork() hook complicates the core kernel
code. And since I personally do not care about cgroups at all, I think this
is much worse ;)

> I would personally prefer that.

I strongly agree.

OK. Lets do it this way. Perhaps we can convince Paul later and cleanup.

Oleg.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 7/8] cgroups: Add a task counter subsystem
  2011-08-09 17:57       ` Oleg Nesterov
@ 2011-08-09 18:09         ` Frederic Weisbecker
  2011-08-09 18:19           ` Oleg Nesterov
  0 siblings, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2011-08-09 18:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Andrew Morton, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali

On Tue, Aug 09, 2011 at 07:57:08PM +0200, Oleg Nesterov wrote:
> On 08/09, Frederic Weisbecker wrote:
> >
> > On Tue, Aug 09, 2011 at 05:11:55PM +0200, Oleg Nesterov wrote:
> > >
> > > This doesn't look right or I missed something.
> > >
> > > What if tsk exits in between? Afaics this can happen with both
> > > cgroup_attach_task() and cgroup_attach_proc().
> > >
> > > Let's look at cgroup_attach_task(). Suppose that
> > > task_counter_can_attach_task() succeeds and charges the new cgrp,
> > > Then cgroup_task_migrate() returns -ESRCH. Who will uncharge the
> > > new cgrp?
> > >
> >
> > I may totally be missing something but that looks safe to me.
> > If the task has exited then cgroup_task_migrate() fails then we
> > rollback with ->cancel_attach_task().
> 
> cgroup_attach_task() doesn't call ->cancel_attach_task() ;)

Oops right, that's missing :)
 
> > > cgroup_attach_proc() is different, it calls cgroup_task_migrate()
> > > after ->attach_task(). Cough.
> >
> > That's bad. I need to fix that.
> >
> > So if it returns -ESRCH, I shall not call attach_task() on it
> > but cancel_attach_task().
> 
> Afaics this can't help, or I misunderstood. probably attach_task()
> can check PF_EXITING...

But cgroup_task_migrate() checks that already before migrating
cgroups (checks PF_EXITING under task_lock() so that it's
synchronized against cgroup_exit())

> > > ->attach_task() can be skipped if cgrp == oldcgrp... Probably this
> > > is fine, in this case can_attach_task() shouldn't actually charge.
> >
> > In fact in this case it simply doesn't charge. res_counter_common_ancestor()
> > returns the res_counter for cgrp as a limit and thus charging stops as soon
> > as it starts.
> 
> Yes, this is what I meant. Just it wasn't immediately obvious for me,
> initially I thought this is buggy.
> 
> > @@ -1295,6 +1295,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> > > >  	p->group_leader = p;
> > > >  	INIT_LIST_HEAD(&p->thread_group);
> > > >
> > > > +	retval = cgroup_task_counter_fork(p);
> > > > +	if (retval)
> > > > +		goto bad_fork_free_pid;
> > > > +
> > >
> > > Well, imho this is not good. You are adding yet another cgroup hook.
> > > Why you can not reuse cgroup_fork_callbacks() ?
> > >
> > > Yes, it returns void. Can't we chane ->fork() to return the error and
> > > make it boolean?
> >
> > That was my first proposition (minus the rollback with exit() that I forgot)
> > but Paul Menage said that added unnecessary complexity in the fork callbacks.
> 
> Hmm. Yes, this adds some complexity in the fork callbacks.
> 
> But yet another cgroup_task_counter_fork() hook complicates the core kernel
> code. And since I personally do not care about cgroups at all, I think this
> is much worse ;)
> 
> > I would personally prefer that.
> 
> I strongly agree.
> 
> OK. Lets do it this way. Perhaps we can convince Paul later and cleanup.

Fine, I'll rewind to that. I would really prefer it that way.

Thanks!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 7/8] cgroups: Add a task counter subsystem
  2011-08-09 18:09         ` Frederic Weisbecker
@ 2011-08-09 18:19           ` Oleg Nesterov
  2011-08-09 18:34             ` Frederic Weisbecker
  0 siblings, 1 reply; 44+ messages in thread
From: Oleg Nesterov @ 2011-08-09 18:19 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali

On 08/09, Frederic Weisbecker wrote:
>
> On Tue, Aug 09, 2011 at 07:57:08PM +0200, Oleg Nesterov wrote:
> > > > cgroup_attach_proc() is different, it calls cgroup_task_migrate()
> > > > after ->attach_task(). Cough.
> > >
> > > That's bad. I need to fix that.
> > >
> > > So if it returns -ESRCH, I shall not call attach_task() on it
> > > but cancel_attach_task().
> >
> > Afaics this can't help, or I misunderstood. probably attach_task()
> > can check PF_EXITING...
>
> But cgroup_task_migrate() checks that already before migrating
> cgroups (checks PF_EXITING under task_lock() so that it's
> synchronized against cgroup_exit())

Yes, but how this can help?

->attach_task() is called before cgroup_task_migrate(). Suppose
that it exits before ->attach_task(). In this case we shouldn't
uncharge the old cgroup, it was already uncharged by cgroup_exit.

Oleg.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 7/8] cgroups: Add a task counter subsystem
  2011-08-09 18:19           ` Oleg Nesterov
@ 2011-08-09 18:34             ` Frederic Weisbecker
  2011-08-09 18:39               ` Oleg Nesterov
  0 siblings, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2011-08-09 18:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Andrew Morton, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali

On Tue, Aug 09, 2011 at 08:19:47PM +0200, Oleg Nesterov wrote:
> On 08/09, Frederic Weisbecker wrote:
> >
> > On Tue, Aug 09, 2011 at 07:57:08PM +0200, Oleg Nesterov wrote:
> > > > > cgroup_attach_proc() is different, it calls cgroup_task_migrate()
> > > > > after ->attach_task(). Cough.
> > > >
> > > > That's bad. I need to fix that.
> > > >
> > > > So if it returns -ESRCH, I shall not call attach_task() on it
> > > > but cancel_attach_task().
> > >
> > > Afaics this can't help, or I misunderstood. probably attach_task()
> > > can check PF_EXITING...
> >
> > But cgroup_task_migrate() checks that already before migrating
> > cgroups (checks PF_EXITING under task_lock() so that it's
> > synchronized against cgroup_exit())
> 
> Yes, but how this can help?
> 
> ->attach_task() is called before cgroup_task_migrate(). Suppose
> that it exits before ->attach_task(). In this case we shouldn't
> uncharge the old cgroup, it was already uncharged by cgroup_exit.

Yeah that's why in cgroup_attach_proc() we need to call cgroup_task_migrate()
before ->attach_task(), as it's done in cgroup_attach_task(). And also call
cancel_attach_task() if it fails there with -ESRCH without cancelling
the whole group attachment.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 7/8] cgroups: Add a task counter subsystem
  2011-08-09 18:34             ` Frederic Weisbecker
@ 2011-08-09 18:39               ` Oleg Nesterov
  0 siblings, 0 replies; 44+ messages in thread
From: Oleg Nesterov @ 2011-08-09 18:39 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali

On 08/09, Frederic Weisbecker wrote:
>
> On Tue, Aug 09, 2011 at 08:19:47PM +0200, Oleg Nesterov wrote:
> >
> > Yes, but how this can help?
> >
> > ->attach_task() is called before cgroup_task_migrate(). Suppose
> > that it exits before ->attach_task(). In this case we shouldn't
> > uncharge the old cgroup, it was already uncharged by cgroup_exit.
>
> Yeah that's why in cgroup_attach_proc() we need to call cgroup_task_migrate()
> before ->attach_task(), as it's done in cgroup_attach_task().

Agreed. I do not undertand why attach task/proc should differ. This
doesn't look good even if (currently) correct.

> And also call
> cancel_attach_task() if it fails there with -ESRCH without cancelling
> the whole group attachment.

Probably... and attach_task() should do the same.

Oleg.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/8 v3] cgroups: Task counter subsystem (was: New max number of tasks subsystem)
  2011-08-01 23:19 ` [PATCH 0/8 v3] cgroups: Task counter subsystem (was: New max number of tasks subsystem) Andrew Morton
  2011-08-03 14:29   ` Frederic Weisbecker
@ 2011-08-12 21:11   ` Tim Hockin
  2011-08-16 16:01     ` Kay Sievers
  1 sibling, 1 reply; 44+ messages in thread
From: Tim Hockin @ 2011-08-12 21:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Frederic Weisbecker, LKML, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov

On Mon, Aug 1, 2011 at 4:19 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Fri, 29 Jul 2011 18:13:22 +0200
> Frederic Weisbecker <fweisbec@gmail.com> wrote:
>
>> Reminder:
>>
>> This patchset is aimed at reducing the impact of a forkbomb to a
>> cgroup boundaries, thus minimizing the consequences of such an attack
>> against the rest of the system.
>>
>> This can be useful when cgroups are used to stage some processes or run
>> untrustees.
>
> Really?  How useful?  Why is it useful enough to justify adding code
> such as this to the kernel?
>
> Is forkbomb-prevention the only use?  Others have proposed different
> ways of preventing forkbombs which were independent of cgroups - is
> this way better and if so, why?

I certainly want this for exactly the proposed use - putting a bounds
on threads/tasks per container.  It's rlimits but more useful.

IMHO, most every limit that can be set at a system level should be
settable at a cgroup level.  This is just one more isolation leak.

>>  block/blk-cgroup.c            |   10 ++-
>>  include/linux/cgroup.h        |   15 +++-
>>  include/linux/cgroup_subsys.h |    8 ++
>>  include/linux/res_counter.h   |   12 +++
>>  init/Kconfig                  |    7 ++
>>  kernel/Makefile               |    1 +
>>  kernel/cgroup.c               |   25 ++++--
>>  kernel/cgroup_freezer.c       |    3 +-
>>  kernel/cgroup_task_counter.c  |  176 +++++++++++++++++++++++++++++++++++++++++
>>  kernel/cpuset.c               |    6 +-
>>  kernel/events/core.c          |    5 +-
>>  kernel/fork.c                 |    4 +
>>  kernel/res_counter.c          |   81 ++++++++++++++++---
>>  kernel/sched.c                |    6 +-
>
> The patch forgot to document the feature: how it works, what it's
> useful for, what behaviour users can expect to see, when they should
> consider using it, what the userspace control interface is and how to
> configure it, etc.  Documentation/cgroups/ is the place for that.

+1 - I am not very familiar with the cgroups code, so I am disinclined
to learn it all just to evaluate the functionality and API of this
patch.  Design doc, please?

Tim

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/8 v3] cgroups: Task counter subsystem (was: New max number of tasks subsystem)
  2011-08-12 21:11   ` Tim Hockin
@ 2011-08-16 16:01     ` Kay Sievers
  2011-08-18 14:33       ` [RFD] Task counter: cgroup core feature or cgroup subsystem? (was Re: [PATCH 0/8 v3] cgroups: Task counter subsystem) Frederic Weisbecker
  0 siblings, 1 reply; 44+ messages in thread
From: Kay Sievers @ 2011-08-16 16:01 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Andrew Morton, Frederic Weisbecker, LKML, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov

On Fri, Aug 12, 2011 at 23:11, Tim Hockin <thockin@hockin.org> wrote:
> On Mon, Aug 1, 2011 at 4:19 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>> On Fri, 29 Jul 2011 18:13:22 +0200
>> Frederic Weisbecker <fweisbec@gmail.com> wrote:
>>
>>> Reminder:
>>>
>>> This patchset is aimed at reducing the impact of a forkbomb to a
>>> cgroup boundaries, thus minimizing the consequences of such an attack
>>> against the rest of the system.
>>>
>>> This can be useful when cgroups are used to stage some processes or run
>>> untrustees.
>>
>> Really?  How useful?  Why is it useful enough to justify adding code
>> such as this to the kernel?
>>
>> Is forkbomb-prevention the only use?  Others have proposed different
>> ways of preventing forkbombs which were independent of cgroups - is
>> this way better and if so, why?
>
> I certainly want this for exactly the proposed use - putting a bounds
> on threads/tasks per container.  It's rlimits but more useful.
>
> IMHO, most every limit that can be set at a system level should be
> settable at a cgroup level.  This is just one more isolation leak.

Such functionality in general sounds useful. System management tools
want to be able to race-free stop a service. A 'service' in the sense
of 'a group of processes and all the future processes it creates'.

A common problem here are user sessions that a logins creates. For
some systems it is required, that after logout of the user, all
processes the user has started are properly cleaned up. Common example
for such enforcements are servers at schools universities that do not
want to allow users to leave things like file sharing programs running
in the background after they log out.

We currently do that in systemd by tracking these session in a cgroup
and kill all pids in that group. This currently requires some
cooperation of the services to be successful. If they would fork
faster than we kill them, we would never be able to finish the task.

Such user sessions are generally untrusted code and processes, and the
system management that cleans up after the end of the session runs
privileged. It would be nice, to be allow trusted code to race-free
kill all remaining processes of such an untrusted session. This is not
so much about fork-bombs, things might not even have bad things in
mind, this would be more like a rlimit for a 'group of pids', that
allows race-free resource management of the services.

For the actual implementation, I think it would be nicer to use to
have such functionality at the core of cgroups, and not require a
specific controller to be set up. We already track every single
service in its own cgroup in a custom hierarchy. These groups just act
as the container for all the pids belonging to the service, so we can
track the service properly.

Naively looking at it as a user of it, we would like to be able to
apply these limits for every cgroup right away, not needing to create
another controller/subsystem/hierarchy.

Thanks,
Kay

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/8] cgroups: Add previous cgroup in can_attach_task/attach_task callbacks
  2011-07-29 16:13 ` [PATCH 3/8] cgroups: Add previous cgroup in can_attach_task/attach_task callbacks Frederic Weisbecker
@ 2011-08-17  2:40   ` Li Zefan
  2011-08-27 13:58     ` Frederic Weisbecker
  0 siblings, 1 reply; 44+ messages in thread
From: Li Zefan @ 2011-08-17  2:40 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Johannes Weiner, Aditya Kali, Oleg Nesterov

Frederic Weisbecker wrote:
> This is to prepare the integration of a new max number of proc
> cgroup subsystem. We'll need to release some resources from the
> previous cgroup.
> 

Documentation/cgroups/cgroups.txt needs update accordingly.

--
Li Zefan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/8] cgroups: New cancel_attach_task subsystem callback
  2011-07-29 16:13 ` [PATCH 4/8] cgroups: New cancel_attach_task subsystem callback Frederic Weisbecker
@ 2011-08-17  2:40   ` Li Zefan
  2011-08-27 13:58     ` Frederic Weisbecker
  0 siblings, 1 reply; 44+ messages in thread
From: Li Zefan @ 2011-08-17  2:40 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Johannes Weiner, Aditya Kali, Oleg Nesterov

Frederic Weisbecker wrote:
> To cancel a process attachment on a subsystem, we only call the
> cancel_attach() callback once on the leader but we have no
> way to cancel the attachment individually for each member of
> the process group.
> 
> This is going to be needed for the max number of tasks susbystem
> that is coming.
> 
> To prepare for this integration, call a new cancel_attach_task()
> callback on each task of the group until we reach the member that
> failed to attach.
> 

Document this new callback in Documentation/cgroups/cgroups.txt

--
Li Zefan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/8] cgroups: Ability to stop res charge propagation on bounded ancestor
  2011-07-29 16:13 ` [PATCH 5/8] cgroups: Ability to stop res charge propagation on bounded ancestor Frederic Weisbecker
@ 2011-08-17  2:41   ` Li Zefan
  2011-08-27 13:59     ` Frederic Weisbecker
  0 siblings, 1 reply; 44+ messages in thread
From: Li Zefan @ 2011-08-17  2:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Johannes Weiner, Aditya Kali, Oleg Nesterov

> +int res_counter_charge(struct res_counter *counter, unsigned long val,
> +			struct res_counter **limit_fail_at)
> +{
> +	return res_counter_charge_until(counter, NULL, val, limit_fail_at);
> +}
> +

static inline in res_counter.h ?

...
> +void res_counter_uncharge(struct res_counter *counter, unsigned long val)
> +{
> +	res_counter_uncharge_until(counter, NULL, val);
> +}
>  

ditto.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 8/8] res_counter: Allow charge failure pointer to be null
  2011-07-29 16:13 ` [PATCH 8/8] res_counter: Allow charge failure pointer to be null Frederic Weisbecker
@ 2011-08-17  2:44   ` Li Zefan
  2011-08-27 14:05     ` Frederic Weisbecker
  0 siblings, 1 reply; 44+ messages in thread
From: Li Zefan @ 2011-08-17  2:44 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Paul Menage, Johannes Weiner, Aditya Kali,
	Oleg Nesterov

Frederic Weisbecker wrote:
> So that callers of res_counter_charge() don't have to create and
> pass this pointer even if they aren't interested in it.
> 

Why not make this change early, or even do this in "[PATCH 5/8]".

> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Paul Menage <menage@google.com>
> Cc: Li Zefan <lizf@cn.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Aditya Kali <adityakali@google.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> ---
>  kernel/cgroup_task_counter.c |    6 ++----
>  kernel/res_counter.c         |    6 ++++--
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/cgroup_task_counter.c b/kernel/cgroup_task_counter.c
> index 7f12ac9..dacc4a9 100644
> --- a/kernel/cgroup_task_counter.c
> +++ b/kernel/cgroup_task_counter.c
> @@ -93,11 +93,10 @@ static int task_counter_can_attach_task(struct cgroup *cgrp, struct cgroup *old_
>  {
>  	struct res_counter *res = cgroup_task_counter_res(cgrp);
>  	struct res_counter *old_res = cgroup_task_counter_res(old_cgrp);
> -	struct res_counter *limit_fail_at;
>  
>  	common_ancestor = res_counter_common_ancestor(res, old_res);
>  
> -	return res_counter_charge_until(res, common_ancestor, 1, &limit_fail_at);
> +	return res_counter_charge_until(res, common_ancestor, 1, NULL);
>  }
>  
>  static void task_counter_cancel_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> @@ -154,13 +153,12 @@ int cgroup_task_counter_fork(struct task_struct *child)
>  {
>  	struct cgroup_subsys_state *css = child->cgroups->subsys[tasks_subsys_id];
>  	struct cgroup *cgrp = css->cgroup;
> -	struct res_counter *limit_fail_at;
>  
>  	/* Optimize for the root cgroup case, which doesn't have a limit */
>  	if (!cgrp->parent)
>  		return 0;
>  
> -	return res_counter_charge(cgroup_task_counter_res(cgrp), 1, &limit_fail_at);
> +	return res_counter_charge(cgroup_task_counter_res(cgrp), 1, NULL);
>  }
>  
>  struct cgroup_subsys tasks_subsys = {
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index 725dfa6..7eac803 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -43,14 +43,16 @@ int res_counter_charge_until(struct res_counter *counter,
>  	unsigned long flags;
>  	struct res_counter *c, *u;
>  
> -	*limit_fail_at = NULL;
> +	if (limit_fail_at)
> +		*limit_fail_at = NULL;
>  	local_irq_save(flags);
>  	for (c = counter; c != limit; c = c->parent) {
>  		spin_lock(&c->lock);
>  		ret = res_counter_charge_locked(c, val);
>  		spin_unlock(&c->lock);
>  		if (ret < 0) {
> -			*limit_fail_at = c;
> +			if (limit_fail_at)
> +				*limit_fail_at = c;
>  			goto undo;
>  		}
>  	}

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 7/8] cgroups: Add a task counter subsystem
  2011-07-29 16:13 ` [PATCH 7/8] cgroups: Add a task counter subsystem Frederic Weisbecker
  2011-08-01 23:13   ` Andrew Morton
  2011-08-09 15:11   ` Oleg Nesterov
@ 2011-08-17  3:18   ` Li Zefan
  2011-08-27 14:16     ` Frederic Weisbecker
  2 siblings, 1 reply; 44+ messages in thread
From: Li Zefan @ 2011-08-17  3:18 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Paul Menage, Johannes Weiner, Aditya Kali,
	Oleg Nesterov

> diff --git a/init/Kconfig b/init/Kconfig
> index 412c21b..bea98d2d 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -690,6 +690,13 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>  	  select this option (if, for some reason, they need to disable it
>  	  then noswapaccount does the trick).
>  
> +config CGROUP_TASK_COUNTER
> +        bool "Control number of tasks in a cgroup"

TAB

> +	depends on RESOURCE_COUNTERS
> +	help
> +	  This option let the user to set up an upper bound allowed number

will let?

> +	  of tasks inside a cgroup.
> +
>  config CGROUP_PERF
>  	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
>  	depends on PERF_EVENTS && CGROUPS

...

> +int cgroup_task_counter_fork(struct task_struct *child)
> +{
> +	struct cgroup_subsys_state *css = child->cgroups->subsys[tasks_subsys_id];
> +	struct cgroup *cgrp = css->cgroup;
> +	struct res_counter *limit_fail_at;
> +
> +	/* Optimize for the root cgroup case, which doesn't have a limit */
> +	if (!cgrp->parent)
> +		return 0;
> +
> +	return res_counter_charge(cgroup_task_counter_res(cgrp), 1, &limit_fail_at);

I think we should change the return value of res_counter_charge.
Currently it returns -ENOMEM when we excceed limit.

> +}
> +
> +struct cgroup_subsys tasks_subsys = {
> +	.name			= "tasks",
> +	.subsys_id		= tasks_subsys_id,
> +	.create			= task_counter_create,
> +	.post_clone		= task_counter_post_clone,
> +	.destroy		= task_counter_destroy,
> +	.exit			= task_counter_exit,
> +	.can_attach_task	= task_counter_can_attach_task,
> +	.cancel_attach_task	= task_counter_cancel_attach_task,
> +	.attach_task		= task_counter_attach_task,
> +	.populate		= task_counter_populate,
> +	.early_init		= 1,

Just set early_init to 0, since this subsystem doesn't have to be
initialized early during kernel boot.

> +};

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFD] Task counter: cgroup core feature or cgroup subsystem? (was Re: [PATCH 0/8 v3] cgroups: Task counter subsystem)
  2011-08-16 16:01     ` Kay Sievers
@ 2011-08-18 14:33       ` Frederic Weisbecker
  2011-08-23 16:07         ` Paul Menage
  0 siblings, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2011-08-18 14:33 UTC (permalink / raw)
  To: Kay Sievers, Paul Menage, Li Zefan
  Cc: Tim Hockin, Andrew Morton, LKML, Johannes Weiner, Aditya Kali,
	Oleg Nesterov

On Tue, Aug 16, 2011 at 06:01:48PM +0200, Kay Sievers wrote:
> On Fri, Aug 12, 2011 at 23:11, Tim Hockin <thockin@hockin.org> wrote:
> > On Mon, Aug 1, 2011 at 4:19 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> >> On Fri, 29 Jul 2011 18:13:22 +0200
> >> Frederic Weisbecker <fweisbec@gmail.com> wrote:
> >>
> >>> Reminder:
> >>>
> >>> This patchset is aimed at reducing the impact of a forkbomb to a
> >>> cgroup boundaries, thus minimizing the consequences of such an attack
> >>> against the rest of the system.
> >>>
> >>> This can be useful when cgroups are used to stage some processes or run
> >>> untrustees.
> >>
> >> Really?  How useful?  Why is it useful enough to justify adding code
> >> such as this to the kernel?
> >>
> >> Is forkbomb-prevention the only use?  Others have proposed different
> >> ways of preventing forkbombs which were independent of cgroups - is
> >> this way better and if so, why?
> >
> > I certainly want this for exactly the proposed use - putting a bounds
> > on threads/tasks per container.  It's rlimits but more useful.
> >
> > IMHO, most every limit that can be set at a system level should be
> > settable at a cgroup level.  This is just one more isolation leak.
> 
> Such functionality in general sounds useful. System management tools
> want to be able to race-free stop a service. A 'service' in the sense
> of 'a group of processes and all the future processes it creates'.

Some background here: we got an offlist discussion where we debated
about how to safely kill all tasks in a cgroup in a race-free way.
This is also needed for containers. So that's how we found a secondary
purpose of this task counter subsystem. Setting the value 0 to tasks.limit
file would reject any future fork on the cgroup, making the whole group
of task killable without worrying against concurrent fork, which otherwise
might induce an unbounded number of iterations.

So there are now two possible uses of that task counter subsystem:

- protection against fork bombs in a container
- allow race free killing of a cgroup

And this secondary purpose is also potentially useful for systemd:

> A common problem here are user sessions that a logins creates. For
> some systems it is required, that after logout of the user, all
> processes the user has started are properly cleaned up. Common example
> for such enforcements are servers at schools universities that do not
> want to allow users to leave things like file sharing programs running
> in the background after they log out.
> 
> We currently do that in systemd by tracking these session in a cgroup
> and kill all pids in that group. This currently requires some
> cooperation of the services to be successful. If they would fork
> faster than we kill them, we would never be able to finish the task.
> 
> Such user sessions are generally untrusted code and processes, and the
> system management that cleans up after the end of the session runs
> privileged. It would be nice, to be allow trusted code to race-free
> kill all remaining processes of such an untrusted session. This is not
> so much about fork-bombs, things might not even have bad things in
> mind, this would be more like a rlimit for a 'group of pids', that
> allows race-free resource management of the services.
> 
> For the actual implementation, I think it would be nicer to use to
> have such functionality at the core of cgroups, and not require a
> specific controller to be set up. We already track every single
> service in its own cgroup in a custom hierarchy. These groups just act
> as the container for all the pids belonging to the service, so we can
> track the service properly.
> 
> Naively looking at it as a user of it, we would like to be able to
> apply these limits for every cgroup right away, not needing to create
> another controller/subsystem/hierarchy.

So the problem with the task counter as a subsystem is that you could
mount it in your systemd cgroups hierarchy but then it's not anymore
available for those who want to use containers.

It would be indeed handy to have that task counter as a cgroup core
feature so that it's usable on any hierarchy. Also it allows to
safely kill all tasks in a cgroup, and that sounds like something
that should be a cgroup core feature.

Now as a counter argument, bringing this at the cgroup core level would
bring some more overhead and complication. It implies to iterate,
on fork and exit, though all cgroups the task belongs to in every
hierachies and then charge/uncharge through all ancestors of these
cgroups.
With the subsystem, we only iterate through one cgroup and its
ancestor.

Now there are alternate ways to solve your issue. One could be
to mount a /sys/kernel/cgroups/task_counter point where anybody
interested in task counter features can use that. And systemd
could move all its task gathering there (without maintaining
a secondary mountpoint).

The other way is to use the cgroup freezer to kill your tasks.
Now I'm not aware of the overhead it implies.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD] Task counter: cgroup core feature or cgroup subsystem? (was Re: [PATCH 0/8 v3] cgroups: Task counter subsystem)
  2011-08-18 14:33       ` [RFD] Task counter: cgroup core feature or cgroup subsystem? (was Re: [PATCH 0/8 v3] cgroups: Task counter subsystem) Frederic Weisbecker
@ 2011-08-23 16:07         ` Paul Menage
  2011-08-24 17:54           ` Frederic Weisbecker
  0 siblings, 1 reply; 44+ messages in thread
From: Paul Menage @ 2011-08-23 16:07 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Kay Sievers, Li Zefan, Tim Hockin, Andrew Morton, LKML,
	Johannes Weiner, Aditya Kali, Oleg Nesterov

On Thu, Aug 18, 2011 at 7:33 AM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
>
> So the problem with the task counter as a subsystem is that you could
> mount it in your systemd cgroups hierarchy but then it's not anymore
> available for those who want to use containers.

Another possible option is something that I prototyped a couple of
years ago, but dropped due to lack of compelling need and demand - the
ability to have subsystems that can be bound on multiple subsystems at
once. See

http://lkml.indiana.edu/hypermail/linux/kernel/0907.0/00574.html
http://lkml.indiana.edu/hypermail/linux/kernel/0907.0/00576.html
http://lkml.indiana.edu/hypermail/linux/kernel/0907.0/00577.html

It's applicable to subsystems whose state isn't tied to any specific
single resource in the kernel outside of cgroups (so e.g. the CPU
scheduler couldn't be usefully multi-bindable, since the CPU cgroup
state is tied to the machine's single CPU scheduler).

In the end I didn't work further on it, since it seemed that most
things that needed to be available to multiple hierarchies could more
simply be added to the core cgroups subsystem and automatically be
available on all hierarchies. But the point about tracking overhead
for fork/exit is certainly something that could make this worthwhile.

Paul

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD] Task counter: cgroup core feature or cgroup subsystem? (was Re: [PATCH 0/8 v3] cgroups: Task counter subsystem)
  2011-08-23 16:07         ` Paul Menage
@ 2011-08-24 17:54           ` Frederic Weisbecker
  2011-08-26  7:28             ` Li Zefan
  2011-08-26 15:16             ` Paul Menage
  0 siblings, 2 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2011-08-24 17:54 UTC (permalink / raw)
  To: Paul Menage
  Cc: Kay Sievers, Li Zefan, Tim Hockin, Andrew Morton, LKML,
	Johannes Weiner, Aditya Kali, Oleg Nesterov

On Tue, Aug 23, 2011 at 09:07:59AM -0700, Paul Menage wrote:
> On Thu, Aug 18, 2011 at 7:33 AM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> >
> > So the problem with the task counter as a subsystem is that you could
> > mount it in your systemd cgroups hierarchy but then it's not anymore
> > available for those who want to use containers.
> 
> Another possible option is something that I prototyped a couple of
> years ago, but dropped due to lack of compelling need and demand - the
> ability to have subsystems that can be bound on multiple subsystems at
> once. See
> 
> http://lkml.indiana.edu/hypermail/linux/kernel/0907.0/00574.html
> http://lkml.indiana.edu/hypermail/linux/kernel/0907.0/00576.html
> http://lkml.indiana.edu/hypermail/linux/kernel/0907.0/00577.html
> 
> It's applicable to subsystems whose state isn't tied to any specific
> single resource in the kernel outside of cgroups (so e.g. the CPU
> scheduler couldn't be usefully multi-bindable, since the CPU cgroup
> state is tied to the machine's single CPU scheduler).
> 
> In the end I didn't work further on it, since it seemed that most
> things that needed to be available to multiple hierarchies could more
> simply be added to the core cgroups subsystem and automatically be
> available on all hierarchies. But the point about tracking overhead
> for fork/exit is certainly something that could make this worthwhile.

That sounds like a perfect fit. I like that much better because there
should be no noticeable overhead when the task counter subsys is
nowhere mounted, compared to a pure core feature.

So I'm going to continue to work on that task counter subsystem and
I will unearth your old patch afterward to make that work on several
mountpoints once we are sure this is needed for systemd.

It seems your patch doesn't handle the ->fork() and ->exit() calls.
We probably need a quick access to states of multi-subsystems from
the task, some lists available from task->cgroups, I don't know yet.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD] Task counter: cgroup core feature or cgroup subsystem? (was Re: [PATCH 0/8 v3] cgroups: Task counter subsystem)
  2011-08-24 17:54           ` Frederic Weisbecker
@ 2011-08-26  7:28             ` Li Zefan
  2011-08-26 14:58               ` Paul Menage
  2011-08-26 15:16             ` Paul Menage
  1 sibling, 1 reply; 44+ messages in thread
From: Li Zefan @ 2011-08-26  7:28 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul Menage, Kay Sievers, Tim Hockin, Andrew Morton, LKML,
	Johannes Weiner, Aditya Kali, Oleg Nesterov

Frederic Weisbecker wrote:
> On Tue, Aug 23, 2011 at 09:07:59AM -0700, Paul Menage wrote:
>> On Thu, Aug 18, 2011 at 7:33 AM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
>>>
>>> So the problem with the task counter as a subsystem is that you could
>>> mount it in your systemd cgroups hierarchy but then it's not anymore
>>> available for those who want to use containers.
>>
>> Another possible option is something that I prototyped a couple of
>> years ago, but dropped due to lack of compelling need and demand - the
>> ability to have subsystems that can be bound on multiple subsystems at
>> once. See
>>
>> http://lkml.indiana.edu/hypermail/linux/kernel/0907.0/00574.html
>> http://lkml.indiana.edu/hypermail/linux/kernel/0907.0/00576.html
>> http://lkml.indiana.edu/hypermail/linux/kernel/0907.0/00577.html
>>
>> It's applicable to subsystems whose state isn't tied to any specific
>> single resource in the kernel outside of cgroups (so e.g. the CPU
>> scheduler couldn't be usefully multi-bindable, since the CPU cgroup
>> state is tied to the machine's single CPU scheduler).
>>
>> In the end I didn't work further on it, since it seemed that most
>> things that needed to be available to multiple hierarchies could more
>> simply be added to the core cgroups subsystem and automatically be
>> available on all hierarchies. But the point about tracking overhead
>> for fork/exit is certainly something that could make this worthwhile.
> 
> That sounds like a perfect fit. I like that much better because there
> should be no noticeable overhead when the task counter subsys is
> nowhere mounted, compared to a pure core feature.
> 

I had a patchset that can make things even better. The patchset is
to bind/unbind a subsystem to/from an existing cgroup hierarchy.

So if you found you need to use task couter but it's not avaiable
in the hierarchy you've set up, you can do this:

	# mount -o remount,task_num xxx /cgroup

Currently the above operation is supported only if there's no child
cgroups in /cgroup.

> So I'm going to continue to work on that task counter subsystem and
> I will unearth your old patch afterward to make that work on several
> mountpoints once we are sure this is needed for systemd.
> 
> It seems your patch doesn't handle the ->fork() and ->exit() calls.
> We probably need a quick access to states of multi-subsystems from
> the task, some lists available from task->cgroups, I don't know yet.
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD] Task counter: cgroup core feature or cgroup subsystem? (was Re: [PATCH 0/8 v3] cgroups: Task counter subsystem)
  2011-08-26  7:28             ` Li Zefan
@ 2011-08-26 14:58               ` Paul Menage
  2011-09-06  9:06                 ` Li Zefan
  0 siblings, 1 reply; 44+ messages in thread
From: Paul Menage @ 2011-08-26 14:58 UTC (permalink / raw)
  To: Li Zefan
  Cc: Frederic Weisbecker, Kay Sievers, Tim Hockin, Andrew Morton,
	LKML, Johannes Weiner, Aditya Kali, Oleg Nesterov

On Fri, Aug 26, 2011 at 12:28 AM, Li Zefan <lizf@cn.fujitsu.com> wrote:
>
> I had a patchset that can make things even better. The patchset is
> to bind/unbind a subsystem to/from an existing cgroup hierarchy.

Sigh, I'd been meaning to comment on that great patchset for ages, but
they kind of slipped out of my attention :-( I'll try to look over
them again - do they still apply cleanly after all the recent changes?

The bindability is orthogonal to the multi-subsystem support, I think
- they definitely complement each other.

Paul

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD] Task counter: cgroup core feature or cgroup subsystem? (was Re: [PATCH 0/8 v3] cgroups: Task counter subsystem)
  2011-08-24 17:54           ` Frederic Weisbecker
  2011-08-26  7:28             ` Li Zefan
@ 2011-08-26 15:16             ` Paul Menage
  2011-08-27 13:40               ` Frederic Weisbecker
  2011-08-31 21:54               ` Frederic Weisbecker
  1 sibling, 2 replies; 44+ messages in thread
From: Paul Menage @ 2011-08-26 15:16 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Kay Sievers, Li Zefan, Tim Hockin, Andrew Morton, LKML,
	Johannes Weiner, Aditya Kali, Oleg Nesterov

On Wed, Aug 24, 2011 at 10:54 AM, Frederic Weisbecker
<fweisbec@gmail.com> wrote:
>
> It seems your patch doesn't handle the ->fork() and ->exit() calls.
> We probably need a quick access to states of multi-subsystems from
> the task, some lists available from task->cgroups, I don't know yet.
>

That state is available, but currently only while holding cgroup_mutex
- at least, that's what task_cgroup_from_root() requires.

It might be the case that we could achieve the same effect by just
locking the task, so the pre-condition for task_cgroup_from_root()
would be either that cgroup_mutex is held or the task lock is held.

We could extend the signature of cgroup_subsys.fork to include a
reference to the cgroup; for the singly-bindable subsystems this would
be trivially available via task->cgroups; for the multi-bindable
subsystems then for each hierarchy that the subsystem is mounted on
we'd call task_cgroup_from_root() to get the cgroup for that
hierarchy. So multi-bindable subsystems with fork/exit callbacks would
get called once for each mounted instance of the subsystem.

This would still make the task counter subsystem a bit painful - it
would read_lock a global rwlock (css_set_lock) on every fork/exit in
order to find the cgroup to charge/uncharge. I'm not sure how painful
that would be on a big system. If that were a noticeable performance
problem, we could have a variable-length extension on the end of
css_set that contains a list of hierarchy_index/cgroup pairs for any
hierarchies that had multi-bindable subsystems (or maybe for all
hierarchies, for simplicity). This would make creating a css_set a
little bit more complicated, but overall shouldn't be too painful, and
would make the problem of finding a cgroup for a given hierarchy
trivial.

Paul

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD] Task counter: cgroup core feature or cgroup subsystem? (was Re: [PATCH 0/8 v3] cgroups: Task counter subsystem)
  2011-08-26 15:16             ` Paul Menage
@ 2011-08-27 13:40               ` Frederic Weisbecker
  2011-08-31 22:36                 ` Paul Menage
  2011-08-31 21:54               ` Frederic Weisbecker
  1 sibling, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2011-08-27 13:40 UTC (permalink / raw)
  To: Paul Menage
  Cc: Kay Sievers, Li Zefan, Tim Hockin, Andrew Morton, LKML,
	Johannes Weiner, Aditya Kali, Oleg Nesterov

On Fri, Aug 26, 2011 at 08:16:32AM -0700, Paul Menage wrote:
> On Wed, Aug 24, 2011 at 10:54 AM, Frederic Weisbecker
> <fweisbec@gmail.com> wrote:
> >
> > It seems your patch doesn't handle the ->fork() and ->exit() calls.
> > We probably need a quick access to states of multi-subsystems from
> > the task, some lists available from task->cgroups, I don't know yet.
> >
> 
> That state is available, but currently only while holding cgroup_mutex
> - at least, that's what task_cgroup_from_root() requires.
> 
> It might be the case that we could achieve the same effect by just
> locking the task, so the pre-condition for task_cgroup_from_root()
> would be either that cgroup_mutex is held or the task lock is held.
> 
> We could extend the signature of cgroup_subsys.fork to include a
> reference to the cgroup; for the singly-bindable subsystems this would
> be trivially available via task->cgroups; for the multi-bindable
> subsystems then for each hierarchy that the subsystem is mounted on
> we'd call task_cgroup_from_root() to get the cgroup for that
> hierarchy. So multi-bindable subsystems with fork/exit callbacks would
> get called once for each mounted instance of the subsystem.
> 
> This would still make the task counter subsystem a bit painful - it
> would read_lock a global rwlock (css_set_lock) on every fork/exit in
> order to find the cgroup to charge/uncharge. I'm not sure how painful
> that would be on a big system. If that were a noticeable performance
> problem, we could have a variable-length extension on the end of
> css_set that contains a list of hierarchy_index/cgroup pairs for any
> hierarchies that had multi-bindable subsystems (or maybe for all
> hierarchies, for simplicity). This would make creating a css_set a
> little bit more complicated, but overall shouldn't be too painful, and
> would make the problem of finding a cgroup for a given hierarchy
> trivial.

Oh you're right. My first idea was to reference multi-bindable
subsystem states in cgroup_subsys_state, like it's done currently
for singletons subsystems. But this indeed require cgroup_mutex
or task_lock. And only the last one look sensible in fork/exit path.
And if that becomes a scalability problem we can still have a
dedicated lock for cgroup attach/detach on tasks.

Whatever we do, we need that lock. So we can pick your
solution that references cgroups that belong to multi-bindable
subsystems for a given task in css_set, or we can have tsk->cgroups->subsys[]
a variable size array that references 1 * singletons and N * multi
bindable subsystems, N beeing the number of hierarchies that use
a given subsystem.

What do you think?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/8] cgroups: Add previous cgroup in can_attach_task/attach_task callbacks
  2011-08-17  2:40   ` Li Zefan
@ 2011-08-27 13:58     ` Frederic Weisbecker
  0 siblings, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2011-08-27 13:58 UTC (permalink / raw)
  To: Li Zefan; +Cc: LKML, Andrew Morton, Johannes Weiner, Aditya Kali, Oleg Nesterov

On Wed, Aug 17, 2011 at 10:40:35AM +0800, Li Zefan wrote:
> Frederic Weisbecker wrote:
> > This is to prepare the integration of a new max number of proc
> > cgroup subsystem. We'll need to release some resources from the
> > previous cgroup.
> > 
> 
> Documentation/cgroups/cgroups.txt needs update accordingly.

Yup, will update.

Thanks.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/8] cgroups: New cancel_attach_task subsystem callback
  2011-08-17  2:40   ` Li Zefan
@ 2011-08-27 13:58     ` Frederic Weisbecker
  0 siblings, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2011-08-27 13:58 UTC (permalink / raw)
  To: Li Zefan; +Cc: LKML, Andrew Morton, Johannes Weiner, Aditya Kali, Oleg Nesterov

On Wed, Aug 17, 2011 at 10:40:51AM +0800, Li Zefan wrote:
> Frederic Weisbecker wrote:
> > To cancel a process attachment on a subsystem, we only call the
> > cancel_attach() callback once on the leader but we have no
> > way to cancel the attachment individually for each member of
> > the process group.
> > 
> > This is going to be needed for the max number of tasks susbystem
> > that is coming.
> > 
> > To prepare for this integration, call a new cancel_attach_task()
> > callback on each task of the group until we reach the member that
> > failed to attach.
> > 
> 
> Document this new callback in Documentation/cgroups/cgroups.txt

I will.

Thanks.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/8] cgroups: Ability to stop res charge propagation on bounded ancestor
  2011-08-17  2:41   ` Li Zefan
@ 2011-08-27 13:59     ` Frederic Weisbecker
  0 siblings, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2011-08-27 13:59 UTC (permalink / raw)
  To: Li Zefan; +Cc: LKML, Andrew Morton, Johannes Weiner, Aditya Kali, Oleg Nesterov

On Wed, Aug 17, 2011 at 10:41:57AM +0800, Li Zefan wrote:
> > +int res_counter_charge(struct res_counter *counter, unsigned long val,
> > +			struct res_counter **limit_fail_at)
> > +{
> > +	return res_counter_charge_until(counter, NULL, val, limit_fail_at);
> > +}
> > +
> 
> static inline in res_counter.h ?
> 
> ...
> > +void res_counter_uncharge(struct res_counter *counter, unsigned long val)
> > +{
> > +	res_counter_uncharge_until(counter, NULL, val);
> > +}
> >  
> 
> ditto.

Right, sounds better inlined.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 8/8] res_counter: Allow charge failure pointer to be null
  2011-08-17  2:44   ` Li Zefan
@ 2011-08-27 14:05     ` Frederic Weisbecker
  0 siblings, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2011-08-27 14:05 UTC (permalink / raw)
  To: Li Zefan
  Cc: LKML, Andrew Morton, Paul Menage, Johannes Weiner, Aditya Kali,
	Oleg Nesterov

On Wed, Aug 17, 2011 at 10:44:22AM +0800, Li Zefan wrote:
> Frederic Weisbecker wrote:
> > So that callers of res_counter_charge() don't have to create and
> > pass this pointer even if they aren't interested in it.
> > 
> 
> Why not make this change early, or even do this in "[PATCH 5/8]".

Would be more simple yeah.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 7/8] cgroups: Add a task counter subsystem
  2011-08-17  3:18   ` Li Zefan
@ 2011-08-27 14:16     ` Frederic Weisbecker
  0 siblings, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2011-08-27 14:16 UTC (permalink / raw)
  To: Li Zefan
  Cc: LKML, Andrew Morton, Paul Menage, Johannes Weiner, Aditya Kali,
	Oleg Nesterov

On Wed, Aug 17, 2011 at 11:18:58AM +0800, Li Zefan wrote:
> > diff --git a/init/Kconfig b/init/Kconfig
> > index 412c21b..bea98d2d 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -690,6 +690,13 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
> >  	  select this option (if, for some reason, they need to disable it
> >  	  then noswapaccount does the trick).
> >  
> > +config CGROUP_TASK_COUNTER
> > +        bool "Control number of tasks in a cgroup"
> 
> TAB

I suck at configuring emacs correctly, it always makes me do some crazy things
on Kconfig files.

> 
> > +	depends on RESOURCE_COUNTERS
> > +	help
> > +	  This option let the user to set up an upper bound allowed number
> 
> will let?

First of all I should just not start with "This option ..." :)

> 
> > +	  of tasks inside a cgroup.
> > +
> >  config CGROUP_PERF
> >  	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
> >  	depends on PERF_EVENTS && CGROUPS
> 
> ...
> 
> > +int cgroup_task_counter_fork(struct task_struct *child)
> > +{
> > +	struct cgroup_subsys_state *css = child->cgroups->subsys[tasks_subsys_id];
> > +	struct cgroup *cgrp = css->cgroup;
> > +	struct res_counter *limit_fail_at;
> > +
> > +	/* Optimize for the root cgroup case, which doesn't have a limit */
> > +	if (!cgrp->parent)
> > +		return 0;
> > +
> > +	return res_counter_charge(cgroup_task_counter_res(cgrp), 1, &limit_fail_at);
> 
> I think we should change the return value of res_counter_charge.
> Currently it returns -ENOMEM when we excceed limit.

Yeah like I said previously in the thread, I need to tweak that from
res_counter API.


> 
> > +}
> > +
> > +struct cgroup_subsys tasks_subsys = {
> > +	.name			= "tasks",
> > +	.subsys_id		= tasks_subsys_id,
> > +	.create			= task_counter_create,
> > +	.post_clone		= task_counter_post_clone,
> > +	.destroy		= task_counter_destroy,
> > +	.exit			= task_counter_exit,
> > +	.can_attach_task	= task_counter_can_attach_task,
> > +	.cancel_attach_task	= task_counter_cancel_attach_task,
> > +	.attach_task		= task_counter_attach_task,
> > +	.populate		= task_counter_populate,
> > +	.early_init		= 1,
> 
> Just set early_init to 0, since this subsystem doesn't have to be
> initialized early during kernel boot.

Ah right, now that we don't touch the root cgroup anymore, we can do that
later.

Thanks.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD] Task counter: cgroup core feature or cgroup subsystem? (was Re: [PATCH 0/8 v3] cgroups: Task counter subsystem)
  2011-08-26 15:16             ` Paul Menage
  2011-08-27 13:40               ` Frederic Weisbecker
@ 2011-08-31 21:54               ` Frederic Weisbecker
  1 sibling, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2011-08-31 21:54 UTC (permalink / raw)
  To: Paul Menage
  Cc: Kay Sievers, Li Zefan, Tim Hockin, Andrew Morton, LKML,
	Johannes Weiner, Aditya Kali, Oleg Nesterov

On Fri, Aug 26, 2011 at 08:16:32AM -0700, Paul Menage wrote:
> On Wed, Aug 24, 2011 at 10:54 AM, Frederic Weisbecker
> <fweisbec@gmail.com> wrote:
> >
> > It seems your patch doesn't handle the ->fork() and ->exit() calls.
> > We probably need a quick access to states of multi-subsystems from
> > the task, some lists available from task->cgroups, I don't know yet.
> >
> 
> That state is available, but currently only while holding cgroup_mutex
> - at least, that's what task_cgroup_from_root() requires.
> 
> It might be the case that we could achieve the same effect by just
> locking the task, so the pre-condition for task_cgroup_from_root()
> would be either that cgroup_mutex is held or the task lock is held.

Now I realize, is it necessary if we only want to access a subsys state
through task->cgroups->subsys[i] from the ->fork() callback?

The task is not yet added to the thread_group, its pid is not yet
hashed so I guess it can not concurrently be moved to another
cgroup.

We need a special flavour of task->cgroups->subsys for multi bindable
subsystem but that's the same.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD] Task counter: cgroup core feature or cgroup subsystem? (was Re: [PATCH 0/8 v3] cgroups: Task counter subsystem)
  2011-08-27 13:40               ` Frederic Weisbecker
@ 2011-08-31 22:36                 ` Paul Menage
  0 siblings, 0 replies; 44+ messages in thread
From: Paul Menage @ 2011-08-31 22:36 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Kay Sievers, Li Zefan, Tim Hockin, Andrew Morton, LKML,
	Johannes Weiner, Aditya Kali, Oleg Nesterov

On Sat, Aug 27, 2011 at 6:40 AM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
>
> Whatever we do, we need that lock. So we can pick your
> solution that references cgroups that belong to multi-bindable
> subsystems for a given task in css_set, or we can have tsk->cgroups->subsys[]
> a variable size array that references 1 * singletons and N * multi
> bindable subsystems, N beeing the number of hierarchies that use
> a given subsystem.

I think that the two approaches give access to the same amount of
information, and would need to be constructed and used in
approximately the same way.

I think that adding the hierarchy->cgroup mapping as a variable-sized
array is simpler conceptually than making the existing fixed-size
subsys array be variable sized and shifting whenever a hierarchy is
added/removed.

But I think the first approach to try would be:

- make task_cgroup_from_root callable with either the task being
locked or cgroup_mutex being held
- expose this in some way outside of the cgroups framework - maybe a
function that returns an array of cgroups for a task, one for each
hierarchy.

This wouldn't involve adding any extra data structures, and we could
later make css_set cache more information if it turned out that it
needed better performance.

Paul

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD] Task counter: cgroup core feature or cgroup subsystem? (was Re: [PATCH 0/8 v3] cgroups: Task counter subsystem)
  2011-08-26 14:58               ` Paul Menage
@ 2011-09-06  9:06                 ` Li Zefan
  0 siblings, 0 replies; 44+ messages in thread
From: Li Zefan @ 2011-09-06  9:06 UTC (permalink / raw)
  To: Paul Menage
  Cc: Frederic Weisbecker, Kay Sievers, Tim Hockin, Andrew Morton,
	LKML, Johannes Weiner, Aditya Kali, Oleg Nesterov

22:58, Paul Menage wrote:
> On Fri, Aug 26, 2011 at 12:28 AM, Li Zefan <lizf@cn.fujitsu.com> wrote:
>>
>> I had a patchset that can make things even better. The patchset is
>> to bind/unbind a subsystem to/from an existing cgroup hierarchy.
> 
> Sigh, I'd been meaning to comment on that great patchset for ages, but
> they kind of slipped out of my attention :-( I'll try to look over
> them again - do they still apply cleanly after all the recent changes?
> 

I was interrupted by other work at that time and didn't continue to push the
patchset after that. I'll rebase the patchset and send it out.

> The bindability is orthogonal to the multi-subsystem support, I think
> - they definitely complement each other.
> 
> Paul
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2011-09-06  9:05 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-29 16:13 [PATCH 0/8 v3] cgroups: Task counter subsystem (was: New max number of tasks subsystem) Frederic Weisbecker
2011-07-29 16:13 ` [PATCH 1/8] cgroups: Add res_counter_write_u64() API Frederic Weisbecker
2011-08-09 15:17   ` Oleg Nesterov
2011-08-09 17:31     ` Frederic Weisbecker
2011-07-29 16:13 ` [PATCH 2/8] cgroups: New resource counter inheritance API Frederic Weisbecker
2011-07-29 16:13 ` [PATCH 3/8] cgroups: Add previous cgroup in can_attach_task/attach_task callbacks Frederic Weisbecker
2011-08-17  2:40   ` Li Zefan
2011-08-27 13:58     ` Frederic Weisbecker
2011-07-29 16:13 ` [PATCH 4/8] cgroups: New cancel_attach_task subsystem callback Frederic Weisbecker
2011-08-17  2:40   ` Li Zefan
2011-08-27 13:58     ` Frederic Weisbecker
2011-07-29 16:13 ` [PATCH 5/8] cgroups: Ability to stop res charge propagation on bounded ancestor Frederic Weisbecker
2011-08-17  2:41   ` Li Zefan
2011-08-27 13:59     ` Frederic Weisbecker
2011-07-29 16:13 ` [PATCH 6/8] cgroups: Add res counter common ancestor searching Frederic Weisbecker
2011-07-29 16:13 ` [PATCH 7/8] cgroups: Add a task counter subsystem Frederic Weisbecker
2011-08-01 23:13   ` Andrew Morton
2011-08-04 14:05     ` Frederic Weisbecker
2011-08-09 15:11   ` Oleg Nesterov
2011-08-09 17:27     ` Frederic Weisbecker
2011-08-09 17:57       ` Oleg Nesterov
2011-08-09 18:09         ` Frederic Weisbecker
2011-08-09 18:19           ` Oleg Nesterov
2011-08-09 18:34             ` Frederic Weisbecker
2011-08-09 18:39               ` Oleg Nesterov
2011-08-17  3:18   ` Li Zefan
2011-08-27 14:16     ` Frederic Weisbecker
2011-07-29 16:13 ` [PATCH 8/8] res_counter: Allow charge failure pointer to be null Frederic Weisbecker
2011-08-17  2:44   ` Li Zefan
2011-08-27 14:05     ` Frederic Weisbecker
2011-08-01 23:19 ` [PATCH 0/8 v3] cgroups: Task counter subsystem (was: New max number of tasks subsystem) Andrew Morton
2011-08-03 14:29   ` Frederic Weisbecker
2011-08-12 21:11   ` Tim Hockin
2011-08-16 16:01     ` Kay Sievers
2011-08-18 14:33       ` [RFD] Task counter: cgroup core feature or cgroup subsystem? (was Re: [PATCH 0/8 v3] cgroups: Task counter subsystem) Frederic Weisbecker
2011-08-23 16:07         ` Paul Menage
2011-08-24 17:54           ` Frederic Weisbecker
2011-08-26  7:28             ` Li Zefan
2011-08-26 14:58               ` Paul Menage
2011-09-06  9:06                 ` Li Zefan
2011-08-26 15:16             ` Paul Menage
2011-08-27 13:40               ` Frederic Weisbecker
2011-08-31 22:36                 ` Paul Menage
2011-08-31 21:54               ` Frederic Weisbecker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.