All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/11 v5] cgroups: Task counter subsystem
@ 2011-09-12 23:11 Frederic Weisbecker
  2011-09-12 23:11 ` [PATCH 01/11] cgroups: Add res_counter_write_u64() API Frederic Weisbecker
                   ` (13 more replies)
  0 siblings, 14 replies; 27+ messages in thread
From: Frederic Weisbecker @ 2011-09-12 23:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Frederic Weisbecker, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo, Containers

No functional changes. Only documentation and comments added.
Checkpatch.pl fixes, etc...

This mostly addresses Andrew's reviews on v4.

Changes in v5:

- commented res_counter_common_ancestor (patch 6)
- commented res_counter_inherit (patch 2)
- merged documentation and task counter subsystem patch (patch 9)
- rename cgroup_task_counter_res in cgroup_task_res_counter (patch 9)
- more comments/ fix comments in cgroup_task_counter.c (patch 9)
- document res_counter_charge return value (patch 8)
- document res_counter_charge_until (patch 5)
- more comment in kconfig help (patch 9)
- fix various checkpatch reported issues

Frederic Weisbecker (11):
  cgroups: Add res_counter_write_u64() API
  cgroups: New resource counter inheritance API
  cgroups: Add previous cgroup in can_attach_task/attach_task callbacks
  cgroups: New cancel_attach_task subsystem callback
  cgroups: Ability to stop res charge propagation on bounded ancestor
  cgroups: Add res counter common ancestor searching
  res_counter: Allow charge failure pointer to be null
  cgroups: Pull up res counter charge failure interpretation to caller
  cgroups: Add a task counter subsystem
  cgroups: Allow subsystems to cancel a fork
  cgroups: Convert task counter to use the subsys fork callback

 Documentation/cgroups/cgroups.txt          |   13 ++-
 Documentation/cgroups/resource_counter.txt |   20 +++-
 Documentation/cgroups/task_counter.txt     |  126 +++++++++++++++
 block/blk-cgroup.c                         |   12 +-
 include/linux/cgroup.h                     |   22 ++-
 include/linux/cgroup_subsys.h              |    8 +
 include/linux/res_counter.h                |   27 +++-
 init/Kconfig                               |    9 +
 kernel/Makefile                            |    1 +
 kernel/cgroup.c                            |   58 ++++++--
 kernel/cgroup_freezer.c                    |    9 +-
 kernel/cgroup_task_counter.c               |  239 ++++++++++++++++++++++++++++
 kernel/cpuset.c                            |    6 +-
 kernel/events/core.c                       |    5 +-
 kernel/exit.c                              |    2 +-
 kernel/fork.c                              |    7 +-
 kernel/res_counter.c                       |   86 +++++++++--
 kernel/sched.c                             |    6 +-
 18 files changed, 601 insertions(+), 55 deletions(-)
 create mode 100644 Documentation/cgroups/task_counter.txt
 create mode 100644 kernel/cgroup_task_counter.c

-- 
1.7.5.4

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 01/11] cgroups: Add res_counter_write_u64() API
  2011-09-12 23:11 [PATCH 00/11 v5] cgroups: Task counter subsystem Frederic Weisbecker
@ 2011-09-12 23:11 ` Frederic Weisbecker
  2011-09-12 23:11 ` [PATCH 02/11] cgroups: New resource counter inheritance API Frederic Weisbecker
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Frederic Weisbecker @ 2011-09-12 23:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Frederic Weisbecker, Containers, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo

Extend the resource counter API with a mirror of
res_counter_read_u64() to make it handy to update a resource
counter value from a cgroup subsystem u64 value file.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
---
 include/linux/res_counter.h |    2 ++
 kernel/res_counter.c        |   25 +++++++++++++++++++------
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index c9d625c..1b3fe05 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -82,6 +82,8 @@ int res_counter_memparse_write_strategy(const char *buf,
 int res_counter_write(struct res_counter *counter, int member,
 		      const char *buffer, write_strategy_fn write_strategy);
 
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
+
 /*
  * the field descriptors. one for each member of res_counter
  */
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 34683ef..0faafcc 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -168,12 +168,26 @@ int res_counter_memparse_write_strategy(const char *buf,
 	return 0;
 }
 
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
+{
+	unsigned long long *target;
+	unsigned long flags;
+
+	/*
+	 * We need the lock to protect against concurrent add/dec on 32 bits.
+	 * No need to ifdef it's seldom used.
+	 */
+	spin_lock_irqsave(&counter->lock, flags);
+	target = res_counter_member(counter, member);
+	*target = val;
+	spin_unlock_irqrestore(&counter->lock, flags);
+}
+
 int res_counter_write(struct res_counter *counter, int member,
 		      const char *buf, write_strategy_fn write_strategy)
 {
 	char *end;
-	unsigned long flags;
-	unsigned long long tmp, *val;
+	unsigned long long tmp;
 
 	if (write_strategy) {
 		if (write_strategy(buf, &tmp))
@@ -183,9 +197,8 @@ int res_counter_write(struct res_counter *counter, int member,
 		if (*end != '\0')
 			return -EINVAL;
 	}
-	spin_lock_irqsave(&counter->lock, flags);
-	val = res_counter_member(counter, member);
-	*val = tmp;
-	spin_unlock_irqrestore(&counter->lock, flags);
+
+	res_counter_write_u64(counter, member, tmp);
+
 	return 0;
 }
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 02/11] cgroups: New resource counter inheritance API
  2011-09-12 23:11 [PATCH 00/11 v5] cgroups: Task counter subsystem Frederic Weisbecker
  2011-09-12 23:11 ` [PATCH 01/11] cgroups: Add res_counter_write_u64() API Frederic Weisbecker
@ 2011-09-12 23:11 ` Frederic Weisbecker
  2011-09-12 23:11 ` [PATCH 03/11] cgroups: Add previous cgroup in can_attach_task/attach_task callbacks Frederic Weisbecker
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Frederic Weisbecker @ 2011-09-12 23:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Frederic Weisbecker, Containers, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo

Provide an API to inherit a counter value from a parent.
This can be useful to implement cgroup.clone_children on
a resource counter.

Still the resources of the children are limited by those
of the parent, so this is only to provide a default setting
behaviour when clone_children is set.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
---
 include/linux/res_counter.h |    2 ++
 kernel/res_counter.c        |   18 ++++++++++++++++++
 2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 1b3fe05..109d118 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -84,6 +84,8 @@ int res_counter_write(struct res_counter *counter, int member,
 
 void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
 
+void res_counter_inherit(struct res_counter *counter, int member);
+
 /*
  * the field descriptors. one for each member of res_counter
  */
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 0faafcc..37abf4e 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -202,3 +202,21 @@ int res_counter_write(struct res_counter *counter, int member,
 
 	return 0;
 }
+
+/*
+ * Simple inheritance implementation to get the same value
+ * than a parent. However this doesn't enforce the child value
+ * to be always below the one of the parent. But the child is
+ * subject to its parent limitation anyway.
+ */
+void res_counter_inherit(struct res_counter *counter, int member)
+{
+	struct res_counter *parent;
+	unsigned long long val;
+
+	parent = counter->parent;
+	if (parent) {
+		val = res_counter_read_u64(parent, member);
+		res_counter_write_u64(counter, member, val);
+	}
+}
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 03/11] cgroups: Add previous cgroup in can_attach_task/attach_task callbacks
  2011-09-12 23:11 [PATCH 00/11 v5] cgroups: Task counter subsystem Frederic Weisbecker
  2011-09-12 23:11 ` [PATCH 01/11] cgroups: Add res_counter_write_u64() API Frederic Weisbecker
  2011-09-12 23:11 ` [PATCH 02/11] cgroups: New resource counter inheritance API Frederic Weisbecker
@ 2011-09-12 23:11 ` Frederic Weisbecker
  2011-09-12 23:11 ` [PATCH 04/11] cgroups: New cancel_attach_task subsystem callback Frederic Weisbecker
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Frederic Weisbecker @ 2011-09-12 23:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Frederic Weisbecker, Containers, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo

This is to prepare the integration of a new max number of proc
cgroup subsystem. We'll need to release some resources from the
previous cgroup.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
---
 Documentation/cgroups/cgroups.txt |    6 ++++--
 block/blk-cgroup.c                |   12 ++++++++----
 include/linux/cgroup.h            |    6 ++++--
 kernel/cgroup.c                   |   11 +++++++----
 kernel/cgroup_freezer.c           |    3 ++-
 kernel/cpuset.c                   |    6 ++++--
 kernel/events/core.c              |    5 +++--
 kernel/sched.c                    |    6 ++++--
 8 files changed, 36 insertions(+), 19 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index cd67e90..0621e93 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -605,7 +605,8 @@ called on a fork. If this method returns 0 (success) then this should
 remain valid while the caller holds cgroup_mutex and it is ensured that either
 attach() or cancel_attach() will be called in future.
 
-int can_attach_task(struct cgroup *cgrp, struct task_struct *tsk);
+int can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+		    struct task_struct *tsk);
 (cgroup_mutex held by caller)
 
 As can_attach, but for operations that must be run once per task to be
@@ -635,7 +636,8 @@ void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 Called after the task has been attached to the cgroup, to allow any
 post-attachment activity that requires memory allocations or blocking.
 
-void attach_task(struct cgroup *cgrp, struct task_struct *tsk);
+void attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+		 struct task_struct *tsk);
 (cgroup_mutex held by caller)
 
 As attach, but for operations that must be run once per task to be attached,
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index bcaf16e..6eddc5f 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -30,8 +30,10 @@ EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
 static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
 						  struct cgroup *);
-static int blkiocg_can_attach_task(struct cgroup *, struct task_struct *);
-static void blkiocg_attach_task(struct cgroup *, struct task_struct *);
+static int blkiocg_can_attach_task(struct cgroup *, struct cgroup *,
+				   struct task_struct *);
+static void blkiocg_attach_task(struct cgroup *, struct cgroup *,
+				struct task_struct *);
 static void blkiocg_destroy(struct cgroup_subsys *, struct cgroup *);
 static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 
@@ -1614,7 +1616,8 @@ done:
  * of the main cic data structures.  For now we allow a task to change
  * its cgroup only if it's the only owner of its ioc.
  */
-static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static int blkiocg_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+				   struct task_struct *tsk)
 {
 	struct io_context *ioc;
 	int ret = 0;
@@ -1629,7 +1632,8 @@ static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	return ret;
 }
 
-static void blkiocg_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static void blkiocg_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+				struct task_struct *tsk)
 {
 	struct io_context *ioc;
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index da7e4bc..ed34eb8 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -468,11 +468,13 @@ struct cgroup_subsys {
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			  struct task_struct *tsk);
-	int (*can_attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
+	int (*can_attach_task)(struct cgroup *cgrp, struct cgroup *old_cgrp,
+			       struct task_struct *tsk);
 	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			      struct task_struct *tsk);
 	void (*pre_attach)(struct cgroup *cgrp);
-	void (*attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
+	void (*attach_task)(struct cgroup *cgrp, struct cgroup *old_cgrp,
+			    struct task_struct *tsk);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 		       struct cgroup *old_cgrp, struct task_struct *tsk);
 	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 84bdace..fafebdb 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1844,7 +1844,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 			}
 		}
 		if (ss->can_attach_task) {
-			retval = ss->can_attach_task(cgrp, tsk);
+			retval = ss->can_attach_task(cgrp, oldcgrp, tsk);
 			if (retval) {
 				failed_ss = ss;
 				goto out;
@@ -1860,7 +1860,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		if (ss->pre_attach)
 			ss->pre_attach(cgrp);
 		if (ss->attach_task)
-			ss->attach_task(cgrp, tsk);
+			ss->attach_task(cgrp, oldcgrp, tsk);
 		if (ss->attach)
 			ss->attach(ss, cgrp, oldcgrp, tsk);
 	}
@@ -2075,7 +2075,10 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 			/* run on each task in the threadgroup. */
 			for (i = 0; i < group_size; i++) {
 				tsk = flex_array_get_ptr(group, i);
-				retval = ss->can_attach_task(cgrp, tsk);
+				oldcgrp = task_cgroup_from_root(tsk, root);
+
+				retval = ss->can_attach_task(cgrp,
+							     oldcgrp, tsk);
 				if (retval) {
 					failed_ss = ss;
 					cancel_failed_ss = true;
@@ -2141,7 +2144,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 			/* attach each task to each subsystem */
 			for_each_subsys(root, ss) {
 				if (ss->attach_task)
-					ss->attach_task(cgrp, tsk);
+					ss->attach_task(cgrp, oldcgrp, tsk);
 			}
 		} else {
 			BUG_ON(retval != -ESRCH);
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index e691818..c1421a1 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -175,7 +175,8 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	return 0;
 }
 
-static int freezer_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static int freezer_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+				   struct task_struct *tsk)
 {
 	rcu_read_lock();
 	if (__cgroup_freezing_or_frozen(tsk)) {
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 10131fd..427be38 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1390,7 +1390,8 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 	return 0;
 }
 
-static int cpuset_can_attach_task(struct cgroup *cgrp, struct task_struct *task)
+static int cpuset_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+				  struct task_struct *task)
 {
 	return security_task_setscheduler(task);
 }
@@ -1418,7 +1419,8 @@ static void cpuset_pre_attach(struct cgroup *cont)
 }
 
 /* Per-thread attachment work. */
-static void cpuset_attach_task(struct cgroup *cont, struct task_struct *tsk)
+static void cpuset_attach_task(struct cgroup *cont, struct cgroup *old,
+			       struct task_struct *tsk)
 {
 	int err;
 	struct cpuset *cs = cgroup_cs(cont);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b8785e2..509464e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7001,7 +7001,8 @@ static int __perf_cgroup_move(void *info)
 }
 
 static void
-perf_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *task)
+perf_cgroup_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+			struct task_struct *task)
 {
 	task_function_call(task, __perf_cgroup_move, task);
 }
@@ -7017,7 +7018,7 @@ static void perf_cgroup_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	if (!(task->flags & PF_EXITING))
 		return;
 
-	perf_cgroup_attach_task(cgrp, task);
+	perf_cgroup_attach_task(cgrp, old_cgrp, task);
 }
 
 struct cgroup_subsys perf_subsys = {
diff --git a/kernel/sched.c b/kernel/sched.c
index ccacdbd..72ce1b1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8967,7 +8967,8 @@ cpu_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 }
 
 static int
-cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+			   struct task_struct *tsk)
 {
 #ifdef CONFIG_RT_GROUP_SCHED
 	if (!sched_rt_can_attach(cgroup_tg(cgrp), tsk))
@@ -8981,7 +8982,8 @@ cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 }
 
 static void
-cpu_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+cpu_cgroup_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+		       struct task_struct *tsk)
 {
 	sched_move_task(tsk);
 }
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 04/11] cgroups: New cancel_attach_task subsystem callback
  2011-09-12 23:11 [PATCH 00/11 v5] cgroups: Task counter subsystem Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2011-09-12 23:11 ` [PATCH 03/11] cgroups: Add previous cgroup in can_attach_task/attach_task callbacks Frederic Weisbecker
@ 2011-09-12 23:11 ` Frederic Weisbecker
  2011-09-12 23:11 ` [PATCH 05/11] cgroups: Ability to stop res charge propagation on bounded ancestor Frederic Weisbecker
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Frederic Weisbecker @ 2011-09-12 23:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Frederic Weisbecker, Containers, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo

To cancel a process attachment on a subsystem, we only call the
cancel_attach() callback once on the leader but we have no
way to cancel the attachment individually for each member of
the process group.

This is going to be needed for the max number of tasks susbystem
that is coming.

To prepare for this integration, call a new cancel_attach_task()
callback on each task of the group until we reach the member that
failed to attach.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
---
 Documentation/cgroups/cgroups.txt |    7 +++++++
 include/linux/cgroup.h            |    2 ++
 kernel/cgroup.c                   |   24 ++++++++++++++++++++----
 3 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 0621e93..3bc1dc9 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -623,6 +623,13 @@ function, so that the subsystem can implement a rollback. If not, not necessary.
 This will be called only about subsystems whose can_attach() operation have
 succeeded.
 
+void cancel_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+(cgroup_mutex held by caller)
+
+As cancel_attach, but for operations that must be cancelled once per
+task that wanted to be attached. This typically revert the effect of
+can_attach_task().
+
 void pre_attach(struct cgroup *cgrp);
 (cgroup_mutex held by caller)
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ed34eb8..b62cf5e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -472,6 +472,8 @@ struct cgroup_subsys {
 			       struct task_struct *tsk);
 	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			      struct task_struct *tsk);
+	void (*cancel_attach_task)(struct cgroup *cgrp,
+				   struct task_struct *tsk);
 	void (*pre_attach)(struct cgroup *cgrp);
 	void (*attach_task)(struct cgroup *cgrp, struct cgroup *old_cgrp,
 			    struct task_struct *tsk);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index fafebdb..709baef 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1883,6 +1883,9 @@ out:
 				 * remaining subsystems.
 				 */
 				break;
+
+			if (ss->cancel_attach_task)
+				ss->cancel_attach_task(cgrp, tsk);
 			if (ss->cancel_attach)
 				ss->cancel_attach(ss, cgrp, tsk);
 		}
@@ -1992,7 +1995,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 {
 	int retval, i, group_size;
 	struct cgroup_subsys *ss, *failed_ss = NULL;
-	bool cancel_failed_ss = false;
+	struct task_struct *failed_task = NULL;
 	/* guaranteed to be initialized later, but the compiler needs this */
 	struct cgroup *oldcgrp = NULL;
 	struct css_set *oldcg;
@@ -2081,7 +2084,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 							     oldcgrp, tsk);
 				if (retval) {
 					failed_ss = ss;
-					cancel_failed_ss = true;
+					failed_task = tsk;
 					goto out_cancel_attach;
 				}
 			}
@@ -2146,8 +2149,11 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 				if (ss->attach_task)
 					ss->attach_task(cgrp, oldcgrp, tsk);
 			}
+		} else if (retval == -ESRCH) {
+			if (ss->cancel_attach_task)
+				ss->cancel_attach_task(cgrp, tsk);
 		} else {
-			BUG_ON(retval != -ESRCH);
+			BUG_ON(1);
 		}
 	}
 	/* nothing is sensitive to fork() after this point. */
@@ -2179,8 +2185,18 @@ out_cancel_attach:
 	/* same deal as in cgroup_attach_task */
 	if (retval) {
 		for_each_subsys(root, ss) {
+			if (ss->cancel_attach_task && (ss != failed_ss ||
+						       failed_task)) {
+				for (i = 0; i < group_size; i++) {
+					tsk = flex_array_get_ptr(group, i);
+					if (tsk == failed_task)
+						break;
+					ss->cancel_attach_task(cgrp, tsk);
+				}
+			}
+
 			if (ss == failed_ss) {
-				if (cancel_failed_ss && ss->cancel_attach)
+				if (failed_task && ss->cancel_attach)
 					ss->cancel_attach(ss, cgrp, leader);
 				break;
 			}
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 05/11] cgroups: Ability to stop res charge propagation on bounded ancestor
  2011-09-12 23:11 [PATCH 00/11 v5] cgroups: Task counter subsystem Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2011-09-12 23:11 ` [PATCH 04/11] cgroups: New cancel_attach_task subsystem callback Frederic Weisbecker
@ 2011-09-12 23:11 ` Frederic Weisbecker
  2011-09-12 23:11 ` [PATCH 06/11] cgroups: Add res counter common ancestor searching Frederic Weisbecker
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Frederic Weisbecker @ 2011-09-12 23:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Frederic Weisbecker, Containers, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo

Moving a task from a cgroup to another may require to substract
its resource charge from the old cgroup and add it to the new one.

For this to happen, the uncharge/charge propagation can just stop
when we reach the common ancestor for the two cgroups. Further
the performance reasons, we also want to avoid to temporarily
overload the common ancestors with a non-accurate resource
counter usage if we charge first the new cgroup and uncharge the
old one thereafter. This is going to be a requirement for the coming
max number of task subsystem.

To solve this, provide a pair of new API that can charge/uncharge
a resource counter until we reach a given ancestor.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
---
 Documentation/cgroups/resource_counter.txt |   18 +++++++++++++++++-
 include/linux/res_counter.h                |   20 +++++++++++++++++---
 kernel/res_counter.c                       |   13 ++++++++-----
 3 files changed, 42 insertions(+), 9 deletions(-)

diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index 95b24d7..a2cd05b 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -83,7 +83,15 @@ to work with it.
 	res_counter->lock internally (it must be called with res_counter->lock
 	held).
 
- e. void res_counter_uncharge[_locked]
+ e. int res_counter_charge_until(struct res_counter *counter,
+			     struct res_counter *limit, unsigned long val,
+			     struct res_counter **limit_fail_at)
+
+	The same as res_counter_charge(), but the charge propagation to
+	the hierarchy stops at the limit given in the "limit" parameter.
+
+
+ f. void res_counter_uncharge[_locked]
 			(struct res_counter *rc, unsigned long val)
 
 	When a resource is released (freed) it should be de-accounted
@@ -92,6 +100,14 @@ to work with it.
 
 	The _locked routines imply that the res_counter->lock is taken.
 
+
+ g. void res_counter_uncharge_until(struct res_counter *counter,
+				struct res_counter *limit,
+				unsigned long val)
+
+	The same as res_counter_charge, but the uncharge propagation to
+	the hierarchy stops at the limit given in the "limit" parameter.
+
  2.1 Other accounting routines
 
     There are more routines that may help you with common needs, like
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 109d118..de4ba29 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -117,8 +117,16 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
 
 int __must_check res_counter_charge_locked(struct res_counter *counter,
 		unsigned long val);
-int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val, struct res_counter **limit_fail_at);
+int __must_check res_counter_charge_until(struct res_counter *counter,
+					  struct res_counter *limit,
+					  unsigned long val,
+					  struct res_counter **limit_fail_at);
+static inline int __must_check
+res_counter_charge(struct res_counter *counter, unsigned long val,
+		   struct res_counter **limit_fail_at)
+{
+	return res_counter_charge_until(counter, NULL, val, limit_fail_at);
+}
 
 /*
  * uncharge - tell that some portion of the resource is released
@@ -131,7 +139,13 @@ int __must_check res_counter_charge(struct res_counter *counter,
  */
 
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
-void res_counter_uncharge(struct res_counter *counter, unsigned long val);
+void res_counter_uncharge_until(struct res_counter *counter,
+				struct res_counter *limit, unsigned long val);
+static inline void res_counter_uncharge(struct res_counter *counter,
+					unsigned long val)
+{
+	res_counter_uncharge_until(counter, NULL, val);
+}
 
 /**
  * res_counter_margin - calculate chargeable space of a counter
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 37abf4e..0cc9c7e 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -35,8 +35,9 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 	return 0;
 }
 
-int res_counter_charge(struct res_counter *counter, unsigned long val,
-			struct res_counter **limit_fail_at)
+int res_counter_charge_until(struct res_counter *counter,
+			     struct res_counter *limit, unsigned long val,
+			     struct res_counter **limit_fail_at)
 {
 	int ret;
 	unsigned long flags;
@@ -44,7 +45,7 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
 
 	*limit_fail_at = NULL;
 	local_irq_save(flags);
-	for (c = counter; c != NULL; c = c->parent) {
+	for (c = counter; c != limit; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
 		spin_unlock(&c->lock);
@@ -74,13 +75,15 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
 	counter->usage -= val;
 }
 
-void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+void res_counter_uncharge_until(struct res_counter *counter,
+				struct res_counter *limit,
+				unsigned long val)
 {
 	unsigned long flags;
 	struct res_counter *c;
 
 	local_irq_save(flags);
-	for (c = counter; c != NULL; c = c->parent) {
+	for (c = counter; c != limit; c = c->parent) {
 		spin_lock(&c->lock);
 		res_counter_uncharge_locked(c, val);
 		spin_unlock(&c->lock);
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 06/11] cgroups: Add res counter common ancestor searching
  2011-09-12 23:11 [PATCH 00/11 v5] cgroups: Task counter subsystem Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2011-09-12 23:11 ` [PATCH 05/11] cgroups: Ability to stop res charge propagation on bounded ancestor Frederic Weisbecker
@ 2011-09-12 23:11 ` Frederic Weisbecker
  2011-09-18 19:59   ` Kirill A. Shutemov
  2011-09-12 23:11 ` [PATCH 07/11] res_counter: Allow charge failure pointer to be null Frederic Weisbecker
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 27+ messages in thread
From: Frederic Weisbecker @ 2011-09-12 23:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Frederic Weisbecker, Containers, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo

Add a new API to find the common ancestor between two resource
counters. This includes the passed resource counter themselves.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
---
 include/linux/res_counter.h |    3 +++
 kernel/res_counter.c        |   22 ++++++++++++++++++++++
 2 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index de4ba29..558f39b 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -147,6 +147,9 @@ static inline void res_counter_uncharge(struct res_counter *counter,
 	res_counter_uncharge_until(counter, NULL, val);
 }
 
+struct res_counter *res_counter_common_ancestor(struct res_counter *l,
+						struct res_counter *r);
+
 /**
  * res_counter_margin - calculate chargeable space of a counter
  * @cnt: the counter
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 0cc9c7e..22f07cc 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -91,6 +91,28 @@ void res_counter_uncharge_until(struct res_counter *counter,
 	local_irq_restore(flags);
 }
 
+/*
+ * Walk through r1 and r2 parents and try to find the closest common one
+ * between both. If none is found, it returns NULL.
+ */
+struct res_counter *
+res_counter_common_ancestor(struct res_counter *r1, struct res_counter *r2)
+{
+	struct res_counter *iter;
+
+	while (r1) {
+		iter = r2;
+		while (iter) {
+			if (iter == r1)
+				return iter;
+			iter = iter->parent;
+		}
+
+		r1 = r1->parent;
+	}
+
+	return NULL;
+}
 
 static inline unsigned long long *
 res_counter_member(struct res_counter *counter, int member)
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 07/11] res_counter: Allow charge failure pointer to be null
  2011-09-12 23:11 [PATCH 00/11 v5] cgroups: Task counter subsystem Frederic Weisbecker
                   ` (5 preceding siblings ...)
  2011-09-12 23:11 ` [PATCH 06/11] cgroups: Add res counter common ancestor searching Frederic Weisbecker
@ 2011-09-12 23:11 ` Frederic Weisbecker
  2011-09-12 23:11 ` [PATCH 08/11] cgroups: Pull up res counter charge failure interpretation to caller Frederic Weisbecker
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Frederic Weisbecker @ 2011-09-12 23:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Frederic Weisbecker, Containers, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo

So that callers of res_counter_charge() don't have to create and
pass this pointer even if they aren't interested in it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
---
 kernel/res_counter.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 22f07cc..f1fe9f0 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -43,14 +43,16 @@ int res_counter_charge_until(struct res_counter *counter,
 	unsigned long flags;
 	struct res_counter *c, *u;
 
-	*limit_fail_at = NULL;
+	if (limit_fail_at)
+		*limit_fail_at = NULL;
 	local_irq_save(flags);
 	for (c = counter; c != limit; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
 		spin_unlock(&c->lock);
 		if (ret < 0) {
-			*limit_fail_at = c;
+			if (limit_fail_at)
+				*limit_fail_at = c;
 			goto undo;
 		}
 	}
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 08/11] cgroups: Pull up res counter charge failure interpretation to caller
  2011-09-12 23:11 [PATCH 00/11 v5] cgroups: Task counter subsystem Frederic Weisbecker
                   ` (6 preceding siblings ...)
  2011-09-12 23:11 ` [PATCH 07/11] res_counter: Allow charge failure pointer to be null Frederic Weisbecker
@ 2011-09-12 23:11 ` Frederic Weisbecker
  2011-09-12 23:11 ` [PATCH 09/11] cgroups: Add a task counter subsystem Frederic Weisbecker
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Frederic Weisbecker @ 2011-09-12 23:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Frederic Weisbecker, Containers, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo

res_counter_charge() always returns -ENOMEM when the limit is reached
and the charge thus can't happen.

However it's up to the caller to interpret this failure and return
the appropriate error value. The task counter subsystem will need
to report the user that a fork() has been cancelled because of some
limit reached, not because we are too short on memory.

Fix this by returning -1 when res_counter_charge() fails.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
---
 Documentation/cgroups/resource_counter.txt |    2 ++
 kernel/res_counter.c                       |    2 +-
 2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index a2cd05b..24ec61c 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -76,6 +76,8 @@ to work with it.
 	limit_fail_at parameter is set to the particular res_counter element
 	where the charging failed.
 
+	It returns 0 on success and -1 on failure.
+
  d. int res_counter_charge_locked
 			(struct res_counter *rc, unsigned long val)
 
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index f1fe9f0..076fda4 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -26,7 +26,7 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 {
 	if (counter->usage + val > counter->limit) {
 		counter->failcnt++;
-		return -ENOMEM;
+		return -1;
 	}
 
 	counter->usage += val;
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 09/11] cgroups: Add a task counter subsystem
  2011-09-12 23:11 [PATCH 00/11 v5] cgroups: Task counter subsystem Frederic Weisbecker
                   ` (7 preceding siblings ...)
  2011-09-12 23:11 ` [PATCH 08/11] cgroups: Pull up res counter charge failure interpretation to caller Frederic Weisbecker
@ 2011-09-12 23:11 ` Frederic Weisbecker
  2011-09-18 20:27   ` Kirill A. Shutemov
  2011-09-12 23:11 ` [PATCH 10/11] cgroups: Allow subsystems to cancel a fork Frederic Weisbecker
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 27+ messages in thread
From: Frederic Weisbecker @ 2011-09-12 23:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Frederic Weisbecker, Containers, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo

Add a new subsystem to limit the number of running tasks,
similar to the NR_PROC rlimit but in the scope of a cgroup.

This is a step to be able to isolate a bit more a cgroup against
the rest of the system and limit the global impact of a fork bomb
inside a given cgroup.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
---
 Documentation/cgroups/task_counter.txt |  126 +++++++++++++++++
 include/linux/cgroup.h                 |    9 ++
 include/linux/cgroup_subsys.h          |    8 +
 init/Kconfig                           |    9 ++
 kernel/Makefile                        |    1 +
 kernel/cgroup_task_counter.c           |  237 ++++++++++++++++++++++++++++++++
 kernel/fork.c                          |    4 +
 7 files changed, 394 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/cgroups/task_counter.txt
 create mode 100644 kernel/cgroup_task_counter.c

diff --git a/Documentation/cgroups/task_counter.txt b/Documentation/cgroups/task_counter.txt
new file mode 100644
index 0000000..e93760a
--- /dev/null
+++ b/Documentation/cgroups/task_counter.txt
@@ -0,0 +1,126 @@
+Task counter subsystem
+
+1. Description
+
+The task counter subsystem limits the number of tasks running
+inside a given cgroup. It behaves like the NR_PROC rlimit but in
+the scope of a cgroup instead of a user.
+
+It has two typical usecases, although more can probably be found:
+
+- Protect against forkbombs that explode inside a container when
+that container is implemented using a cgroup. The NR_PROC rlimit
+is not efficient for that because if we have several containers
+running in parallel under the same user, one container could starve
+all the others by spawning a high number of tasks close to the
+rlimit boundary. So in this case we need this limitation to be
+done in a per cgroup granularity.
+
+- Kill all tasks inside a cgroup without races. By setting the limit
+of running tasks to 0, one can prevent from any further fork inside a
+cgroup and then kill all of its tasks without the need to retry an
+unbound amount of time due to races between kills and forks running
+in parallel (more details in "Kill a cgroup safely" paragraph).
+
+
+2. Interface
+
+When a hierarchy is mounted with the task counter subsystem binded, it
+adds two files into the cgroups directories, except the root one:
+
+- tasks.usage contains the number of tasks running inside a cgroup and
+its children in the hierarchy (see paragraph about Inheritance).
+
+- tasks.limit contains the maximum number of tasks that can run inside
+a cgroup. We check this limit when a task forks or when it is migrated
+to a cgroup.
+
+Note that the tasks.limit value can be forced below tasks.usage, in which
+case any new task in the cgroup will be rejected until the tasks.usage
+value goes below tasks.limit.
+
+For optimization reasons, the root directory of a hierarchy doesn't have
+a task counter.
+
+
+3. Inheritance
+
+When a task is added to a cgroup, by way of a cgroup migration or a fork,
+it increases the task counter of that cgroup and of all its ancestors.
+Hence a cgroup is also subject to the limit of its ancestors.
+
+In the following hierarchy:
+
+
+             A
+             |
+             B
+           /   \
+          C     D
+
+
+We have 1 task running in B, one running in C and none running in D.
+It means we have tasks.usage = 1 in C and tasks.usage = 2 in B because
+B counts its task and those of its children.
+
+Now lets set tasks.limit = 2 in B and tasks.limit = 1 in D.
+If we move a new task in D, it will be refused because the limit in B has
+been reached already.
+
+
+4. Kill a cgroup safely
+
+As explained in the description, this subsystem is also helpful to
+kill all tasks in a cgroup safely, after setting tasks.limit to 0,
+so that we don't race against parallel forks in an unbound numbers
+of kill iterations.
+
+But there is a small detail to be aware of to use this feature that
+way.
+
+Some typical way to proceed would be:
+
+	echo 0 > tasks.limit
+	for TASK in $(cat cgroup.procs)
+	do
+		kill -KILL $TASK
+	done
+
+However there is a small race window where a task can be in the way to
+be forked but hasn't enough completed the fork to have the PID of the
+fork appearing in the cgroup.procs file.
+
+The only way to get it right is to run a loop that reads tasks.usage, kill
+all the tasks in cgroup.procs and exit the loop only if the value in
+tasks.usage was the same than the number of tasks that were in cgroup.procs,
+ie: the number of tasks that were killed.
+
+It works because the new child appears in tasks.usage right before we check,
+in the fork path, whether the parent has a pending signal, in which case the
+fork is cancelled anyway. So relying on tasks.usage is fine and non-racy.
+
+This race window is tiny and unlikely to happen, so most of the time a single
+kill iteration should be enough. But it's worth knowing about that corner
+case spotted by Oleg Nesterov.
+
+Example of safe use would be:
+
+	echo 0 > tasks.limit
+	END=false
+
+	while [ $END == false ]
+	do
+		NR_TASKS=$(cat tasks.usage)
+		NR_KILLED=0
+
+		for TASK in $(cat cgroup.procs)
+		do
+			let NR_KILLED=NR_KILLED+1
+			kill -KILL $TASK
+		done
+
+		if [ "$NR_TASKS" = "$NR_KILLED" ]
+		then
+			END=true
+		fi
+	done
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b62cf5e..3f132f5 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -661,4 +661,13 @@ static inline int cgroup_attach_task_current_cg(struct task_struct *t)
 
 #endif /* !CONFIG_CGROUPS */
 
+#ifdef CONFIG_CGROUP_TASK_COUNTER
+int cgroup_task_counter_fork(struct task_struct *child);
+#else
+static inline int cgroup_task_counter_fork(struct task_struct *child)
+{
+	return 0;
+}
+#endif /* CONFIG_CGROUP_TASK_COUNTER */
+
 #endif /* _LINUX_CGROUP_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index ac663c1..5425822 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -59,8 +59,16 @@ SUBSYS(net_cls)
 SUBSYS(blkio)
 #endif
 
+/* */
+
 #ifdef CONFIG_CGROUP_PERF
 SUBSYS(perf)
 #endif
 
 /* */
+
+#ifdef CONFIG_CGROUP_TASK_COUNTER
+SUBSYS(tasks)
+#endif
+
+/* */
diff --git a/init/Kconfig b/init/Kconfig
index d627783..7410b05 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -690,6 +690,15 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
 	  select this option (if, for some reason, they need to disable it
 	  then swapaccount=0 does the trick).
 
+config CGROUP_TASK_COUNTER
+	bool "Control number of tasks in a cgroup"
+	depends on RESOURCE_COUNTERS
+	help
+	  Let the user set up an upper boundary of the allowed number of tasks
+	  running in a cgroup. When a task forks or is migrated to a cgroup that
+	  has this subsystem binded, the limit is checked to either accept or
+	  reject the fork/migration.
+
 config CGROUP_PERF
 	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
 	depends on PERF_EVENTS && CGROUPS
diff --git a/kernel/Makefile b/kernel/Makefile
index eca595e..5598a7f 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -60,6 +60,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
+obj-$(CONFIG_CGROUP_TASK_COUNTER) += cgroup_task_counter.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
diff --git a/kernel/cgroup_task_counter.c b/kernel/cgroup_task_counter.c
new file mode 100644
index 0000000..8882552
--- /dev/null
+++ b/kernel/cgroup_task_counter.c
@@ -0,0 +1,237 @@
+/*
+ * Limits on number of tasks subsystem for cgroups
+ *
+ * Copyright (C) 2011 Red Hat, Inc., Frederic Weisbecker <fweisbec@redhat.com>
+ *
+ * Thanks to Andrew Morton, Johannes Weiner, Li Zefan, Oleg Nesterov and
+ * Paul Menage for their suggestions.
+ *
+ */
+
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/res_counter.h>
+
+
+struct task_counter {
+	struct res_counter		res;
+	struct cgroup_subsys_state	css;
+};
+
+/*
+ * The root task counter doesn't exist because it's not part of the
+ * whole task counting. We want to optimize the trivial case of only
+ * one root cgroup living.
+ */
+static struct cgroup_subsys_state root_css;
+
+
+static inline struct task_counter *cgroup_task_counter(struct cgroup *cgrp)
+{
+	if (!cgrp->parent)
+		return NULL;
+
+	return container_of(cgroup_subsys_state(cgrp, tasks_subsys_id),
+			    struct task_counter, css);
+}
+
+static inline struct res_counter *cgroup_task_res_counter(struct cgroup *cgrp)
+{
+	struct task_counter *cnt;
+
+	cnt = cgroup_task_counter(cgrp);
+	if (!cnt)
+		return NULL;
+
+	return &cnt->res;
+}
+
+static struct cgroup_subsys_state *
+task_counter_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct task_counter *cnt;
+	struct res_counter *parent_res;
+
+	if (!cgrp->parent)
+		return &root_css;
+
+	cnt = kzalloc(sizeof(*cnt), GFP_KERNEL);
+	if (!cnt)
+		return ERR_PTR(-ENOMEM);
+
+	parent_res = cgroup_task_res_counter(cgrp->parent);
+
+	res_counter_init(&cnt->res, parent_res);
+
+	return &cnt->css;
+}
+
+/*
+ * Inherit the limit value of the parent. This is not really to enforce
+ * a limit below or equal to the one of the parent which can be changed
+ * concurrently anyway. This is just to honour the clone flag.
+ */
+static void task_counter_post_clone(struct cgroup_subsys *ss,
+				    struct cgroup *cgrp)
+{
+	/* cgrp can't be root, so cgroup_task_res_counter() can't return NULL */
+	res_counter_inherit(cgroup_task_res_counter(cgrp), RES_LIMIT);
+}
+
+static void task_counter_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct task_counter *cnt = cgroup_task_counter(cgrp);
+
+	kfree(cnt);
+}
+
+/* Uncharge the cgroup the task was attached to */
+static void task_counter_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
+			      struct cgroup *old_cgrp, struct task_struct *task)
+{
+	/* Optimize for the root cgroup case */
+	if (old_cgrp->parent)
+		res_counter_uncharge(cgroup_task_res_counter(old_cgrp), 1);
+}
+
+/*
+ * Protected amongst can_attach_task/attach_task/cancel_attach_task by
+ * cgroup mutex
+ */
+static struct res_counter *common_ancestor;
+
+/*
+ * This does more than just probing the ability to attach to the dest cgroup.
+ * We can not just _check_ if we can attach to the destination and do the real
+ * attachment later in task_counter_attach_task() because a task in the dest
+ * cgroup can fork before and steal the last remaining count.
+ * Thus we need to charge the dest cgroup right now.
+ */
+static int task_counter_can_attach_task(struct cgroup *cgrp,
+					struct cgroup *old_cgrp,
+					struct task_struct *tsk)
+{
+	struct res_counter *res = cgroup_task_res_counter(cgrp);
+	struct res_counter *old_res = cgroup_task_res_counter(old_cgrp);
+	int err;
+
+	/*
+	 * When moving a task from a cgroup to another, we don't want
+	 * to charge the common ancestors, even though they will be
+	 * uncharged later from attach_task(), because during that
+	 * short window between charge and uncharge, a task could fork
+	 * in the ancestor and spuriously fail due to the temporary
+	 * charge.
+	 */
+	common_ancestor = res_counter_common_ancestor(res, old_res);
+
+	/*
+	 * If cgrp is the root then res is NULL, however in this case
+	 * the common ancestor is NULL as well, making the below a NOP.
+	 */
+	err = res_counter_charge_until(res, common_ancestor, 1, NULL);
+	if (err)
+		return -EINVAL;
+
+	return 0;
+}
+
+/* Uncharge the dest cgroup that we charged in task_counter_can_attach_task() */
+static void task_counter_cancel_attach_task(struct cgroup *cgrp,
+					    struct task_struct *tsk)
+{
+	res_counter_uncharge_until(cgroup_task_res_counter(cgrp),
+				   common_ancestor, 1);
+}
+
+/*
+ * This uncharge the old cgroup. We can do that now that we are sure the
+ * attachment can't cancelled anymore, because this uncharge operation
+ * couldn't be reverted later: a task in the old cgroup could fork after
+ * we uncharge and reach the task counter limit, making our return there
+ * not possible.
+ */
+static void task_counter_attach_task(struct cgroup *cgrp,
+				     struct cgroup *old_cgrp,
+				     struct task_struct *tsk)
+{
+	res_counter_uncharge_until(cgroup_task_res_counter(old_cgrp),
+				   common_ancestor, 1);
+}
+
+static u64 task_counter_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+	int type = cft->private;
+
+	return res_counter_read_u64(cgroup_task_res_counter(cgrp), type);
+}
+
+static int task_counter_write_u64(struct cgroup *cgrp, struct cftype *cft,
+				  u64 val)
+{
+	int type = cft->private;
+
+	res_counter_write_u64(cgroup_task_res_counter(cgrp), type, val);
+
+	return 0;
+}
+
+static struct cftype files[] = {
+	{
+		.name		= "limit",
+		.read_u64	= task_counter_read_u64,
+		.write_u64	= task_counter_write_u64,
+		.private	= RES_LIMIT,
+	},
+
+	{
+		.name		= "usage",
+		.read_u64	= task_counter_read_u64,
+		.private	= RES_USAGE,
+	},
+};
+
+static int task_counter_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	if (!cgrp->parent)
+		return 0;
+
+	return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
+}
+
+/*
+ * Charge the task counter with the new child coming, or reject it if we
+ * reached the limit.
+ */
+int cgroup_task_counter_fork(struct task_struct *child)
+{
+	struct cgroup_subsys_state *css;
+	struct cgroup *cgrp;
+	int err;
+
+	css = child->cgroups->subsys[tasks_subsys_id];
+	cgrp = css->cgroup;
+
+	/* Optimize for the root cgroup case, which doesn't have a limit */
+	if (!cgrp->parent)
+		return 0;
+
+	err = res_counter_charge(cgroup_task_res_counter(cgrp), 1, NULL);
+	if (err)
+		return -EAGAIN;
+
+	return 0;
+}
+
+struct cgroup_subsys tasks_subsys = {
+	.name			= "tasks",
+	.subsys_id		= tasks_subsys_id,
+	.create			= task_counter_create,
+	.post_clone		= task_counter_post_clone,
+	.destroy		= task_counter_destroy,
+	.exit			= task_counter_exit,
+	.can_attach_task	= task_counter_can_attach_task,
+	.cancel_attach_task	= task_counter_cancel_attach_task,
+	.attach_task		= task_counter_attach_task,
+	.populate		= task_counter_populate,
+};
diff --git a/kernel/fork.c b/kernel/fork.c
index 8e6b6f4..f716436 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1309,6 +1309,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p->group_leader = p;
 	INIT_LIST_HEAD(&p->thread_group);
 
+	retval = cgroup_task_counter_fork(p);
+	if (retval)
+		goto bad_fork_free_pid;
+
 	/* Now that the task is set up, run cgroup callbacks if
 	 * necessary. We need to run them before the task is visible
 	 * on the tasklist. */
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 10/11] cgroups: Allow subsystems to cancel a fork
  2011-09-12 23:11 [PATCH 00/11 v5] cgroups: Task counter subsystem Frederic Weisbecker
                   ` (8 preceding siblings ...)
  2011-09-12 23:11 ` [PATCH 09/11] cgroups: Add a task counter subsystem Frederic Weisbecker
@ 2011-09-12 23:11 ` Frederic Weisbecker
  2011-09-12 23:11 ` [PATCH 11/11] cgroups: Convert task counter to use the subsys fork callback Frederic Weisbecker
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Frederic Weisbecker @ 2011-09-12 23:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Frederic Weisbecker, Containers, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo

Let the subsystem's fork callback return an error value so
that they can cancel a fork. This is going to be used by
the task counter subsystem to implement the limit.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup.h  |   14 +++++++++-----
 kernel/cgroup.c         |   23 +++++++++++++++++++----
 kernel/cgroup_freezer.c |    6 ++++--
 kernel/exit.c           |    2 +-
 kernel/fork.c           |    7 +++++--
 5 files changed, 38 insertions(+), 14 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 3f132f5..62f7001 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -32,9 +32,11 @@ extern int cgroup_lock_is_held(void);
 extern bool cgroup_lock_live_group(struct cgroup *cgrp);
 extern void cgroup_unlock(void);
 extern void cgroup_fork(struct task_struct *p);
-extern void cgroup_fork_callbacks(struct task_struct *p);
+extern int cgroup_fork_callbacks(struct task_struct *p,
+				 struct cgroup_subsys **failed_ss);
 extern void cgroup_post_fork(struct task_struct *p);
-extern void cgroup_exit(struct task_struct *p, int run_callbacks);
+extern void cgroup_exit(struct task_struct *p, int run_callbacks,
+			struct cgroup_subsys *failed_ss);
 extern int cgroupstats_build(struct cgroupstats *stats,
 				struct dentry *dentry);
 extern int cgroup_load_subsys(struct cgroup_subsys *ss);
@@ -479,7 +481,7 @@ struct cgroup_subsys {
 			    struct task_struct *tsk);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 		       struct cgroup *old_cgrp, struct task_struct *tsk);
-	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
+	int (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
 	void (*exit)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			struct cgroup *old_cgrp, struct task_struct *task);
 	int (*populate)(struct cgroup_subsys *ss,
@@ -636,9 +638,11 @@ struct cgroup_subsys_state *cgroup_css_from_dir(struct file *f, int id);
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
 static inline void cgroup_fork(struct task_struct *p) {}
-static inline void cgroup_fork_callbacks(struct task_struct *p) {}
+static inline int cgroup_fork_callbacks(struct task_struct *p,
+					struct cgroup_subsys **failed_ss) {}
 static inline void cgroup_post_fork(struct task_struct *p) {}
-static inline void cgroup_exit(struct task_struct *p, int callbacks) {}
+static inline void cgroup_exit(struct task_struct *p, int callbacks,
+			       struct cgroup_subsys **failed_ss) {}
 
 static inline void cgroup_lock(void) {}
 static inline void cgroup_unlock(void) {}
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 709baef..018df9d 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4540,8 +4540,11 @@ void cgroup_fork(struct task_struct *child)
  * tasklist. No need to take any locks since no-one can
  * be operating on this task.
  */
-void cgroup_fork_callbacks(struct task_struct *child)
+int cgroup_fork_callbacks(struct task_struct *child,
+			  struct cgroup_subsys **failed_ss)
 {
+	int err;
+
 	if (need_forkexit_callback) {
 		int i;
 		/*
@@ -4551,10 +4554,17 @@ void cgroup_fork_callbacks(struct task_struct *child)
 		 */
 		for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
 			struct cgroup_subsys *ss = subsys[i];
-			if (ss->fork)
-				ss->fork(ss, child);
+			if (ss->fork) {
+				err = ss->fork(ss, child);
+				if (err) {
+					*failed_ss = ss;
+					return err;
+				}
+			}
 		}
 	}
+
+	return 0;
 }
 
 /**
@@ -4612,7 +4622,8 @@ void cgroup_post_fork(struct task_struct *child)
  *    which wards off any cgroup_attach_task() attempts, or task is a failed
  *    fork, never visible to cgroup_attach_task.
  */
-void cgroup_exit(struct task_struct *tsk, int run_callbacks)
+void cgroup_exit(struct task_struct *tsk, int run_callbacks,
+		 struct cgroup_subsys *failed_ss)
 {
 	struct css_set *cg;
 	int i;
@@ -4641,6 +4652,10 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
 		 */
 		for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
 			struct cgroup_subsys *ss = subsys[i];
+
+			if (ss == failed_ss)
+				break;
+
 			if (ss->exit) {
 				struct cgroup *old_cgrp =
 					rcu_dereference_raw(cg->subsys[i])->cgroup;
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index c1421a1..30a4332 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -187,7 +187,7 @@ static int freezer_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
 	return 0;
 }
 
-static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
+static int freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 {
 	struct freezer *freezer;
 
@@ -207,7 +207,7 @@ static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 	 * following check.
 	 */
 	if (!freezer->css.cgroup->parent)
-		return;
+		return 0;
 
 	spin_lock_irq(&freezer->lock);
 	BUG_ON(freezer->state == CGROUP_FROZEN);
@@ -216,6 +216,8 @@ static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 	if (freezer->state == CGROUP_FREEZING)
 		freeze_task(task, true);
 	spin_unlock_irq(&freezer->lock);
+
+	return 0;
 }
 
 /*
diff --git a/kernel/exit.c b/kernel/exit.c
index 2913b35..abab32b 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -994,7 +994,7 @@ NORET_TYPE void do_exit(long code)
 	 */
 	perf_event_exit_task(tsk);
 
-	cgroup_exit(tsk, 1);
+	cgroup_exit(tsk, 1, NULL);
 
 	if (group_dead)
 		disassociate_ctty(1);
diff --git a/kernel/fork.c b/kernel/fork.c
index f716436..989c5a0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1058,6 +1058,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
+	struct cgroup_subsys *cgroup_failed_ss = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1316,8 +1317,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	/* Now that the task is set up, run cgroup callbacks if
 	 * necessary. We need to run them before the task is visible
 	 * on the tasklist. */
-	cgroup_fork_callbacks(p);
+	retval = cgroup_fork_callbacks(p, &cgroup_failed_ss);
 	cgroup_callbacks_done = 1;
+	if (retval)
+		goto bad_fork_free_pid;
 
 	/* Need tasklist lock for parent etc handling! */
 	write_lock_irq(&tasklist_lock);
@@ -1423,7 +1426,7 @@ bad_fork_cleanup_cgroup:
 #endif
 	if (clone_flags & CLONE_THREAD)
 		threadgroup_fork_read_unlock(current);
-	cgroup_exit(p, cgroup_callbacks_done);
+	cgroup_exit(p, cgroup_callbacks_done, cgroup_failed_ss);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);
 bad_fork_cleanup_count:
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 11/11] cgroups: Convert task counter to use the subsys fork callback
  2011-09-12 23:11 [PATCH 00/11 v5] cgroups: Task counter subsystem Frederic Weisbecker
                   ` (9 preceding siblings ...)
  2011-09-12 23:11 ` [PATCH 10/11] cgroups: Allow subsystems to cancel a fork Frederic Weisbecker
@ 2011-09-12 23:11 ` Frederic Weisbecker
  2011-09-13 14:32 ` [PATCH 00/11 v5] cgroups: Task counter subsystem Peter Zijlstra
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Frederic Weisbecker @ 2011-09-12 23:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Frederic Weisbecker, Containers, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo

Instead of using an ad hoc hook on the fork path for the task
counter, use the subsystem fork callback that we can now use
to cancel the fork.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup.h       |    9 ---------
 kernel/cgroup_task_counter.c |    4 +++-
 kernel/fork.c                |    4 ----
 3 files changed, 3 insertions(+), 14 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 62f7001..917cc39 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -665,13 +665,4 @@ static inline int cgroup_attach_task_current_cg(struct task_struct *t)
 
 #endif /* !CONFIG_CGROUPS */
 
-#ifdef CONFIG_CGROUP_TASK_COUNTER
-int cgroup_task_counter_fork(struct task_struct *child);
-#else
-static inline int cgroup_task_counter_fork(struct task_struct *child)
-{
-	return 0;
-}
-#endif /* CONFIG_CGROUP_TASK_COUNTER */
-
 #endif /* _LINUX_CGROUP_H */
diff --git a/kernel/cgroup_task_counter.c b/kernel/cgroup_task_counter.c
index 8882552..2374905 100644
--- a/kernel/cgroup_task_counter.c
+++ b/kernel/cgroup_task_counter.c
@@ -203,7 +203,8 @@ static int task_counter_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
  * Charge the task counter with the new child coming, or reject it if we
  * reached the limit.
  */
-int cgroup_task_counter_fork(struct task_struct *child)
+static int task_counter_fork(struct cgroup_subsys *ss,
+			     struct task_struct *child)
 {
 	struct cgroup_subsys_state *css;
 	struct cgroup *cgrp;
@@ -233,5 +234,6 @@ struct cgroup_subsys tasks_subsys = {
 	.can_attach_task	= task_counter_can_attach_task,
 	.cancel_attach_task	= task_counter_cancel_attach_task,
 	.attach_task		= task_counter_attach_task,
+	.fork			= task_counter_fork,
 	.populate		= task_counter_populate,
 };
diff --git a/kernel/fork.c b/kernel/fork.c
index 989c5a0..ee8abdb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1310,10 +1310,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p->group_leader = p;
 	INIT_LIST_HEAD(&p->thread_group);
 
-	retval = cgroup_task_counter_fork(p);
-	if (retval)
-		goto bad_fork_free_pid;
-
 	/* Now that the task is set up, run cgroup callbacks if
 	 * necessary. We need to run them before the task is visible
 	 * on the tasklist. */
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 00/11 v5] cgroups: Task counter subsystem
  2011-09-12 23:11 [PATCH 00/11 v5] cgroups: Task counter subsystem Frederic Weisbecker
                   ` (10 preceding siblings ...)
  2011-09-12 23:11 ` [PATCH 11/11] cgroups: Convert task counter to use the subsys fork callback Frederic Weisbecker
@ 2011-09-13 14:32 ` Peter Zijlstra
  2011-09-13 14:37   ` Frederic Weisbecker
  2011-09-13 15:21 ` Frederic Weisbecker
  2011-09-13 22:23 ` Andrew Morton
  13 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2011-09-13 14:32 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, LKML, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Containers

On Tue, 2011-09-13 at 01:11 +0200, Frederic Weisbecker wrote:
>  kernel/cgroup_task_counter.c               |  239 ++++++++++++++++++++++++++++

Horrible name, since it does more than just count, it also limits the
number of tasks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 00/11 v5] cgroups: Task counter subsystem
  2011-09-13 14:32 ` [PATCH 00/11 v5] cgroups: Task counter subsystem Peter Zijlstra
@ 2011-09-13 14:37   ` Frederic Weisbecker
  2011-09-13 14:49     ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Frederic Weisbecker @ 2011-09-13 14:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, LKML, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Containers

On Tue, Sep 13, 2011 at 04:32:51PM +0200, Peter Zijlstra wrote:
> On Tue, 2011-09-13 at 01:11 +0200, Frederic Weisbecker wrote:
> >  kernel/cgroup_task_counter.c               |  239 ++++++++++++++++++++++++++++
> 
> Horrible name, since it does more than just count, it also limits the
> number of tasks.

cgroup_task_count_limit.c ? ;)

Because cgroup_task_limit.c sounds way too much generic to be self-explanatory.

I believe a better thing would be to create kernel/cgroup/ and move things like
cgroup.c, cgroup_freezer.c (renamed into freezer.c in that directory)
and have task_count_limit.c inside.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 00/11 v5] cgroups: Task counter subsystem
  2011-09-13 14:37   ` Frederic Weisbecker
@ 2011-09-13 14:49     ` Peter Zijlstra
  2011-09-13 15:01       ` Frederic Weisbecker
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2011-09-13 14:49 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, LKML, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Containers

On Tue, 2011-09-13 at 16:37 +0200, Frederic Weisbecker wrote:
> Because cgroup_task_limit.c sounds way too much generic to be
> self-explanatory.

Uhm, why? That's exactly what it does, no? It limits the number of
tasks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 00/11 v5] cgroups: Task counter subsystem
  2011-09-13 14:49     ` Peter Zijlstra
@ 2011-09-13 15:01       ` Frederic Weisbecker
  0 siblings, 0 replies; 27+ messages in thread
From: Frederic Weisbecker @ 2011-09-13 15:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, LKML, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Containers

On Tue, Sep 13, 2011 at 04:49:20PM +0200, Peter Zijlstra wrote:
> On Tue, 2011-09-13 at 16:37 +0200, Frederic Weisbecker wrote:
> > Because cgroup_task_limit.c sounds way too much generic to be
> > self-explanatory.
> 
> Uhm, why? That's exactly what it does, no? It limits the number of
> tasks.

To me that name sounds too generic. It's like something that does
a random limit on the tasks. Could be memory, IO or whatever. But
not something on the number of tasks.

Or may be cgroup_tasks_limit.c (note the "s" that tells about tasks
as a set), or cgroup_max_tasks.c ?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 00/11 v5] cgroups: Task counter subsystem
  2011-09-12 23:11 [PATCH 00/11 v5] cgroups: Task counter subsystem Frederic Weisbecker
                   ` (11 preceding siblings ...)
  2011-09-13 14:32 ` [PATCH 00/11 v5] cgroups: Task counter subsystem Peter Zijlstra
@ 2011-09-13 15:21 ` Frederic Weisbecker
  2011-09-13 22:23 ` Andrew Morton
  13 siblings, 0 replies; 27+ messages in thread
From: Frederic Weisbecker @ 2011-09-13 15:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo, Containers

On Tue, Sep 13, 2011 at 01:11:20AM +0200, Frederic Weisbecker wrote:
> No functional changes. Only documentation and comments added.
> Checkpatch.pl fixes, etc...
> 
> This mostly addresses Andrew's reviews on v4.
> 
> Changes in v5:
> 
> - commented res_counter_common_ancestor (patch 6)
> - commented res_counter_inherit (patch 2)
> - merged documentation and task counter subsystem patch (patch 9)
> - rename cgroup_task_counter_res in cgroup_task_res_counter (patch 9)
> - more comments/ fix comments in cgroup_task_counter.c (patch 9)
> - document res_counter_charge return value (patch 8)
> - document res_counter_charge_until (patch 5)
> - more comment in kconfig help (patch 9)
> - fix various checkpatch reported issues

There is a test tool that you can download there:

https://tglx.de/~fweisbec/task_counter_test.tar.gz

It tests various things to ensure the subsystem behaves
as expected:

- correct tasks.usage value when tasks are moved around
- correct fork and cgroup migration rejects if limit
is reached
- correct propagation of limit to parents
- correct protection against forkbombs
- test killing of tasks in a cgroup
- ...

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 00/11 v5] cgroups: Task counter subsystem
  2011-09-12 23:11 [PATCH 00/11 v5] cgroups: Task counter subsystem Frederic Weisbecker
                   ` (12 preceding siblings ...)
  2011-09-13 15:21 ` Frederic Weisbecker
@ 2011-09-13 22:23 ` Andrew Morton
  2011-09-15 13:56   ` Frederic Weisbecker
  13 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2011-09-13 22:23 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo, Containers,
	Andrew Morton

On Tue, 13 Sep 2011 01:11:20 +0200
Frederic Weisbecker <fweisbec@gmail.com> wrote:

> No functional changes. Only documentation and comments added.
> Checkpatch.pl fixes, etc...
> 

What is the actual rationale for merging all of this?  For this amount
of complexity I do think we need to see significant end-user benefits. 
But all I'm seeing in this patchset is

       This is a step to be able to isolate a bit more a cgroup
       against the rest of the system and limit the global impact of a
       fork bomb inside a given cgroup.

which is really very thin.



Also, the changelogs don't appear to mention any testing results for
the fork-bomb-killer feature.

Is the fork-bomb-killer feature realistically useful?  As I understand
it, the problem with a fork-bomb is that it causes a huge swapstorm
while creating tasks very quickly.  The latency impact of the swapping
makes it very hard to regain control of the system so you can stop the
forking.  So to be effective, this feature would need to limit the
swapping?  Or something.  More substantiation, please.



Also, what is the relationship between this and RLIMIT_NPROC?  Given
that we have user namespaces, does that give us per-user,
per-namespace, per-container rlimits?  If it doesn't, should it?  Will
it?  If it does/will, how duplicative will that be?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 00/11 v5] cgroups: Task counter subsystem
  2011-09-13 22:23 ` Andrew Morton
@ 2011-09-15 13:56   ` Frederic Weisbecker
  0 siblings, 0 replies; 27+ messages in thread
From: Frederic Weisbecker @ 2011-09-15 13:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo, Containers,
	Andrew Morton

On Tue, Sep 13, 2011 at 03:23:02PM -0700, Andrew Morton wrote:
> On Tue, 13 Sep 2011 01:11:20 +0200
> Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > No functional changes. Only documentation and comments added.
> > Checkpatch.pl fixes, etc...
> > 
> 
> What is the actual rationale for merging all of this?  For this amount
> of complexity I do think we need to see significant end-user benefits. 
> But all I'm seeing in this patchset is
> 
>        This is a step to be able to isolate a bit more a cgroup
>        against the rest of the system and limit the global impact of a
>        fork bomb inside a given cgroup.
> 
> which is really very thin.

Yeah I should have detailed more the goal of this subsystem in the
changelog.

The thing is better described in the documentation.
Quote:

"""
It has two typical usecases, although more can probably be found:

- Protect against forkbombs that explode inside a container when
that container is implemented using a cgroup. The NR_PROC rlimit
is not efficient for that because if we have several containers
running in parallel under the same user, one container could starve
all the others by spawning a high number of tasks close to the
rlimit boundary. So in this case we need this limitation to be
done in a per cgroup granularity.

- Kill all tasks inside a cgroup without races. By setting the limit
of running tasks to 0, one can prevent from any further fork inside a
cgroup and then kill all of its tasks without the need to retry an
unbound amount of time due to races between kills and forks running
in parallel (more details in "Kill a cgroup safely" paragraph).
"""

May be I can refine the changelog to explain the point there?
 
> Also, the changelogs don't appear to mention any testing results for
> the fork-bomb-killer feature.

Yeah I posted a test tool to the thread: https://lkml.org/lkml/2011/9/13/193

Among other things it includes a forkbomb that gets stopped and killed.
The limit I set is of 128 tasks but I tested it succefully with 4000 as
well.

Now it's actually hard to post the result of such a test because there
is no really useful numbers: either the machine hangs (without that
feature or other appropriate protection like rlimit) or we keep
control of it and we kill the forkbomb.

> Is the fork-bomb-killer feature realistically useful?  As I understand
> it, the problem with a fork-bomb is that it causes a huge swapstorm
> while creating tasks very quickly.  The latency impact of the swapping
> makes it very hard to regain control of the system so you can stop the
> forking.  So to be effective, this feature would need to limit the
> swapping?  Or something.  More substantiation, please.

I don't pretend to know well the internals of what happens when a forkbomb
spread far enough that you can't control the machine anymore. But what
you describe above is not surprising.

Now the goal of this subsystem is to prevent from even reaching that point
of running severely out of memory.

Setting a limit of 1024 should be enough for most processes, and if that limit
is joined, you should still be far from a swapstorm while the forkbomb
can't spread further.

People need to find the right leverage between the limit they set and
the possible resources their containers may need.

> Also, what is the relationship between this and RLIMIT_NPROC?  Given
> that we have user namespaces, does that give us per-user,
> per-namespace, per-container rlimits?  If it doesn't, should it?  Will
> it?  If it does/will, how duplicative will that be?

That too is on the doc but I can remind it in the changelog.
That subsystem is deemed for having per containers limit, where containers
are implemented by way of cgroups. RLIMIT doesn't work in that scope because
a single cgroup could starve all the others by using a huge number of tasks
if the limit is per user.

So it's not a duplication. They have no relation with each other. When a new
task is coming, if it reaches either the rlimit or the cgroup task limit, it's
refused, otherwise it increase both counters.

They are rather complementary, just not in the same scope.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 06/11] cgroups: Add res counter common ancestor searching
  2011-09-12 23:11 ` [PATCH 06/11] cgroups: Add res counter common ancestor searching Frederic Weisbecker
@ 2011-09-18 19:59   ` Kirill A. Shutemov
  2011-09-30 12:42     ` Frederic Weisbecker
  2011-10-01 15:07     ` Frederic Weisbecker
  0 siblings, 2 replies; 27+ messages in thread
From: Kirill A. Shutemov @ 2011-09-18 19:59 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, LKML, Containers, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo

On Tue, Sep 13, 2011 at 01:11:26AM +0200, Frederic Weisbecker wrote:
> Add a new API to find the common ancestor between two resource
> counters. This includes the passed resource counter themselves.
> 
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Acked-by: Paul Menage <paul@paulmenage.org>
> Cc: Li Zefan <lizf@cn.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Aditya Kali <adityakali@google.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Kay Sievers <kay.sievers@vrfy.org>
> Cc: Tim Hockin <thockin@hockin.org>
> Cc: Tejun Heo <tj@kernel.org>
> ---
>  include/linux/res_counter.h |    3 +++
>  kernel/res_counter.c        |   22 ++++++++++++++++++++++
>  2 files changed, 25 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index de4ba29..558f39b 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -147,6 +147,9 @@ static inline void res_counter_uncharge(struct res_counter *counter,
>  	res_counter_uncharge_until(counter, NULL, val);
>  }
>  
> +struct res_counter *res_counter_common_ancestor(struct res_counter *l,
> +						struct res_counter *r);
> +
>  /**
>   * res_counter_margin - calculate chargeable space of a counter
>   * @cnt: the counter
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index 0cc9c7e..22f07cc 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -91,6 +91,28 @@ void res_counter_uncharge_until(struct res_counter *counter,
>  	local_irq_restore(flags);
>  }
>  
> +/*
> + * Walk through r1 and r2 parents and try to find the closest common one
> + * between both. If none is found, it returns NULL.
> + */
> +struct res_counter *
> +res_counter_common_ancestor(struct res_counter *r1, struct res_counter *r2)
> +{
> +	struct res_counter *iter;
> +
> +	while (r1) {
> +		iter = r2;
> +		while (iter) {
> +			if (iter == r1)
> +				return iter;
> +			iter = iter->parent;
> +		}
> +
> +		r1 = r1->parent;
> +	}
> +
> +	return NULL;
> +}

Your implementation is O(n*m).

What about:

struct res_counter *
res_counter_common_ancestor(struct res_counter *r1, struct res_counter *r2)
{
	struct res_counter *iter;
	int r1_depth = 0, r2_depth = 0;

	for (iter = r1; iter; iter = iter->parent)
		r1_depth++;
	for (iter = r2; iter; iter = iter->parent)
		r2_depth++;

	while (r1_depth > r2_depth) {
		r1 = r1->parent;
		r1_depth--;
	}
	while (r2_depth > r1_depth) {
		r2 = r2->parent;
		r2_depth--;
	}

	while (r1 != r2) {
		r1 = r1->parent;
		r2 = r2->parent;
	}

	return r1;
}

Untested. It's O(n+m).

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 09/11] cgroups: Add a task counter subsystem
  2011-09-12 23:11 ` [PATCH 09/11] cgroups: Add a task counter subsystem Frederic Weisbecker
@ 2011-09-18 20:27   ` Kirill A. Shutemov
  2011-09-30 12:54     ` Frederic Weisbecker
  0 siblings, 1 reply; 27+ messages in thread
From: Kirill A. Shutemov @ 2011-09-18 20:27 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, LKML, Containers, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo

On Tue, Sep 13, 2011 at 01:11:29AM +0200, Frederic Weisbecker wrote:
> +static struct cftype files[] = {
> +	{
> +		.name		= "limit",
> +		.read_u64	= task_counter_read_u64,
> +		.write_u64	= task_counter_write_u64,
> +		.private	= RES_LIMIT,
> +	},
> +
> +	{
> +		.name		= "usage",
> +		.read_u64	= task_counter_read_u64,

It would be nice to implement .register_event/.unregister_event as well.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 06/11] cgroups: Add res counter common ancestor searching
  2011-09-18 19:59   ` Kirill A. Shutemov
@ 2011-09-30 12:42     ` Frederic Weisbecker
  2011-10-01 15:07     ` Frederic Weisbecker
  1 sibling, 0 replies; 27+ messages in thread
From: Frederic Weisbecker @ 2011-09-30 12:42 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, LKML, Containers, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo

On Sun, Sep 18, 2011 at 10:59:32PM +0300, Kirill A. Shutemov wrote:
> On Tue, Sep 13, 2011 at 01:11:26AM +0200, Frederic Weisbecker wrote:
> > Add a new API to find the common ancestor between two resource
> > counters. This includes the passed resource counter themselves.
> > 
> > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > Acked-by: Paul Menage <paul@paulmenage.org>
> > Cc: Li Zefan <lizf@cn.fujitsu.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Aditya Kali <adityakali@google.com>
> > Cc: Oleg Nesterov <oleg@redhat.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Kay Sievers <kay.sievers@vrfy.org>
> > Cc: Tim Hockin <thockin@hockin.org>
> > Cc: Tejun Heo <tj@kernel.org>
> > ---
> >  include/linux/res_counter.h |    3 +++
> >  kernel/res_counter.c        |   22 ++++++++++++++++++++++
> >  2 files changed, 25 insertions(+), 0 deletions(-)
> > 
> > diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> > index de4ba29..558f39b 100644
> > --- a/include/linux/res_counter.h
> > +++ b/include/linux/res_counter.h
> > @@ -147,6 +147,9 @@ static inline void res_counter_uncharge(struct res_counter *counter,
> >  	res_counter_uncharge_until(counter, NULL, val);
> >  }
> >  
> > +struct res_counter *res_counter_common_ancestor(struct res_counter *l,
> > +						struct res_counter *r);
> > +
> >  /**
> >   * res_counter_margin - calculate chargeable space of a counter
> >   * @cnt: the counter
> > diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> > index 0cc9c7e..22f07cc 100644
> > --- a/kernel/res_counter.c
> > +++ b/kernel/res_counter.c
> > @@ -91,6 +91,28 @@ void res_counter_uncharge_until(struct res_counter *counter,
> >  	local_irq_restore(flags);
> >  }
> >  
> > +/*
> > + * Walk through r1 and r2 parents and try to find the closest common one
> > + * between both. If none is found, it returns NULL.
> > + */
> > +struct res_counter *
> > +res_counter_common_ancestor(struct res_counter *r1, struct res_counter *r2)
> > +{
> > +	struct res_counter *iter;
> > +
> > +	while (r1) {
> > +		iter = r2;
> > +		while (iter) {
> > +			if (iter == r1)
> > +				return iter;
> > +			iter = iter->parent;
> > +		}
> > +
> > +		r1 = r1->parent;
> > +	}
> > +
> > +	return NULL;
> > +}
> 
> Your implementation is O(n*m).

Paul Menage said that shouldn't harm because in practice
cgroups hierarchies are not very deep.

But I can take your implementation below, it won't harm
further and might behave better on some corner cases.

> 
> What about:
> 
> struct res_counter *
> res_counter_common_ancestor(struct res_counter *r1, struct res_counter *r2)
> {
> 	struct res_counter *iter;
> 	int r1_depth = 0, r2_depth = 0;
> 
> 	for (iter = r1; iter; iter = iter->parent)
> 		r1_depth++;
> 	for (iter = r2; iter; iter = iter->parent)
> 		r2_depth++;
> 
> 	while (r1_depth > r2_depth) {
> 		r1 = r1->parent;
> 		r1_depth--;
> 	}
> 	while (r2_depth > r1_depth) {
> 		r2 = r2->parent;
> 		r2_depth--;
> 	}
> 
> 	while (r1 != r2) {
> 		r1 = r1->parent;
> 		r2 = r2->parent;
> 	}
> 
> 	return r1;
> }
> 
> Untested. It's O(n+m).
> 
> -- 
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 09/11] cgroups: Add a task counter subsystem
  2011-09-18 20:27   ` Kirill A. Shutemov
@ 2011-09-30 12:54     ` Frederic Weisbecker
  2011-09-30 21:03       ` Kirill A. Shutemov
  0 siblings, 1 reply; 27+ messages in thread
From: Frederic Weisbecker @ 2011-09-30 12:54 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, LKML, Containers, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo

On Sun, Sep 18, 2011 at 11:27:10PM +0300, Kirill A. Shutemov wrote:
> On Tue, Sep 13, 2011 at 01:11:29AM +0200, Frederic Weisbecker wrote:
> > +static struct cftype files[] = {
> > +	{
> > +		.name		= "limit",
> > +		.read_u64	= task_counter_read_u64,
> > +		.write_u64	= task_counter_write_u64,
> > +		.private	= RES_LIMIT,
> > +	},
> > +
> > +	{
> > +		.name		= "usage",
> > +		.read_u64	= task_counter_read_u64,
> 
> It would be nice to implement .register_event/.unregister_event as well.

It adds several complications. I prefer to wait for someone requesting
that feature before doing it.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 09/11] cgroups: Add a task counter subsystem
  2011-09-30 12:54     ` Frederic Weisbecker
@ 2011-09-30 21:03       ` Kirill A. Shutemov
  2011-10-01 13:03         ` Frederic Weisbecker
  0 siblings, 1 reply; 27+ messages in thread
From: Kirill A. Shutemov @ 2011-09-30 21:03 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, LKML, Containers, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo

On Fri, Sep 30, 2011 at 02:54:46PM +0200, Frederic Weisbecker wrote:
> On Sun, Sep 18, 2011 at 11:27:10PM +0300, Kirill A. Shutemov wrote:
> > On Tue, Sep 13, 2011 at 01:11:29AM +0200, Frederic Weisbecker wrote:
> > > +static struct cftype files[] = {
> > > +	{
> > > +		.name		= "limit",
> > > +		.read_u64	= task_counter_read_u64,
> > > +		.write_u64	= task_counter_write_u64,
> > > +		.private	= RES_LIMIT,
> > > +	},
> > > +
> > > +	{
> > > +		.name		= "usage",
> > > +		.read_u64	= task_counter_read_u64,
> > 
> > It would be nice to implement .register_event/.unregister_event as well.
> 
> It adds several complications. I prefer to wait for someone requesting
> that feature before doing it.

It may be a good replacement for release_agent. Or not?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 09/11] cgroups: Add a task counter subsystem
  2011-09-30 21:03       ` Kirill A. Shutemov
@ 2011-10-01 13:03         ` Frederic Weisbecker
  0 siblings, 0 replies; 27+ messages in thread
From: Frederic Weisbecker @ 2011-10-01 13:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, LKML, Containers, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo

On Sat, Oct 01, 2011 at 12:03:36AM +0300, Kirill A. Shutemov wrote:
> On Fri, Sep 30, 2011 at 02:54:46PM +0200, Frederic Weisbecker wrote:
> > On Sun, Sep 18, 2011 at 11:27:10PM +0300, Kirill A. Shutemov wrote:
> > > On Tue, Sep 13, 2011 at 01:11:29AM +0200, Frederic Weisbecker wrote:
> > > > +static struct cftype files[] = {
> > > > +	{
> > > > +		.name		= "limit",
> > > > +		.read_u64	= task_counter_read_u64,
> > > > +		.write_u64	= task_counter_write_u64,
> > > > +		.private	= RES_LIMIT,
> > > > +	},
> > > > +
> > > > +	{
> > > > +		.name		= "usage",
> > > > +		.read_u64	= task_counter_read_u64,
> > > 
> > > It would be nice to implement .register_event/.unregister_event as well.
> > 
> > It adds several complications. I prefer to wait for someone requesting
> > that feature before doing it.
> 
> It may be a good replacement for release_agent. Or not?

Right. I just don't think it's a good idea to bring a new feature and add
maintainance work for it if it may never be used. If someone
request it for his needs, I'll be happy do implement it. We can still
do it incrementally in the future. Until then I don't see the need for it.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 06/11] cgroups: Add res counter common ancestor searching
  2011-09-18 19:59   ` Kirill A. Shutemov
  2011-09-30 12:42     ` Frederic Weisbecker
@ 2011-10-01 15:07     ` Frederic Weisbecker
  2011-10-01 15:19       ` Kirill A. Shutemov
  1 sibling, 1 reply; 27+ messages in thread
From: Frederic Weisbecker @ 2011-10-01 15:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, LKML, Containers, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo

On Sun, Sep 18, 2011 at 10:59:32PM +0300, Kirill A. Shutemov wrote:
> On Tue, Sep 13, 2011 at 01:11:26AM +0200, Frederic Weisbecker wrote:
> > Add a new API to find the common ancestor between two resource
> > counters. This includes the passed resource counter themselves.
> > 
> > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > Acked-by: Paul Menage <paul@paulmenage.org>
> > Cc: Li Zefan <lizf@cn.fujitsu.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Aditya Kali <adityakali@google.com>
> > Cc: Oleg Nesterov <oleg@redhat.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Kay Sievers <kay.sievers@vrfy.org>
> > Cc: Tim Hockin <thockin@hockin.org>
> > Cc: Tejun Heo <tj@kernel.org>
> > ---
> >  include/linux/res_counter.h |    3 +++
> >  kernel/res_counter.c        |   22 ++++++++++++++++++++++
> >  2 files changed, 25 insertions(+), 0 deletions(-)
> > 
> > diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> > index de4ba29..558f39b 100644
> > --- a/include/linux/res_counter.h
> > +++ b/include/linux/res_counter.h
> > @@ -147,6 +147,9 @@ static inline void res_counter_uncharge(struct res_counter *counter,
> >  	res_counter_uncharge_until(counter, NULL, val);
> >  }
> >  
> > +struct res_counter *res_counter_common_ancestor(struct res_counter *l,
> > +						struct res_counter *r);
> > +
> >  /**
> >   * res_counter_margin - calculate chargeable space of a counter
> >   * @cnt: the counter
> > diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> > index 0cc9c7e..22f07cc 100644
> > --- a/kernel/res_counter.c
> > +++ b/kernel/res_counter.c
> > @@ -91,6 +91,28 @@ void res_counter_uncharge_until(struct res_counter *counter,
> >  	local_irq_restore(flags);
> >  }
> >  
> > +/*
> > + * Walk through r1 and r2 parents and try to find the closest common one
> > + * between both. If none is found, it returns NULL.
> > + */
> > +struct res_counter *
> > +res_counter_common_ancestor(struct res_counter *r1, struct res_counter *r2)
> > +{
> > +	struct res_counter *iter;
> > +
> > +	while (r1) {
> > +		iter = r2;
> > +		while (iter) {
> > +			if (iter == r1)
> > +				return iter;
> > +			iter = iter->parent;
> > +		}
> > +
> > +		r1 = r1->parent;
> > +	}
> > +
> > +	return NULL;
> > +}
> 
> Your implementation is O(n*m).
> 
> What about:
> 
> struct res_counter *
> res_counter_common_ancestor(struct res_counter *r1, struct res_counter *r2)
> {
> 	struct res_counter *iter;
> 	int r1_depth = 0, r2_depth = 0;
> 
> 	for (iter = r1; iter; iter = iter->parent)
> 		r1_depth++;
> 	for (iter = r2; iter; iter = iter->parent)
> 		r2_depth++;
> 
> 	while (r1_depth > r2_depth) {
> 		r1 = r1->parent;
> 		r1_depth--;
> 	}
> 	while (r2_depth > r1_depth) {
> 		r2 = r2->parent;
> 		r2_depth--;
> 	}
> 
> 	while (r1 != r2) {
> 		r1 = r1->parent;
> 		r2 = r2->parent;
> 	}
> 
> 	return r1;
> }

While applying this (works well btw) I realize you rewrote the patch.

Can you give me your "Signed-off-by:" so that I correctly set the
authorship and the tags that come along?

Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 06/11] cgroups: Add res counter common ancestor searching
  2011-10-01 15:07     ` Frederic Weisbecker
@ 2011-10-01 15:19       ` Kirill A. Shutemov
  0 siblings, 0 replies; 27+ messages in thread
From: Kirill A. Shutemov @ 2011-10-01 15:19 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, LKML, Containers, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo

On Sat, Oct 01, 2011 at 05:07:03PM +0200, Frederic Weisbecker wrote:
> On Sun, Sep 18, 2011 at 10:59:32PM +0300, Kirill A. Shutemov wrote:
> > On Tue, Sep 13, 2011 at 01:11:26AM +0200, Frederic Weisbecker wrote:
> > > Add a new API to find the common ancestor between two resource
> > > counters. This includes the passed resource counter themselves.
> > > 
> > > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > > Acked-by: Paul Menage <paul@paulmenage.org>
> > > Cc: Li Zefan <lizf@cn.fujitsu.com>
> > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > Cc: Aditya Kali <adityakali@google.com>
> > > Cc: Oleg Nesterov <oleg@redhat.com>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Kay Sievers <kay.sievers@vrfy.org>
> > > Cc: Tim Hockin <thockin@hockin.org>
> > > Cc: Tejun Heo <tj@kernel.org>
> > > ---
> > >  include/linux/res_counter.h |    3 +++
> > >  kernel/res_counter.c        |   22 ++++++++++++++++++++++
> > >  2 files changed, 25 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> > > index de4ba29..558f39b 100644
> > > --- a/include/linux/res_counter.h
> > > +++ b/include/linux/res_counter.h
> > > @@ -147,6 +147,9 @@ static inline void res_counter_uncharge(struct res_counter *counter,
> > >  	res_counter_uncharge_until(counter, NULL, val);
> > >  }
> > >  
> > > +struct res_counter *res_counter_common_ancestor(struct res_counter *l,
> > > +						struct res_counter *r);
> > > +
> > >  /**
> > >   * res_counter_margin - calculate chargeable space of a counter
> > >   * @cnt: the counter
> > > diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> > > index 0cc9c7e..22f07cc 100644
> > > --- a/kernel/res_counter.c
> > > +++ b/kernel/res_counter.c
> > > @@ -91,6 +91,28 @@ void res_counter_uncharge_until(struct res_counter *counter,
> > >  	local_irq_restore(flags);
> > >  }
> > >  
> > > +/*
> > > + * Walk through r1 and r2 parents and try to find the closest common one
> > > + * between both. If none is found, it returns NULL.
> > > + */
> > > +struct res_counter *
> > > +res_counter_common_ancestor(struct res_counter *r1, struct res_counter *r2)
> > > +{
> > > +	struct res_counter *iter;
> > > +
> > > +	while (r1) {
> > > +		iter = r2;
> > > +		while (iter) {
> > > +			if (iter == r1)
> > > +				return iter;
> > > +			iter = iter->parent;
> > > +		}
> > > +
> > > +		r1 = r1->parent;
> > > +	}
> > > +
> > > +	return NULL;
> > > +}
> > 
> > Your implementation is O(n*m).
> > 
> > What about:
> > 
> > struct res_counter *
> > res_counter_common_ancestor(struct res_counter *r1, struct res_counter *r2)
> > {
> > 	struct res_counter *iter;
> > 	int r1_depth = 0, r2_depth = 0;
> > 
> > 	for (iter = r1; iter; iter = iter->parent)
> > 		r1_depth++;
> > 	for (iter = r2; iter; iter = iter->parent)
> > 		r2_depth++;
> > 
> > 	while (r1_depth > r2_depth) {
> > 		r1 = r1->parent;
> > 		r1_depth--;
> > 	}
> > 	while (r2_depth > r1_depth) {
> > 		r2 = r2->parent;
> > 		r2_depth--;
> > 	}
> > 
> > 	while (r1 != r2) {
> > 		r1 = r1->parent;
> > 		r2 = r2->parent;
> > 	}
> > 
> > 	return r1;
> > }
> 
> While applying this (works well btw) I realize you rewrote the patch.
> 
> Can you give me your "Signed-off-by:" so that I correctly set the
> authorship and the tags that come along?

Sure.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2011-10-01 15:19 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-12 23:11 [PATCH 00/11 v5] cgroups: Task counter subsystem Frederic Weisbecker
2011-09-12 23:11 ` [PATCH 01/11] cgroups: Add res_counter_write_u64() API Frederic Weisbecker
2011-09-12 23:11 ` [PATCH 02/11] cgroups: New resource counter inheritance API Frederic Weisbecker
2011-09-12 23:11 ` [PATCH 03/11] cgroups: Add previous cgroup in can_attach_task/attach_task callbacks Frederic Weisbecker
2011-09-12 23:11 ` [PATCH 04/11] cgroups: New cancel_attach_task subsystem callback Frederic Weisbecker
2011-09-12 23:11 ` [PATCH 05/11] cgroups: Ability to stop res charge propagation on bounded ancestor Frederic Weisbecker
2011-09-12 23:11 ` [PATCH 06/11] cgroups: Add res counter common ancestor searching Frederic Weisbecker
2011-09-18 19:59   ` Kirill A. Shutemov
2011-09-30 12:42     ` Frederic Weisbecker
2011-10-01 15:07     ` Frederic Weisbecker
2011-10-01 15:19       ` Kirill A. Shutemov
2011-09-12 23:11 ` [PATCH 07/11] res_counter: Allow charge failure pointer to be null Frederic Weisbecker
2011-09-12 23:11 ` [PATCH 08/11] cgroups: Pull up res counter charge failure interpretation to caller Frederic Weisbecker
2011-09-12 23:11 ` [PATCH 09/11] cgroups: Add a task counter subsystem Frederic Weisbecker
2011-09-18 20:27   ` Kirill A. Shutemov
2011-09-30 12:54     ` Frederic Weisbecker
2011-09-30 21:03       ` Kirill A. Shutemov
2011-10-01 13:03         ` Frederic Weisbecker
2011-09-12 23:11 ` [PATCH 10/11] cgroups: Allow subsystems to cancel a fork Frederic Weisbecker
2011-09-12 23:11 ` [PATCH 11/11] cgroups: Convert task counter to use the subsys fork callback Frederic Weisbecker
2011-09-13 14:32 ` [PATCH 00/11 v5] cgroups: Task counter subsystem Peter Zijlstra
2011-09-13 14:37   ` Frederic Weisbecker
2011-09-13 14:49     ` Peter Zijlstra
2011-09-13 15:01       ` Frederic Weisbecker
2011-09-13 15:21 ` Frederic Weisbecker
2011-09-13 22:23 ` Andrew Morton
2011-09-15 13:56   ` Frederic Weisbecker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.