linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] cgroups: Task counter subsystem v6
@ 2011-10-03 19:07 Frederic Weisbecker
  2011-10-03 19:07 ` [PATCH 01/10] cgroups: Add res_counter_write_u64() API Frederic Weisbecker
                   ` (12 more replies)
  0 siblings, 13 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2011-10-03 19:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Frederic Weisbecker, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo, Kirill A. Shutemov, Containers

Hi Andrew,

This contains minor changes, mostly documentation and changelog
updates, off-case build fix, and a code optimization in
res_counter_common_ancestor().

It's hard to put some statistic numbers while testing this feature
given that the result is rather binary: we launch a forkbomb and
either we stop and kill it or the system become unresponsive.
    
Meanwhile, one can find a testsuite at this address:
https://tglx.de/~fweisbec/task_counter_test.tar.gz
    
It performs several checks to ensure the interface and the behaviour
are reliable after common events like moving tasks around over cgroups
in a hierarchy, forking inside, etc.. It also launches a forkbomb,
tries to stop and kill it. So beware, don't run it on a system that
is doing serious things. Ensure you have CGROUP_TASK_COUNTER set
before, or it may compress the Ten Plagues in your MBR and
inflate the whole after your next reboot.

Changes in v6:

- Update res_counter_common_ancestor() to be O(n+m) instead of O(n*m),
patch from Kirill A. Shutemov (6/10)
- Refine documentation and changelog to better explain the claims and
goals of this subsystem (10/10)
- Fix !CONFIG_CGROUPS build crash (10/10)
- Step over the temporary state that used an ad-hoc fork hook.

Frederic Weisbecker (9):
  cgroups: Add res_counter_write_u64() API
  cgroups: New resource counter inheritance API
  cgroups: Add previous cgroup in can_attach_task/attach_task callbacks
  cgroups: New cancel_attach_task subsystem callback
  cgroups: Ability to stop res charge propagation on bounded ancestor
  res_counter: Allow charge failure pointer to be null
  cgroups: Pull up res counter charge failure interpretation to caller
  cgroups: Allow subsystems to cancel a fork
  cgroups: Add a task counter subsystem

Kirill A. Shutemov (1):
  cgroups: Add res counter common ancestor searching

 Documentation/cgroups/cgroups.txt          |   13 ++-
 Documentation/cgroups/resource_counter.txt |   20 +++-
 Documentation/cgroups/task_counter.txt     |  153 ++++++++++++++++++
 block/blk-cgroup.c                         |   12 +-
 include/linux/cgroup.h                     |   28 +++-
 include/linux/cgroup_subsys.h              |    8 +
 include/linux/res_counter.h                |   27 +++-
 init/Kconfig                               |    9 +
 kernel/Makefile                            |    1 +
 kernel/cgroup.c                            |   58 ++++++--
 kernel/cgroup_freezer.c                    |    9 +-
 kernel/cgroup_task_counter.c               |  239 ++++++++++++++++++++++++++++
 kernel/cpuset.c                            |    6 +-
 kernel/events/core.c                       |    5 +-
 kernel/exit.c                              |    2 +-
 kernel/fork.c                              |    7 +-
 kernel/res_counter.c                       |   97 ++++++++++--
 kernel/sched.c                             |    6 +-
 18 files changed, 644 insertions(+), 56 deletions(-)
 create mode 100644 Documentation/cgroups/task_counter.txt
 create mode 100644 kernel/cgroup_task_counter.c

-- 
1.7.5.4


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 01/10] cgroups: Add res_counter_write_u64() API
  2011-10-03 19:07 [PATCH 00/10] cgroups: Task counter subsystem v6 Frederic Weisbecker
@ 2011-10-03 19:07 ` Frederic Weisbecker
  2011-10-04  0:17   ` Kirill A. Shutemov
  2011-10-03 19:07 ` [PATCH 02/10] cgroups: New resource counter inheritance API Frederic Weisbecker
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2011-10-03 19:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Frederic Weisbecker, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo, Kirill A. Shutemov, Containers

Extend the resource counter API with a mirror of
res_counter_read_u64() to make it handy to update a resource
counter value from a cgroup subsystem u64 value file.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Containers <containers@lists.linux-foundation.org>
---
 include/linux/res_counter.h |    2 ++
 kernel/res_counter.c        |   25 +++++++++++++++++++------
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index c9d625c..1b3fe05 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -82,6 +82,8 @@ int res_counter_memparse_write_strategy(const char *buf,
 int res_counter_write(struct res_counter *counter, int member,
 		      const char *buffer, write_strategy_fn write_strategy);
 
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
+
 /*
  * the field descriptors. one for each member of res_counter
  */
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 34683ef..0faafcc 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -168,12 +168,26 @@ int res_counter_memparse_write_strategy(const char *buf,
 	return 0;
 }
 
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
+{
+	unsigned long long *target;
+	unsigned long flags;
+
+	/*
+	 * We need the lock to protect against concurrent add/dec on 32 bits.
+	 * No need to ifdef it's seldom used.
+	 */
+	spin_lock_irqsave(&counter->lock, flags);
+	target = res_counter_member(counter, member);
+	*target = val;
+	spin_unlock_irqrestore(&counter->lock, flags);
+}
+
 int res_counter_write(struct res_counter *counter, int member,
 		      const char *buf, write_strategy_fn write_strategy)
 {
 	char *end;
-	unsigned long flags;
-	unsigned long long tmp, *val;
+	unsigned long long tmp;
 
 	if (write_strategy) {
 		if (write_strategy(buf, &tmp))
@@ -183,9 +197,8 @@ int res_counter_write(struct res_counter *counter, int member,
 		if (*end != '\0')
 			return -EINVAL;
 	}
-	spin_lock_irqsave(&counter->lock, flags);
-	val = res_counter_member(counter, member);
-	*val = tmp;
-	spin_unlock_irqrestore(&counter->lock, flags);
+
+	res_counter_write_u64(counter, member, tmp);
+
 	return 0;
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 02/10] cgroups: New resource counter inheritance API
  2011-10-03 19:07 [PATCH 00/10] cgroups: Task counter subsystem v6 Frederic Weisbecker
  2011-10-03 19:07 ` [PATCH 01/10] cgroups: Add res_counter_write_u64() API Frederic Weisbecker
@ 2011-10-03 19:07 ` Frederic Weisbecker
  2011-10-04  0:20   ` Kirill A. Shutemov
  2011-10-03 19:07 ` [PATCH 03/10] cgroups: Add previous cgroup in can_attach_task/attach_task callbacks Frederic Weisbecker
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2011-10-03 19:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Frederic Weisbecker, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo, Kirill A. Shutemov, Containers

Provide an API to inherit a counter value from a parent.
This can be useful to implement cgroup.clone_children on
a resource counter.

Still the resources of the children are limited by those
of the parent, so this is only to provide a default setting
behaviour when clone_children is set.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Containers <containers@lists.linux-foundation.org>
---
 include/linux/res_counter.h |    2 ++
 kernel/res_counter.c        |   18 ++++++++++++++++++
 2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 1b3fe05..109d118 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -84,6 +84,8 @@ int res_counter_write(struct res_counter *counter, int member,
 
 void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
 
+void res_counter_inherit(struct res_counter *counter, int member);
+
 /*
  * the field descriptors. one for each member of res_counter
  */
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 0faafcc..37abf4e 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -202,3 +202,21 @@ int res_counter_write(struct res_counter *counter, int member,
 
 	return 0;
 }
+
+/*
+ * Simple inheritance implementation to get the same value
+ * than a parent. However this doesn't enforce the child value
+ * to be always below the one of the parent. But the child is
+ * subject to its parent limitation anyway.
+ */
+void res_counter_inherit(struct res_counter *counter, int member)
+{
+	struct res_counter *parent;
+	unsigned long long val;
+
+	parent = counter->parent;
+	if (parent) {
+		val = res_counter_read_u64(parent, member);
+		res_counter_write_u64(counter, member, val);
+	}
+}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 03/10] cgroups: Add previous cgroup in can_attach_task/attach_task callbacks
  2011-10-03 19:07 [PATCH 00/10] cgroups: Task counter subsystem v6 Frederic Weisbecker
  2011-10-03 19:07 ` [PATCH 01/10] cgroups: Add res_counter_write_u64() API Frederic Weisbecker
  2011-10-03 19:07 ` [PATCH 02/10] cgroups: New resource counter inheritance API Frederic Weisbecker
@ 2011-10-03 19:07 ` Frederic Weisbecker
  2011-10-04  0:22   ` Kirill A. Shutemov
  2011-10-03 19:07 ` [PATCH 04/10] cgroups: New cancel_attach_task subsystem callback Frederic Weisbecker
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2011-10-03 19:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Frederic Weisbecker, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo, Kirill A. Shutemov, Containers

This is to prepare the integration of a new max number of proc
cgroup subsystem. We'll need to release some resources from the
previous cgroup.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Containers <containers@lists.linux-foundation.org>
---
 Documentation/cgroups/cgroups.txt |    6 ++++--
 block/blk-cgroup.c                |   12 ++++++++----
 include/linux/cgroup.h            |    6 ++++--
 kernel/cgroup.c                   |   11 +++++++----
 kernel/cgroup_freezer.c           |    3 ++-
 kernel/cpuset.c                   |    6 ++++--
 kernel/events/core.c              |    5 +++--
 kernel/sched.c                    |    6 ++++--
 8 files changed, 36 insertions(+), 19 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index cd67e90..0621e93 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -605,7 +605,8 @@ called on a fork. If this method returns 0 (success) then this should
 remain valid while the caller holds cgroup_mutex and it is ensured that either
 attach() or cancel_attach() will be called in future.
 
-int can_attach_task(struct cgroup *cgrp, struct task_struct *tsk);
+int can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+		    struct task_struct *tsk);
 (cgroup_mutex held by caller)
 
 As can_attach, but for operations that must be run once per task to be
@@ -635,7 +636,8 @@ void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 Called after the task has been attached to the cgroup, to allow any
 post-attachment activity that requires memory allocations or blocking.
 
-void attach_task(struct cgroup *cgrp, struct task_struct *tsk);
+void attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+		 struct task_struct *tsk);
 (cgroup_mutex held by caller)
 
 As attach, but for operations that must be run once per task to be attached,
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index bcaf16e..6eddc5f 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -30,8 +30,10 @@ EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
 static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
 						  struct cgroup *);
-static int blkiocg_can_attach_task(struct cgroup *, struct task_struct *);
-static void blkiocg_attach_task(struct cgroup *, struct task_struct *);
+static int blkiocg_can_attach_task(struct cgroup *, struct cgroup *,
+				   struct task_struct *);
+static void blkiocg_attach_task(struct cgroup *, struct cgroup *,
+				struct task_struct *);
 static void blkiocg_destroy(struct cgroup_subsys *, struct cgroup *);
 static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 
@@ -1614,7 +1616,8 @@ done:
  * of the main cic data structures.  For now we allow a task to change
  * its cgroup only if it's the only owner of its ioc.
  */
-static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static int blkiocg_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+				   struct task_struct *tsk)
 {
 	struct io_context *ioc;
 	int ret = 0;
@@ -1629,7 +1632,8 @@ static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	return ret;
 }
 
-static void blkiocg_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static void blkiocg_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+				struct task_struct *tsk)
 {
 	struct io_context *ioc;
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index da7e4bc..ed34eb8 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -468,11 +468,13 @@ struct cgroup_subsys {
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			  struct task_struct *tsk);
-	int (*can_attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
+	int (*can_attach_task)(struct cgroup *cgrp, struct cgroup *old_cgrp,
+			       struct task_struct *tsk);
 	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			      struct task_struct *tsk);
 	void (*pre_attach)(struct cgroup *cgrp);
-	void (*attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
+	void (*attach_task)(struct cgroup *cgrp, struct cgroup *old_cgrp,
+			    struct task_struct *tsk);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 		       struct cgroup *old_cgrp, struct task_struct *tsk);
 	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 84bdace..fafebdb 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1844,7 +1844,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 			}
 		}
 		if (ss->can_attach_task) {
-			retval = ss->can_attach_task(cgrp, tsk);
+			retval = ss->can_attach_task(cgrp, oldcgrp, tsk);
 			if (retval) {
 				failed_ss = ss;
 				goto out;
@@ -1860,7 +1860,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		if (ss->pre_attach)
 			ss->pre_attach(cgrp);
 		if (ss->attach_task)
-			ss->attach_task(cgrp, tsk);
+			ss->attach_task(cgrp, oldcgrp, tsk);
 		if (ss->attach)
 			ss->attach(ss, cgrp, oldcgrp, tsk);
 	}
@@ -2075,7 +2075,10 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 			/* run on each task in the threadgroup. */
 			for (i = 0; i < group_size; i++) {
 				tsk = flex_array_get_ptr(group, i);
-				retval = ss->can_attach_task(cgrp, tsk);
+				oldcgrp = task_cgroup_from_root(tsk, root);
+
+				retval = ss->can_attach_task(cgrp,
+							     oldcgrp, tsk);
 				if (retval) {
 					failed_ss = ss;
 					cancel_failed_ss = true;
@@ -2141,7 +2144,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 			/* attach each task to each subsystem */
 			for_each_subsys(root, ss) {
 				if (ss->attach_task)
-					ss->attach_task(cgrp, tsk);
+					ss->attach_task(cgrp, oldcgrp, tsk);
 			}
 		} else {
 			BUG_ON(retval != -ESRCH);
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index e691818..c1421a1 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -175,7 +175,8 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	return 0;
 }
 
-static int freezer_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static int freezer_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+				   struct task_struct *tsk)
 {
 	rcu_read_lock();
 	if (__cgroup_freezing_or_frozen(tsk)) {
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 10131fd..427be38 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1390,7 +1390,8 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 	return 0;
 }
 
-static int cpuset_can_attach_task(struct cgroup *cgrp, struct task_struct *task)
+static int cpuset_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+				  struct task_struct *task)
 {
 	return security_task_setscheduler(task);
 }
@@ -1418,7 +1419,8 @@ static void cpuset_pre_attach(struct cgroup *cont)
 }
 
 /* Per-thread attachment work. */
-static void cpuset_attach_task(struct cgroup *cont, struct task_struct *tsk)
+static void cpuset_attach_task(struct cgroup *cont, struct cgroup *old,
+			       struct task_struct *tsk)
 {
 	int err;
 	struct cpuset *cs = cgroup_cs(cont);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b8785e2..509464e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7001,7 +7001,8 @@ static int __perf_cgroup_move(void *info)
 }
 
 static void
-perf_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *task)
+perf_cgroup_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+			struct task_struct *task)
 {
 	task_function_call(task, __perf_cgroup_move, task);
 }
@@ -7017,7 +7018,7 @@ static void perf_cgroup_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	if (!(task->flags & PF_EXITING))
 		return;
 
-	perf_cgroup_attach_task(cgrp, task);
+	perf_cgroup_attach_task(cgrp, old_cgrp, task);
 }
 
 struct cgroup_subsys perf_subsys = {
diff --git a/kernel/sched.c b/kernel/sched.c
index ccacdbd..72ce1b1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8967,7 +8967,8 @@ cpu_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 }
 
 static int
-cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+			   struct task_struct *tsk)
 {
 #ifdef CONFIG_RT_GROUP_SCHED
 	if (!sched_rt_can_attach(cgroup_tg(cgrp), tsk))
@@ -8981,7 +8982,8 @@ cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 }
 
 static void
-cpu_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+cpu_cgroup_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+		       struct task_struct *tsk)
 {
 	sched_move_task(tsk);
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 04/10] cgroups: New cancel_attach_task subsystem callback
  2011-10-03 19:07 [PATCH 00/10] cgroups: Task counter subsystem v6 Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2011-10-03 19:07 ` [PATCH 03/10] cgroups: Add previous cgroup in can_attach_task/attach_task callbacks Frederic Weisbecker
@ 2011-10-03 19:07 ` Frederic Weisbecker
  2011-10-04  0:27   ` Kirill A. Shutemov
  2011-10-03 19:07 ` [PATCH 05/10] cgroups: Ability to stop res charge propagation on bounded ancestor Frederic Weisbecker
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2011-10-03 19:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Frederic Weisbecker, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo, Kirill A. Shutemov, Containers

To cancel a process attachment on a subsystem, we only call the
cancel_attach() callback once on the leader but we have no
way to cancel the attachment individually for each member of
the process group.

This is going to be needed for the max number of tasks susbystem
that is coming.

To prepare for this integration, call a new cancel_attach_task()
callback on each task of the group until we reach the member that
failed to attach.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Containers <containers@lists.linux-foundation.org>
---
 Documentation/cgroups/cgroups.txt |    7 +++++++
 include/linux/cgroup.h            |    2 ++
 kernel/cgroup.c                   |   24 ++++++++++++++++++++----
 3 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 0621e93..3bc1dc9 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -623,6 +623,13 @@ function, so that the subsystem can implement a rollback. If not, not necessary.
 This will be called only about subsystems whose can_attach() operation have
 succeeded.
 
+void cancel_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+(cgroup_mutex held by caller)
+
+As cancel_attach, but for operations that must be cancelled once per
+task that wanted to be attached. This typically revert the effect of
+can_attach_task().
+
 void pre_attach(struct cgroup *cgrp);
 (cgroup_mutex held by caller)
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ed34eb8..b62cf5e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -472,6 +472,8 @@ struct cgroup_subsys {
 			       struct task_struct *tsk);
 	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			      struct task_struct *tsk);
+	void (*cancel_attach_task)(struct cgroup *cgrp,
+				   struct task_struct *tsk);
 	void (*pre_attach)(struct cgroup *cgrp);
 	void (*attach_task)(struct cgroup *cgrp, struct cgroup *old_cgrp,
 			    struct task_struct *tsk);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index fafebdb..709baef 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1883,6 +1883,9 @@ out:
 				 * remaining subsystems.
 				 */
 				break;
+
+			if (ss->cancel_attach_task)
+				ss->cancel_attach_task(cgrp, tsk);
 			if (ss->cancel_attach)
 				ss->cancel_attach(ss, cgrp, tsk);
 		}
@@ -1992,7 +1995,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 {
 	int retval, i, group_size;
 	struct cgroup_subsys *ss, *failed_ss = NULL;
-	bool cancel_failed_ss = false;
+	struct task_struct *failed_task = NULL;
 	/* guaranteed to be initialized later, but the compiler needs this */
 	struct cgroup *oldcgrp = NULL;
 	struct css_set *oldcg;
@@ -2081,7 +2084,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 							     oldcgrp, tsk);
 				if (retval) {
 					failed_ss = ss;
-					cancel_failed_ss = true;
+					failed_task = tsk;
 					goto out_cancel_attach;
 				}
 			}
@@ -2146,8 +2149,11 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 				if (ss->attach_task)
 					ss->attach_task(cgrp, oldcgrp, tsk);
 			}
+		} else if (retval == -ESRCH) {
+			if (ss->cancel_attach_task)
+				ss->cancel_attach_task(cgrp, tsk);
 		} else {
-			BUG_ON(retval != -ESRCH);
+			BUG_ON(1);
 		}
 	}
 	/* nothing is sensitive to fork() after this point. */
@@ -2179,8 +2185,18 @@ out_cancel_attach:
 	/* same deal as in cgroup_attach_task */
 	if (retval) {
 		for_each_subsys(root, ss) {
+			if (ss->cancel_attach_task && (ss != failed_ss ||
+						       failed_task)) {
+				for (i = 0; i < group_size; i++) {
+					tsk = flex_array_get_ptr(group, i);
+					if (tsk == failed_task)
+						break;
+					ss->cancel_attach_task(cgrp, tsk);
+				}
+			}
+
 			if (ss == failed_ss) {
-				if (cancel_failed_ss && ss->cancel_attach)
+				if (failed_task && ss->cancel_attach)
 					ss->cancel_attach(ss, cgrp, leader);
 				break;
 			}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 05/10] cgroups: Ability to stop res charge propagation on bounded ancestor
  2011-10-03 19:07 [PATCH 00/10] cgroups: Task counter subsystem v6 Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2011-10-03 19:07 ` [PATCH 04/10] cgroups: New cancel_attach_task subsystem callback Frederic Weisbecker
@ 2011-10-03 19:07 ` Frederic Weisbecker
  2011-10-04  0:41   ` Kirill A. Shutemov
  2011-10-03 19:07 ` [PATCH 06/10] cgroups: Add res counter common ancestor searching Frederic Weisbecker
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2011-10-03 19:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Frederic Weisbecker, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo, Kirill A. Shutemov, Containers

Moving a task from a cgroup to another may require to substract
its resource charge from the old cgroup and add it to the new one.

For this to happen, the uncharge/charge propagation can just stop
when we reach the common ancestor for the two cgroups. Further
the performance reasons, we also want to avoid to temporarily
overload the common ancestors with a non-accurate resource
counter usage if we charge first the new cgroup and uncharge the
old one thereafter. This is going to be a requirement for the coming
max number of task subsystem.

To solve this, provide a pair of new API that can charge/uncharge
a resource counter until we reach a given ancestor.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Containers <containers@lists.linux-foundation.org>
---
 Documentation/cgroups/resource_counter.txt |   18 +++++++++++++++++-
 include/linux/res_counter.h                |   20 +++++++++++++++++---
 kernel/res_counter.c                       |   13 ++++++++-----
 3 files changed, 42 insertions(+), 9 deletions(-)

diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index 95b24d7..a2cd05b 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -83,7 +83,15 @@ to work with it.
 	res_counter->lock internally (it must be called with res_counter->lock
 	held).
 
- e. void res_counter_uncharge[_locked]
+ e. int res_counter_charge_until(struct res_counter *counter,
+			     struct res_counter *limit, unsigned long val,
+			     struct res_counter **limit_fail_at)
+
+	The same as res_counter_charge(), but the charge propagation to
+	the hierarchy stops at the limit given in the "limit" parameter.
+
+
+ f. void res_counter_uncharge[_locked]
 			(struct res_counter *rc, unsigned long val)
 
 	When a resource is released (freed) it should be de-accounted
@@ -92,6 +100,14 @@ to work with it.
 
 	The _locked routines imply that the res_counter->lock is taken.
 
+
+ g. void res_counter_uncharge_until(struct res_counter *counter,
+				struct res_counter *limit,
+				unsigned long val)
+
+	The same as res_counter_charge, but the uncharge propagation to
+	the hierarchy stops at the limit given in the "limit" parameter.
+
  2.1 Other accounting routines
 
     There are more routines that may help you with common needs, like
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 109d118..de4ba29 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -117,8 +117,16 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
 
 int __must_check res_counter_charge_locked(struct res_counter *counter,
 		unsigned long val);
-int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val, struct res_counter **limit_fail_at);
+int __must_check res_counter_charge_until(struct res_counter *counter,
+					  struct res_counter *limit,
+					  unsigned long val,
+					  struct res_counter **limit_fail_at);
+static inline int __must_check
+res_counter_charge(struct res_counter *counter, unsigned long val,
+		   struct res_counter **limit_fail_at)
+{
+	return res_counter_charge_until(counter, NULL, val, limit_fail_at);
+}
 
 /*
  * uncharge - tell that some portion of the resource is released
@@ -131,7 +139,13 @@ int __must_check res_counter_charge(struct res_counter *counter,
  */
 
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
-void res_counter_uncharge(struct res_counter *counter, unsigned long val);
+void res_counter_uncharge_until(struct res_counter *counter,
+				struct res_counter *limit, unsigned long val);
+static inline void res_counter_uncharge(struct res_counter *counter,
+					unsigned long val)
+{
+	res_counter_uncharge_until(counter, NULL, val);
+}
 
 /**
  * res_counter_margin - calculate chargeable space of a counter
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 37abf4e..0cc9c7e 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -35,8 +35,9 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 	return 0;
 }
 
-int res_counter_charge(struct res_counter *counter, unsigned long val,
-			struct res_counter **limit_fail_at)
+int res_counter_charge_until(struct res_counter *counter,
+			     struct res_counter *limit, unsigned long val,
+			     struct res_counter **limit_fail_at)
 {
 	int ret;
 	unsigned long flags;
@@ -44,7 +45,7 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
 
 	*limit_fail_at = NULL;
 	local_irq_save(flags);
-	for (c = counter; c != NULL; c = c->parent) {
+	for (c = counter; c != limit; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
 		spin_unlock(&c->lock);
@@ -74,13 +75,15 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
 	counter->usage -= val;
 }
 
-void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+void res_counter_uncharge_until(struct res_counter *counter,
+				struct res_counter *limit,
+				unsigned long val)
 {
 	unsigned long flags;
 	struct res_counter *c;
 
 	local_irq_save(flags);
-	for (c = counter; c != NULL; c = c->parent) {
+	for (c = counter; c != limit; c = c->parent) {
 		spin_lock(&c->lock);
 		res_counter_uncharge_locked(c, val);
 		spin_unlock(&c->lock);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 06/10] cgroups: Add res counter common ancestor searching
  2011-10-03 19:07 [PATCH 00/10] cgroups: Task counter subsystem v6 Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2011-10-03 19:07 ` [PATCH 05/10] cgroups: Ability to stop res charge propagation on bounded ancestor Frederic Weisbecker
@ 2011-10-03 19:07 ` Frederic Weisbecker
  2011-10-03 19:07 ` [PATCH 07/10] res_counter: Allow charge failure pointer to be null Frederic Weisbecker
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2011-10-03 19:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Kirill A. Shutemov, Li Zefan, Paul Menage, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Containers, Frederic Weisbecker

From: "Kirill A. Shutemov" <kirill@shutemov.name>

Add a new API to find the common ancestor between two resource
counters. This includes the passed resource counter themselves.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Containers <containers@lists.linux-foundation.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/res_counter.h |    3 +++
 kernel/res_counter.c        |   33 +++++++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index de4ba29..558f39b 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -147,6 +147,9 @@ static inline void res_counter_uncharge(struct res_counter *counter,
 	res_counter_uncharge_until(counter, NULL, val);
 }
 
+struct res_counter *res_counter_common_ancestor(struct res_counter *l,
+						struct res_counter *r);
+
 /**
  * res_counter_margin - calculate chargeable space of a counter
  * @cnt: the counter
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 0cc9c7e..8f4c728 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -91,6 +91,39 @@ void res_counter_uncharge_until(struct res_counter *counter,
 	local_irq_restore(flags);
 }
 
+/*
+ * Walk through r1 and r2 parents and try to find the closest common one
+ * between both. If none is found, it returns NULL.
+ */
+struct res_counter *
+res_counter_common_ancestor(struct res_counter *r1, struct res_counter *r2)
+{
+	struct res_counter *iter;
+	int r1_depth = 0, r2_depth = 0;
+
+	for (iter = r1; iter; iter = iter->parent)
+		r1_depth++;
+
+	for (iter = r2; iter; iter = iter->parent)
+		r2_depth++;
+
+	while (r1_depth > r2_depth) {
+		r1 = r1->parent;
+		r1_depth--;
+	}
+
+	while (r2_depth > r1_depth) {
+		r2 = r2->parent;
+		r2_depth--;
+	}
+
+	while (r1 != r2) {
+		r1 = r1->parent;
+		r2 = r2->parent;
+	}
+
+	return r1;
+}
 
 static inline unsigned long long *
 res_counter_member(struct res_counter *counter, int member)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 07/10] res_counter: Allow charge failure pointer to be null
  2011-10-03 19:07 [PATCH 00/10] cgroups: Task counter subsystem v6 Frederic Weisbecker
                   ` (5 preceding siblings ...)
  2011-10-03 19:07 ` [PATCH 06/10] cgroups: Add res counter common ancestor searching Frederic Weisbecker
@ 2011-10-03 19:07 ` Frederic Weisbecker
  2011-10-04  1:30   ` Kirill A. Shutemov
  2011-10-03 19:07 ` [PATCH 08/10] cgroups: Pull up res counter charge failure interpretation to caller Frederic Weisbecker
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2011-10-03 19:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Frederic Weisbecker, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo, Kirill A. Shutemov, Containers

So that callers of res_counter_charge() don't have to create and
pass this pointer even if they aren't interested in it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Containers <containers@lists.linux-foundation.org>
---
 kernel/res_counter.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 8f4c728..6b36823 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -43,14 +43,16 @@ int res_counter_charge_until(struct res_counter *counter,
 	unsigned long flags;
 	struct res_counter *c, *u;
 
-	*limit_fail_at = NULL;
+	if (limit_fail_at)
+		*limit_fail_at = NULL;
 	local_irq_save(flags);
 	for (c = counter; c != limit; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
 		spin_unlock(&c->lock);
 		if (ret < 0) {
-			*limit_fail_at = c;
+			if (limit_fail_at)
+				*limit_fail_at = c;
 			goto undo;
 		}
 	}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 08/10] cgroups: Pull up res counter charge failure interpretation to caller
  2011-10-03 19:07 [PATCH 00/10] cgroups: Task counter subsystem v6 Frederic Weisbecker
                   ` (6 preceding siblings ...)
  2011-10-03 19:07 ` [PATCH 07/10] res_counter: Allow charge failure pointer to be null Frederic Weisbecker
@ 2011-10-03 19:07 ` Frederic Weisbecker
  2011-10-04  1:32   ` Kirill A. Shutemov
  2011-10-03 19:07 ` [PATCH 09/10] cgroups: Allow subsystems to cancel a fork Frederic Weisbecker
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2011-10-03 19:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Frederic Weisbecker, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo, Kirill A. Shutemov, Containers

res_counter_charge() always returns -ENOMEM when the limit is reached
and the charge thus can't happen.

However it's up to the caller to interpret this failure and return
the appropriate error value. The task counter subsystem will need
to report the user that a fork() has been cancelled because of some
limit reached, not because we are too short on memory.

Fix this by returning -1 when res_counter_charge() fails.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Containers <containers@lists.linux-foundation.org>
---
 Documentation/cgroups/resource_counter.txt |    2 ++
 kernel/res_counter.c                       |    2 +-
 2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index a2cd05b..24ec61c 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -76,6 +76,8 @@ to work with it.
 	limit_fail_at parameter is set to the particular res_counter element
 	where the charging failed.
 
+	It returns 0 on success and -1 on failure.
+
  d. int res_counter_charge_locked
 			(struct res_counter *rc, unsigned long val)
 
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 6b36823..b814d6c 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -26,7 +26,7 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 {
 	if (counter->usage + val > counter->limit) {
 		counter->failcnt++;
-		return -ENOMEM;
+		return -1;
 	}
 
 	counter->usage += val;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 09/10] cgroups: Allow subsystems to cancel a fork
  2011-10-03 19:07 [PATCH 00/10] cgroups: Task counter subsystem v6 Frederic Weisbecker
                   ` (7 preceding siblings ...)
  2011-10-03 19:07 ` [PATCH 08/10] cgroups: Pull up res counter charge failure interpretation to caller Frederic Weisbecker
@ 2011-10-03 19:07 ` Frederic Weisbecker
  2011-10-04  1:38   ` Kirill A. Shutemov
  2011-10-03 19:07 ` [PATCH 10/10] cgroups: Add a task counter subsystem Frederic Weisbecker
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2011-10-03 19:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Frederic Weisbecker, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo, Kirill A. Shutemov, Containers

Let the subsystem's fork callback return an error value so
that they can cancel a fork. This is going to be used by
the task counter subsystem to implement the limit.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Containers <containers@lists.linux-foundation.org>
---
 include/linux/cgroup.h  |   20 ++++++++++++++------
 kernel/cgroup.c         |   23 +++++++++++++++++++----
 kernel/cgroup_freezer.c |    6 ++++--
 kernel/exit.c           |    2 +-
 kernel/fork.c           |    7 +++++--
 5 files changed, 43 insertions(+), 15 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b62cf5e..9c8151e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -17,10 +17,11 @@
 #include <linux/rwsem.h>
 #include <linux/idr.h>
 
+struct cgroup_subsys;
+
 #ifdef CONFIG_CGROUPS
 
 struct cgroupfs_root;
-struct cgroup_subsys;
 struct inode;
 struct cgroup;
 struct css_id;
@@ -32,9 +33,11 @@ extern int cgroup_lock_is_held(void);
 extern bool cgroup_lock_live_group(struct cgroup *cgrp);
 extern void cgroup_unlock(void);
 extern void cgroup_fork(struct task_struct *p);
-extern void cgroup_fork_callbacks(struct task_struct *p);
+extern int cgroup_fork_callbacks(struct task_struct *p,
+				 struct cgroup_subsys **failed_ss);
 extern void cgroup_post_fork(struct task_struct *p);
-extern void cgroup_exit(struct task_struct *p, int run_callbacks);
+extern void cgroup_exit(struct task_struct *p, int run_callbacks,
+			struct cgroup_subsys *failed_ss);
 extern int cgroupstats_build(struct cgroupstats *stats,
 				struct dentry *dentry);
 extern int cgroup_load_subsys(struct cgroup_subsys *ss);
@@ -479,7 +482,7 @@ struct cgroup_subsys {
 			    struct task_struct *tsk);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 		       struct cgroup *old_cgrp, struct task_struct *tsk);
-	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
+	int (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
 	void (*exit)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			struct cgroup *old_cgrp, struct task_struct *task);
 	int (*populate)(struct cgroup_subsys *ss,
@@ -636,9 +639,14 @@ struct cgroup_subsys_state *cgroup_css_from_dir(struct file *f, int id);
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
 static inline void cgroup_fork(struct task_struct *p) {}
-static inline void cgroup_fork_callbacks(struct task_struct *p) {}
+static inline int cgroup_fork_callbacks(struct task_struct *p,
+					struct cgroup_subsys **failed_ss)
+{
+	return 0;
+}
 static inline void cgroup_post_fork(struct task_struct *p) {}
-static inline void cgroup_exit(struct task_struct *p, int callbacks) {}
+static inline void cgroup_exit(struct task_struct *p, int callbacks,
+			       struct cgroup_subsys *failed_ss) {}
 
 static inline void cgroup_lock(void) {}
 static inline void cgroup_unlock(void) {}
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 709baef..018df9d 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4540,8 +4540,11 @@ void cgroup_fork(struct task_struct *child)
  * tasklist. No need to take any locks since no-one can
  * be operating on this task.
  */
-void cgroup_fork_callbacks(struct task_struct *child)
+int cgroup_fork_callbacks(struct task_struct *child,
+			  struct cgroup_subsys **failed_ss)
 {
+	int err;
+
 	if (need_forkexit_callback) {
 		int i;
 		/*
@@ -4551,10 +4554,17 @@ void cgroup_fork_callbacks(struct task_struct *child)
 		 */
 		for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
 			struct cgroup_subsys *ss = subsys[i];
-			if (ss->fork)
-				ss->fork(ss, child);
+			if (ss->fork) {
+				err = ss->fork(ss, child);
+				if (err) {
+					*failed_ss = ss;
+					return err;
+				}
+			}
 		}
 	}
+
+	return 0;
 }
 
 /**
@@ -4612,7 +4622,8 @@ void cgroup_post_fork(struct task_struct *child)
  *    which wards off any cgroup_attach_task() attempts, or task is a failed
  *    fork, never visible to cgroup_attach_task.
  */
-void cgroup_exit(struct task_struct *tsk, int run_callbacks)
+void cgroup_exit(struct task_struct *tsk, int run_callbacks,
+		 struct cgroup_subsys *failed_ss)
 {
 	struct css_set *cg;
 	int i;
@@ -4641,6 +4652,10 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
 		 */
 		for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
 			struct cgroup_subsys *ss = subsys[i];
+
+			if (ss == failed_ss)
+				break;
+
 			if (ss->exit) {
 				struct cgroup *old_cgrp =
 					rcu_dereference_raw(cg->subsys[i])->cgroup;
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index c1421a1..30a4332 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -187,7 +187,7 @@ static int freezer_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
 	return 0;
 }
 
-static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
+static int freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 {
 	struct freezer *freezer;
 
@@ -207,7 +207,7 @@ static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 	 * following check.
 	 */
 	if (!freezer->css.cgroup->parent)
-		return;
+		return 0;
 
 	spin_lock_irq(&freezer->lock);
 	BUG_ON(freezer->state == CGROUP_FROZEN);
@@ -216,6 +216,8 @@ static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 	if (freezer->state == CGROUP_FREEZING)
 		freeze_task(task, true);
 	spin_unlock_irq(&freezer->lock);
+
+	return 0;
 }
 
 /*
diff --git a/kernel/exit.c b/kernel/exit.c
index 2913b35..abab32b 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -994,7 +994,7 @@ NORET_TYPE void do_exit(long code)
 	 */
 	perf_event_exit_task(tsk);
 
-	cgroup_exit(tsk, 1);
+	cgroup_exit(tsk, 1, NULL);
 
 	if (group_dead)
 		disassociate_ctty(1);
diff --git a/kernel/fork.c b/kernel/fork.c
index 8e6b6f4..ee8abdb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1058,6 +1058,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
+	struct cgroup_subsys *cgroup_failed_ss = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1312,8 +1313,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	/* Now that the task is set up, run cgroup callbacks if
 	 * necessary. We need to run them before the task is visible
 	 * on the tasklist. */
-	cgroup_fork_callbacks(p);
+	retval = cgroup_fork_callbacks(p, &cgroup_failed_ss);
 	cgroup_callbacks_done = 1;
+	if (retval)
+		goto bad_fork_free_pid;
 
 	/* Need tasklist lock for parent etc handling! */
 	write_lock_irq(&tasklist_lock);
@@ -1419,7 +1422,7 @@ bad_fork_cleanup_cgroup:
 #endif
 	if (clone_flags & CLONE_THREAD)
 		threadgroup_fork_read_unlock(current);
-	cgroup_exit(p, cgroup_callbacks_done);
+	cgroup_exit(p, cgroup_callbacks_done, cgroup_failed_ss);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);
 bad_fork_cleanup_count:
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 10/10] cgroups: Add a task counter subsystem
  2011-10-03 19:07 [PATCH 00/10] cgroups: Task counter subsystem v6 Frederic Weisbecker
                   ` (8 preceding siblings ...)
  2011-10-03 19:07 ` [PATCH 09/10] cgroups: Allow subsystems to cancel a fork Frederic Weisbecker
@ 2011-10-03 19:07 ` Frederic Weisbecker
  2011-10-06  9:23   ` Kirill A. Shutemov
  2011-10-04 22:01 ` [PATCH 00/10] cgroups: Task counter subsystem v6 Andrew Morton
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2011-10-03 19:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Frederic Weisbecker, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo, Kirill A. Shutemov, Containers

Add a new subsystem to limit the number of running tasks,
similar to the NR_PROC rlimit but in the scope of a cgroup.

The user can set an upper bound limit that is checked every
time a task forks in a cgroup or is moved into a cgroup
with that subsystem binded.

The primary goal is to protect against forkbombs that explode
inside a container. The traditional NR_PROC rlimit is not
efficient in that case because if we run containers in parallel
under the same user, one of these could starve all the others
by spawning a high number of tasks close to the user wide limit.

This is a prevention against forkbombs, so it's not deemed to
cure the effects of a forkbomb when the system is in a state
where it's not responsive. It's aimed at preventing from ever
reaching that state and stop the spreading of tasks early.
While defining the limit on the allowed number of tasks, it's
up to the user to find the right balance between the resource
its containers may need and what it can afford to provide.

As it's totally dissociated from the rlimit NR_PROC, both
can be complementary: the cgroup task counter can set an upper
bound per container and the rlmit can be an upper bound on the
overall set of containers.

Also this subsystem can be used to kill all the tasks in a cgroup
without races against concurrent forks, by setting the limit of
tasks to 0, any further forks can be rejected. This is a good
way to kill a forkbomb in a container, or simply kill any container
without the need to retry an unbound number of times.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Containers <containers@lists.linux-foundation.org>
---
 Documentation/cgroups/task_counter.txt |  153 ++++++++++++++++++++
 include/linux/cgroup_subsys.h          |    8 +
 init/Kconfig                           |    9 ++
 kernel/Makefile                        |    1 +
 kernel/cgroup_task_counter.c           |  239 ++++++++++++++++++++++++++++++++
 5 files changed, 410 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/cgroups/task_counter.txt
 create mode 100644 kernel/cgroup_task_counter.c

diff --git a/Documentation/cgroups/task_counter.txt b/Documentation/cgroups/task_counter.txt
new file mode 100644
index 0000000..1562d88
--- /dev/null
+++ b/Documentation/cgroups/task_counter.txt
@@ -0,0 +1,153 @@
+Task counter subsystem
+
+1. Description
+
+The task counter subsystem limits the number of tasks running
+inside a given cgroup. It behaves like the NR_PROC rlimit but in
+the scope of a cgroup instead of a user.
+
+It has two typical usecases, although more can probably be found:
+
+1.1 Protection against forkbomb in a container
+
+One usecase is to protect against forkbombs that explode inside
+a container when that container is implemented using a cgroup. The
+NR_PROC rlimit is known to be a working protection against this type
+of attack but is not suitable anymore when we run containers in
+parallel under the same user. One container could starve all the
+others by spawning a high number of tasks close to the rlimit
+boundary. So in this case we need this limitation to be done in a
+per cgroup granularity.
+
+Note this works by preventing forkbombs propagation. It doesn't cure
+the forkbomb effects when it has already grown up enough to make
+the system hardly responsive. While defining the limit on the number
+of tasks, it's up to the admin to find the right balance between the
+possible needs of a container and the resources the system can afford
+to provide.
+
+Also the NR_PROC rlimit and this cgroup subsystem are totally
+dissociated. But they can be complementary. The task counter limits
+the containers and the rlimit can provide an upper bound on the whole
+set of containers.
+
+
+1.2 Kill tasks inside a cgroup
+
+An other usecase comes along the forkbomb prevention: it brings
+the ability to kill all tasks inside a cgroup without races. By
+setting the limit of running tasks to 0, one can prevent from any
+further fork inside a cgroup and then kill all of its tasks without
+the need to retry an unbound amount of time due to races between
+kills and forks running in parallel (more details in "Kill a cgroup
+safely" paragraph).
+
+This is useful to kill a forkbomb for example. When its gazillion
+of forks are competing with the kills, one need to ensure this
+operation won't run in a nearly endless loop of retry.
+
+And more generally it is useful to kill a cgroup in a bound amount
+of pass.
+
+
+2. Interface
+
+When a hierarchy is mounted with the task counter subsystem binded, it
+adds two files into the cgroups directories, except the root one:
+
+- tasks.usage contains the number of tasks running inside a cgroup and
+its children in the hierarchy (see paragraph about Inheritance).
+
+- tasks.limit contains the maximum number of tasks that can run inside
+a cgroup. We check this limit when a task forks or when it is migrated
+to a cgroup.
+
+Note that the tasks.limit value can be forced below tasks.usage, in which
+case any new task in the cgroup will be rejected until the tasks.usage
+value goes below tasks.limit.
+
+For optimization reasons, the root directory of a hierarchy doesn't have
+a task counter.
+
+
+3. Inheritance
+
+When a task is added to a cgroup, by way of a cgroup migration or a fork,
+it increases the task counter of that cgroup and of all its ancestors.
+Hence a cgroup is also subject to the limit of its ancestors.
+
+In the following hierarchy:
+
+
+             A
+             |
+             B
+           /   \
+          C     D
+
+
+We have 1 task running in B, one running in C and none running in D.
+It means we have tasks.usage = 1 in C and tasks.usage = 2 in B because
+B counts its task and those of its children.
+
+Now lets set tasks.limit = 2 in B and tasks.limit = 1 in D.
+If we move a new task in D, it will be refused because the limit in B has
+been reached already.
+
+
+4. Kill a cgroup safely
+
+As explained in the description, this subsystem is also helpful to
+kill all tasks in a cgroup safely, after setting tasks.limit to 0,
+so that we don't race against parallel forks in an unbound numbers
+of kill iterations.
+
+But there is a small detail to be aware of to use this feature that
+way.
+
+Some typical way to proceed would be:
+
+	echo 0 > tasks.limit
+	for TASK in $(cat cgroup.procs)
+	do
+		kill -KILL $TASK
+	done
+
+However there is a small race window where a task can be in the way to
+be forked but hasn't enough completed the fork to have the PID of the
+fork appearing in the cgroup.procs file.
+
+The only way to get it right is to run a loop that reads tasks.usage, kill
+all the tasks in cgroup.procs and exit the loop only if the value in
+tasks.usage was the same than the number of tasks that were in cgroup.procs,
+ie: the number of tasks that were killed.
+
+It works because the new child appears in tasks.usage right before we check,
+in the fork path, whether the parent has a pending signal, in which case the
+fork is cancelled anyway. So relying on tasks.usage is fine and non-racy.
+
+This race window is tiny and unlikely to happen, so most of the time a single
+kill iteration should be enough. But it's worth knowing about that corner
+case spotted by Oleg Nesterov.
+
+Example of safe use would be:
+
+	echo 0 > tasks.limit
+	END=false
+
+	while [ $END == false ]
+	do
+		NR_TASKS=$(cat tasks.usage)
+		NR_KILLED=0
+
+		for TASK in $(cat cgroup.procs)
+		do
+			let NR_KILLED=NR_KILLED+1
+			kill -KILL $TASK
+		done
+
+		if [ "$NR_TASKS" = "$NR_KILLED" ]
+		then
+			END=true
+		fi
+	done
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index ac663c1..5425822 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -59,8 +59,16 @@ SUBSYS(net_cls)
 SUBSYS(blkio)
 #endif
 
+/* */
+
 #ifdef CONFIG_CGROUP_PERF
 SUBSYS(perf)
 #endif
 
 /* */
+
+#ifdef CONFIG_CGROUP_TASK_COUNTER
+SUBSYS(tasks)
+#endif
+
+/* */
diff --git a/init/Kconfig b/init/Kconfig
index d627783..7410b05 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -690,6 +690,15 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
 	  select this option (if, for some reason, they need to disable it
 	  then swapaccount=0 does the trick).
 
+config CGROUP_TASK_COUNTER
+	bool "Control number of tasks in a cgroup"
+	depends on RESOURCE_COUNTERS
+	help
+	  Let the user set up an upper boundary of the allowed number of tasks
+	  running in a cgroup. When a task forks or is migrated to a cgroup that
+	  has this subsystem binded, the limit is checked to either accept or
+	  reject the fork/migration.
+
 config CGROUP_PERF
 	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
 	depends on PERF_EVENTS && CGROUPS
diff --git a/kernel/Makefile b/kernel/Makefile
index eca595e..5598a7f 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -60,6 +60,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
+obj-$(CONFIG_CGROUP_TASK_COUNTER) += cgroup_task_counter.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
diff --git a/kernel/cgroup_task_counter.c b/kernel/cgroup_task_counter.c
new file mode 100644
index 0000000..2374905
--- /dev/null
+++ b/kernel/cgroup_task_counter.c
@@ -0,0 +1,239 @@
+/*
+ * Limits on number of tasks subsystem for cgroups
+ *
+ * Copyright (C) 2011 Red Hat, Inc., Frederic Weisbecker <fweisbec@redhat.com>
+ *
+ * Thanks to Andrew Morton, Johannes Weiner, Li Zefan, Oleg Nesterov and
+ * Paul Menage for their suggestions.
+ *
+ */
+
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/res_counter.h>
+
+
+struct task_counter {
+	struct res_counter		res;
+	struct cgroup_subsys_state	css;
+};
+
+/*
+ * The root task counter doesn't exist because it's not part of the
+ * whole task counting. We want to optimize the trivial case of only
+ * one root cgroup living.
+ */
+static struct cgroup_subsys_state root_css;
+
+
+static inline struct task_counter *cgroup_task_counter(struct cgroup *cgrp)
+{
+	if (!cgrp->parent)
+		return NULL;
+
+	return container_of(cgroup_subsys_state(cgrp, tasks_subsys_id),
+			    struct task_counter, css);
+}
+
+static inline struct res_counter *cgroup_task_res_counter(struct cgroup *cgrp)
+{
+	struct task_counter *cnt;
+
+	cnt = cgroup_task_counter(cgrp);
+	if (!cnt)
+		return NULL;
+
+	return &cnt->res;
+}
+
+static struct cgroup_subsys_state *
+task_counter_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct task_counter *cnt;
+	struct res_counter *parent_res;
+
+	if (!cgrp->parent)
+		return &root_css;
+
+	cnt = kzalloc(sizeof(*cnt), GFP_KERNEL);
+	if (!cnt)
+		return ERR_PTR(-ENOMEM);
+
+	parent_res = cgroup_task_res_counter(cgrp->parent);
+
+	res_counter_init(&cnt->res, parent_res);
+
+	return &cnt->css;
+}
+
+/*
+ * Inherit the limit value of the parent. This is not really to enforce
+ * a limit below or equal to the one of the parent which can be changed
+ * concurrently anyway. This is just to honour the clone flag.
+ */
+static void task_counter_post_clone(struct cgroup_subsys *ss,
+				    struct cgroup *cgrp)
+{
+	/* cgrp can't be root, so cgroup_task_res_counter() can't return NULL */
+	res_counter_inherit(cgroup_task_res_counter(cgrp), RES_LIMIT);
+}
+
+static void task_counter_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct task_counter *cnt = cgroup_task_counter(cgrp);
+
+	kfree(cnt);
+}
+
+/* Uncharge the cgroup the task was attached to */
+static void task_counter_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
+			      struct cgroup *old_cgrp, struct task_struct *task)
+{
+	/* Optimize for the root cgroup case */
+	if (old_cgrp->parent)
+		res_counter_uncharge(cgroup_task_res_counter(old_cgrp), 1);
+}
+
+/*
+ * Protected amongst can_attach_task/attach_task/cancel_attach_task by
+ * cgroup mutex
+ */
+static struct res_counter *common_ancestor;
+
+/*
+ * This does more than just probing the ability to attach to the dest cgroup.
+ * We can not just _check_ if we can attach to the destination and do the real
+ * attachment later in task_counter_attach_task() because a task in the dest
+ * cgroup can fork before and steal the last remaining count.
+ * Thus we need to charge the dest cgroup right now.
+ */
+static int task_counter_can_attach_task(struct cgroup *cgrp,
+					struct cgroup *old_cgrp,
+					struct task_struct *tsk)
+{
+	struct res_counter *res = cgroup_task_res_counter(cgrp);
+	struct res_counter *old_res = cgroup_task_res_counter(old_cgrp);
+	int err;
+
+	/*
+	 * When moving a task from a cgroup to another, we don't want
+	 * to charge the common ancestors, even though they will be
+	 * uncharged later from attach_task(), because during that
+	 * short window between charge and uncharge, a task could fork
+	 * in the ancestor and spuriously fail due to the temporary
+	 * charge.
+	 */
+	common_ancestor = res_counter_common_ancestor(res, old_res);
+
+	/*
+	 * If cgrp is the root then res is NULL, however in this case
+	 * the common ancestor is NULL as well, making the below a NOP.
+	 */
+	err = res_counter_charge_until(res, common_ancestor, 1, NULL);
+	if (err)
+		return -EINVAL;
+
+	return 0;
+}
+
+/* Uncharge the dest cgroup that we charged in task_counter_can_attach_task() */
+static void task_counter_cancel_attach_task(struct cgroup *cgrp,
+					    struct task_struct *tsk)
+{
+	res_counter_uncharge_until(cgroup_task_res_counter(cgrp),
+				   common_ancestor, 1);
+}
+
+/*
+ * This uncharge the old cgroup. We can do that now that we are sure the
+ * attachment can't cancelled anymore, because this uncharge operation
+ * couldn't be reverted later: a task in the old cgroup could fork after
+ * we uncharge and reach the task counter limit, making our return there
+ * not possible.
+ */
+static void task_counter_attach_task(struct cgroup *cgrp,
+				     struct cgroup *old_cgrp,
+				     struct task_struct *tsk)
+{
+	res_counter_uncharge_until(cgroup_task_res_counter(old_cgrp),
+				   common_ancestor, 1);
+}
+
+static u64 task_counter_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+	int type = cft->private;
+
+	return res_counter_read_u64(cgroup_task_res_counter(cgrp), type);
+}
+
+static int task_counter_write_u64(struct cgroup *cgrp, struct cftype *cft,
+				  u64 val)
+{
+	int type = cft->private;
+
+	res_counter_write_u64(cgroup_task_res_counter(cgrp), type, val);
+
+	return 0;
+}
+
+static struct cftype files[] = {
+	{
+		.name		= "limit",
+		.read_u64	= task_counter_read_u64,
+		.write_u64	= task_counter_write_u64,
+		.private	= RES_LIMIT,
+	},
+
+	{
+		.name		= "usage",
+		.read_u64	= task_counter_read_u64,
+		.private	= RES_USAGE,
+	},
+};
+
+static int task_counter_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	if (!cgrp->parent)
+		return 0;
+
+	return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
+}
+
+/*
+ * Charge the task counter with the new child coming, or reject it if we
+ * reached the limit.
+ */
+static int task_counter_fork(struct cgroup_subsys *ss,
+			     struct task_struct *child)
+{
+	struct cgroup_subsys_state *css;
+	struct cgroup *cgrp;
+	int err;
+
+	css = child->cgroups->subsys[tasks_subsys_id];
+	cgrp = css->cgroup;
+
+	/* Optimize for the root cgroup case, which doesn't have a limit */
+	if (!cgrp->parent)
+		return 0;
+
+	err = res_counter_charge(cgroup_task_res_counter(cgrp), 1, NULL);
+	if (err)
+		return -EAGAIN;
+
+	return 0;
+}
+
+struct cgroup_subsys tasks_subsys = {
+	.name			= "tasks",
+	.subsys_id		= tasks_subsys_id,
+	.create			= task_counter_create,
+	.post_clone		= task_counter_post_clone,
+	.destroy		= task_counter_destroy,
+	.exit			= task_counter_exit,
+	.can_attach_task	= task_counter_can_attach_task,
+	.cancel_attach_task	= task_counter_cancel_attach_task,
+	.attach_task		= task_counter_attach_task,
+	.fork			= task_counter_fork,
+	.populate		= task_counter_populate,
+};
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 01/10] cgroups: Add res_counter_write_u64() API
  2011-10-03 19:07 ` [PATCH 01/10] cgroups: Add res_counter_write_u64() API Frederic Weisbecker
@ 2011-10-04  0:17   ` Kirill A. Shutemov
  2011-10-11 13:44     ` Frederic Weisbecker
  0 siblings, 1 reply; 44+ messages in thread
From: Kirill A. Shutemov @ 2011-10-04  0:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, LKML, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Containers

On Mon, Oct 03, 2011 at 09:07:03PM +0200, Frederic Weisbecker wrote:
> Extend the resource counter API with a mirror of
> res_counter_read_u64() to make it handy to update a resource
> counter value from a cgroup subsystem u64 value file.
> 
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Acked-by: Paul Menage <paul@paulmenage.org>
> Cc: Li Zefan <lizf@cn.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Aditya Kali <adityakali@google.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Kay Sievers <kay.sievers@vrfy.org>
> Cc: Tim Hockin <thockin@hockin.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Containers <containers@lists.linux-foundation.org>
> ---
>  include/linux/res_counter.h |    2 ++
>  kernel/res_counter.c        |   25 +++++++++++++++++++------
>  2 files changed, 21 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index c9d625c..1b3fe05 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -82,6 +82,8 @@ int res_counter_memparse_write_strategy(const char *buf,
>  int res_counter_write(struct res_counter *counter, int member,
>  		      const char *buffer, write_strategy_fn write_strategy);
>  
> +void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
> +
>  /*
>   * the field descriptors. one for each member of res_counter
>   */
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index 34683ef..0faafcc 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -168,12 +168,26 @@ int res_counter_memparse_write_strategy(const char *buf,
>  	return 0;
>  }
>  
> +void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
> +{
> +	unsigned long long *target;
> +	unsigned long flags;
> +
> +	/*
> +	 * We need the lock to protect against concurrent add/dec on 32 bits.
> +	 * No need to ifdef it's seldom used.
> +	 */

Should we hace two version of res_counter_write_u64() for 32/64 bit host?
As with res_counter_read_u64().

> +	spin_lock_irqsave(&counter->lock, flags);
> +	target = res_counter_member(counter, member);
> +	*target = val;
> +	spin_unlock_irqrestore(&counter->lock, flags);
> +}
> +
>  int res_counter_write(struct res_counter *counter, int member,
>  		      const char *buf, write_strategy_fn write_strategy)
>  {
>  	char *end;
> -	unsigned long flags;
> -	unsigned long long tmp, *val;
> +	unsigned long long tmp;
>  
>  	if (write_strategy) {
>  		if (write_strategy(buf, &tmp))
> @@ -183,9 +197,8 @@ int res_counter_write(struct res_counter *counter, int member,
>  		if (*end != '\0')
>  			return -EINVAL;
>  	}
> -	spin_lock_irqsave(&counter->lock, flags);
> -	val = res_counter_member(counter, member);
> -	*val = tmp;
> -	spin_unlock_irqrestore(&counter->lock, flags);
> +
> +	res_counter_write_u64(counter, member, tmp);
> +
>  	return 0;
>  }
> -- 
> 1.7.5.4
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 02/10] cgroups: New resource counter inheritance API
  2011-10-03 19:07 ` [PATCH 02/10] cgroups: New resource counter inheritance API Frederic Weisbecker
@ 2011-10-04  0:20   ` Kirill A. Shutemov
  0 siblings, 0 replies; 44+ messages in thread
From: Kirill A. Shutemov @ 2011-10-04  0:20 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, LKML, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Containers

On Mon, Oct 03, 2011 at 09:07:04PM +0200, Frederic Weisbecker wrote:
> Provide an API to inherit a counter value from a parent.
> This can be useful to implement cgroup.clone_children on
> a resource counter.
> 
> Still the resources of the children are limited by those
> of the parent, so this is only to provide a default setting
> behaviour when clone_children is set.
> 
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>

Acked-by: Kirill A. Shutemov <kirill@shutemov.name>

> Cc: Paul Menage <paul@paulmenage.org>
> Cc: Li Zefan <lizf@cn.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Aditya Kali <adityakali@google.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Kay Sievers <kay.sievers@vrfy.org>
> Cc: Tim Hockin <thockin@hockin.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Containers <containers@lists.linux-foundation.org>
> ---
>  include/linux/res_counter.h |    2 ++
>  kernel/res_counter.c        |   18 ++++++++++++++++++
>  2 files changed, 20 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index 1b3fe05..109d118 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -84,6 +84,8 @@ int res_counter_write(struct res_counter *counter, int member,
>  
>  void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
>  
> +void res_counter_inherit(struct res_counter *counter, int member);
> +
>  /*
>   * the field descriptors. one for each member of res_counter
>   */
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index 0faafcc..37abf4e 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -202,3 +202,21 @@ int res_counter_write(struct res_counter *counter, int member,
>  
>  	return 0;
>  }
> +
> +/*
> + * Simple inheritance implementation to get the same value
> + * than a parent. However this doesn't enforce the child value
> + * to be always below the one of the parent. But the child is
> + * subject to its parent limitation anyway.
> + */
> +void res_counter_inherit(struct res_counter *counter, int member)
> +{
> +	struct res_counter *parent;
> +	unsigned long long val;
> +
> +	parent = counter->parent;
> +	if (parent) {
> +		val = res_counter_read_u64(parent, member);
> +		res_counter_write_u64(counter, member, val);
> +	}
> +}
> -- 
> 1.7.5.4
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 03/10] cgroups: Add previous cgroup in can_attach_task/attach_task callbacks
  2011-10-03 19:07 ` [PATCH 03/10] cgroups: Add previous cgroup in can_attach_task/attach_task callbacks Frederic Weisbecker
@ 2011-10-04  0:22   ` Kirill A. Shutemov
  0 siblings, 0 replies; 44+ messages in thread
From: Kirill A. Shutemov @ 2011-10-04  0:22 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, LKML, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Containers

On Mon, Oct 03, 2011 at 09:07:05PM +0200, Frederic Weisbecker wrote:
> This is to prepare the integration of a new max number of proc
> cgroup subsystem. We'll need to release some resources from the
> previous cgroup.
> 
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Acked-by: Paul Menage <paul@paulmenage.org>

Acked-by: Kirill A. Shutemov <kirill@shutemov.name>

> Cc: Li Zefan <lizf@cn.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Aditya Kali <adityakali@google.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Kay Sievers <kay.sievers@vrfy.org>
> Cc: Tim Hockin <thockin@hockin.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Containers <containers@lists.linux-foundation.org>
> ---
>  Documentation/cgroups/cgroups.txt |    6 ++++--
>  block/blk-cgroup.c                |   12 ++++++++----
>  include/linux/cgroup.h            |    6 ++++--
>  kernel/cgroup.c                   |   11 +++++++----
>  kernel/cgroup_freezer.c           |    3 ++-
>  kernel/cpuset.c                   |    6 ++++--
>  kernel/events/core.c              |    5 +++--
>  kernel/sched.c                    |    6 ++++--
>  8 files changed, 36 insertions(+), 19 deletions(-)
> 
> diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
> index cd67e90..0621e93 100644
> --- a/Documentation/cgroups/cgroups.txt
> +++ b/Documentation/cgroups/cgroups.txt
> @@ -605,7 +605,8 @@ called on a fork. If this method returns 0 (success) then this should
>  remain valid while the caller holds cgroup_mutex and it is ensured that either
>  attach() or cancel_attach() will be called in future.
>  
> -int can_attach_task(struct cgroup *cgrp, struct task_struct *tsk);
> +int can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
> +		    struct task_struct *tsk);
>  (cgroup_mutex held by caller)
>  
>  As can_attach, but for operations that must be run once per task to be
> @@ -635,7 +636,8 @@ void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
>  Called after the task has been attached to the cgroup, to allow any
>  post-attachment activity that requires memory allocations or blocking.
>  
> -void attach_task(struct cgroup *cgrp, struct task_struct *tsk);
> +void attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
> +		 struct task_struct *tsk);
>  (cgroup_mutex held by caller)
>  
>  As attach, but for operations that must be run once per task to be attached,
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index bcaf16e..6eddc5f 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -30,8 +30,10 @@ EXPORT_SYMBOL_GPL(blkio_root_cgroup);
>  
>  static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
>  						  struct cgroup *);
> -static int blkiocg_can_attach_task(struct cgroup *, struct task_struct *);
> -static void blkiocg_attach_task(struct cgroup *, struct task_struct *);
> +static int blkiocg_can_attach_task(struct cgroup *, struct cgroup *,
> +				   struct task_struct *);
> +static void blkiocg_attach_task(struct cgroup *, struct cgroup *,
> +				struct task_struct *);
>  static void blkiocg_destroy(struct cgroup_subsys *, struct cgroup *);
>  static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
>  
> @@ -1614,7 +1616,8 @@ done:
>   * of the main cic data structures.  For now we allow a task to change
>   * its cgroup only if it's the only owner of its ioc.
>   */
> -static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> +static int blkiocg_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
> +				   struct task_struct *tsk)
>  {
>  	struct io_context *ioc;
>  	int ret = 0;
> @@ -1629,7 +1632,8 @@ static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  	return ret;
>  }
>  
> -static void blkiocg_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> +static void blkiocg_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
> +				struct task_struct *tsk)
>  {
>  	struct io_context *ioc;
>  
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index da7e4bc..ed34eb8 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -468,11 +468,13 @@ struct cgroup_subsys {
>  	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>  	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
>  			  struct task_struct *tsk);
> -	int (*can_attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
> +	int (*can_attach_task)(struct cgroup *cgrp, struct cgroup *old_cgrp,
> +			       struct task_struct *tsk);
>  	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
>  			      struct task_struct *tsk);
>  	void (*pre_attach)(struct cgroup *cgrp);
> -	void (*attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
> +	void (*attach_task)(struct cgroup *cgrp, struct cgroup *old_cgrp,
> +			    struct task_struct *tsk);
>  	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
>  		       struct cgroup *old_cgrp, struct task_struct *tsk);
>  	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 84bdace..fafebdb 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1844,7 +1844,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  			}
>  		}
>  		if (ss->can_attach_task) {
> -			retval = ss->can_attach_task(cgrp, tsk);
> +			retval = ss->can_attach_task(cgrp, oldcgrp, tsk);
>  			if (retval) {
>  				failed_ss = ss;
>  				goto out;
> @@ -1860,7 +1860,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  		if (ss->pre_attach)
>  			ss->pre_attach(cgrp);
>  		if (ss->attach_task)
> -			ss->attach_task(cgrp, tsk);
> +			ss->attach_task(cgrp, oldcgrp, tsk);
>  		if (ss->attach)
>  			ss->attach(ss, cgrp, oldcgrp, tsk);
>  	}
> @@ -2075,7 +2075,10 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
>  			/* run on each task in the threadgroup. */
>  			for (i = 0; i < group_size; i++) {
>  				tsk = flex_array_get_ptr(group, i);
> -				retval = ss->can_attach_task(cgrp, tsk);
> +				oldcgrp = task_cgroup_from_root(tsk, root);
> +
> +				retval = ss->can_attach_task(cgrp,
> +							     oldcgrp, tsk);
>  				if (retval) {
>  					failed_ss = ss;
>  					cancel_failed_ss = true;
> @@ -2141,7 +2144,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
>  			/* attach each task to each subsystem */
>  			for_each_subsys(root, ss) {
>  				if (ss->attach_task)
> -					ss->attach_task(cgrp, tsk);
> +					ss->attach_task(cgrp, oldcgrp, tsk);
>  			}
>  		} else {
>  			BUG_ON(retval != -ESRCH);
> diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
> index e691818..c1421a1 100644
> --- a/kernel/cgroup_freezer.c
> +++ b/kernel/cgroup_freezer.c
> @@ -175,7 +175,8 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
>  	return 0;
>  }
>  
> -static int freezer_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> +static int freezer_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
> +				   struct task_struct *tsk)
>  {
>  	rcu_read_lock();
>  	if (__cgroup_freezing_or_frozen(tsk)) {
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index 10131fd..427be38 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -1390,7 +1390,8 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
>  	return 0;
>  }
>  
> -static int cpuset_can_attach_task(struct cgroup *cgrp, struct task_struct *task)
> +static int cpuset_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
> +				  struct task_struct *task)
>  {
>  	return security_task_setscheduler(task);
>  }
> @@ -1418,7 +1419,8 @@ static void cpuset_pre_attach(struct cgroup *cont)
>  }
>  
>  /* Per-thread attachment work. */
> -static void cpuset_attach_task(struct cgroup *cont, struct task_struct *tsk)
> +static void cpuset_attach_task(struct cgroup *cont, struct cgroup *old,
> +			       struct task_struct *tsk)
>  {
>  	int err;
>  	struct cpuset *cs = cgroup_cs(cont);
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index b8785e2..509464e 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -7001,7 +7001,8 @@ static int __perf_cgroup_move(void *info)
>  }
>  
>  static void
> -perf_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *task)
> +perf_cgroup_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
> +			struct task_struct *task)
>  {
>  	task_function_call(task, __perf_cgroup_move, task);
>  }
> @@ -7017,7 +7018,7 @@ static void perf_cgroup_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
>  	if (!(task->flags & PF_EXITING))
>  		return;
>  
> -	perf_cgroup_attach_task(cgrp, task);
> +	perf_cgroup_attach_task(cgrp, old_cgrp, task);
>  }
>  
>  struct cgroup_subsys perf_subsys = {
> diff --git a/kernel/sched.c b/kernel/sched.c
> index ccacdbd..72ce1b1 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -8967,7 +8967,8 @@ cpu_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
>  }
>  
>  static int
> -cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> +cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
> +			   struct task_struct *tsk)
>  {
>  #ifdef CONFIG_RT_GROUP_SCHED
>  	if (!sched_rt_can_attach(cgroup_tg(cgrp), tsk))
> @@ -8981,7 +8982,8 @@ cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  }
>  
>  static void
> -cpu_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> +cpu_cgroup_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
> +		       struct task_struct *tsk)
>  {
>  	sched_move_task(tsk);
>  }
> -- 
> 1.7.5.4
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 04/10] cgroups: New cancel_attach_task subsystem callback
  2011-10-03 19:07 ` [PATCH 04/10] cgroups: New cancel_attach_task subsystem callback Frederic Weisbecker
@ 2011-10-04  0:27   ` Kirill A. Shutemov
  0 siblings, 0 replies; 44+ messages in thread
From: Kirill A. Shutemov @ 2011-10-04  0:27 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, LKML, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Containers

On Mon, Oct 03, 2011 at 09:07:06PM +0200, Frederic Weisbecker wrote:
> To cancel a process attachment on a subsystem, we only call the
> cancel_attach() callback once on the leader but we have no
> way to cancel the attachment individually for each member of
> the process group.
> 
> This is going to be needed for the max number of tasks susbystem
> that is coming.
> 
> To prepare for this integration, call a new cancel_attach_task()
> callback on each task of the group until we reach the member that
> failed to attach.
> 
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Acked-by: Paul Menage <paul@paulmenage.org>

Acked-by: Kirill A. Shutemov <kirill@shutemov.name>

> Cc: Li Zefan <lizf@cn.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Aditya Kali <adityakali@google.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Kay Sievers <kay.sievers@vrfy.org>
> Cc: Tim Hockin <thockin@hockin.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Containers <containers@lists.linux-foundation.org>
> ---
>  Documentation/cgroups/cgroups.txt |    7 +++++++
>  include/linux/cgroup.h            |    2 ++
>  kernel/cgroup.c                   |   24 ++++++++++++++++++++----
>  3 files changed, 29 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
> index 0621e93..3bc1dc9 100644
> --- a/Documentation/cgroups/cgroups.txt
> +++ b/Documentation/cgroups/cgroups.txt
> @@ -623,6 +623,13 @@ function, so that the subsystem can implement a rollback. If not, not necessary.
>  This will be called only about subsystems whose can_attach() operation have
>  succeeded.
>  
> +void cancel_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> +(cgroup_mutex held by caller)
> +
> +As cancel_attach, but for operations that must be cancelled once per
> +task that wanted to be attached. This typically revert the effect of
> +can_attach_task().
> +
>  void pre_attach(struct cgroup *cgrp);
>  (cgroup_mutex held by caller)
>  
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index ed34eb8..b62cf5e 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -472,6 +472,8 @@ struct cgroup_subsys {
>  			       struct task_struct *tsk);
>  	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
>  			      struct task_struct *tsk);
> +	void (*cancel_attach_task)(struct cgroup *cgrp,
> +				   struct task_struct *tsk);
>  	void (*pre_attach)(struct cgroup *cgrp);
>  	void (*attach_task)(struct cgroup *cgrp, struct cgroup *old_cgrp,
>  			    struct task_struct *tsk);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index fafebdb..709baef 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1883,6 +1883,9 @@ out:
>  				 * remaining subsystems.
>  				 */
>  				break;
> +
> +			if (ss->cancel_attach_task)
> +				ss->cancel_attach_task(cgrp, tsk);
>  			if (ss->cancel_attach)
>  				ss->cancel_attach(ss, cgrp, tsk);
>  		}
> @@ -1992,7 +1995,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
>  {
>  	int retval, i, group_size;
>  	struct cgroup_subsys *ss, *failed_ss = NULL;
> -	bool cancel_failed_ss = false;
> +	struct task_struct *failed_task = NULL;
>  	/* guaranteed to be initialized later, but the compiler needs this */
>  	struct cgroup *oldcgrp = NULL;
>  	struct css_set *oldcg;
> @@ -2081,7 +2084,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
>  							     oldcgrp, tsk);
>  				if (retval) {
>  					failed_ss = ss;
> -					cancel_failed_ss = true;
> +					failed_task = tsk;
>  					goto out_cancel_attach;
>  				}
>  			}
> @@ -2146,8 +2149,11 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
>  				if (ss->attach_task)
>  					ss->attach_task(cgrp, oldcgrp, tsk);
>  			}
> +		} else if (retval == -ESRCH) {
> +			if (ss->cancel_attach_task)
> +				ss->cancel_attach_task(cgrp, tsk);
>  		} else {
> -			BUG_ON(retval != -ESRCH);
> +			BUG_ON(1);
>  		}
>  	}
>  	/* nothing is sensitive to fork() after this point. */
> @@ -2179,8 +2185,18 @@ out_cancel_attach:
>  	/* same deal as in cgroup_attach_task */
>  	if (retval) {
>  		for_each_subsys(root, ss) {
> +			if (ss->cancel_attach_task && (ss != failed_ss ||
> +						       failed_task)) {
> +				for (i = 0; i < group_size; i++) {
> +					tsk = flex_array_get_ptr(group, i);
> +					if (tsk == failed_task)
> +						break;
> +					ss->cancel_attach_task(cgrp, tsk);
> +				}
> +			}
> +
>  			if (ss == failed_ss) {
> -				if (cancel_failed_ss && ss->cancel_attach)
> +				if (failed_task && ss->cancel_attach)
>  					ss->cancel_attach(ss, cgrp, leader);
>  				break;
>  			}
> -- 
> 1.7.5.4
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 05/10] cgroups: Ability to stop res charge propagation on bounded ancestor
  2011-10-03 19:07 ` [PATCH 05/10] cgroups: Ability to stop res charge propagation on bounded ancestor Frederic Weisbecker
@ 2011-10-04  0:41   ` Kirill A. Shutemov
  0 siblings, 0 replies; 44+ messages in thread
From: Kirill A. Shutemov @ 2011-10-04  0:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, LKML, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Containers

On Mon, Oct 03, 2011 at 09:07:07PM +0200, Frederic Weisbecker wrote:
> Moving a task from a cgroup to another may require to substract
> its resource charge from the old cgroup and add it to the new one.
> 
> For this to happen, the uncharge/charge propagation can just stop
> when we reach the common ancestor for the two cgroups. Further
> the performance reasons, we also want to avoid to temporarily
> overload the common ancestors with a non-accurate resource
> counter usage if we charge first the new cgroup and uncharge the
> old one thereafter. This is going to be a requirement for the coming
> max number of task subsystem.
> 
> To solve this, provide a pair of new API that can charge/uncharge
> a resource counter until we reach a given ancestor.
> 
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Acked-by: Paul Menage <paul@paulmenage.org>

Acked-by: Kirill A. Shutemov <kirill@shutemov.name>

> Cc: Li Zefan <lizf@cn.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Aditya Kali <adityakali@google.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Kay Sievers <kay.sievers@vrfy.org>
> Cc: Tim Hockin <thockin@hockin.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Containers <containers@lists.linux-foundation.org>
> ---
>  Documentation/cgroups/resource_counter.txt |   18 +++++++++++++++++-
>  include/linux/res_counter.h                |   20 +++++++++++++++++---
>  kernel/res_counter.c                       |   13 ++++++++-----
>  3 files changed, 42 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
> index 95b24d7..a2cd05b 100644
> --- a/Documentation/cgroups/resource_counter.txt
> +++ b/Documentation/cgroups/resource_counter.txt
> @@ -83,7 +83,15 @@ to work with it.
>  	res_counter->lock internally (it must be called with res_counter->lock
>  	held).
>  
> - e. void res_counter_uncharge[_locked]
> + e. int res_counter_charge_until(struct res_counter *counter,
> +			     struct res_counter *limit, unsigned long val,
> +			     struct res_counter **limit_fail_at)
> +
> +	The same as res_counter_charge(), but the charge propagation to
> +	the hierarchy stops at the limit given in the "limit" parameter.
> +
> +
> + f. void res_counter_uncharge[_locked]
>  			(struct res_counter *rc, unsigned long val)
>  
>  	When a resource is released (freed) it should be de-accounted
> @@ -92,6 +100,14 @@ to work with it.
>  
>  	The _locked routines imply that the res_counter->lock is taken.
>  
> +
> + g. void res_counter_uncharge_until(struct res_counter *counter,
> +				struct res_counter *limit,
> +				unsigned long val)
> +
> +	The same as res_counter_charge, but the uncharge propagation to
> +	the hierarchy stops at the limit given in the "limit" parameter.
> +
>   2.1 Other accounting routines
>  
>      There are more routines that may help you with common needs, like
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index 109d118..de4ba29 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -117,8 +117,16 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
>  
>  int __must_check res_counter_charge_locked(struct res_counter *counter,
>  		unsigned long val);
> -int __must_check res_counter_charge(struct res_counter *counter,
> -		unsigned long val, struct res_counter **limit_fail_at);
> +int __must_check res_counter_charge_until(struct res_counter *counter,
> +					  struct res_counter *limit,
> +					  unsigned long val,
> +					  struct res_counter **limit_fail_at);
> +static inline int __must_check
> +res_counter_charge(struct res_counter *counter, unsigned long val,
> +		   struct res_counter **limit_fail_at)
> +{
> +	return res_counter_charge_until(counter, NULL, val, limit_fail_at);
> +}
>  
>  /*
>   * uncharge - tell that some portion of the resource is released
> @@ -131,7 +139,13 @@ int __must_check res_counter_charge(struct res_counter *counter,
>   */
>  
>  void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
> -void res_counter_uncharge(struct res_counter *counter, unsigned long val);
> +void res_counter_uncharge_until(struct res_counter *counter,
> +				struct res_counter *limit, unsigned long val);
> +static inline void res_counter_uncharge(struct res_counter *counter,
> +					unsigned long val)
> +{
> +	res_counter_uncharge_until(counter, NULL, val);
> +}
>  
>  /**
>   * res_counter_margin - calculate chargeable space of a counter
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index 37abf4e..0cc9c7e 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -35,8 +35,9 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
>  	return 0;
>  }
>  
> -int res_counter_charge(struct res_counter *counter, unsigned long val,
> -			struct res_counter **limit_fail_at)
> +int res_counter_charge_until(struct res_counter *counter,
> +			     struct res_counter *limit, unsigned long val,
> +			     struct res_counter **limit_fail_at)
>  {
>  	int ret;
>  	unsigned long flags;
> @@ -44,7 +45,7 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
>  
>  	*limit_fail_at = NULL;
>  	local_irq_save(flags);
> -	for (c = counter; c != NULL; c = c->parent) {
> +	for (c = counter; c != limit; c = c->parent) {
>  		spin_lock(&c->lock);
>  		ret = res_counter_charge_locked(c, val);
>  		spin_unlock(&c->lock);
> @@ -74,13 +75,15 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
>  	counter->usage -= val;
>  }
>  
> -void res_counter_uncharge(struct res_counter *counter, unsigned long val)
> +void res_counter_uncharge_until(struct res_counter *counter,
> +				struct res_counter *limit,
> +				unsigned long val)
>  {
>  	unsigned long flags;
>  	struct res_counter *c;
>  
>  	local_irq_save(flags);
> -	for (c = counter; c != NULL; c = c->parent) {
> +	for (c = counter; c != limit; c = c->parent) {
>  		spin_lock(&c->lock);
>  		res_counter_uncharge_locked(c, val);
>  		spin_unlock(&c->lock);
> -- 
> 1.7.5.4
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 07/10] res_counter: Allow charge failure pointer to be null
  2011-10-03 19:07 ` [PATCH 07/10] res_counter: Allow charge failure pointer to be null Frederic Weisbecker
@ 2011-10-04  1:30   ` Kirill A. Shutemov
  0 siblings, 0 replies; 44+ messages in thread
From: Kirill A. Shutemov @ 2011-10-04  1:30 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, LKML, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Containers

On Mon, Oct 03, 2011 at 09:07:09PM +0200, Frederic Weisbecker wrote:
> So that callers of res_counter_charge() don't have to create and
> pass this pointer even if they aren't interested in it.
> 
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>

Acked-by: Kirill A. Shutemov <kirill@shutemov.name>

> Cc: Paul Menage <paul@paulmenage.org>
> Cc: Li Zefan <lizf@cn.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Aditya Kali <adityakali@google.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Kay Sievers <kay.sievers@vrfy.org>
> Cc: Tim Hockin <thockin@hockin.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Containers <containers@lists.linux-foundation.org>
> ---
>  kernel/res_counter.c |    6 ++++--
>  1 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index 8f4c728..6b36823 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -43,14 +43,16 @@ int res_counter_charge_until(struct res_counter *counter,
>  	unsigned long flags;
>  	struct res_counter *c, *u;
>  
> -	*limit_fail_at = NULL;
> +	if (limit_fail_at)
> +		*limit_fail_at = NULL;
>  	local_irq_save(flags);
>  	for (c = counter; c != limit; c = c->parent) {
>  		spin_lock(&c->lock);
>  		ret = res_counter_charge_locked(c, val);
>  		spin_unlock(&c->lock);
>  		if (ret < 0) {
> -			*limit_fail_at = c;
> +			if (limit_fail_at)
> +				*limit_fail_at = c;
>  			goto undo;
>  		}
>  	}
> -- 
> 1.7.5.4
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 08/10] cgroups: Pull up res counter charge failure interpretation to caller
  2011-10-03 19:07 ` [PATCH 08/10] cgroups: Pull up res counter charge failure interpretation to caller Frederic Weisbecker
@ 2011-10-04  1:32   ` Kirill A. Shutemov
  0 siblings, 0 replies; 44+ messages in thread
From: Kirill A. Shutemov @ 2011-10-04  1:32 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, LKML, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Containers

On Mon, Oct 03, 2011 at 09:07:10PM +0200, Frederic Weisbecker wrote:
> res_counter_charge() always returns -ENOMEM when the limit is reached
> and the charge thus can't happen.
> 
> However it's up to the caller to interpret this failure and return
> the appropriate error value. The task counter subsystem will need
> to report the user that a fork() has been cancelled because of some
> limit reached, not because we are too short on memory.
> 
> Fix this by returning -1 when res_counter_charge() fails.
> 
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>

Acked-by: Kirill A. Shutemov <kirill@shutemov.name>

> Cc: Paul Menage <paul@paulmenage.org>
> Cc: Li Zefan <lizf@cn.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Aditya Kali <adityakali@google.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Kay Sievers <kay.sievers@vrfy.org>
> Cc: Tim Hockin <thockin@hockin.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Containers <containers@lists.linux-foundation.org>
> ---
>  Documentation/cgroups/resource_counter.txt |    2 ++
>  kernel/res_counter.c                       |    2 +-
>  2 files changed, 3 insertions(+), 1 deletions(-)
> 
> diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
> index a2cd05b..24ec61c 100644
> --- a/Documentation/cgroups/resource_counter.txt
> +++ b/Documentation/cgroups/resource_counter.txt
> @@ -76,6 +76,8 @@ to work with it.
>  	limit_fail_at parameter is set to the particular res_counter element
>  	where the charging failed.
>  
> +	It returns 0 on success and -1 on failure.
> +
>   d. int res_counter_charge_locked
>  			(struct res_counter *rc, unsigned long val)
>  
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index 6b36823..b814d6c 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -26,7 +26,7 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
>  {
>  	if (counter->usage + val > counter->limit) {
>  		counter->failcnt++;
> -		return -ENOMEM;
> +		return -1;
>  	}
>  
>  	counter->usage += val;
> -- 
> 1.7.5.4
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 09/10] cgroups: Allow subsystems to cancel a fork
  2011-10-03 19:07 ` [PATCH 09/10] cgroups: Allow subsystems to cancel a fork Frederic Weisbecker
@ 2011-10-04  1:38   ` Kirill A. Shutemov
  0 siblings, 0 replies; 44+ messages in thread
From: Kirill A. Shutemov @ 2011-10-04  1:38 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, LKML, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Containers

On Mon, Oct 03, 2011 at 09:07:11PM +0200, Frederic Weisbecker wrote:
> Let the subsystem's fork callback return an error value so
> that they can cancel a fork. This is going to be used by
> the task counter subsystem to implement the limit.
> 
> Suggested-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>

Acked-by: Kirill A. Shutemov <kirill@shutemov.name>

> Cc: Paul Menage <paul@paulmenage.org>
> Cc: Li Zefan <lizf@cn.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Aditya Kali <adityakali@google.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Kay Sievers <kay.sievers@vrfy.org>
> Cc: Tim Hockin <thockin@hockin.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Containers <containers@lists.linux-foundation.org>
> ---
>  include/linux/cgroup.h  |   20 ++++++++++++++------
>  kernel/cgroup.c         |   23 +++++++++++++++++++----
>  kernel/cgroup_freezer.c |    6 ++++--
>  kernel/exit.c           |    2 +-
>  kernel/fork.c           |    7 +++++--
>  5 files changed, 43 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index b62cf5e..9c8151e 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -17,10 +17,11 @@
>  #include <linux/rwsem.h>
>  #include <linux/idr.h>
>  
> +struct cgroup_subsys;
> +
>  #ifdef CONFIG_CGROUPS
>  
>  struct cgroupfs_root;
> -struct cgroup_subsys;
>  struct inode;
>  struct cgroup;
>  struct css_id;
> @@ -32,9 +33,11 @@ extern int cgroup_lock_is_held(void);
>  extern bool cgroup_lock_live_group(struct cgroup *cgrp);
>  extern void cgroup_unlock(void);
>  extern void cgroup_fork(struct task_struct *p);
> -extern void cgroup_fork_callbacks(struct task_struct *p);
> +extern int cgroup_fork_callbacks(struct task_struct *p,
> +				 struct cgroup_subsys **failed_ss);
>  extern void cgroup_post_fork(struct task_struct *p);
> -extern void cgroup_exit(struct task_struct *p, int run_callbacks);
> +extern void cgroup_exit(struct task_struct *p, int run_callbacks,
> +			struct cgroup_subsys *failed_ss);
>  extern int cgroupstats_build(struct cgroupstats *stats,
>  				struct dentry *dentry);
>  extern int cgroup_load_subsys(struct cgroup_subsys *ss);
> @@ -479,7 +482,7 @@ struct cgroup_subsys {
>  			    struct task_struct *tsk);
>  	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
>  		       struct cgroup *old_cgrp, struct task_struct *tsk);
> -	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
> +	int (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
>  	void (*exit)(struct cgroup_subsys *ss, struct cgroup *cgrp,
>  			struct cgroup *old_cgrp, struct task_struct *task);
>  	int (*populate)(struct cgroup_subsys *ss,
> @@ -636,9 +639,14 @@ struct cgroup_subsys_state *cgroup_css_from_dir(struct file *f, int id);
>  static inline int cgroup_init_early(void) { return 0; }
>  static inline int cgroup_init(void) { return 0; }
>  static inline void cgroup_fork(struct task_struct *p) {}
> -static inline void cgroup_fork_callbacks(struct task_struct *p) {}
> +static inline int cgroup_fork_callbacks(struct task_struct *p,
> +					struct cgroup_subsys **failed_ss)
> +{
> +	return 0;
> +}
>  static inline void cgroup_post_fork(struct task_struct *p) {}
> -static inline void cgroup_exit(struct task_struct *p, int callbacks) {}
> +static inline void cgroup_exit(struct task_struct *p, int callbacks,
> +			       struct cgroup_subsys *failed_ss) {}
>  
>  static inline void cgroup_lock(void) {}
>  static inline void cgroup_unlock(void) {}
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 709baef..018df9d 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -4540,8 +4540,11 @@ void cgroup_fork(struct task_struct *child)
>   * tasklist. No need to take any locks since no-one can
>   * be operating on this task.
>   */
> -void cgroup_fork_callbacks(struct task_struct *child)
> +int cgroup_fork_callbacks(struct task_struct *child,
> +			  struct cgroup_subsys **failed_ss)
>  {
> +	int err;
> +
>  	if (need_forkexit_callback) {
>  		int i;
>  		/*
> @@ -4551,10 +4554,17 @@ void cgroup_fork_callbacks(struct task_struct *child)
>  		 */
>  		for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
>  			struct cgroup_subsys *ss = subsys[i];
> -			if (ss->fork)
> -				ss->fork(ss, child);
> +			if (ss->fork) {
> +				err = ss->fork(ss, child);
> +				if (err) {
> +					*failed_ss = ss;
> +					return err;
> +				}
> +			}
>  		}
>  	}
> +
> +	return 0;
>  }
>  
>  /**
> @@ -4612,7 +4622,8 @@ void cgroup_post_fork(struct task_struct *child)
>   *    which wards off any cgroup_attach_task() attempts, or task is a failed
>   *    fork, never visible to cgroup_attach_task.
>   */
> -void cgroup_exit(struct task_struct *tsk, int run_callbacks)
> +void cgroup_exit(struct task_struct *tsk, int run_callbacks,
> +		 struct cgroup_subsys *failed_ss)
>  {
>  	struct css_set *cg;
>  	int i;
> @@ -4641,6 +4652,10 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
>  		 */
>  		for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
>  			struct cgroup_subsys *ss = subsys[i];
> +
> +			if (ss == failed_ss)
> +				break;
> +
>  			if (ss->exit) {
>  				struct cgroup *old_cgrp =
>  					rcu_dereference_raw(cg->subsys[i])->cgroup;
> diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
> index c1421a1..30a4332 100644
> --- a/kernel/cgroup_freezer.c
> +++ b/kernel/cgroup_freezer.c
> @@ -187,7 +187,7 @@ static int freezer_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
>  	return 0;
>  }
>  
> -static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
> +static int freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
>  {
>  	struct freezer *freezer;
>  
> @@ -207,7 +207,7 @@ static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
>  	 * following check.
>  	 */
>  	if (!freezer->css.cgroup->parent)
> -		return;
> +		return 0;
>  
>  	spin_lock_irq(&freezer->lock);
>  	BUG_ON(freezer->state == CGROUP_FROZEN);
> @@ -216,6 +216,8 @@ static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
>  	if (freezer->state == CGROUP_FREEZING)
>  		freeze_task(task, true);
>  	spin_unlock_irq(&freezer->lock);
> +
> +	return 0;
>  }
>  
>  /*
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 2913b35..abab32b 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -994,7 +994,7 @@ NORET_TYPE void do_exit(long code)
>  	 */
>  	perf_event_exit_task(tsk);
>  
> -	cgroup_exit(tsk, 1);
> +	cgroup_exit(tsk, 1, NULL);
>  
>  	if (group_dead)
>  		disassociate_ctty(1);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 8e6b6f4..ee8abdb 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1058,6 +1058,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	int retval;
>  	struct task_struct *p;
>  	int cgroup_callbacks_done = 0;
> +	struct cgroup_subsys *cgroup_failed_ss = NULL;
>  
>  	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
>  		return ERR_PTR(-EINVAL);
> @@ -1312,8 +1313,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	/* Now that the task is set up, run cgroup callbacks if
>  	 * necessary. We need to run them before the task is visible
>  	 * on the tasklist. */
> -	cgroup_fork_callbacks(p);
> +	retval = cgroup_fork_callbacks(p, &cgroup_failed_ss);
>  	cgroup_callbacks_done = 1;
> +	if (retval)
> +		goto bad_fork_free_pid;
>  
>  	/* Need tasklist lock for parent etc handling! */
>  	write_lock_irq(&tasklist_lock);
> @@ -1419,7 +1422,7 @@ bad_fork_cleanup_cgroup:
>  #endif
>  	if (clone_flags & CLONE_THREAD)
>  		threadgroup_fork_read_unlock(current);
> -	cgroup_exit(p, cgroup_callbacks_done);
> +	cgroup_exit(p, cgroup_callbacks_done, cgroup_failed_ss);
>  	delayacct_tsk_free(p);
>  	module_put(task_thread_info(p)->exec_domain->module);
>  bad_fork_cleanup_count:
> -- 
> 1.7.5.4
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v6
  2011-10-03 19:07 [PATCH 00/10] cgroups: Task counter subsystem v6 Frederic Weisbecker
                   ` (9 preceding siblings ...)
  2011-10-03 19:07 ` [PATCH 10/10] cgroups: Add a task counter subsystem Frederic Weisbecker
@ 2011-10-04 22:01 ` Andrew Morton
  2011-10-11 13:40   ` Frederic Weisbecker
  2011-10-25 20:06   ` Tim Hockin
  2011-10-06  6:51 ` Li Zefan
  2011-12-13 15:58 ` Tejun Heo
  12 siblings, 2 replies; 44+ messages in thread
From: Andrew Morton @ 2011-10-04 22:01 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Kirill A. Shutemov, Containers, Andrew Morton

On Mon,  3 Oct 2011 21:07:02 +0200
Frederic Weisbecker <fweisbec@gmail.com> wrote:

> Hi Andrew,
> 
> This contains minor changes, mostly documentation and changelog
> updates, off-case build fix, and a code optimization in
> res_counter_common_ancestor().

I'd normally duck a patch series like this when we're at -rc8 and ask
for it to be resent late in -rc1.  But I was feeling frisky so I
grabbed this lot for a bit of testing and will sit on it until -rc1.

I'm still not convinced that the kernel has a burning need for a "task
counter subsystem".  Someone convince me that we should merge this!

> It's hard to put some statistic numbers while testing this feature
> given that the result is rather binary: we launch a forkbomb and
> either we stop and kill it or the system become unresponsive.
>     
> Meanwhile, one can find a testsuite at this address:
> https://tglx.de/~fweisbec/task_counter_test.tar.gz

I do think that we should merge tests like this into the main tree.  So
I can do "cd tests ; make ; ./run-tests".  The first step is for some hero
to propose the (simple!) framework and to drop a first test in there.

> It performs several checks to ensure the interface and the behaviour
> are reliable after common events like moving tasks around over cgroups
> in a hierarchy, forking inside, etc.. It also launches a forkbomb,
> tries to stop and kill it. So beware, don't run it on a system that
> is doing serious things.

Good stuff, that.  Then, when people propose additions or fix bugs, I can
whine at them for not updating the test suite.

> Ensure you have CGROUP_TASK_COUNTER set
> before, or it may compress the Ten Plagues in your MBR and
> inflate the whole after your next reboot.

That problem would need to be fixed.  Either probe for the feature
up-front, or don't build the test at all if CONFIG_CGROUP_TASK_COUNTER=n.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v6
  2011-10-03 19:07 [PATCH 00/10] cgroups: Task counter subsystem v6 Frederic Weisbecker
                   ` (10 preceding siblings ...)
  2011-10-04 22:01 ` [PATCH 00/10] cgroups: Task counter subsystem v6 Andrew Morton
@ 2011-10-06  6:51 ` Li Zefan
  2011-10-11 13:41   ` Frederic Weisbecker
  2011-12-13 15:58 ` Tejun Heo
  12 siblings, 1 reply; 44+ messages in thread
From: Li Zefan @ 2011-10-06  6:51 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, LKML, Paul Menage, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Kirill A. Shutemov, Containers

> Frederic Weisbecker (9):
>   cgroups: Add res_counter_write_u64() API
>   cgroups: New resource counter inheritance API
>   cgroups: Add previous cgroup in can_attach_task/attach_task callbacks
>   cgroups: New cancel_attach_task subsystem callback
>   cgroups: Ability to stop res charge propagation on bounded ancestor
>   res_counter: Allow charge failure pointer to be null
>   cgroups: Pull up res counter charge failure interpretation to caller
>   cgroups: Allow subsystems to cancel a fork
>   cgroups: Add a task counter subsystem
> 
> Kirill A. Shutemov (1):
>   cgroups: Add res counter common ancestor searching
> 
>  Documentation/cgroups/cgroups.txt          |   13 ++-
>  Documentation/cgroups/resource_counter.txt |   20 +++-
>  Documentation/cgroups/task_counter.txt     |  153 ++++++++++++++++++
>  block/blk-cgroup.c                         |   12 +-
>  include/linux/cgroup.h                     |   28 +++-
>  include/linux/cgroup_subsys.h              |    8 +
>  include/linux/res_counter.h                |   27 +++-
>  init/Kconfig                               |    9 +
>  kernel/Makefile                            |    1 +
>  kernel/cgroup.c                            |   58 ++++++--
>  kernel/cgroup_freezer.c                    |    9 +-
>  kernel/cgroup_task_counter.c               |  239 ++++++++++++++++++++++++++++
>  kernel/cpuset.c                            |    6 +-
>  kernel/events/core.c                       |    5 +-
>  kernel/exit.c                              |    2 +-
>  kernel/fork.c                              |    7 +-
>  kernel/res_counter.c                       |   97 ++++++++++--
>  kernel/sched.c                             |    6 +-
>  18 files changed, 644 insertions(+), 56 deletions(-)
>  create mode 100644 Documentation/cgroups/task_counter.txt
>  create mode 100644 kernel/cgroup_task_counter.c
> 

Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 10/10] cgroups: Add a task counter subsystem
  2011-10-03 19:07 ` [PATCH 10/10] cgroups: Add a task counter subsystem Frederic Weisbecker
@ 2011-10-06  9:23   ` Kirill A. Shutemov
  2011-10-11 13:41     ` Frederic Weisbecker
  0 siblings, 1 reply; 44+ messages in thread
From: Kirill A. Shutemov @ 2011-10-06  9:23 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, LKML, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Containers

On Mon, Oct 03, 2011 at 09:07:12PM +0200, Frederic Weisbecker wrote:
> Add a new subsystem to limit the number of running tasks,
> similar to the NR_PROC rlimit but in the scope of a cgroup.
> 
> The user can set an upper bound limit that is checked every
> time a task forks in a cgroup or is moved into a cgroup
> with that subsystem binded.
> 
> The primary goal is to protect against forkbombs that explode
> inside a container. The traditional NR_PROC rlimit is not
> efficient in that case because if we run containers in parallel
> under the same user, one of these could starve all the others
> by spawning a high number of tasks close to the user wide limit.
> 
> This is a prevention against forkbombs, so it's not deemed to
> cure the effects of a forkbomb when the system is in a state
> where it's not responsive. It's aimed at preventing from ever
> reaching that state and stop the spreading of tasks early.
> While defining the limit on the allowed number of tasks, it's
> up to the user to find the right balance between the resource
> its containers may need and what it can afford to provide.
> 
> As it's totally dissociated from the rlimit NR_PROC, both
> can be complementary: the cgroup task counter can set an upper
> bound per container and the rlmit can be an upper bound on the
> overall set of containers.
> 
> Also this subsystem can be used to kill all the tasks in a cgroup
> without races against concurrent forks, by setting the limit of
> tasks to 0, any further forks can be rejected. This is a good
> way to kill a forkbomb in a container, or simply kill any container
> without the need to retry an unbound number of times.
> 
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>

Acked-by: Kirill A. Shutemov <kirill@shutemov.name>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v6
  2011-10-04 22:01 ` [PATCH 00/10] cgroups: Task counter subsystem v6 Andrew Morton
@ 2011-10-11 13:40   ` Frederic Weisbecker
  2011-10-25 20:06   ` Tim Hockin
  1 sibling, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2011-10-11 13:40 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Rik van Riel, Daniel J. Walsh,
	Daniel P. Berrange
  Cc: LKML, Paul Menage, Li Zefan, Aditya Kali, Oleg Nesterov,
	Kay Sievers, Tim Hockin, Tejun Heo, Kirill A. Shutemov,
	Containers, Andrew Morton

On Tue, Oct 04, 2011 at 03:01:11PM -0700, Andrew Morton wrote:
> On Mon,  3 Oct 2011 21:07:02 +0200
> Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > Hi Andrew,
> > 
> > This contains minor changes, mostly documentation and changelog
> > updates, off-case build fix, and a code optimization in
> > res_counter_common_ancestor().
> 
> I'd normally duck a patch series like this when we're at -rc8 and ask
> for it to be resent late in -rc1.  But I was feeling frisky so I
> grabbed this lot for a bit of testing and will sit on it until -rc1.
> 
> I'm still not convinced that the kernel has a burning need for a "task
> counter subsystem".  Someone convince me that we should merge this!

(Adding more people in Cc with whom I discussed this and who got
nice insights about the issues and the needs).

In practice we need it for Lxc to secure containers. Since you wrote
me that email last week I've tried to think more about another
solution to protect containers against forkbomb by reusing an exisiting
feature instead of pushing a new subsystem and ABI.

So I've been thinking about using user namespaces. The idea is
to create the container with a process forked with CLONE_NEWUSER
such that its NR_PROC rlimit only applies to it and its children.
This way we can have our per container limitation given we have
one namespace per container.

But discussing this with other people, this doesn't work anymore
as soon as we want to contain privilege processes or multi-user
applications. Privilege processes can spawn new users at will
and with multi-user we can't anymore have a global limit over
the container.

As far as I explored the issue, discussing this with lxc guys,
having that limit per cgroup is the only thing that seem to
work in any case.

But if somebody finds another way to solve that with existing
features or something more simple, I'll be happy to drop this
patchset.

> 
> > It's hard to put some statistic numbers while testing this feature
> > given that the result is rather binary: we launch a forkbomb and
> > either we stop and kill it or the system become unresponsive.
> >     
> > Meanwhile, one can find a testsuite at this address:
> > https://tglx.de/~fweisbec/task_counter_test.tar.gz
> 
> I do think that we should merge tests like this into the main tree.  So
> I can do "cd tests ; make ; ./run-tests".  The first step is for some hero
> to propose the (simple!) framework and to drop a first test in there.

I can do that. Some general tools/test/ directory that can host this one
and more.

> > It performs several checks to ensure the interface and the behaviour
> > are reliable after common events like moving tasks around over cgroups
> > in a hierarchy, forking inside, etc.. It also launches a forkbomb,
> > tries to stop and kill it. So beware, don't run it on a system that
> > is doing serious things.
> 
> Good stuff, that.  Then, when people propose additions or fix bugs, I can
> whine at them for not updating the test suite.
> 
> > Ensure you have CGROUP_TASK_COUNTER set
> > before, or it may compress the Ten Plagues in your MBR and
> > inflate the whole after your next reboot.
> 
> That problem would need to be fixed.  Either probe for the feature
> up-front, or don't build the test at all if CONFIG_CGROUP_TASK_COUNTER=n.
> 

Agreed. The simplest is to try to mount this subsystem and just give
up the test if we can't.

Will fix.

Thanks.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 10/10] cgroups: Add a task counter subsystem
  2011-10-06  9:23   ` Kirill A. Shutemov
@ 2011-10-11 13:41     ` Frederic Weisbecker
  0 siblings, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2011-10-11 13:41 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, LKML, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Containers

On Thu, Oct 06, 2011 at 12:23:48PM +0300, Kirill A. Shutemov wrote:
> On Mon, Oct 03, 2011 at 09:07:12PM +0200, Frederic Weisbecker wrote:
> > Add a new subsystem to limit the number of running tasks,
> > similar to the NR_PROC rlimit but in the scope of a cgroup.
> > 
> > The user can set an upper bound limit that is checked every
> > time a task forks in a cgroup or is moved into a cgroup
> > with that subsystem binded.
> > 
> > The primary goal is to protect against forkbombs that explode
> > inside a container. The traditional NR_PROC rlimit is not
> > efficient in that case because if we run containers in parallel
> > under the same user, one of these could starve all the others
> > by spawning a high number of tasks close to the user wide limit.
> > 
> > This is a prevention against forkbombs, so it's not deemed to
> > cure the effects of a forkbomb when the system is in a state
> > where it's not responsive. It's aimed at preventing from ever
> > reaching that state and stop the spreading of tasks early.
> > While defining the limit on the allowed number of tasks, it's
> > up to the user to find the right balance between the resource
> > its containers may need and what it can afford to provide.
> > 
> > As it's totally dissociated from the rlimit NR_PROC, both
> > can be complementary: the cgroup task counter can set an upper
> > bound per container and the rlmit can be an upper bound on the
> > overall set of containers.
> > 
> > Also this subsystem can be used to kill all the tasks in a cgroup
> > without races against concurrent forks, by setting the limit of
> > tasks to 0, any further forks can be rejected. This is a good
> > way to kill a forkbomb in a container, or simply kill any container
> > without the need to retry an unbound number of times.
> > 
> > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> 
> Acked-by: Kirill A. Shutemov <kirill@shutemov.name>

Thanks a lot for your acks Kirill!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v6
  2011-10-06  6:51 ` Li Zefan
@ 2011-10-11 13:41   ` Frederic Weisbecker
  0 siblings, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2011-10-11 13:41 UTC (permalink / raw)
  To: Li Zefan
  Cc: Andrew Morton, LKML, Paul Menage, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Kirill A. Shutemov, Containers

On Thu, Oct 06, 2011 at 02:51:55PM +0800, Li Zefan wrote:
> > Frederic Weisbecker (9):
> >   cgroups: Add res_counter_write_u64() API
> >   cgroups: New resource counter inheritance API
> >   cgroups: Add previous cgroup in can_attach_task/attach_task callbacks
> >   cgroups: New cancel_attach_task subsystem callback
> >   cgroups: Ability to stop res charge propagation on bounded ancestor
> >   res_counter: Allow charge failure pointer to be null
> >   cgroups: Pull up res counter charge failure interpretation to caller
> >   cgroups: Allow subsystems to cancel a fork
> >   cgroups: Add a task counter subsystem
> > 
> > Kirill A. Shutemov (1):
> >   cgroups: Add res counter common ancestor searching
> > 
> >  Documentation/cgroups/cgroups.txt          |   13 ++-
> >  Documentation/cgroups/resource_counter.txt |   20 +++-
> >  Documentation/cgroups/task_counter.txt     |  153 ++++++++++++++++++
> >  block/blk-cgroup.c                         |   12 +-
> >  include/linux/cgroup.h                     |   28 +++-
> >  include/linux/cgroup_subsys.h              |    8 +
> >  include/linux/res_counter.h                |   27 +++-
> >  init/Kconfig                               |    9 +
> >  kernel/Makefile                            |    1 +
> >  kernel/cgroup.c                            |   58 ++++++--
> >  kernel/cgroup_freezer.c                    |    9 +-
> >  kernel/cgroup_task_counter.c               |  239 ++++++++++++++++++++++++++++
> >  kernel/cpuset.c                            |    6 +-
> >  kernel/events/core.c                       |    5 +-
> >  kernel/exit.c                              |    2 +-
> >  kernel/fork.c                              |    7 +-
> >  kernel/res_counter.c                       |   97 ++++++++++--
> >  kernel/sched.c                             |    6 +-
> >  18 files changed, 644 insertions(+), 56 deletions(-)
> >  create mode 100644 Documentation/cgroups/task_counter.txt
> >  create mode 100644 kernel/cgroup_task_counter.c
> > 
> 
> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>

Thanks a lot!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 01/10] cgroups: Add res_counter_write_u64() API
  2011-10-04  0:17   ` Kirill A. Shutemov
@ 2011-10-11 13:44     ` Frederic Weisbecker
  0 siblings, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2011-10-11 13:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, LKML, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Containers

On Tue, Oct 04, 2011 at 03:17:51AM +0300, Kirill A. Shutemov wrote:
> On Mon, Oct 03, 2011 at 09:07:03PM +0200, Frederic Weisbecker wrote:
> > Extend the resource counter API with a mirror of
> > res_counter_read_u64() to make it handy to update a resource
> > counter value from a cgroup subsystem u64 value file.
> > 
> > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > Acked-by: Paul Menage <paul@paulmenage.org>
> > Cc: Li Zefan <lizf@cn.fujitsu.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Aditya Kali <adityakali@google.com>
> > Cc: Oleg Nesterov <oleg@redhat.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Kay Sievers <kay.sievers@vrfy.org>
> > Cc: Tim Hockin <thockin@hockin.org>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Kirill A. Shutemov <kirill@shutemov.name>
> > Cc: Containers <containers@lists.linux-foundation.org>
> > ---
> >  include/linux/res_counter.h |    2 ++
> >  kernel/res_counter.c        |   25 +++++++++++++++++++------
> >  2 files changed, 21 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> > index c9d625c..1b3fe05 100644
> > --- a/include/linux/res_counter.h
> > +++ b/include/linux/res_counter.h
> > @@ -82,6 +82,8 @@ int res_counter_memparse_write_strategy(const char *buf,
> >  int res_counter_write(struct res_counter *counter, int member,
> >  		      const char *buffer, write_strategy_fn write_strategy);
> >  
> > +void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
> > +
> >  /*
> >   * the field descriptors. one for each member of res_counter
> >   */
> > diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> > index 34683ef..0faafcc 100644
> > --- a/kernel/res_counter.c
> > +++ b/kernel/res_counter.c
> > @@ -168,12 +168,26 @@ int res_counter_memparse_write_strategy(const char *buf,
> >  	return 0;
> >  }
> >  
> > +void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
> > +{
> > +	unsigned long long *target;
> > +	unsigned long flags;
> > +
> > +	/*
> > +	 * We need the lock to protect against concurrent add/dec on 32 bits.
> > +	 * No need to ifdef it's seldom used.
> > +	 */
> 
> Should we hace two version of res_counter_write_u64() for 32/64 bit host?
> As with res_counter_read_u64().

I thought about it yeah.

But this is used in rare cases when the user writes the value from the cgroup
filesystem. In pratice the overhead won't be noticed.
I fear that would rather uglify the code with more ifdeffery.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v6
  2011-10-04 22:01 ` [PATCH 00/10] cgroups: Task counter subsystem v6 Andrew Morton
  2011-10-11 13:40   ` Frederic Weisbecker
@ 2011-10-25 20:06   ` Tim Hockin
  2011-10-28 23:30     ` Andrew Morton
  1 sibling, 1 reply; 44+ messages in thread
From: Tim Hockin @ 2011-10-25 20:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Frederic Weisbecker, LKML, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tejun Heo, Kirill A. Shutemov, Containers, Andrew Morton

On Tue, Oct 4, 2011 at 3:01 PM, Andrew Morton <akpm00@gmail.com> wrote:
> On Mon,  3 Oct 2011 21:07:02 +0200
> Frederic Weisbecker <fweisbec@gmail.com> wrote:
>
>> Hi Andrew,
>>
>> This contains minor changes, mostly documentation and changelog
>> updates, off-case build fix, and a code optimization in
>> res_counter_common_ancestor().
>
> I'd normally duck a patch series like this when we're at -rc8 and ask
> for it to be resent late in -rc1.  But I was feeling frisky so I
> grabbed this lot for a bit of testing and will sit on it until -rc1.
>
> I'm still not convinced that the kernel has a burning need for a "task
> counter subsystem".  Someone convince me that we should merge this!

We have real (accidental) DoS situations which happen because we don't
have this.  It usually takes the form of some library no re-joining
threads.  We end up deploying a few apps linked against this library,
and suddenly we're in trouble on a machine.  Except, this being
Google, we're in trouble on a lot of machines.

There may be other ways to cobble this sort of safety together, but
they are less appealing for various reasons.  cgroups are how we
control groups of related pids.

I'd really love to be able to use this.


>> It's hard to put some statistic numbers while testing this feature
>> given that the result is rather binary: we launch a forkbomb and
>> either we stop and kill it or the system become unresponsive.
>>
>> Meanwhile, one can find a testsuite at this address:
>> https://tglx.de/~fweisbec/task_counter_test.tar.gz
>
> I do think that we should merge tests like this into the main tree.  So
> I can do "cd tests ; make ; ./run-tests".  The first step is for some hero
> to propose the (simple!) framework and to drop a first test in there.
>
>> It performs several checks to ensure the interface and the behaviour
>> are reliable after common events like moving tasks around over cgroups
>> in a hierarchy, forking inside, etc.. It also launches a forkbomb,
>> tries to stop and kill it. So beware, don't run it on a system that
>> is doing serious things.
>
> Good stuff, that.  Then, when people propose additions or fix bugs, I can
> whine at them for not updating the test suite.
>
>> Ensure you have CGROUP_TASK_COUNTER set
>> before, or it may compress the Ten Plagues in your MBR and
>> inflate the whole after your next reboot.
>
> That problem would need to be fixed.  Either probe for the feature
> up-front, or don't build the test at all if CONFIG_CGROUP_TASK_COUNTER=n.
>
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v6
  2011-10-25 20:06   ` Tim Hockin
@ 2011-10-28 23:30     ` Andrew Morton
  2011-10-29  9:38       ` Glauber Costa
  2011-11-03 17:00       ` Frederic Weisbecker
  0 siblings, 2 replies; 44+ messages in thread
From: Andrew Morton @ 2011-10-28 23:30 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Frederic Weisbecker, LKML, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tejun Heo, Kirill A. Shutemov, Containers

On Tue, 25 Oct 2011 13:06:35 -0700
Tim Hockin <thockin@hockin.org> wrote:

> On Tue, Oct 4, 2011 at 3:01 PM, Andrew Morton <akpm00@gmail.com> wrote:
> > On Mon, __3 Oct 2011 21:07:02 +0200
> > Frederic Weisbecker <fweisbec@gmail.com> wrote:
> >
> >> Hi Andrew,
> >>
> >> This contains minor changes, mostly documentation and changelog
> >> updates, off-case build fix, and a code optimization in
> >> res_counter_common_ancestor().
> >
> > I'd normally duck a patch series like this when we're at -rc8 and ask
> > for it to be resent late in -rc1. __But I was feeling frisky so I
> > grabbed this lot for a bit of testing and will sit on it until -rc1.
> >
> > I'm still not convinced that the kernel has a burning need for a "task
> > counter subsystem". __Someone convince me that we should merge this!
> 
> We have real (accidental) DoS situations which happen because we don't
> have this.  It usually takes the form of some library no re-joining
> threads.  We end up deploying a few apps linked against this library,
> and suddenly we're in trouble on a machine.  Except, this being
> Google, we're in trouble on a lot of machines.

This is a bit foggy.  I think you mean that machines are experiencing
accidental forkbombs?

> There may be other ways to cobble this sort of safety together, but
> they are less appealing for various reasons.  cgroups are how we
> control groups of related pids.
> 
> I'd really love to be able to use this.

Has it been confirmed that this implementation actually solves the
problem?  ie: tested a bit?

btw, Frederic told me that this version of the patchset had some
serious problem so it's on hold pending an upgrade, regardless of other
matters.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v6
  2011-10-28 23:30     ` Andrew Morton
@ 2011-10-29  9:38       ` Glauber Costa
  2011-11-03 16:49         ` Frederic Weisbecker
  2011-11-03 17:00       ` Frederic Weisbecker
  1 sibling, 1 reply; 44+ messages in thread
From: Glauber Costa @ 2011-10-29  9:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Hockin, Frederic Weisbecker, LKML, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tejun Heo, Kirill A. Shutemov, Containers, Glauber Costa

On Sat, Oct 29, 2011 at 1:30 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Tue, 25 Oct 2011 13:06:35 -0700
> Tim Hockin <thockin@hockin.org> wrote:
>
>> On Tue, Oct 4, 2011 at 3:01 PM, Andrew Morton <akpm00@gmail.com> wrote:
>> > On Mon, __3 Oct 2011 21:07:02 +0200
>> > Frederic Weisbecker <fweisbec@gmail.com> wrote:
>> >
>> >> Hi Andrew,
>> >>
>> >> This contains minor changes, mostly documentation and changelog
>> >> updates, off-case build fix, and a code optimization in
>> >> res_counter_common_ancestor().
>> >
>> > I'd normally duck a patch series like this when we're at -rc8 and ask
>> > for it to be resent late in -rc1. __But I was feeling frisky so I
>> > grabbed this lot for a bit of testing and will sit on it until -rc1.
>> >
>> > I'm still not convinced that the kernel has a burning need for a "task
>> > counter subsystem". __Someone convince me that we should merge this!
>>
>> We have real (accidental) DoS situations which happen because we don't
>> have this.  It usually takes the form of some library no re-joining
>> threads.  We end up deploying a few apps linked against this library,
>> and suddenly we're in trouble on a machine.  Except, this being
>> Google, we're in trouble on a lot of machines.
>
> This is a bit foggy.  I think you mean that machines are experiencing
> accidental forkbombs?
>
>> There may be other ways to cobble this sort of safety together, but
>> they are less appealing for various reasons.  cgroups are how we
>> control groups of related pids.
>>

In the end of the day, all cgroups are just a group of tasks. So I don't really
get the need to have a cgroup to control the number of tasks in the system.

Why don't we just allow all cgroups to have a limit on the number of
tasks it can hold?




-- 
Sent from my Atari.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v6
  2011-10-29  9:38       ` Glauber Costa
@ 2011-11-03 16:49         ` Frederic Weisbecker
  2011-11-03 16:58           ` Glauber Costa
  0 siblings, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2011-11-03 16:49 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Andrew Morton, Tim Hockin, LKML, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tejun Heo, Kirill A. Shutemov, Containers, Glauber Costa

On Sat, Oct 29, 2011 at 11:38:25AM +0200, Glauber Costa wrote:
> On Sat, Oct 29, 2011 at 1:30 AM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> > On Tue, 25 Oct 2011 13:06:35 -0700
> > Tim Hockin <thockin@hockin.org> wrote:
> >
> >> On Tue, Oct 4, 2011 at 3:01 PM, Andrew Morton <akpm00@gmail.com> wrote:
> >> > On Mon, __3 Oct 2011 21:07:02 +0200
> >> > Frederic Weisbecker <fweisbec@gmail.com> wrote:
> >> >
> >> >> Hi Andrew,
> >> >>
> >> >> This contains minor changes, mostly documentation and changelog
> >> >> updates, off-case build fix, and a code optimization in
> >> >> res_counter_common_ancestor().
> >> >
> >> > I'd normally duck a patch series like this when we're at -rc8 and ask
> >> > for it to be resent late in -rc1. __But I was feeling frisky so I
> >> > grabbed this lot for a bit of testing and will sit on it until -rc1.
> >> >
> >> > I'm still not convinced that the kernel has a burning need for a "task
> >> > counter subsystem". __Someone convince me that we should merge this!
> >>
> >> We have real (accidental) DoS situations which happen because we don't
> >> have this.  It usually takes the form of some library no re-joining
> >> threads.  We end up deploying a few apps linked against this library,
> >> and suddenly we're in trouble on a machine.  Except, this being
> >> Google, we're in trouble on a lot of machines.
> >
> > This is a bit foggy.  I think you mean that machines are experiencing
> > accidental forkbombs?
> >
> >> There may be other ways to cobble this sort of safety together, but
> >> they are less appealing for various reasons.  cgroups are how we
> >> control groups of related pids.
> >>
> 
> In the end of the day, all cgroups are just a group of tasks. So I don't really
> get the need to have a cgroup to control the number of tasks in the system.
> 
> Why don't we just allow all cgroups to have a limit on the number of
> tasks it can hold?

Not sure what you mean. You would prefer to have this as a core feature in
cgroups rather than a subsystem?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v6
  2011-11-03 16:49         ` Frederic Weisbecker
@ 2011-11-03 16:58           ` Glauber Costa
  2011-11-03 17:02             ` Paul Menage
  0 siblings, 1 reply; 44+ messages in thread
From: Glauber Costa @ 2011-11-03 16:58 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Glauber Costa, Andrew Morton, Tim Hockin, LKML, Paul Menage,
	Li Zefan, Johannes Weiner, Aditya Kali, Oleg Nesterov,
	Kay Sievers, Tejun Heo, Kirill A. Shutemov, Containers,
	Paul Turner

On 11/03/2011 02:49 PM, Frederic Weisbecker wrote:
> On Sat, Oct 29, 2011 at 11:38:25AM +0200, Glauber Costa wrote:
>> On Sat, Oct 29, 2011 at 1:30 AM, Andrew Morton
>> <akpm@linux-foundation.org>  wrote:
>>> On Tue, 25 Oct 2011 13:06:35 -0700
>>> Tim Hockin<thockin@hockin.org>  wrote:
>>>
>>>> On Tue, Oct 4, 2011 at 3:01 PM, Andrew Morton<akpm00@gmail.com>  wrote:
>>>>> On Mon, __3 Oct 2011 21:07:02 +0200
>>>>> Frederic Weisbecker<fweisbec@gmail.com>  wrote:
>>>>>
>>>>>> Hi Andrew,
>>>>>>
>>>>>> This contains minor changes, mostly documentation and changelog
>>>>>> updates, off-case build fix, and a code optimization in
>>>>>> res_counter_common_ancestor().
>>>>>
>>>>> I'd normally duck a patch series like this when we're at -rc8 and ask
>>>>> for it to be resent late in -rc1. __But I was feeling frisky so I
>>>>> grabbed this lot for a bit of testing and will sit on it until -rc1.
>>>>>
>>>>> I'm still not convinced that the kernel has a burning need for a "task
>>>>> counter subsystem". __Someone convince me that we should merge this!
>>>>
>>>> We have real (accidental) DoS situations which happen because we don't
>>>> have this.  It usually takes the form of some library no re-joining
>>>> threads.  We end up deploying a few apps linked against this library,
>>>> and suddenly we're in trouble on a machine.  Except, this being
>>>> Google, we're in trouble on a lot of machines.
>>>
>>> This is a bit foggy.  I think you mean that machines are experiencing
>>> accidental forkbombs?
>>>
>>>> There may be other ways to cobble this sort of safety together, but
>>>> they are less appealing for various reasons.  cgroups are how we
>>>> control groups of related pids.
>>>>
>>
>> In the end of the day, all cgroups are just a group of tasks. So I don't really
>> get the need to have a cgroup to control the number of tasks in the system.
>>
>> Why don't we just allow all cgroups to have a limit on the number of
>> tasks it can hold?
>
> Not sure what you mean. You would prefer to have this as a core feature in
> cgroups rather than a subsystem?
Well, ideally, I think we should put some effort in trying to reduce the 
number of different possible cgroups subsystems.

I do see how keeping a different cgroup here adds flexibility. However, 
this flexibility very easily translate into performance losses. The 
reason is that when more than one cgroup needs to control and update 
some piece of data, because we can't assume anything about the set of 
processes they have, we have to walk hierarchies upwards multiple times 
- they are potentially different.

See for instance what happens with cpu vs cpuacct, that I am trying to 
get rid of.

Because you are controlling tasks, and tasks are the main building block 
of all cgroups, I think you should at least consider either using
a cgroup property, or bundling this into some other cgroup, like cpu - 
where there is already some need, albeit minor, to keep track of the 
number of process in a group.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v6
  2011-10-28 23:30     ` Andrew Morton
  2011-10-29  9:38       ` Glauber Costa
@ 2011-11-03 17:00       ` Frederic Weisbecker
  2011-11-04  2:57         ` Li Zefan
  1 sibling, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2011-11-03 17:00 UTC (permalink / raw)
  To: Andrew Morton, Tim Hockin
  Cc: LKML, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Kay Sievers, Tejun Heo, Kirill A. Shutemov,
	Containers

On Fri, Oct 28, 2011 at 04:30:21PM -0700, Andrew Morton wrote:
> On Tue, 25 Oct 2011 13:06:35 -0700
> Tim Hockin <thockin@hockin.org> wrote:
> 
> > On Tue, Oct 4, 2011 at 3:01 PM, Andrew Morton <akpm00@gmail.com> wrote:
> > > On Mon, __3 Oct 2011 21:07:02 +0200
> > > Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > >
> > >> Hi Andrew,
> > >>
> > >> This contains minor changes, mostly documentation and changelog
> > >> updates, off-case build fix, and a code optimization in
> > >> res_counter_common_ancestor().
> > >
> > > I'd normally duck a patch series like this when we're at -rc8 and ask
> > > for it to be resent late in -rc1. __But I was feeling frisky so I
> > > grabbed this lot for a bit of testing and will sit on it until -rc1.
> > >
> > > I'm still not convinced that the kernel has a burning need for a "task
> > > counter subsystem". __Someone convince me that we should merge this!
> > 
> > We have real (accidental) DoS situations which happen because we don't
> > have this.  It usually takes the form of some library no re-joining
> > threads.  We end up deploying a few apps linked against this library,
> > and suddenly we're in trouble on a machine.  Except, this being
> > Google, we're in trouble on a lot of machines.
> 
> This is a bit foggy.  I think you mean that machines are experiencing
> accidental forkbombs?

I'd like to hear about more details as well.

> 
> > There may be other ways to cobble this sort of safety together, but
> > they are less appealing for various reasons.  cgroups are how we
> > control groups of related pids.
> > 
> > I'd really love to be able to use this.
> 
> Has it been confirmed that this implementation actually solves the
> problem?  ie: tested a bit?
> 
> btw, Frederic told me that this version of the patchset had some
> serious problem so it's on hold pending an upgrade, regardless of other
> matters.

Yep. The particular issue is https://lkml.org/lkml/2011/10/13/532

Li Zefan proposed a fix (https://lkml.org/lkml/2011/10/17/26) which I'm
currently reworking.

But then I'd love it if you can test this subsystem to see if it really matches
your needs, Tim.

Thanks!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v6
  2011-11-03 16:58           ` Glauber Costa
@ 2011-11-03 17:02             ` Paul Menage
  2011-11-03 17:06               ` Glauber Costa
  0 siblings, 1 reply; 44+ messages in thread
From: Paul Menage @ 2011-11-03 17:02 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Frederic Weisbecker, Glauber Costa, Andrew Morton, Tim Hockin,
	LKML, Li Zefan, Johannes Weiner, Aditya Kali, Oleg Nesterov,
	Kay Sievers, Tejun Heo, Kirill A. Shutemov, Containers,
	Paul Turner

On Thu, Nov 3, 2011 at 9:58 AM, Glauber Costa <glommer@parallels.com> wrote:
>
> Because you are controlling tasks, and tasks are the main building block of
> all cgroups, I think you should at least consider either using
> a cgroup property,

I don't see how making it a core cgroup property would remove the need
to walk the hierarchy.

Paul

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v6
  2011-11-03 17:02             ` Paul Menage
@ 2011-11-03 17:06               ` Glauber Costa
  2011-11-03 17:28                 ` Paul Menage
  0 siblings, 1 reply; 44+ messages in thread
From: Glauber Costa @ 2011-11-03 17:06 UTC (permalink / raw)
  To: Paul Menage
  Cc: Frederic Weisbecker, Glauber Costa, Andrew Morton, Tim Hockin,
	LKML, Li Zefan, Johannes Weiner, Aditya Kali, Oleg Nesterov,
	Kay Sievers, Tejun Heo, Kirill A. Shutemov, Containers,
	Paul Turner

On 11/03/2011 03:02 PM, Paul Menage wrote:
> On Thu, Nov 3, 2011 at 9:58 AM, Glauber Costa<glommer@parallels.com>  wrote:
>>
>> Because you are controlling tasks, and tasks are the main building block of
>> all cgroups, I think you should at least consider either using
>> a cgroup property,
>
> I don't see how making it a core cgroup property would remove the need
> to walk the hierarchy.
>
Sorry if I wasn't clear: It removes the need to walk multiple 
independent hierarchies. The walk is done only once.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v6
  2011-11-03 17:06               ` Glauber Costa
@ 2011-11-03 17:28                 ` Paul Menage
  2011-11-03 17:35                   ` Glauber Costa
  0 siblings, 1 reply; 44+ messages in thread
From: Paul Menage @ 2011-11-03 17:28 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Frederic Weisbecker, Glauber Costa, Andrew Morton, Tim Hockin,
	LKML, Li Zefan, Johannes Weiner, Aditya Kali, Oleg Nesterov,
	Kay Sievers, Tejun Heo, Kirill A. Shutemov, Containers,
	Paul Turner

On Thu, Nov 3, 2011 at 10:06 AM, Glauber Costa <glommer@parallels.com> wrote:
> Sorry if I wasn't clear: It removes the need to walk multiple independent
> hierarchies. The walk is done only once.

You're talking about at fork time, and the concern is the cache
footprint involved in walking up the parent pointer chain?

Isn't that an argument against multiple hierarchies (which is a
decision for the admin), rather than against more subsystem
flexibility? If multiple subsystems on the same hierarchy each need to
walk up the pointer chain on the same event, then after the first
subsystem has done so the chain will be in cache for any subsequent
walks from other subsystems.

Paul

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v6
  2011-11-03 17:28                 ` Paul Menage
@ 2011-11-03 17:35                   ` Glauber Costa
  2011-11-03 17:56                     ` Paul Menage
  0 siblings, 1 reply; 44+ messages in thread
From: Glauber Costa @ 2011-11-03 17:35 UTC (permalink / raw)
  To: Paul Menage
  Cc: Frederic Weisbecker, Glauber Costa, Andrew Morton, Tim Hockin,
	LKML, Li Zefan, Johannes Weiner, Aditya Kali, Oleg Nesterov,
	Kay Sievers, Tejun Heo, Kirill A. Shutemov, Containers,
	Paul Turner

On 11/03/2011 03:28 PM, Paul Menage wrote:
> On Thu, Nov 3, 2011 at 10:06 AM, Glauber Costa<glommer@parallels.com>  wrote:
>> Sorry if I wasn't clear: It removes the need to walk multiple independent
>> hierarchies. The walk is done only once.
>
> You're talking about at fork time, and the concern is the cache
> footprint involved in walking up the parent pointer chain?

Yes, we can say this is my main concern.

> Isn't that an argument against multiple hierarchies (which is a
> decision for the admin), rather than against more subsystem
> flexibility?

Not always it is a decision for the admin. In most cases, it is a 
constraint of the problem. For containers - take lxc as an example,
the most reasonable thing to do is to grab all cgroups subsystems 
available, and contain them.

> If multiple subsystems on the same hierarchy each need to
> walk up the pointer chain on the same event, then after the first
> subsystem has done so the chain will be in cache for any subsequent
> walks from other subsystems.
No, it won't. Precisely because different subsystems have completely
independent pointer chains.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v6
  2011-11-03 17:35                   ` Glauber Costa
@ 2011-11-03 17:56                     ` Paul Menage
  2011-11-04 13:17                       ` Glauber Costa
  0 siblings, 1 reply; 44+ messages in thread
From: Paul Menage @ 2011-11-03 17:56 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Frederic Weisbecker, Glauber Costa, Andrew Morton, Tim Hockin,
	LKML, Li Zefan, Johannes Weiner, Aditya Kali, Oleg Nesterov,
	Kay Sievers, Tejun Heo, Kirill A. Shutemov, Containers,
	Paul Turner

On Thu, Nov 3, 2011 at 10:35 AM, Glauber Costa <glommer@parallels.com> wrote:
>
>> If multiple subsystems on the same hierarchy each need to
>> walk up the pointer chain on the same event, then after the first
>> subsystem has done so the chain will be in cache for any subsequent
>> walks from other subsystems.
>
> No, it won't. Precisely because different subsystems have completely
> independent pointer chains.

Because they're following res_counter parent pointers, etc, rather
than using the single cgroups parent pointer chain?

So if that's the problem, rather than artificially constrain
flexibility in order to improve micro-benchmarks, why not come up with
approaches that keep both the flexibility and the performance?

- make res_counter hierarchies be explicitly defined via the cgroup
parent pointers, rather than an parent pointer hidden inside the
res_counter. So the cgroup parent chain traversal would all be along
the common parent pointers (and res_counter would be one pointer
smaller).

- allow subsystems to specify that they need a small amount of data
that can be accessed efficiently up the cgroup chain. (Many subsystems
wouldn't need this, and those that do would likely only need it for a
subset of their per-cgroup data). Pack this data into as few
cachelines as possible, allocated as a single lump of memory per
cgroup. Each subsystem would know where in that allocation its private
data lay (it would be the same offset for every cgroup, although
dynamically determined at runtime based on the number of subsystems
mounted on that hierarchy)

Paul

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v6
  2011-11-03 17:00       ` Frederic Weisbecker
@ 2011-11-04  2:57         ` Li Zefan
  2011-11-04 12:37           ` Frederic Weisbecker
  0 siblings, 1 reply; 44+ messages in thread
From: Li Zefan @ 2011-11-04  2:57 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, Tim Hockin, LKML, Paul Menage, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tejun Heo,
	Kirill A. Shutemov, Containers

>>> There may be other ways to cobble this sort of safety together, but
>>> they are less appealing for various reasons.  cgroups are how we
>>> control groups of related pids.
>>>
>>> I'd really love to be able to use this.
>>
>> Has it been confirmed that this implementation actually solves the
>> problem?  ie: tested a bit?
>>
>> btw, Frederic told me that this version of the patchset had some
>> serious problem so it's on hold pending an upgrade, regardless of other
>> matters.
> 
> Yep. The particular issue is https://lkml.org/lkml/2011/10/13/532
> 
> Li Zefan proposed a fix (https://lkml.org/lkml/2011/10/17/26) which I'm
> currently reworking.
> 

We really need to coordinate cgroup patches. I mean, the patchset+fix conflict
with Tejun's work, and the conflict is not trivial.

> But then I'd love it if you can test this subsystem to see if it really matches
> your needs, Tim.
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v6
  2011-11-04  2:57         ` Li Zefan
@ 2011-11-04 12:37           ` Frederic Weisbecker
  0 siblings, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2011-11-04 12:37 UTC (permalink / raw)
  To: Li Zefan
  Cc: Andrew Morton, Tim Hockin, LKML, Paul Menage, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tejun Heo,
	Kirill A. Shutemov, Containers

On Fri, Nov 04, 2011 at 10:57:33AM +0800, Li Zefan wrote:
> >>> There may be other ways to cobble this sort of safety together, but
> >>> they are less appealing for various reasons.  cgroups are how we
> >>> control groups of related pids.
> >>>
> >>> I'd really love to be able to use this.
> >>
> >> Has it been confirmed that this implementation actually solves the
> >> problem?  ie: tested a bit?
> >>
> >> btw, Frederic told me that this version of the patchset had some
> >> serious problem so it's on hold pending an upgrade, regardless of other
> >> matters.
> > 
> > Yep. The particular issue is https://lkml.org/lkml/2011/10/13/532
> > 
> > Li Zefan proposed a fix (https://lkml.org/lkml/2011/10/17/26) which I'm
> > currently reworking.
> > 
> 
> We really need to coordinate cgroup patches. I mean, the patchset+fix conflict
> with Tejun's work, and the conflict is not trivial.

Either Tejun targets for -mm, or I try to get my patches into the pm
tree where Tejun's patches are aimed. I just would like to keep Andrew
in the process of my patches somehow.

Also it might be time for you and/or Paul Menage to run a cgroup git tree, what do
you think :)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v6
  2011-11-03 17:56                     ` Paul Menage
@ 2011-11-04 13:17                       ` Glauber Costa
  0 siblings, 0 replies; 44+ messages in thread
From: Glauber Costa @ 2011-11-04 13:17 UTC (permalink / raw)
  To: Paul Menage
  Cc: Frederic Weisbecker, Glauber Costa, Andrew Morton, Tim Hockin,
	LKML, Li Zefan, Johannes Weiner, Aditya Kali, Oleg Nesterov,
	Kay Sievers, Tejun Heo, Kirill A. Shutemov, Containers,
	Paul Turner, luksow, cgroups

On 11/03/2011 03:56 PM, Paul Menage wrote:
> On Thu, Nov 3, 2011 at 10:35 AM, Glauber Costa<glommer@parallels.com>  wrote:
>>
>>> If multiple subsystems on the same hierarchy each need to
>>> walk up the pointer chain on the same event, then after the first
>>> subsystem has done so the chain will be in cache for any subsequent
>>> walks from other subsystems.
>>
>> No, it won't. Precisely because different subsystems have completely
>> independent pointer chains.
>
> Because they're following res_counter parent pointers, etc, rather
> than using the single cgroups parent pointer chain?

No. Because:

/sys/fs/cgroup/my_subsys/
/sys/fs/cgroup/my_subsys/foo1
/sys/fs/cgroup/my_subsys/foo2
/sys/fs/cgroup/my_subsys/foo1/bar1

and:

/sys/fs/cgroup/my_subsys2/
/sys/fs/cgroup/my_subsys2/foo1
/sys/fs/cgroup/my_subsys2/foo1/bar1
/sys/fs/cgroup/my_subsys2/foo1/bar2

Are completely independent pointer chains. the only thing they share is 
the pointer to the root. And that's irrelevant in the pointer dance.
Also note that I used cpu and cpuacct as an example, and they don't use 
res_counters.

> So if that's the problem, rather than artificially constrain
> flexibility in order to improve micro-benchmarks, why not come up with
> approaches that keep both the flexibility and the performance?

Well, I am not opposed to that even if you happen to agree on what I 
said above. But in the end of the day, with many cgroups appearing, it
may not be about just micro benchmarks.

It is hard to draw the line, but I believe that avoiding creating new 
cgroups subsystems when possible plays in our favor.

Specifically for this one, my arguments are:

* cgroups are a task-grouping entity
* therefore, all cgroups already do some task manipulation in attach/dettach
* all cgroups subsystem already can register a fork handler

Adding a fork limit as a cgroup property seems a logical step to me 
based on that.

If, however, we are really creating this, I think we'd be better of 
referring to this as a "Task Controller" rather than a "Task Counter".

Then at least in the near future when people start trying to limit other 
task-related resources, this can serve as a natural placeholder for 
this. (See the syscall limiting that Lukasz is trying to achieve)

>
> - make res_counter hierarchies be explicitly defined via the cgroup
> parent pointers, rather than an parent pointer hidden inside the
> res_counter. So the cgroup parent chain traversal would all be along
> the common parent pointers (and res_counter would be one pointer
> smaller).
 >
>
> - allow subsystems to specify that they need a small amount of data
> that can be accessed efficiently up the cgroup chain. (Many subsystems
> wouldn't need this, and those that do would likely only need it for a
> subset of their per-cgroup data). Pack this data into as few
> cachelines as possible, allocated as a single lump of memory per
> cgroup. Each subsystem would know where in that allocation its private
> data lay (it would be the same offset for every cgroup, although
> dynamically determined at runtime based on the number of subsystems
> mounted on that hierarchy)
I thought about this second one myself.
I am not yet convinced this would be a win, but I believe there are chances.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v6
  2011-10-03 19:07 [PATCH 00/10] cgroups: Task counter subsystem v6 Frederic Weisbecker
                   ` (11 preceding siblings ...)
  2011-10-06  6:51 ` Li Zefan
@ 2011-12-13 15:58 ` Tejun Heo
  2011-12-13 19:06   ` Frederic Weisbecker
  12 siblings, 1 reply; 44+ messages in thread
From: Tejun Heo @ 2011-12-13 15:58 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, LKML, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin,
	Kirill A. Shutemov, Containers

Hello, Frederic.

Can you please rebase the patchset on top of cgroup/for-3.3?  I
primarily like the idea of being able to track process usage w/ cgroup
and enforce limits on it but hope that it could somehow integrate w/
cgroup freezer.  ie. trigger freezer if it goes over limit and let the
userland tool / administrator deal with the frozen cgroup.  I'm
planning on extending cgroup freezer such that it supports recursive
freezing and killing of frozen tasks.  If we can fit task counters
into that, we'll have general method of handling problematic cgroups -
freeze, notify userland and let it deal with it.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v6
  2011-12-13 15:58 ` Tejun Heo
@ 2011-12-13 19:06   ` Frederic Weisbecker
  2011-12-13 20:49     ` Tejun Heo
  0 siblings, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2011-12-13 19:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, LKML, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin,
	Kirill A. Shutemov, Containers

On Tue, Dec 13, 2011 at 07:58:48AM -0800, Tejun Heo wrote:
> Hello, Frederic.
> 
> Can you please rebase the patchset on top of cgroup/for-3.3?

Sure. But please note its fate is still under discussion. Whether
we want it upstream is still a running debate. But I certainly
need to rebase against your tree.

> I primarily like the idea of being able to track process usage w/ cgroup
> and enforce limits on it but hope that it could somehow integrate w/
> cgroup freezer.  ie. trigger freezer if it goes over limit and let the
> userland tool / administrator deal with the frozen cgroup.  I'm
> planning on extending cgroup freezer such that it supports recursive
> freezing and killing of frozen tasks.  If we can fit task counters
> into that, we'll have general method of handling problematic cgroups -
> freeze, notify userland and let it deal with it.

Hmm, so you suggest a kernel trigger that freeze the cgroup when the
task limit is reached?

What about rather implementing register_event() for the tasks.usage such
that the user can be notified using eventfd when the limit is reached.
Then it would be up to the user to decide to freeze or any other thing.
Sounds like a more generic solution.

Hm?

Thanks.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v6
  2011-12-13 19:06   ` Frederic Weisbecker
@ 2011-12-13 20:49     ` Tejun Heo
  2011-12-14 15:07       ` Frederic Weisbecker
  0 siblings, 1 reply; 44+ messages in thread
From: Tejun Heo @ 2011-12-13 20:49 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, LKML, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin,
	Kirill A. Shutemov, Containers

Hello,

On Tue, Dec 13, 2011 at 08:06:46PM +0100, Frederic Weisbecker wrote:
> On Tue, Dec 13, 2011 at 07:58:48AM -0800, Tejun Heo wrote:
> > Can you please rebase the patchset on top of cgroup/for-3.3?
> 
> Sure. But please note its fate is still under discussion. Whether
> we want it upstream is still a running debate. But I certainly
> need to rebase against your tree.

I see.

> > I primarily like the idea of being able to track process usage w/ cgroup
> > and enforce limits on it but hope that it could somehow integrate w/
> > cgroup freezer.  ie. trigger freezer if it goes over limit and let the
> > userland tool / administrator deal with the frozen cgroup.  I'm
> > planning on extending cgroup freezer such that it supports recursive
> > freezing and killing of frozen tasks.  If we can fit task counters
> > into that, we'll have general method of handling problematic cgroups -
> > freeze, notify userland and let it deal with it.
> 
> Hmm, so you suggest a kernel trigger that freeze the cgroup when the
> task limit is reached?

Yeah, something like that.  I'm not really sure about how it would
actually work tho.

> What about rather implementing register_event() for the tasks.usage such
> that the user can be notified using eventfd when the limit is reached.
> Then it would be up to the user to decide to freeze or any other thing.
> Sounds like a more generic solution.

Maybe, the problem would be how to ensure that the userland manager
can respond fast enough (whatever that means...).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v6
  2011-12-13 20:49     ` Tejun Heo
@ 2011-12-14 15:07       ` Frederic Weisbecker
  0 siblings, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2011-12-14 15:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, LKML, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin,
	Kirill A. Shutemov, Containers

On Tue, Dec 13, 2011 at 12:49:18PM -0800, Tejun Heo wrote:
> Hello,
> 
> On Tue, Dec 13, 2011 at 08:06:46PM +0100, Frederic Weisbecker wrote:
> > On Tue, Dec 13, 2011 at 07:58:48AM -0800, Tejun Heo wrote:
> > > Can you please rebase the patchset on top of cgroup/for-3.3?
> > 
> > Sure. But please note its fate is still under discussion. Whether
> > we want it upstream is still a running debate. But I certainly
> > need to rebase against your tree.
> 
> I see.
> 
> > > I primarily like the idea of being able to track process usage w/ cgroup
> > > and enforce limits on it but hope that it could somehow integrate w/
> > > cgroup freezer.  ie. trigger freezer if it goes over limit and let the
> > > userland tool / administrator deal with the frozen cgroup.  I'm
> > > planning on extending cgroup freezer such that it supports recursive
> > > freezing and killing of frozen tasks.  If we can fit task counters
> > > into that, we'll have general method of handling problematic cgroups -
> > > freeze, notify userland and let it deal with it.
> > 
> > Hmm, so you suggest a kernel trigger that freeze the cgroup when the
> > task limit is reached?
> 
> Yeah, something like that.  I'm not really sure about how it would
> actually work tho.
> 
> > What about rather implementing register_event() for the tasks.usage such
> > that the user can be notified using eventfd when the limit is reached.
> > Then it would be up to the user to decide to freeze or any other thing.
> > Sounds like a more generic solution.
> 
> Maybe, the problem would be how to ensure that the userland manager
> can respond fast enough (whatever that means...).

Yeah that's part of the goal of the task counter: limit the spreading
of the forkbomb soon enough such that the machine stays responsive and
the admin can react accordingly.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2011-12-14 15:08 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-03 19:07 [PATCH 00/10] cgroups: Task counter subsystem v6 Frederic Weisbecker
2011-10-03 19:07 ` [PATCH 01/10] cgroups: Add res_counter_write_u64() API Frederic Weisbecker
2011-10-04  0:17   ` Kirill A. Shutemov
2011-10-11 13:44     ` Frederic Weisbecker
2011-10-03 19:07 ` [PATCH 02/10] cgroups: New resource counter inheritance API Frederic Weisbecker
2011-10-04  0:20   ` Kirill A. Shutemov
2011-10-03 19:07 ` [PATCH 03/10] cgroups: Add previous cgroup in can_attach_task/attach_task callbacks Frederic Weisbecker
2011-10-04  0:22   ` Kirill A. Shutemov
2011-10-03 19:07 ` [PATCH 04/10] cgroups: New cancel_attach_task subsystem callback Frederic Weisbecker
2011-10-04  0:27   ` Kirill A. Shutemov
2011-10-03 19:07 ` [PATCH 05/10] cgroups: Ability to stop res charge propagation on bounded ancestor Frederic Weisbecker
2011-10-04  0:41   ` Kirill A. Shutemov
2011-10-03 19:07 ` [PATCH 06/10] cgroups: Add res counter common ancestor searching Frederic Weisbecker
2011-10-03 19:07 ` [PATCH 07/10] res_counter: Allow charge failure pointer to be null Frederic Weisbecker
2011-10-04  1:30   ` Kirill A. Shutemov
2011-10-03 19:07 ` [PATCH 08/10] cgroups: Pull up res counter charge failure interpretation to caller Frederic Weisbecker
2011-10-04  1:32   ` Kirill A. Shutemov
2011-10-03 19:07 ` [PATCH 09/10] cgroups: Allow subsystems to cancel a fork Frederic Weisbecker
2011-10-04  1:38   ` Kirill A. Shutemov
2011-10-03 19:07 ` [PATCH 10/10] cgroups: Add a task counter subsystem Frederic Weisbecker
2011-10-06  9:23   ` Kirill A. Shutemov
2011-10-11 13:41     ` Frederic Weisbecker
2011-10-04 22:01 ` [PATCH 00/10] cgroups: Task counter subsystem v6 Andrew Morton
2011-10-11 13:40   ` Frederic Weisbecker
2011-10-25 20:06   ` Tim Hockin
2011-10-28 23:30     ` Andrew Morton
2011-10-29  9:38       ` Glauber Costa
2011-11-03 16:49         ` Frederic Weisbecker
2011-11-03 16:58           ` Glauber Costa
2011-11-03 17:02             ` Paul Menage
2011-11-03 17:06               ` Glauber Costa
2011-11-03 17:28                 ` Paul Menage
2011-11-03 17:35                   ` Glauber Costa
2011-11-03 17:56                     ` Paul Menage
2011-11-04 13:17                       ` Glauber Costa
2011-11-03 17:00       ` Frederic Weisbecker
2011-11-04  2:57         ` Li Zefan
2011-11-04 12:37           ` Frederic Weisbecker
2011-10-06  6:51 ` Li Zefan
2011-10-11 13:41   ` Frederic Weisbecker
2011-12-13 15:58 ` Tejun Heo
2011-12-13 19:06   ` Frederic Weisbecker
2011-12-13 20:49     ` Tejun Heo
2011-12-14 15:07       ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).