All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/12 v4][RESEND] cgroups: Task counter subsystem
@ 2011-09-06  0:12 Frederic Weisbecker
  2011-09-06  0:12 ` [PATCH 01/12] cgroups: Add res_counter_write_u64() API Frederic Weisbecker
                   ` (11 more replies)
  0 siblings, 12 replies; 26+ messages in thread
From: Frederic Weisbecker @ 2011-09-06  0:12 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Andrew Morton, Kay Sievers,
	Tim Hockin, Tejun Heo

(Sorry, I had unbalanced "<" and ">" around email addresses so I need
to resend otherwise LKML wouldn't take it.)


Reminder:

The task counter is a cgroup subsystem that features an upper bound
limitation on the number of tasks running inside cgroup.

It has two desired usecases, but possibly more can be found:

- Protect against forkbomb in the scope of a cgroup, this can be
especially useful in the case of containers where traditional rlmit
is too limited.

- Kill all tasks inside a cgroup without worrying about races against
concurrent forks.

See documentation in patch 10/12 for more details.

May be I should rebase the whole against Tejun's patches that use
an iterator on tasks.

Changes in v4:

- Rebase on top of "cgroups: Don't attach task to subsystem if
  migration failed" (https://lkml.org/lkml/2011/8/26/262), applied
  in -mm.
- Add comment about the use of spinlock in res_counter_write_u64() (patch 1)
- Update documentation after [can_]attach_task() arguments changes (patch 3)
- Update documentation for cancel_attach_task() new callback (patch 4)
- Cancel task attachment on migration failure in cgroup_attach_proc() (patch 4)
- Some function inlining (patch 5)
- Move "Allow charge failure pointer to be null" earlier in the set for
  sanity (patch 7)
- Fix error return value (patch 8 and 9)
- Whitespace fixes, comments added, improve config help section (patch 9)
- Drop subsystem early init (patch 9)
- Add subsystem documentation (patch 10)
- Remove ad-hoc fork hook and reuse existing one for cgroup (patch 11 and 12)

This can not be pulled from:                                                                                                     
                                                                                                                             
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git                                                    
        cgroup/nr_proc-v4

Frederic Weisbecker (12):
  cgroups: Add res_counter_write_u64() API
  cgroups: New resource counter inheritance API
  cgroups: Add previous cgroup in can_attach_task/attach_task callbacks
  cgroups: New cancel_attach_task subsystem callback
  cgroups: Ability to stop res charge propagation on bounded ancestor
  cgroups: Add res counter common ancestor searching
  res_counter: Allow charge failure pointer to be null
  cgroups: Pull up res counter charge failure interpretation to caller
  cgroups: Add a task counter subsystem
  cgroups: Add documentation for task counter subsystem
  cgroups: Allow subsystems to cancel a fork
  cgroups: Convert task counter to use the subsys fork callback

 Documentation/cgroups/cgroups.txt      |   13 ++-
 Documentation/cgroups/task_counter.txt |  126 ++++++++++++++++++++
 block/blk-cgroup.c                     |   10 +-
 include/linux/cgroup.h                 |   20 ++-
 include/linux/cgroup_subsys.h          |    8 ++
 include/linux/res_counter.h            |   25 ++++-
 init/Kconfig                           |    7 +
 kernel/Makefile                        |    1 +
 kernel/cgroup.c                        |   56 +++++++--
 kernel/cgroup_freezer.c                |    9 +-
 kernel/cgroup_task_counter.c           |  200 ++++++++++++++++++++++++++++++++
 kernel/cpuset.c                        |    6 +-
 kernel/events/core.c                   |    5 +-
 kernel/exit.c                          |    2 +-
 kernel/fork.c                          |    7 +-
 kernel/res_counter.c                   |   77 ++++++++++---
 kernel/sched.c                         |    6 +-
 17 files changed, 524 insertions(+), 54 deletions(-)
 create mode 100644 Documentation/cgroups/task_counter.txt
 create mode 100644 kernel/cgroup_task_counter.c

-- 
1.7.5.4


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 01/12] cgroups: Add res_counter_write_u64() API
  2011-09-06  0:12 [PATCH 00/12 v4][RESEND] cgroups: Task counter subsystem Frederic Weisbecker
@ 2011-09-06  0:12 ` Frederic Weisbecker
  2011-09-06  0:12 ` [PATCH 02/12] cgroups: New resource counter inheritance API Frederic Weisbecker
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Frederic Weisbecker @ 2011-09-06  0:12 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Andrew Morton, Kay Sievers,
	Tim Hockin, Tejun Heo

Extend the resource counter API with a mirror of
res_counter_read_u64() to make it handy to update a resource
counter value from a cgroup subsystem u64 value file.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
---
 include/linux/res_counter.h |    2 ++
 kernel/res_counter.c        |   25 +++++++++++++++++++------
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index c9d625c..1b3fe05 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -82,6 +82,8 @@ int res_counter_memparse_write_strategy(const char *buf,
 int res_counter_write(struct res_counter *counter, int member,
 		      const char *buffer, write_strategy_fn write_strategy);
 
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
+
 /*
  * the field descriptors. one for each member of res_counter
  */
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 34683ef..0faafcc 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -168,12 +168,26 @@ int res_counter_memparse_write_strategy(const char *buf,
 	return 0;
 }
 
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
+{
+	unsigned long long *target;
+	unsigned long flags;
+
+	/*
+	 * We need the lock to protect against concurrent add/dec on 32 bits.
+	 * No need to ifdef it's seldom used.
+	 */
+	spin_lock_irqsave(&counter->lock, flags);
+	target = res_counter_member(counter, member);
+	*target = val;
+	spin_unlock_irqrestore(&counter->lock, flags);
+}
+
 int res_counter_write(struct res_counter *counter, int member,
 		      const char *buf, write_strategy_fn write_strategy)
 {
 	char *end;
-	unsigned long flags;
-	unsigned long long tmp, *val;
+	unsigned long long tmp;
 
 	if (write_strategy) {
 		if (write_strategy(buf, &tmp))
@@ -183,9 +197,8 @@ int res_counter_write(struct res_counter *counter, int member,
 		if (*end != '\0')
 			return -EINVAL;
 	}
-	spin_lock_irqsave(&counter->lock, flags);
-	val = res_counter_member(counter, member);
-	*val = tmp;
-	spin_unlock_irqrestore(&counter->lock, flags);
+
+	res_counter_write_u64(counter, member, tmp);
+
 	return 0;
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 02/12] cgroups: New resource counter inheritance API
  2011-09-06  0:12 [PATCH 00/12 v4][RESEND] cgroups: Task counter subsystem Frederic Weisbecker
  2011-09-06  0:12 ` [PATCH 01/12] cgroups: Add res_counter_write_u64() API Frederic Weisbecker
@ 2011-09-06  0:12 ` Frederic Weisbecker
  2011-09-06 22:17   ` Andrew Morton
  2011-09-06  0:12 ` [PATCH 03/12] cgroups: Add previous cgroup in can_attach_task/attach_task callbacks Frederic Weisbecker
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Frederic Weisbecker @ 2011-09-06  0:12 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Andrew Morton, Kay Sievers,
	Tim Hockin, Tejun Heo

Provide an API to inherit a counter value from a parent.
This can be useful to implement cgroup.clone_children on
a resource counter.

Still the resources of the children are limited by those
of the parent, so this is only to provide a default setting
behaviour when clone_children is set.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
---
 include/linux/res_counter.h |    2 ++
 kernel/res_counter.c        |   12 ++++++++++++
 2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 1b3fe05..109d118 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -84,6 +84,8 @@ int res_counter_write(struct res_counter *counter, int member,
 
 void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
 
+void res_counter_inherit(struct res_counter *counter, int member);
+
 /*
  * the field descriptors. one for each member of res_counter
  */
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 0faafcc..1349115 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -202,3 +202,15 @@ int res_counter_write(struct res_counter *counter, int member,
 
 	return 0;
 }
+
+void res_counter_inherit(struct res_counter *counter, int member)
+{
+	struct res_counter *parent;
+	unsigned long long val;
+
+	parent = counter->parent;
+	if (parent) {
+		val = res_counter_read_u64(parent, member);
+		res_counter_write_u64(counter, member, val);
+	}
+}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 03/12] cgroups: Add previous cgroup in can_attach_task/attach_task callbacks
  2011-09-06  0:12 [PATCH 00/12 v4][RESEND] cgroups: Task counter subsystem Frederic Weisbecker
  2011-09-06  0:12 ` [PATCH 01/12] cgroups: Add res_counter_write_u64() API Frederic Weisbecker
  2011-09-06  0:12 ` [PATCH 02/12] cgroups: New resource counter inheritance API Frederic Weisbecker
@ 2011-09-06  0:12 ` Frederic Weisbecker
  2011-09-06  0:12 ` [PATCH 04/12] cgroups: New cancel_attach_task subsystem callback Frederic Weisbecker
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Frederic Weisbecker @ 2011-09-06  0:12 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Andrew Morton, Kay Sievers,
	Tim Hockin, Tejun Heo

This is to prepare the integration of a new max number of proc
cgroup subsystem. We'll need to release some resources from the
previous cgroup.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
---
 Documentation/cgroups/cgroups.txt |    6 ++++--
 block/blk-cgroup.c                |   10 ++++++----
 include/linux/cgroup.h            |    5 +++--
 kernel/cgroup.c                   |   10 ++++++----
 kernel/cgroup_freezer.c           |    3 ++-
 kernel/cpuset.c                   |    6 ++++--
 kernel/events/core.c              |    5 +++--
 kernel/sched.c                    |    6 ++++--
 8 files changed, 32 insertions(+), 19 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index cd67e90..0621e93 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -605,7 +605,8 @@ called on a fork. If this method returns 0 (success) then this should
 remain valid while the caller holds cgroup_mutex and it is ensured that either
 attach() or cancel_attach() will be called in future.
 
-int can_attach_task(struct cgroup *cgrp, struct task_struct *tsk);
+int can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+		    struct task_struct *tsk);
 (cgroup_mutex held by caller)
 
 As can_attach, but for operations that must be run once per task to be
@@ -635,7 +636,8 @@ void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 Called after the task has been attached to the cgroup, to allow any
 post-attachment activity that requires memory allocations or blocking.
 
-void attach_task(struct cgroup *cgrp, struct task_struct *tsk);
+void attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+		 struct task_struct *tsk);
 (cgroup_mutex held by caller)
 
 As attach, but for operations that must be run once per task to be attached,
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index bcaf16e..d1bfe88 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -30,8 +30,8 @@ EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
 static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
 						  struct cgroup *);
-static int blkiocg_can_attach_task(struct cgroup *, struct task_struct *);
-static void blkiocg_attach_task(struct cgroup *, struct task_struct *);
+static int blkiocg_can_attach_task(struct cgroup *, struct cgroup *, struct task_struct *);
+static void blkiocg_attach_task(struct cgroup *, struct cgroup *, struct task_struct *);
 static void blkiocg_destroy(struct cgroup_subsys *, struct cgroup *);
 static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 
@@ -1614,7 +1614,8 @@ done:
  * of the main cic data structures.  For now we allow a task to change
  * its cgroup only if it's the only owner of its ioc.
  */
-static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static int blkiocg_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+				   struct task_struct *tsk)
 {
 	struct io_context *ioc;
 	int ret = 0;
@@ -1629,7 +1630,8 @@ static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	return ret;
 }
 
-static void blkiocg_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static void blkiocg_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+				struct task_struct *tsk)
 {
 	struct io_context *ioc;
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index da7e4bc..ecda4a0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -468,11 +468,12 @@ struct cgroup_subsys {
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			  struct task_struct *tsk);
-	int (*can_attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
+	int (*can_attach_task)(struct cgroup *cgrp, struct cgroup *old_cgrp,
+			       struct task_struct *tsk);
 	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			      struct task_struct *tsk);
 	void (*pre_attach)(struct cgroup *cgrp);
-	void (*attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
+	void (*attach_task)(struct cgroup *cgrp, struct cgroup *old_cgrp, struct task_struct *tsk);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 		       struct cgroup *old_cgrp, struct task_struct *tsk);
 	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 84bdace..7a775c9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1844,7 +1844,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 			}
 		}
 		if (ss->can_attach_task) {
-			retval = ss->can_attach_task(cgrp, tsk);
+			retval = ss->can_attach_task(cgrp, oldcgrp, tsk);
 			if (retval) {
 				failed_ss = ss;
 				goto out;
@@ -1860,7 +1860,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		if (ss->pre_attach)
 			ss->pre_attach(cgrp);
 		if (ss->attach_task)
-			ss->attach_task(cgrp, tsk);
+			ss->attach_task(cgrp, oldcgrp, tsk);
 		if (ss->attach)
 			ss->attach(ss, cgrp, oldcgrp, tsk);
 	}
@@ -2075,7 +2075,9 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 			/* run on each task in the threadgroup. */
 			for (i = 0; i < group_size; i++) {
 				tsk = flex_array_get_ptr(group, i);
-				retval = ss->can_attach_task(cgrp, tsk);
+				oldcgrp = task_cgroup_from_root(tsk, root);
+
+				retval = ss->can_attach_task(cgrp, oldcgrp, tsk);
 				if (retval) {
 					failed_ss = ss;
 					cancel_failed_ss = true;
@@ -2141,7 +2143,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 			/* attach each task to each subsystem */
 			for_each_subsys(root, ss) {
 				if (ss->attach_task)
-					ss->attach_task(cgrp, tsk);
+					ss->attach_task(cgrp, oldcgrp, tsk);
 			}
 		} else {
 			BUG_ON(retval != -ESRCH);
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index e691818..c1421a1 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -175,7 +175,8 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	return 0;
 }
 
-static int freezer_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static int freezer_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+				   struct task_struct *tsk)
 {
 	rcu_read_lock();
 	if (__cgroup_freezing_or_frozen(tsk)) {
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 10131fd..427be38 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1390,7 +1390,8 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 	return 0;
 }
 
-static int cpuset_can_attach_task(struct cgroup *cgrp, struct task_struct *task)
+static int cpuset_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+				  struct task_struct *task)
 {
 	return security_task_setscheduler(task);
 }
@@ -1418,7 +1419,8 @@ static void cpuset_pre_attach(struct cgroup *cont)
 }
 
 /* Per-thread attachment work. */
-static void cpuset_attach_task(struct cgroup *cont, struct task_struct *tsk)
+static void cpuset_attach_task(struct cgroup *cont, struct cgroup *old,
+			       struct task_struct *tsk)
 {
 	int err;
 	struct cpuset *cs = cgroup_cs(cont);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b8785e2..509464e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7001,7 +7001,8 @@ static int __perf_cgroup_move(void *info)
 }
 
 static void
-perf_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *task)
+perf_cgroup_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+			struct task_struct *task)
 {
 	task_function_call(task, __perf_cgroup_move, task);
 }
@@ -7017,7 +7018,7 @@ static void perf_cgroup_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	if (!(task->flags & PF_EXITING))
 		return;
 
-	perf_cgroup_attach_task(cgrp, task);
+	perf_cgroup_attach_task(cgrp, old_cgrp, task);
 }
 
 struct cgroup_subsys perf_subsys = {
diff --git a/kernel/sched.c b/kernel/sched.c
index ccacdbd..72ce1b1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8967,7 +8967,8 @@ cpu_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 }
 
 static int
-cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+			   struct task_struct *tsk)
 {
 #ifdef CONFIG_RT_GROUP_SCHED
 	if (!sched_rt_can_attach(cgroup_tg(cgrp), tsk))
@@ -8981,7 +8982,8 @@ cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 }
 
 static void
-cpu_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+cpu_cgroup_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+		       struct task_struct *tsk)
 {
 	sched_move_task(tsk);
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 04/12] cgroups: New cancel_attach_task subsystem callback
  2011-09-06  0:12 [PATCH 00/12 v4][RESEND] cgroups: Task counter subsystem Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2011-09-06  0:12 ` [PATCH 03/12] cgroups: Add previous cgroup in can_attach_task/attach_task callbacks Frederic Weisbecker
@ 2011-09-06  0:12 ` Frederic Weisbecker
  2011-09-06  0:12 ` [PATCH 05/12] cgroups: Ability to stop res charge propagation on bounded ancestor Frederic Weisbecker
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Frederic Weisbecker @ 2011-09-06  0:12 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Andrew Morton, Kay Sievers,
	Tim Hockin, Tejun Heo

To cancel a process attachment on a subsystem, we only call the
cancel_attach() callback once on the leader but we have no
way to cancel the attachment individually for each member of
the process group.

This is going to be needed for the max number of tasks susbystem
that is coming.

To prepare for this integration, call a new cancel_attach_task()
callback on each task of the group until we reach the member that
failed to attach.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
---
 Documentation/cgroups/cgroups.txt |    7 +++++++
 include/linux/cgroup.h            |    1 +
 kernel/cgroup.c                   |   23 +++++++++++++++++++----
 3 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 0621e93..3bc1dc9 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -623,6 +623,13 @@ function, so that the subsystem can implement a rollback. If not, not necessary.
 This will be called only about subsystems whose can_attach() operation have
 succeeded.
 
+void cancel_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+(cgroup_mutex held by caller)
+
+As cancel_attach, but for operations that must be cancelled once per
+task that wanted to be attached. This typically revert the effect of
+can_attach_task().
+
 void pre_attach(struct cgroup *cgrp);
 (cgroup_mutex held by caller)
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ecda4a0..be66470 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -472,6 +472,7 @@ struct cgroup_subsys {
 			       struct task_struct *tsk);
 	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			      struct task_struct *tsk);
+	void (*cancel_attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
 	void (*pre_attach)(struct cgroup *cgrp);
 	void (*attach_task)(struct cgroup *cgrp, struct cgroup *old_cgrp, struct task_struct *tsk);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7a775c9..73ee2fd 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1883,6 +1883,9 @@ out:
 				 * remaining subsystems.
 				 */
 				break;
+
+			if (ss->cancel_attach_task)
+				ss->cancel_attach_task(cgrp, tsk);
 			if (ss->cancel_attach)
 				ss->cancel_attach(ss, cgrp, tsk);
 		}
@@ -1992,7 +1995,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 {
 	int retval, i, group_size;
 	struct cgroup_subsys *ss, *failed_ss = NULL;
-	bool cancel_failed_ss = false;
+	struct task_struct *failed_task = NULL;
 	/* guaranteed to be initialized later, but the compiler needs this */
 	struct cgroup *oldcgrp = NULL;
 	struct css_set *oldcg;
@@ -2080,7 +2083,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 				retval = ss->can_attach_task(cgrp, oldcgrp, tsk);
 				if (retval) {
 					failed_ss = ss;
-					cancel_failed_ss = true;
+					failed_task = tsk;
 					goto out_cancel_attach;
 				}
 			}
@@ -2145,8 +2148,11 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 				if (ss->attach_task)
 					ss->attach_task(cgrp, oldcgrp, tsk);
 			}
+		} else if (retval == -ESRCH) {
+			if (ss->cancel_attach_task)
+				ss->cancel_attach_task(cgrp, tsk);
 		} else {
-			BUG_ON(retval != -ESRCH);
+			BUG_ON(1);
 		}
 	}
 	/* nothing is sensitive to fork() after this point. */
@@ -2178,8 +2184,17 @@ out_cancel_attach:
 	/* same deal as in cgroup_attach_task */
 	if (retval) {
 		for_each_subsys(root, ss) {
+			if (ss->cancel_attach_task && (ss != failed_ss || failed_task)) {
+				for (i = 0; i < group_size; i++) {
+					tsk = flex_array_get_ptr(group, i);
+					if (tsk == failed_task)
+						break;
+					ss->cancel_attach_task(cgrp, tsk);
+				}
+			}
+
 			if (ss == failed_ss) {
-				if (cancel_failed_ss && ss->cancel_attach)
+				if (failed_task && ss->cancel_attach)
 					ss->cancel_attach(ss, cgrp, leader);
 				break;
 			}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 05/12] cgroups: Ability to stop res charge propagation on bounded ancestor
  2011-09-06  0:12 [PATCH 00/12 v4][RESEND] cgroups: Task counter subsystem Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2011-09-06  0:12 ` [PATCH 04/12] cgroups: New cancel_attach_task subsystem callback Frederic Weisbecker
@ 2011-09-06  0:12 ` Frederic Weisbecker
  2011-09-06  0:13 ` [PATCH 06/12] cgroups: Add res counter common ancestor searching Frederic Weisbecker
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Frederic Weisbecker @ 2011-09-06  0:12 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Andrew Morton, Kay Sievers,
	Tim Hockin, Tejun Heo

Moving a task from a cgroup to another may require to substract
its resource charge from the old cgroup and add it to the new one.

For this to happen, the uncharge/charge propagation can just stop
when we reach the common ancestor for the two cgroups. Further
the performance reasons, we also want to avoid to temporarily
overload the common ancestors with a non-accurate resource
counter usage if we charge first the new cgroup and uncharge the
old one thereafter. This is going to be a requirement for the coming
max number of task subsystem.

To solve this, provide a pair of new API that can charge/uncharge
a resource counter until we reach a given ancestor.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
---
 include/linux/res_counter.h |   19 ++++++++++++++++---
 kernel/res_counter.c        |   13 ++++++++-----
 2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 109d118..95e7756 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -117,8 +117,16 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
 
 int __must_check res_counter_charge_locked(struct res_counter *counter,
 		unsigned long val);
-int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val, struct res_counter **limit_fail_at);
+int __must_check res_counter_charge_until(struct res_counter *counter,
+					  struct res_counter *limit,
+					  unsigned long val,
+					  struct res_counter **limit_fail_at);
+static inline int __must_check
+res_counter_charge(struct res_counter *counter, unsigned long val,
+		   struct res_counter **limit_fail_at)
+{
+	return res_counter_charge_until(counter, NULL, val, limit_fail_at);
+}
 
 /*
  * uncharge - tell that some portion of the resource is released
@@ -131,7 +139,12 @@ int __must_check res_counter_charge(struct res_counter *counter,
  */
 
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
-void res_counter_uncharge(struct res_counter *counter, unsigned long val);
+void res_counter_uncharge_until(struct res_counter *counter, struct res_counter *limit,
+				unsigned long val);
+static inline void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+{
+	res_counter_uncharge_until(counter, NULL, val);
+}
 
 /**
  * res_counter_margin - calculate chargeable space of a counter
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 1349115..1185c86 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -35,8 +35,9 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 	return 0;
 }
 
-int res_counter_charge(struct res_counter *counter, unsigned long val,
-			struct res_counter **limit_fail_at)
+int res_counter_charge_until(struct res_counter *counter,
+			     struct res_counter *limit, unsigned long val,
+			     struct res_counter **limit_fail_at)
 {
 	int ret;
 	unsigned long flags;
@@ -44,7 +45,7 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
 
 	*limit_fail_at = NULL;
 	local_irq_save(flags);
-	for (c = counter; c != NULL; c = c->parent) {
+	for (c = counter; c != limit; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
 		spin_unlock(&c->lock);
@@ -74,13 +75,15 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
 	counter->usage -= val;
 }
 
-void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+void res_counter_uncharge_until(struct res_counter *counter,
+				struct res_counter *limit,
+				unsigned long val)
 {
 	unsigned long flags;
 	struct res_counter *c;
 
 	local_irq_save(flags);
-	for (c = counter; c != NULL; c = c->parent) {
+	for (c = counter; c != limit; c = c->parent) {
 		spin_lock(&c->lock);
 		res_counter_uncharge_locked(c, val);
 		spin_unlock(&c->lock);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 06/12] cgroups: Add res counter common ancestor searching
  2011-09-06  0:12 [PATCH 00/12 v4][RESEND] cgroups: Task counter subsystem Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2011-09-06  0:12 ` [PATCH 05/12] cgroups: Ability to stop res charge propagation on bounded ancestor Frederic Weisbecker
@ 2011-09-06  0:13 ` Frederic Weisbecker
  2011-09-06 22:21   ` Andrew Morton
  2011-09-06  0:13 ` [PATCH 07/12] res_counter: Allow charge failure pointer to be null Frederic Weisbecker
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Frederic Weisbecker @ 2011-09-06  0:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Andrew Morton, Kay Sievers,
	Tim Hockin, Tejun Heo

Add a new API to find the common ancestor between two resource
counters. This includes the passed resource counter themselves.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
---
 include/linux/res_counter.h |    2 ++
 kernel/res_counter.c        |   19 +++++++++++++++++++
 2 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 95e7756..f545548 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -146,6 +146,8 @@ static inline void res_counter_uncharge(struct res_counter *counter, unsigned lo
 	res_counter_uncharge_until(counter, NULL, val);
 }
 
+struct res_counter *res_counter_common_ancestor(struct res_counter *l, struct res_counter *r);
+
 /**
  * res_counter_margin - calculate chargeable space of a counter
  * @cnt: the counter
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 1185c86..67a9ae4 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -92,6 +92,25 @@ void res_counter_uncharge_until(struct res_counter *counter,
 }
 
 
+struct res_counter *
+res_counter_common_ancestor(struct res_counter *r1, struct res_counter *r2)
+{
+	struct res_counter *iter;
+
+	while (r1) {
+		iter = r2;
+		while (iter) {
+			if (iter == r1)
+				return iter;
+			iter = iter->parent;
+		}
+
+		r1 = r1->parent;
+	}
+
+	return NULL;
+}
+
 static inline unsigned long long *
 res_counter_member(struct res_counter *counter, int member)
 {
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 07/12] res_counter: Allow charge failure pointer to be null
  2011-09-06  0:12 [PATCH 00/12 v4][RESEND] cgroups: Task counter subsystem Frederic Weisbecker
                   ` (5 preceding siblings ...)
  2011-09-06  0:13 ` [PATCH 06/12] cgroups: Add res counter common ancestor searching Frederic Weisbecker
@ 2011-09-06  0:13 ` Frederic Weisbecker
  2011-09-06  0:13 ` [PATCH 08/12] cgroups: Pull up res counter charge failure interpretation to caller Frederic Weisbecker
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Frederic Weisbecker @ 2011-09-06  0:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Andrew Morton, Kay Sievers,
	Tim Hockin, Tejun Heo

So that callers of res_counter_charge() don't have to create and
pass this pointer even if they aren't interested in it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
---
 kernel/res_counter.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 67a9ae4..4aaa790 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -43,14 +43,16 @@ int res_counter_charge_until(struct res_counter *counter,
 	unsigned long flags;
 	struct res_counter *c, *u;
 
-	*limit_fail_at = NULL;
+	if (limit_fail_at)
+		*limit_fail_at = NULL;
 	local_irq_save(flags);
 	for (c = counter; c != limit; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
 		spin_unlock(&c->lock);
 		if (ret < 0) {
-			*limit_fail_at = c;
+			if (limit_fail_at)
+				*limit_fail_at = c;
 			goto undo;
 		}
 	}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 08/12] cgroups: Pull up res counter charge failure interpretation to caller
  2011-09-06  0:12 [PATCH 00/12 v4][RESEND] cgroups: Task counter subsystem Frederic Weisbecker
                   ` (6 preceding siblings ...)
  2011-09-06  0:13 ` [PATCH 07/12] res_counter: Allow charge failure pointer to be null Frederic Weisbecker
@ 2011-09-06  0:13 ` Frederic Weisbecker
  2011-09-06 22:26   ` Andrew Morton
  2011-09-06  0:13 ` [PATCH 09/12] cgroups: Add a task counter subsystem Frederic Weisbecker
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Frederic Weisbecker @ 2011-09-06  0:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Andrew Morton, Kay Sievers,
	Tim Hockin, Tejun Heo

res_counter_charge() always returns -ENOMEM when the limit is reached
and the charge thus can't happen.

However it's up to the caller to interpret this failure and return
the appropriate error value. The task counter subsystem will need
to report the user that a fork() has been cancelled because of some
limit reached, not because we are too short on memory.

Fix this by returning -1 when res_counter_charge() fails.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
---
 kernel/res_counter.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 4aaa790..45fa6fb 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -26,7 +26,7 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 {
 	if (counter->usage + val > counter->limit) {
 		counter->failcnt++;
-		return -ENOMEM;
+		return -1;
 	}
 
 	counter->usage += val;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 09/12] cgroups: Add a task counter subsystem
  2011-09-06  0:12 [PATCH 00/12 v4][RESEND] cgroups: Task counter subsystem Frederic Weisbecker
                   ` (7 preceding siblings ...)
  2011-09-06  0:13 ` [PATCH 08/12] cgroups: Pull up res counter charge failure interpretation to caller Frederic Weisbecker
@ 2011-09-06  0:13 ` Frederic Weisbecker
  2011-09-06 22:40   ` Andrew Morton
  2011-09-06  0:13 ` [PATCH 10/12] cgroups: Add documentation for " Frederic Weisbecker
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Frederic Weisbecker @ 2011-09-06  0:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Andrew Morton, Kay Sievers,
	Tim Hockin, Tejun Heo

Add a new subsystem to limit the number of running tasks,
similar to the NR_PROC rlimit but in the scope of a cgroup.

This is a step to be able to isolate a bit more a cgroup against
the rest of the system and limit the global impact of a fork bomb
inside a given cgroup.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup.h        |    9 ++
 include/linux/cgroup_subsys.h |    8 ++
 init/Kconfig                  |    7 ++
 kernel/Makefile               |    1 +
 kernel/cgroup_task_counter.c  |  199 +++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                 |    4 +
 6 files changed, 228 insertions(+), 0 deletions(-)
 create mode 100644 kernel/cgroup_task_counter.c

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index be66470..5e39341 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -659,4 +659,13 @@ static inline int cgroup_attach_task_current_cg(struct task_struct *t)
 
 #endif /* !CONFIG_CGROUPS */
 
+#ifdef CONFIG_CGROUP_TASK_COUNTER
+int cgroup_task_counter_fork(struct task_struct *child);
+#else
+static inline int cgroup_task_counter_fork(struct task_struct *child)
+{
+	return 0;
+}
+#endif /* CONFIG_CGROUP_TASK_COUNTER */
+
 #endif /* _LINUX_CGROUP_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index ac663c1..5425822 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -59,8 +59,16 @@ SUBSYS(net_cls)
 SUBSYS(blkio)
 #endif
 
+/* */
+
 #ifdef CONFIG_CGROUP_PERF
 SUBSYS(perf)
 #endif
 
 /* */
+
+#ifdef CONFIG_CGROUP_TASK_COUNTER
+SUBSYS(tasks)
+#endif
+
+/* */
diff --git a/init/Kconfig b/init/Kconfig
index d627783..c337ebd 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -690,6 +690,13 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
 	  select this option (if, for some reason, they need to disable it
 	  then swapaccount=0 does the trick).
 
+config CGROUP_TASK_COUNTER
+	bool "Control number of tasks in a cgroup"
+	depends on RESOURCE_COUNTERS
+	help
+	  Let the user set up an upper bound allowed number of tasks running
+	  in a cgroup.
+
 config CGROUP_PERF
 	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
 	depends on PERF_EVENTS && CGROUPS
diff --git a/kernel/Makefile b/kernel/Makefile
index eca595e..5598a7f 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -60,6 +60,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
+obj-$(CONFIG_CGROUP_TASK_COUNTER) += cgroup_task_counter.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
diff --git a/kernel/cgroup_task_counter.c b/kernel/cgroup_task_counter.c
new file mode 100644
index 0000000..2ca7f41
--- /dev/null
+++ b/kernel/cgroup_task_counter.c
@@ -0,0 +1,199 @@
+/*
+ * Limits on number of tasks subsystem for cgroups
+ *
+ * Copyright (C) 2011 Red Hat, Inc., Frederic Weisbecker <fweisbec@redhat.com>
+ *
+ * Thanks to Andrew Morton, Johannes Weiner, Li Zefan, Oleg Nesterov and Paul Menage
+ * for their suggestions.
+ *
+ */
+
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/res_counter.h>
+
+
+struct task_counter {
+	struct res_counter		res;
+	struct cgroup_subsys_state	css;
+};
+
+/*
+ * The root task counter doesn't exist as it's not part of the
+ * whole task counting in order to optimize the trivial case
+ * of only one root cgroup living.
+ */
+static struct cgroup_subsys_state root_css;
+
+
+static inline struct task_counter *cgroup_task_counter(struct cgroup *cgrp)
+{
+	if (!cgrp->parent)
+		return NULL;
+
+	return container_of(cgroup_subsys_state(cgrp, tasks_subsys_id),
+			    struct task_counter, css);
+}
+
+static inline struct res_counter *cgroup_task_counter_res(struct cgroup *cgrp)
+{
+	struct task_counter *cnt;
+
+	cnt = cgroup_task_counter(cgrp);
+	if (!cnt)
+		return NULL;
+
+	return &cnt->res;
+}
+
+static struct cgroup_subsys_state *
+task_counter_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct task_counter *cnt;
+	struct res_counter *parent_res;
+
+	if (!cgrp->parent)
+		return &root_css;
+
+	cnt = kzalloc(sizeof(*cnt), GFP_KERNEL);
+	if (!cnt)
+		return ERR_PTR(-ENOMEM);
+
+	parent_res = cgroup_task_counter_res(cgrp->parent);
+
+	res_counter_init(&cnt->res, parent_res);
+
+	return &cnt->css;
+}
+
+static void task_counter_post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	/* cgrp can't be root, so cgroup_task_counter_res() can't return NULL */
+	res_counter_inherit(cgroup_task_counter_res(cgrp), RES_LIMIT);
+}
+
+static void task_counter_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct task_counter *cnt = cgroup_task_counter(cgrp);
+
+	kfree(cnt);
+}
+
+static void task_counter_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
+			      struct cgroup *old_cgrp, struct task_struct *task)
+{
+	/* Optimize for the root cgroup case */
+	if (old_cgrp->parent)
+		res_counter_uncharge(cgroup_task_counter_res(old_cgrp), 1);
+}
+
+/* Protected amongst can_attach_task/attach_task/cancel_attach_task by cgroup mutex */
+static struct res_counter *common_ancestor;
+
+static int task_counter_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+					struct task_struct *tsk)
+{
+	struct res_counter *res = cgroup_task_counter_res(cgrp);
+	struct res_counter *old_res = cgroup_task_counter_res(old_cgrp);
+	int err;
+
+	/*
+	 * When moving a task from a cgroup to another, we don't want
+	 * to charge the common ancestors, even though they will be
+	 * uncharged later from attach_task(), because during that
+	 * short window between charge and uncharge, a task could fork
+	 * in the ancestor and spuriously fail due to the temporary
+	 * charge.
+	 */
+	common_ancestor = res_counter_common_ancestor(res, old_res);
+
+	/*
+	 * If cgrp is the root then res is NULL, however in this case
+	 * the common ancestor is NULL as well, making the below a NOP.
+	 */
+	err = res_counter_charge_until(res, common_ancestor, 1, NULL);
+	if (err)
+		return -EINVAL;
+
+	return 0;
+}
+
+static void task_counter_cancel_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+{
+	res_counter_uncharge_until(cgroup_task_counter_res(cgrp), common_ancestor, 1);
+}
+
+static void task_counter_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
+				     struct task_struct *tsk)
+{
+	res_counter_uncharge_until(cgroup_task_counter_res(old_cgrp), common_ancestor, 1);
+}
+
+static u64 task_counter_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+	int type = cft->private;
+
+	return res_counter_read_u64(cgroup_task_counter_res(cgrp), type);
+}
+
+static int task_counter_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	int type = cft->private;
+
+	res_counter_write_u64(cgroup_task_counter_res(cgrp), type, val);
+
+	return 0;
+}
+
+static struct cftype files[] = {
+	{
+		.name		= "limit",
+		.read_u64	= task_counter_read_u64,
+		.write_u64	= task_counter_write_u64,
+		.private	= RES_LIMIT,
+	},
+
+	{
+		.name		= "usage",
+		.read_u64	= task_counter_read_u64,
+		.private	= RES_USAGE,
+	},
+};
+
+static int task_counter_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	if (!cgrp->parent)
+		return 0;
+
+	return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
+}
+
+int cgroup_task_counter_fork(struct task_struct *child)
+{
+	struct cgroup_subsys_state *css = child->cgroups->subsys[tasks_subsys_id];
+	struct cgroup *cgrp = css->cgroup;
+	int err;
+
+	/* Optimize for the root cgroup case, which doesn't have a limit */
+	if (!cgrp->parent)
+		return 0;
+
+	err = res_counter_charge(cgroup_task_counter_res(cgrp), 1, NULL);
+	if (err)
+		return -EAGAIN;
+
+	return 0;
+}
+
+struct cgroup_subsys tasks_subsys = {
+	.name			= "tasks",
+	.subsys_id		= tasks_subsys_id,
+	.create			= task_counter_create,
+	.post_clone		= task_counter_post_clone,
+	.destroy		= task_counter_destroy,
+	.exit			= task_counter_exit,
+	.can_attach_task	= task_counter_can_attach_task,
+	.cancel_attach_task	= task_counter_cancel_attach_task,
+	.attach_task		= task_counter_attach_task,
+	.populate		= task_counter_populate,
+};
diff --git a/kernel/fork.c b/kernel/fork.c
index 8e6b6f4..f716436 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1309,6 +1309,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p->group_leader = p;
 	INIT_LIST_HEAD(&p->thread_group);
 
+	retval = cgroup_task_counter_fork(p);
+	if (retval)
+		goto bad_fork_free_pid;
+
 	/* Now that the task is set up, run cgroup callbacks if
 	 * necessary. We need to run them before the task is visible
 	 * on the tasklist. */
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 10/12] cgroups: Add documentation for task counter subsystem
  2011-09-06  0:12 [PATCH 00/12 v4][RESEND] cgroups: Task counter subsystem Frederic Weisbecker
                   ` (8 preceding siblings ...)
  2011-09-06  0:13 ` [PATCH 09/12] cgroups: Add a task counter subsystem Frederic Weisbecker
@ 2011-09-06  0:13 ` Frederic Weisbecker
  2011-09-06 22:41   ` Andrew Morton
  2011-09-06  0:13 ` [RFC PATCH 11/12] cgroups: Allow subsystems to cancel a fork Frederic Weisbecker
  2011-09-06  0:13 ` [RFC PATCH 12/12] cgroups: Convert task counter to use the subsys fork callback Frederic Weisbecker
  11 siblings, 1 reply; 26+ messages in thread
From: Frederic Weisbecker @ 2011-09-06  0:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Andrew Morton, Kay Sievers,
	Tim Hockin, Tejun Heo

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
---
 Documentation/cgroups/task_counter.txt |  126 ++++++++++++++++++++++++++++++++
 1 files changed, 126 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/cgroups/task_counter.txt

diff --git a/Documentation/cgroups/task_counter.txt b/Documentation/cgroups/task_counter.txt
new file mode 100644
index 0000000..e93760a
--- /dev/null
+++ b/Documentation/cgroups/task_counter.txt
@@ -0,0 +1,126 @@
+Task counter subsystem
+
+1. Description
+
+The task counter subsystem limits the number of tasks running
+inside a given cgroup. It behaves like the NR_PROC rlimit but in
+the scope of a cgroup instead of a user.
+
+It has two typical usecases, although more can probably be found:
+
+- Protect against forkbombs that explode inside a container when
+that container is implemented using a cgroup. The NR_PROC rlimit
+is not efficient for that because if we have several containers
+running in parallel under the same user, one container could starve
+all the others by spawning a high number of tasks close to the
+rlimit boundary. So in this case we need this limitation to be
+done in a per cgroup granularity.
+
+- Kill all tasks inside a cgroup without races. By setting the limit
+of running tasks to 0, one can prevent from any further fork inside a
+cgroup and then kill all of its tasks without the need to retry an
+unbound amount of time due to races between kills and forks running
+in parallel (more details in "Kill a cgroup safely" paragraph).
+
+
+2. Interface
+
+When a hierarchy is mounted with the task counter subsystem binded, it
+adds two files into the cgroups directories, except the root one:
+
+- tasks.usage contains the number of tasks running inside a cgroup and
+its children in the hierarchy (see paragraph about Inheritance).
+
+- tasks.limit contains the maximum number of tasks that can run inside
+a cgroup. We check this limit when a task forks or when it is migrated
+to a cgroup.
+
+Note that the tasks.limit value can be forced below tasks.usage, in which
+case any new task in the cgroup will be rejected until the tasks.usage
+value goes below tasks.limit.
+
+For optimization reasons, the root directory of a hierarchy doesn't have
+a task counter.
+
+
+3. Inheritance
+
+When a task is added to a cgroup, by way of a cgroup migration or a fork,
+it increases the task counter of that cgroup and of all its ancestors.
+Hence a cgroup is also subject to the limit of its ancestors.
+
+In the following hierarchy:
+
+
+             A
+             |
+             B
+           /   \
+          C     D
+
+
+We have 1 task running in B, one running in C and none running in D.
+It means we have tasks.usage = 1 in C and tasks.usage = 2 in B because
+B counts its task and those of its children.
+
+Now lets set tasks.limit = 2 in B and tasks.limit = 1 in D.
+If we move a new task in D, it will be refused because the limit in B has
+been reached already.
+
+
+4. Kill a cgroup safely
+
+As explained in the description, this subsystem is also helpful to
+kill all tasks in a cgroup safely, after setting tasks.limit to 0,
+so that we don't race against parallel forks in an unbound numbers
+of kill iterations.
+
+But there is a small detail to be aware of to use this feature that
+way.
+
+Some typical way to proceed would be:
+
+	echo 0 > tasks.limit
+	for TASK in $(cat cgroup.procs)
+	do
+		kill -KILL $TASK
+	done
+
+However there is a small race window where a task can be in the way to
+be forked but hasn't enough completed the fork to have the PID of the
+fork appearing in the cgroup.procs file.
+
+The only way to get it right is to run a loop that reads tasks.usage, kill
+all the tasks in cgroup.procs and exit the loop only if the value in
+tasks.usage was the same than the number of tasks that were in cgroup.procs,
+ie: the number of tasks that were killed.
+
+It works because the new child appears in tasks.usage right before we check,
+in the fork path, whether the parent has a pending signal, in which case the
+fork is cancelled anyway. So relying on tasks.usage is fine and non-racy.
+
+This race window is tiny and unlikely to happen, so most of the time a single
+kill iteration should be enough. But it's worth knowing about that corner
+case spotted by Oleg Nesterov.
+
+Example of safe use would be:
+
+	echo 0 > tasks.limit
+	END=false
+
+	while [ $END == false ]
+	do
+		NR_TASKS=$(cat tasks.usage)
+		NR_KILLED=0
+
+		for TASK in $(cat cgroup.procs)
+		do
+			let NR_KILLED=NR_KILLED+1
+			kill -KILL $TASK
+		done
+
+		if [ "$NR_TASKS" = "$NR_KILLED" ]
+		then
+			END=true
+		fi
+	done
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 11/12] cgroups: Allow subsystems to cancel a fork
  2011-09-06  0:12 [PATCH 00/12 v4][RESEND] cgroups: Task counter subsystem Frederic Weisbecker
                   ` (9 preceding siblings ...)
  2011-09-06  0:13 ` [PATCH 10/12] cgroups: Add documentation for " Frederic Weisbecker
@ 2011-09-06  0:13 ` Frederic Weisbecker
  2011-09-15 21:09   ` Andrew Morton
  2011-09-06  0:13 ` [RFC PATCH 12/12] cgroups: Convert task counter to use the subsys fork callback Frederic Weisbecker
  11 siblings, 1 reply; 26+ messages in thread
From: Frederic Weisbecker @ 2011-09-06  0:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Andrew Morton, Kay Sievers,
	Tim Hockin, Tejun Heo

Let the subsystem's fork callback return an error value so
that they can cancel a fork. This is going to be used by
the task counter subsystem to implement the limit.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup.h  |   14 +++++++++-----
 kernel/cgroup.c         |   23 +++++++++++++++++++----
 kernel/cgroup_freezer.c |    6 ++++--
 kernel/exit.c           |    2 +-
 kernel/fork.c           |    7 +++++--
 5 files changed, 38 insertions(+), 14 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 5e39341..253543b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -32,9 +32,11 @@ extern int cgroup_lock_is_held(void);
 extern bool cgroup_lock_live_group(struct cgroup *cgrp);
 extern void cgroup_unlock(void);
 extern void cgroup_fork(struct task_struct *p);
-extern void cgroup_fork_callbacks(struct task_struct *p);
+extern int cgroup_fork_callbacks(struct task_struct *p,
+				 struct cgroup_subsys **failed_ss);
 extern void cgroup_post_fork(struct task_struct *p);
-extern void cgroup_exit(struct task_struct *p, int run_callbacks);
+extern void cgroup_exit(struct task_struct *p, int run_callbacks,
+			struct cgroup_subsys *failed_ss);
 extern int cgroupstats_build(struct cgroupstats *stats,
 				struct dentry *dentry);
 extern int cgroup_load_subsys(struct cgroup_subsys *ss);
@@ -477,7 +479,7 @@ struct cgroup_subsys {
 	void (*attach_task)(struct cgroup *cgrp, struct cgroup *old_cgrp, struct task_struct *tsk);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 		       struct cgroup *old_cgrp, struct task_struct *tsk);
-	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
+	int (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
 	void (*exit)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			struct cgroup *old_cgrp, struct task_struct *task);
 	int (*populate)(struct cgroup_subsys *ss,
@@ -634,9 +636,11 @@ struct cgroup_subsys_state *cgroup_css_from_dir(struct file *f, int id);
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
 static inline void cgroup_fork(struct task_struct *p) {}
-static inline void cgroup_fork_callbacks(struct task_struct *p) {}
+static inline int cgroup_fork_callbacks(struct task_struct *p,
+					struct cgroup_subsys **failed_ss) {}
 static inline void cgroup_post_fork(struct task_struct *p) {}
-static inline void cgroup_exit(struct task_struct *p, int callbacks) {}
+static inline void cgroup_exit(struct task_struct *p, int callbacks,
+			       struct cgroup_subsys **failed_ss) {}
 
 static inline void cgroup_lock(void) {}
 static inline void cgroup_unlock(void) {}
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 73ee2fd..c99e9cc 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4538,8 +4538,11 @@ void cgroup_fork(struct task_struct *child)
  * tasklist. No need to take any locks since no-one can
  * be operating on this task.
  */
-void cgroup_fork_callbacks(struct task_struct *child)
+int cgroup_fork_callbacks(struct task_struct *child,
+			  struct cgroup_subsys **failed_ss)
 {
+	int err;
+
 	if (need_forkexit_callback) {
 		int i;
 		/*
@@ -4549,10 +4552,17 @@ void cgroup_fork_callbacks(struct task_struct *child)
 		 */
 		for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
 			struct cgroup_subsys *ss = subsys[i];
-			if (ss->fork)
-				ss->fork(ss, child);
+			if (ss->fork) {
+				err = ss->fork(ss, child);
+				if (err) {
+					*failed_ss = ss;
+					return err;
+				}
+			}
 		}
 	}
+
+	return 0;
 }
 
 /**
@@ -4610,7 +4620,8 @@ void cgroup_post_fork(struct task_struct *child)
  *    which wards off any cgroup_attach_task() attempts, or task is a failed
  *    fork, never visible to cgroup_attach_task.
  */
-void cgroup_exit(struct task_struct *tsk, int run_callbacks)
+void cgroup_exit(struct task_struct *tsk, int run_callbacks,
+		 struct cgroup_subsys *failed_ss)
 {
 	struct css_set *cg;
 	int i;
@@ -4639,6 +4650,10 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
 		 */
 		for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
 			struct cgroup_subsys *ss = subsys[i];
+
+			if (ss == failed_ss)
+				break;
+
 			if (ss->exit) {
 				struct cgroup *old_cgrp =
 					rcu_dereference_raw(cg->subsys[i])->cgroup;
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index c1421a1..30a4332 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -187,7 +187,7 @@ static int freezer_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
 	return 0;
 }
 
-static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
+static int freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 {
 	struct freezer *freezer;
 
@@ -207,7 +207,7 @@ static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 	 * following check.
 	 */
 	if (!freezer->css.cgroup->parent)
-		return;
+		return 0;
 
 	spin_lock_irq(&freezer->lock);
 	BUG_ON(freezer->state == CGROUP_FROZEN);
@@ -216,6 +216,8 @@ static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 	if (freezer->state == CGROUP_FREEZING)
 		freeze_task(task, true);
 	spin_unlock_irq(&freezer->lock);
+
+	return 0;
 }
 
 /*
diff --git a/kernel/exit.c b/kernel/exit.c
index 2913b35..abab32b 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -994,7 +994,7 @@ NORET_TYPE void do_exit(long code)
 	 */
 	perf_event_exit_task(tsk);
 
-	cgroup_exit(tsk, 1);
+	cgroup_exit(tsk, 1, NULL);
 
 	if (group_dead)
 		disassociate_ctty(1);
diff --git a/kernel/fork.c b/kernel/fork.c
index f716436..989c5a0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1058,6 +1058,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
+	struct cgroup_subsys *cgroup_failed_ss = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1316,8 +1317,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	/* Now that the task is set up, run cgroup callbacks if
 	 * necessary. We need to run them before the task is visible
 	 * on the tasklist. */
-	cgroup_fork_callbacks(p);
+	retval = cgroup_fork_callbacks(p, &cgroup_failed_ss);
 	cgroup_callbacks_done = 1;
+	if (retval)
+		goto bad_fork_free_pid;
 
 	/* Need tasklist lock for parent etc handling! */
 	write_lock_irq(&tasklist_lock);
@@ -1423,7 +1426,7 @@ bad_fork_cleanup_cgroup:
 #endif
 	if (clone_flags & CLONE_THREAD)
 		threadgroup_fork_read_unlock(current);
-	cgroup_exit(p, cgroup_callbacks_done);
+	cgroup_exit(p, cgroup_callbacks_done, cgroup_failed_ss);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);
 bad_fork_cleanup_count:
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 12/12] cgroups: Convert task counter to use the subsys fork callback
  2011-09-06  0:12 [PATCH 00/12 v4][RESEND] cgroups: Task counter subsystem Frederic Weisbecker
                   ` (10 preceding siblings ...)
  2011-09-06  0:13 ` [RFC PATCH 11/12] cgroups: Allow subsystems to cancel a fork Frederic Weisbecker
@ 2011-09-06  0:13 ` Frederic Weisbecker
  11 siblings, 0 replies; 26+ messages in thread
From: Frederic Weisbecker @ 2011-09-06  0:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Andrew Morton, Kay Sievers,
	Tim Hockin, Tejun Heo

Instead of using an ad hoc hook on the fork path for the task
counter, use the subsystem fork callback that we can now use
to cancel the fork.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup.h       |    9 ---------
 kernel/cgroup_task_counter.c |    3 ++-
 kernel/fork.c                |    4 ----
 3 files changed, 2 insertions(+), 14 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 253543b..7ba9a38 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -663,13 +663,4 @@ static inline int cgroup_attach_task_current_cg(struct task_struct *t)
 
 #endif /* !CONFIG_CGROUPS */
 
-#ifdef CONFIG_CGROUP_TASK_COUNTER
-int cgroup_task_counter_fork(struct task_struct *child);
-#else
-static inline int cgroup_task_counter_fork(struct task_struct *child)
-{
-	return 0;
-}
-#endif /* CONFIG_CGROUP_TASK_COUNTER */
-
 #endif /* _LINUX_CGROUP_H */
diff --git a/kernel/cgroup_task_counter.c b/kernel/cgroup_task_counter.c
index 2ca7f41..dbdb625 100644
--- a/kernel/cgroup_task_counter.c
+++ b/kernel/cgroup_task_counter.c
@@ -168,7 +168,7 @@ static int task_counter_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
 	return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
 }
 
-int cgroup_task_counter_fork(struct task_struct *child)
+static int task_counter_fork(struct cgroup_subsys *ss, struct task_struct *child)
 {
 	struct cgroup_subsys_state *css = child->cgroups->subsys[tasks_subsys_id];
 	struct cgroup *cgrp = css->cgroup;
@@ -195,5 +195,6 @@ struct cgroup_subsys tasks_subsys = {
 	.can_attach_task	= task_counter_can_attach_task,
 	.cancel_attach_task	= task_counter_cancel_attach_task,
 	.attach_task		= task_counter_attach_task,
+	.fork			= task_counter_fork,
 	.populate		= task_counter_populate,
 };
diff --git a/kernel/fork.c b/kernel/fork.c
index 989c5a0..ee8abdb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1310,10 +1310,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p->group_leader = p;
 	INIT_LIST_HEAD(&p->thread_group);
 
-	retval = cgroup_task_counter_fork(p);
-	if (retval)
-		goto bad_fork_free_pid;
-
 	/* Now that the task is set up, run cgroup callbacks if
 	 * necessary. We need to run them before the task is visible
 	 * on the tasklist. */
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 02/12] cgroups: New resource counter inheritance API
  2011-09-06  0:12 ` [PATCH 02/12] cgroups: New resource counter inheritance API Frederic Weisbecker
@ 2011-09-06 22:17   ` Andrew Morton
  2011-09-08 13:25     ` Frederic Weisbecker
  0 siblings, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2011-09-06 22:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo

On Tue,  6 Sep 2011 02:12:56 +0200
Frederic Weisbecker <fweisbec@gmail.com> wrote:

> +void res_counter_inherit(struct res_counter *counter, int member)
> +{
> +	struct res_counter *parent;
> +	unsigned long long val;
> +
> +	parent = counter->parent;
> +	if (parent) {
> +		val = res_counter_read_u64(parent, member);
> +		res_counter_write_u64(counter, member, val);
> +	}
> +}

What locking protects the read-modify-write?  If none, please fix,
otherwise please document it.

All of kernel/res_counter.c is charmingly undocumented.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 06/12] cgroups: Add res counter common ancestor searching
  2011-09-06  0:13 ` [PATCH 06/12] cgroups: Add res counter common ancestor searching Frederic Weisbecker
@ 2011-09-06 22:21   ` Andrew Morton
  2011-09-09 12:31     ` Frederic Weisbecker
  0 siblings, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2011-09-06 22:21 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo

On Tue,  6 Sep 2011 02:13:00 +0200
Frederic Weisbecker <fweisbec@gmail.com> wrote:

> +struct res_counter *
> +res_counter_common_ancestor(struct res_counter *r1, struct res_counter *r2)
> +{
> +	struct res_counter *iter;
> +
> +	while (r1) {
> +		iter = r2;
> +		while (iter) {
> +			if (iter == r1)
> +				return iter;
> +			iter = iter->parent;
> +		}
> +
> +		r1 = r1->parent;
> +	}
> +
> +	return NULL;
> +}

cgroup_mutex, one assumes?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 08/12] cgroups: Pull up res counter charge failure interpretation to caller
  2011-09-06  0:13 ` [PATCH 08/12] cgroups: Pull up res counter charge failure interpretation to caller Frederic Weisbecker
@ 2011-09-06 22:26   ` Andrew Morton
  2011-09-09 13:33     ` Frederic Weisbecker
  0 siblings, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2011-09-06 22:26 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo

On Tue,  6 Sep 2011 02:13:02 +0200
Frederic Weisbecker <fweisbec@gmail.com> wrote:

> res_counter_charge() always returns -ENOMEM when the limit is reached
> and the charge thus can't happen.
> 
> However it's up to the caller to interpret this failure and return
> the appropriate error value. The task counter subsystem will need
> to report the user that a fork() has been cancelled because of some
> limit reached, not because we are too short on memory.
> 
> Fix this by returning -1 when res_counter_charge() fails.
> 
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Paul Menage <paul@paulmenage.org>
> Cc: Li Zefan <lizf@cn.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Aditya Kali <adityakali@google.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Kay Sievers <kay.sievers@vrfy.org>
> Cc: Tim Hockin <thockin@hockin.org>
> Cc: Tejun Heo <tj@kernel.org>
> ---
>  kernel/res_counter.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index 4aaa790..45fa6fb 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -26,7 +26,7 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
>  {
>  	if (counter->usage + val > counter->limit) {
>  		counter->failcnt++;
> -		return -ENOMEM;
> +		return -1;
>  	}
>  
>  	counter->usage += val;

This also affects the return value of your new and undocumented
res_counter_charge_until().

That's a bit of a hand-grenade which could lead to system calls
returning -1 (ie: EPERM) to userspace.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 09/12] cgroups: Add a task counter subsystem
  2011-09-06  0:13 ` [PATCH 09/12] cgroups: Add a task counter subsystem Frederic Weisbecker
@ 2011-09-06 22:40   ` Andrew Morton
  2011-09-13 15:13     ` Frederic Weisbecker
  0 siblings, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2011-09-06 22:40 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo

On Tue,  6 Sep 2011 02:13:03 +0200
Frederic Weisbecker <fweisbec@gmail.com> wrote:

> Add a new subsystem to limit the number of running tasks,
> similar to the NR_PROC rlimit but in the scope of a cgroup.
> 
> This is a step to be able to isolate a bit more a cgroup against
> the rest of the system and limit the global impact of a fork bomb
> inside a given cgroup.

It would be nice to show some testing results for the putative
forkbomb-control feature.

>
> ...
>
> +config CGROUP_TASK_COUNTER
> +	bool "Control number of tasks in a cgroup"
> +	depends on RESOURCE_COUNTERS
> +	help
> +	  Let the user set up an upper bound allowed number of tasks running
> +	  in a cgroup.

"of the allowed"?

Perhaps this help section could be fleshed out somewhat.

>
> ...
>
> @@ -0,0 +1,199 @@
> +/*
> + * Limits on number of tasks subsystem for cgroups
> + *
> + * Copyright (C) 2011 Red Hat, Inc., Frederic Weisbecker <fweisbec@redhat.com>
> + *
> + * Thanks to Andrew Morton, Johannes Weiner, Li Zefan, Oleg Nesterov and Paul Menage
> + * for their suggestions.

80 cols, please.  (checkpatch!)

> + *
> + */
> +
> +#include <linux/cgroup.h>
> +#include <linux/slab.h>
> +#include <linux/res_counter.h>
> +
> +
> +struct task_counter {
> +	struct res_counter		res;
> +	struct cgroup_subsys_state	css;
> +};
> +
> +/*
> + * The root task counter doesn't exist as it's not part of the
> + * whole task counting in order to optimize the trivial case
> + * of only one root cgroup living.

That sentence is rather hard to follow.

> + */
> +static struct cgroup_subsys_state root_css;
> +
> +
> +static inline struct task_counter *cgroup_task_counter(struct cgroup *cgrp)
> +{
> +	if (!cgrp->parent)
> +		return NULL;
> +
> +	return container_of(cgroup_subsys_state(cgrp, tasks_subsys_id),
> +			    struct task_counter, css);
> +}
> +
> +static inline struct res_counter *cgroup_task_counter_res(struct cgroup *cgrp)

"cgroup_res_counter" would be a more symmetrical name.  Or perhaps
cgroup_task_res_counter.  Swapping the "counter" and "res" seems odd.

> +{
> +	struct task_counter *cnt;
> +
> +	cnt = cgroup_task_counter(cgrp);
> +	if (!cnt)
> +		return NULL;
> +
> +	return &cnt->res;
> +}
> +
>
> ...
>
> +/* Protected amongst can_attach_task/attach_task/cancel_attach_task by cgroup mutex */

/*
 * Protected amongst can_attach_task/attach_task/cancel_attach_task by cgroup
 * mutex
 */

(checkpatch)

>
> ...
>
> +static int task_counter_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
> +					struct task_struct *tsk)
> +{
> +	struct res_counter *res = cgroup_task_counter_res(cgrp);
> +	struct res_counter *old_res = cgroup_task_counter_res(old_cgrp);
> +	int err;
> +
> +	/*
> +	 * When moving a task from a cgroup to another, we don't want
> +	 * to charge the common ancestors, even though they will be
> +	 * uncharged later from attach_task(), because during that
> +	 * short window between charge and uncharge, a task could fork
> +	 * in the ancestor and spuriously fail due to the temporary
> +	 * charge.
> +	 */
> +	common_ancestor = res_counter_common_ancestor(res, old_res);
> +
> +	/*
> +	 * If cgrp is the root then res is NULL, however in this case
> +	 * the common ancestor is NULL as well, making the below a NOP.
> +	 */
> +	err = res_counter_charge_until(res, common_ancestor, 1, NULL);
> +	if (err)
> +		return -EINVAL;
> +
> +	return 0;
> +}

One would expect a "can"-named function to return a boolean.  This one
returns an errno which is OK, I guess.  But the function is misnamed
because if successful it actually alters charges.  A less misleading
name would be simply task_counter_attach_task(), but that's already
taken.  Or perhaps task_counter_try_attach_task(), but that seems
unnecessary to me - many many kernel functions "try" something and back
out with an errno if it failed.

I really do dislike the fact that the documentation is over in another
file and another patch.  For a start, it makes review harder and
slower.

>
> ...
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 10/12] cgroups: Add documentation for task counter subsystem
  2011-09-06  0:13 ` [PATCH 10/12] cgroups: Add documentation for " Frederic Weisbecker
@ 2011-09-06 22:41   ` Andrew Morton
  2011-09-13 17:35     ` Frederic Weisbecker
  0 siblings, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2011-09-06 22:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo

Oh.  And I now see that the documentation file provides no documentation
for the code in cgroup_task_counter.c anyway.  Really?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 02/12] cgroups: New resource counter inheritance API
  2011-09-06 22:17   ` Andrew Morton
@ 2011-09-08 13:25     ` Frederic Weisbecker
  0 siblings, 0 replies; 26+ messages in thread
From: Frederic Weisbecker @ 2011-09-08 13:25 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: LKML, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Kay Sievers, Tim Hockin

On Tue, Sep 06, 2011 at 03:17:47PM -0700, Andrew Morton wrote:
> On Tue,  6 Sep 2011 02:12:56 +0200
> Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > +void res_counter_inherit(struct res_counter *counter, int member)
> > +{
> > +	struct res_counter *parent;
> > +	unsigned long long val;
> > +
> > +	parent = counter->parent;
> > +	if (parent) {
> > +		val = res_counter_read_u64(parent, member);
> > +		res_counter_write_u64(counter, member, val);
> > +	}
> > +}
> 
> What locking protects the read-modify-write?  If none, please fix,
> otherwise please document it.
> 
> All of kernel/res_counter.c is charmingly undocumented.

How should I address your reviews now that you've applied these
patches? Would you prefer me to send new patches on top of -mm?

Also Tejun has patches that conflict with mine. But I believe
-mm is on top of -next, so perhaps I should rebase those patches
when his branch reach -next? Or something else?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 06/12] cgroups: Add res counter common ancestor searching
  2011-09-06 22:21   ` Andrew Morton
@ 2011-09-09 12:31     ` Frederic Weisbecker
  0 siblings, 0 replies; 26+ messages in thread
From: Frederic Weisbecker @ 2011-09-09 12:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo

On Tue, Sep 06, 2011 at 03:21:29PM -0700, Andrew Morton wrote:
> On Tue,  6 Sep 2011 02:13:00 +0200
> Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > +struct res_counter *
> > +res_counter_common_ancestor(struct res_counter *r1, struct res_counter *r2)
> > +{
> > +	struct res_counter *iter;
> > +
> > +	while (r1) {
> > +		iter = r2;
> > +		while (iter) {
> > +			if (iter == r1)
> > +				return iter;
> > +			iter = iter->parent;
> > +		}
> > +
> > +		r1 = r1->parent;
> > +	}
> > +
> > +	return NULL;
> > +}
> 
> cgroup_mutex, one assumes?

No, a cgroup can't move to other parents, they remain stable
as long as they have children (if I understood correctly).
So this mirrors to res_counters.

In fact res counters is just a library, pretty much like
lists, except they are only used by cgroups for now, so it's up
to the caller to guarantee the stability of its objects.

But yeah the documentation for this function is missing. I'm
adding some comments in the file.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 08/12] cgroups: Pull up res counter charge failure interpretation to caller
  2011-09-06 22:26   ` Andrew Morton
@ 2011-09-09 13:33     ` Frederic Weisbecker
  2011-09-09 15:17       ` Andrew Morton
  0 siblings, 1 reply; 26+ messages in thread
From: Frederic Weisbecker @ 2011-09-09 13:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo

On Tue, Sep 06, 2011 at 03:26:50PM -0700, Andrew Morton wrote:
> On Tue,  6 Sep 2011 02:13:02 +0200
> Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > res_counter_charge() always returns -ENOMEM when the limit is reached
> > and the charge thus can't happen.
> > 
> > However it's up to the caller to interpret this failure and return
> > the appropriate error value. The task counter subsystem will need
> > to report the user that a fork() has been cancelled because of some
> > limit reached, not because we are too short on memory.
> > 
> > Fix this by returning -1 when res_counter_charge() fails.
> > 
> > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > Cc: Paul Menage <paul@paulmenage.org>
> > Cc: Li Zefan <lizf@cn.fujitsu.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Aditya Kali <adityakali@google.com>
> > Cc: Oleg Nesterov <oleg@redhat.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Kay Sievers <kay.sievers@vrfy.org>
> > Cc: Tim Hockin <thockin@hockin.org>
> > Cc: Tejun Heo <tj@kernel.org>
> > ---
> >  kernel/res_counter.c |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> > index 4aaa790..45fa6fb 100644
> > --- a/kernel/res_counter.c
> > +++ b/kernel/res_counter.c
> > @@ -26,7 +26,7 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
> >  {
> >  	if (counter->usage + val > counter->limit) {
> >  		counter->failcnt++;
> > -		return -ENOMEM;
> > +		return -1;
> >  	}
> >  
> >  	counter->usage += val;
> 
> This also affects the return value of your new and undocumented
> res_counter_charge_until().
> 
> That's a bit of a hand-grenade which could lead to system calls
> returning -1 (ie: EPERM) to userspace.

Right. What about making it a boolean?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 08/12] cgroups: Pull up res counter charge failure interpretation to caller
  2011-09-09 13:33     ` Frederic Weisbecker
@ 2011-09-09 15:17       ` Andrew Morton
  0 siblings, 0 replies; 26+ messages in thread
From: Andrew Morton @ 2011-09-09 15:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo

On Fri, 9 Sep 2011 15:33:20 +0200 Frederic Weisbecker <fweisbec@gmail.com> wrote:

> On Tue, Sep 06, 2011 at 03:26:50PM -0700, Andrew Morton wrote:
> > On Tue,  6 Sep 2011 02:13:02 +0200
> > Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > 
> > > res_counter_charge() always returns -ENOMEM when the limit is reached
> > > and the charge thus can't happen.
> > > 
> > > However it's up to the caller to interpret this failure and return
> > > the appropriate error value. The task counter subsystem will need
> > > to report the user that a fork() has been cancelled because of some
> > > limit reached, not because we are too short on memory.
> > > 
> > > Fix this by returning -1 when res_counter_charge() fails.
> > > 
> > > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > > Cc: Paul Menage <paul@paulmenage.org>
> > > Cc: Li Zefan <lizf@cn.fujitsu.com>
> > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > Cc: Aditya Kali <adityakali@google.com>
> > > Cc: Oleg Nesterov <oleg@redhat.com>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Kay Sievers <kay.sievers@vrfy.org>
> > > Cc: Tim Hockin <thockin@hockin.org>
> > > Cc: Tejun Heo <tj@kernel.org>
> > > ---
> > >  kernel/res_counter.c |    2 +-
> > >  1 files changed, 1 insertions(+), 1 deletions(-)
> > > 
> > > diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> > > index 4aaa790..45fa6fb 100644
> > > --- a/kernel/res_counter.c
> > > +++ b/kernel/res_counter.c
> > > @@ -26,7 +26,7 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
> > >  {
> > >  	if (counter->usage + val > counter->limit) {
> > >  		counter->failcnt++;
> > > -		return -ENOMEM;
> > > +		return -1;
> > >  	}
> > >  
> > >  	counter->usage += val;
> > 
> > This also affects the return value of your new and undocumented
> > res_counter_charge_until().
> > 
> > That's a bit of a hand-grenade which could lead to system calls
> > returning -1 (ie: EPERM) to userspace.
> 
> Right. What about making it a boolean?

mmm, not sure.  0/-1 is a reasonable return value for a function which
either did or didn't succeed.  Adding appropriate interface
documentation is a way of reducing the opportunity for making this mistake.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 09/12] cgroups: Add a task counter subsystem
  2011-09-06 22:40   ` Andrew Morton
@ 2011-09-13 15:13     ` Frederic Weisbecker
  0 siblings, 0 replies; 26+ messages in thread
From: Frederic Weisbecker @ 2011-09-13 15:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo

On Tue, Sep 06, 2011 at 03:40:09PM -0700, Andrew Morton wrote:
> On Tue,  6 Sep 2011 02:13:03 +0200
> Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > Add a new subsystem to limit the number of running tasks,
> > similar to the NR_PROC rlimit but in the scope of a cgroup.
> > 
> > This is a step to be able to isolate a bit more a cgroup against
> > the rest of the system and limit the global impact of a fork bomb
> > inside a given cgroup.
> 
> It would be nice to show some testing results for the putative
> forkbomb-control feature.

I'm uploading a selftest tool that I've been using to ensure it behaves
as expected. A forkbomb test is included.
 
> >
> > ...
> >
> > +config CGROUP_TASK_COUNTER
> > +	bool "Control number of tasks in a cgroup"
> > +	depends on RESOURCE_COUNTERS
> > +	help
> > +	  Let the user set up an upper bound allowed number of tasks running
> > +	  in a cgroup.
> 
> "of the allowed"?
> 
> Perhaps this help section could be fleshed out somewhat.

I've fixed and expanded it a bit for the v5.

> 
> >
> > ...
> >
> > @@ -0,0 +1,199 @@
> > +/*
> > + * Limits on number of tasks subsystem for cgroups
> > + *
> > + * Copyright (C) 2011 Red Hat, Inc., Frederic Weisbecker <fweisbec@redhat.com>
> > + *
> > + * Thanks to Andrew Morton, Johannes Weiner, Li Zefan, Oleg Nesterov and Paul Menage
> > + * for their suggestions.
> 
> 80 cols, please.  (checkpatch!)

Fixed in v5.

> 
> > + *
> > + */
> > +
> > +#include <linux/cgroup.h>
> > +#include <linux/slab.h>
> > +#include <linux/res_counter.h>
> > +
> > +
> > +struct task_counter {
> > +	struct res_counter		res;
> > +	struct cgroup_subsys_state	css;
> > +};
> > +
> > +/*
> > + * The root task counter doesn't exist as it's not part of the
> > + * whole task counting in order to optimize the trivial case
> > + * of only one root cgroup living.
> 
> That sentence is rather hard to follow.

I've fixed it too. I mean I tried something...

> > + */
> > +static struct cgroup_subsys_state root_css;
> > +
> > +
> > +static inline struct task_counter *cgroup_task_counter(struct cgroup *cgrp)
> > +{
> > +	if (!cgrp->parent)
> > +		return NULL;
> > +
> > +	return container_of(cgroup_subsys_state(cgrp, tasks_subsys_id),
> > +			    struct task_counter, css);
> > +}
> > +
> > +static inline struct res_counter *cgroup_task_counter_res(struct cgroup *cgrp)
> 
> "cgroup_res_counter" would be a more symmetrical name.  Or perhaps
> cgroup_task_res_counter.  Swapping the "counter" and "res" seems odd.

Indeed; fixed.

> > +{
> > +	struct task_counter *cnt;
> > +
> > +	cnt = cgroup_task_counter(cgrp);
> > +	if (!cnt)
> > +		return NULL;
> > +
> > +	return &cnt->res;
> > +}
> > +
> >
> > ...
> >
> > +/* Protected amongst can_attach_task/attach_task/cancel_attach_task by cgroup mutex */
> 
> /*
>  * Protected amongst can_attach_task/attach_task/cancel_attach_task by cgroup
>  * mutex
>  */
> 
> (checkpatch)

Fixed

> >
> > ...
> >
> > +static int task_counter_can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
> > +					struct task_struct *tsk)
> > +{
> > +	struct res_counter *res = cgroup_task_counter_res(cgrp);
> > +	struct res_counter *old_res = cgroup_task_counter_res(old_cgrp);
> > +	int err;
> > +
> > +	/*
> > +	 * When moving a task from a cgroup to another, we don't want
> > +	 * to charge the common ancestors, even though they will be
> > +	 * uncharged later from attach_task(), because during that
> > +	 * short window between charge and uncharge, a task could fork
> > +	 * in the ancestor and spuriously fail due to the temporary
> > +	 * charge.
> > +	 */
> > +	common_ancestor = res_counter_common_ancestor(res, old_res);
> > +
> > +	/*
> > +	 * If cgrp is the root then res is NULL, however in this case
> > +	 * the common ancestor is NULL as well, making the below a NOP.
> > +	 */
> > +	err = res_counter_charge_until(res, common_ancestor, 1, NULL);
> > +	if (err)
> > +		return -EINVAL;
> > +
> > +	return 0;
> > +}
> 
> One would expect a "can"-named function to return a boolean.  This one
> returns an errno which is OK, I guess.  But the function is misnamed
> because if successful it actually alters charges.  A less misleading
> name would be simply task_counter_attach_task(), but that's already
> taken.  Or perhaps task_counter_try_attach_task(), but that seems
> unnecessary to me - many many kernel functions "try" something and back
> out with an errno if it failed.

Yeah, the ->can_attach_task() cgroup subsystem callbacks are more than
just things that passively report if something is possible or not.
They allow some side effects that can be rolled back in ->cancel_attach()
callbacks.

May be they could be renamed as pre_attach() one day. ->pre_attach()
callbacks already exist but are targeted for removal in Tejun's patches.

> 
> I really do dislike the fact that the documentation is over in another
> file and another patch.  For a start, it makes review harder and
> slower.

Ok, I've merged the doc in that patch.

Thanks!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 10/12] cgroups: Add documentation for task counter subsystem
  2011-09-06 22:41   ` Andrew Morton
@ 2011-09-13 17:35     ` Frederic Weisbecker
  0 siblings, 0 replies; 26+ messages in thread
From: Frederic Weisbecker @ 2011-09-13 17:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo

On Tue, Sep 06, 2011 at 03:41:31PM -0700, Andrew Morton wrote:
> Oh.  And I now see that the documentation file provides no documentation
> for the code in cgroup_task_counter.c anyway.  Really?

I've added more comment in that file for the v5. Especially for
the can_attach - attach - cancel_attach sequence that is a bit
tricky.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 11/12] cgroups: Allow subsystems to cancel a fork
  2011-09-06  0:13 ` [RFC PATCH 11/12] cgroups: Allow subsystems to cancel a fork Frederic Weisbecker
@ 2011-09-15 21:09   ` Andrew Morton
  2011-10-01 15:29     ` Frederic Weisbecker
  0 siblings, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2011-09-15 21:09 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo

On Tue,  6 Sep 2011 02:13:05 +0200
Frederic Weisbecker <fweisbec@gmail.com> wrote:

> Let the subsystem's fork callback return an error value so
> that they can cancel a fork. This is going to be used by
> the task counter subsystem to implement the limit.

This breaks the build quite gruesomely with CONFIG_CGROUPS=n.  I looked
at doing a quick fix but that would just have piled ugly on ugly.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 11/12] cgroups: Allow subsystems to cancel a fork
  2011-09-15 21:09   ` Andrew Morton
@ 2011-10-01 15:29     ` Frederic Weisbecker
  0 siblings, 0 replies; 26+ messages in thread
From: Frederic Weisbecker @ 2011-10-01 15:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo

On Thu, Sep 15, 2011 at 02:09:41PM -0700, Andrew Morton wrote:
> On Tue,  6 Sep 2011 02:13:05 +0200
> Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > Let the subsystem's fork callback return an error value so
> > that they can cancel a fork. This is going to be used by
> > the task counter subsystem to implement the limit.
> 
> This breaks the build quite gruesomely with CONFIG_CGROUPS=n.  I looked
> at doing a quick fix but that would just have piled ugly on ugly.
> 

Sorry about that. I'm fixing the whole so that it builds with or
without cgroups. 

I'm also improving the documentation and changelog along the way.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2011-10-01 15:29 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-06  0:12 [PATCH 00/12 v4][RESEND] cgroups: Task counter subsystem Frederic Weisbecker
2011-09-06  0:12 ` [PATCH 01/12] cgroups: Add res_counter_write_u64() API Frederic Weisbecker
2011-09-06  0:12 ` [PATCH 02/12] cgroups: New resource counter inheritance API Frederic Weisbecker
2011-09-06 22:17   ` Andrew Morton
2011-09-08 13:25     ` Frederic Weisbecker
2011-09-06  0:12 ` [PATCH 03/12] cgroups: Add previous cgroup in can_attach_task/attach_task callbacks Frederic Weisbecker
2011-09-06  0:12 ` [PATCH 04/12] cgroups: New cancel_attach_task subsystem callback Frederic Weisbecker
2011-09-06  0:12 ` [PATCH 05/12] cgroups: Ability to stop res charge propagation on bounded ancestor Frederic Weisbecker
2011-09-06  0:13 ` [PATCH 06/12] cgroups: Add res counter common ancestor searching Frederic Weisbecker
2011-09-06 22:21   ` Andrew Morton
2011-09-09 12:31     ` Frederic Weisbecker
2011-09-06  0:13 ` [PATCH 07/12] res_counter: Allow charge failure pointer to be null Frederic Weisbecker
2011-09-06  0:13 ` [PATCH 08/12] cgroups: Pull up res counter charge failure interpretation to caller Frederic Weisbecker
2011-09-06 22:26   ` Andrew Morton
2011-09-09 13:33     ` Frederic Weisbecker
2011-09-09 15:17       ` Andrew Morton
2011-09-06  0:13 ` [PATCH 09/12] cgroups: Add a task counter subsystem Frederic Weisbecker
2011-09-06 22:40   ` Andrew Morton
2011-09-13 15:13     ` Frederic Weisbecker
2011-09-06  0:13 ` [PATCH 10/12] cgroups: Add documentation for " Frederic Weisbecker
2011-09-06 22:41   ` Andrew Morton
2011-09-13 17:35     ` Frederic Weisbecker
2011-09-06  0:13 ` [RFC PATCH 11/12] cgroups: Allow subsystems to cancel a fork Frederic Weisbecker
2011-09-15 21:09   ` Andrew Morton
2011-10-01 15:29     ` Frederic Weisbecker
2011-09-06  0:13 ` [RFC PATCH 12/12] cgroups: Convert task counter to use the subsys fork callback Frederic Weisbecker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.