linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] cgroups: Task counter subsystem v8
@ 2012-02-01  3:37 Frederic Weisbecker
  2012-02-01  3:37 ` [PATCH 01/10] cgroups: add res_counter_write_u64() API Frederic Weisbecker
                   ` (11 more replies)
  0 siblings, 12 replies; 38+ messages in thread
From: Frederic Weisbecker @ 2012-02-01  3:37 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo, Li Zefan
  Cc: LKML, Frederic Weisbecker, Kirill A. Shutemov, Paul Menage,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Tim Hockin,
	Containers, Glauber Costa, Cgroups, Daniel J Walsh,
	Daniel P. Berrange, KAMEZAWA Hiroyuki, Max Kellermann,
	Mandeep Singh Baines

Hi,

Changes In this version:

- Split 32/64 bits version of res_counter_write_u64() [1/10]
  Courtesy of Kirill A. Shutemov

- Added Kirill's ack [8/10]

- Added selftests [9/10], [10/10]

Please consider for merging. At least two users want this feature:
https://lkml.org/lkml/2011/12/13/309
https://lkml.org/lkml/2011/12/13/364

More general details provided in the last version posting:
https://lkml.org/lkml/2012/1/13/230

Thanks!


Frederic Weisbecker (9):
  cgroups: add res_counter_write_u64() API
  cgroups: new resource counter inheritance API
  cgroups: ability to stop res charge propagation on bounded ancestor
  res_counter: allow charge failure pointer to be null
  cgroups: pull up res counter charge failure interpretation to caller
  cgroups: allow subsystems to cancel a fork
  cgroups: Add a task counter subsystem
  selftests: Enter each directories before executing selftests
  selftests: Add a new task counter selftest

Kirill A. Shutemov (1):
  cgroups: add res counter common ancestor searching

 Documentation/cgroups/resource_counter.txt         |   20 ++-
 Documentation/cgroups/task_counter.txt             |  153 +++++++++++
 include/linux/cgroup.h                             |   20 +-
 include/linux/cgroup_subsys.h                      |    5 +
 include/linux/res_counter.h                        |   27 ++-
 init/Kconfig                                       |    9 +
 kernel/Makefile                                    |    1 +
 kernel/cgroup.c                                    |   23 ++-
 kernel/cgroup_freezer.c                            |    6 +-
 kernel/cgroup_task_counter.c                       |  272 ++++++++++++++++++++
 kernel/exit.c                                      |    2 +-
 kernel/fork.c                                      |    7 +-
 kernel/res_counter.c                               |  103 +++++++-
 tools/testing/selftests/Makefile                   |    2 +-
 tools/testing/selftests/run_tests                  |    6 +-
 tools/testing/selftests/task_counter/Makefile      |    8 +
 tools/testing/selftests/task_counter/fork.c        |   40 +++
 tools/testing/selftests/task_counter/forkbomb.c    |   40 +++
 tools/testing/selftests/task_counter/multithread.c |   68 +++++
 tools/testing/selftests/task_counter/run_test      |  198 ++++++++++++++
 .../selftests/task_counter/spread_thread_group.c   |   82 ++++++
 21 files changed, 1056 insertions(+), 36 deletions(-)
 create mode 100644 Documentation/cgroups/task_counter.txt
 create mode 100644 kernel/cgroup_task_counter.c
 create mode 100644 tools/testing/selftests/task_counter/Makefile
 create mode 100644 tools/testing/selftests/task_counter/fork.c
 create mode 100644 tools/testing/selftests/task_counter/forkbomb.c
 create mode 100644 tools/testing/selftests/task_counter/multithread.c
 create mode 100755 tools/testing/selftests/task_counter/run_test
 create mode 100644 tools/testing/selftests/task_counter/spread_thread_group.c

-- 
1.7.5.4


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 01/10] cgroups: add res_counter_write_u64() API
  2012-02-01  3:37 [PATCH 00/10] cgroups: Task counter subsystem v8 Frederic Weisbecker
@ 2012-02-01  3:37 ` Frederic Weisbecker
  2012-02-02 12:33   ` Kirill A. Shutemov
  2012-02-01  3:37 ` [PATCH 02/10] cgroups: new resource counter inheritance API Frederic Weisbecker
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Frederic Weisbecker @ 2012-02-01  3:37 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo, Li Zefan
  Cc: LKML, Frederic Weisbecker, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Tim Hockin, Tejun Heo, Containers, Glauber Costa,
	Cgroups, Daniel J Walsh, Daniel P. Berrange, KAMEZAWA Hiroyuki,
	Max Kellermann, Mandeep Singh Baines, Kirill A. Shutemov

Extend the resource counter API with a mirror of res_counter_read_u64() to
make it handy to update a resource counter value from a cgroup subsystem
u64 value file.

[kirill@shutemov.name: Separate 32 bits and 64 bits versions]

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Containers <containers@lists.linux-foundation.org>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Cgroups <cgroups@vger.kernel.org>
Cc: Daniel J Walsh <dwalsh@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Max Kellermann <mk@cm4all.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
---
 include/linux/res_counter.h |    2 ++
 kernel/res_counter.c        |   31 +++++++++++++++++++++++++------
 2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index c9d625c..1b3fe05 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -82,6 +82,8 @@ int res_counter_memparse_write_strategy(const char *buf,
 int res_counter_write(struct res_counter *counter, int member,
 		      const char *buffer, write_strategy_fn write_strategy);
 
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
+
 /*
  * the field descriptors. one for each member of res_counter
  */
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 6d269cc..c46465c 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -167,12 +167,32 @@ int res_counter_memparse_write_strategy(const char *buf,
 	return 0;
 }
 
+#if BITS_PER_LONG == 32
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
+{
+	unsigned long long *target;
+	unsigned long flags;
+
+	spin_lock_irqsave(&counter->lock, flags);
+	target = res_counter_member(counter, member);
+	*target = val;
+	spin_unlock_irqrestore(&counter->lock, flags);
+}
+#else
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
+{
+	unsigned long long *target;
+
+	target = res_counter_member(counter, member);
+	*target = val;
+}
+#endif
+
 int res_counter_write(struct res_counter *counter, int member,
 		      const char *buf, write_strategy_fn write_strategy)
 {
 	char *end;
-	unsigned long flags;
-	unsigned long long tmp, *val;
+	unsigned long long tmp;
 
 	if (write_strategy) {
 		if (write_strategy(buf, &tmp))
@@ -182,9 +202,8 @@ int res_counter_write(struct res_counter *counter, int member,
 		if (*end != '\0')
 			return -EINVAL;
 	}
-	spin_lock_irqsave(&counter->lock, flags);
-	val = res_counter_member(counter, member);
-	*val = tmp;
-	spin_unlock_irqrestore(&counter->lock, flags);
+
+	res_counter_write_u64(counter, member, tmp);
+
 	return 0;
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 02/10] cgroups: new resource counter inheritance API
  2012-02-01  3:37 [PATCH 00/10] cgroups: Task counter subsystem v8 Frederic Weisbecker
  2012-02-01  3:37 ` [PATCH 01/10] cgroups: add res_counter_write_u64() API Frederic Weisbecker
@ 2012-02-01  3:37 ` Frederic Weisbecker
  2012-02-01  3:37 ` [PATCH 03/10] cgroups: ability to stop res charge propagation on bounded ancestor Frederic Weisbecker
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 38+ messages in thread
From: Frederic Weisbecker @ 2012-02-01  3:37 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo, Li Zefan
  Cc: LKML, Frederic Weisbecker, Paul Menage, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Tim Hockin, Tejun Heo, Containers,
	Glauber Costa, Cgroups, Daniel J Walsh, Daniel P. Berrange,
	KAMEZAWA Hiroyuki, Max Kellermann, Mandeep Singh Baines

Provide an API to inherit a counter value from a parent.  This can be
useful to implement cgroup.clone_children on a resource counter.

Still the resources of the children are limited by those of the parent, so
this is only to provide a default setting behaviour when clone_children is
set.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Containers <containers@lists.linux-foundation.org>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Cgroups <cgroups@vger.kernel.org>
Cc: Daniel J Walsh <dwalsh@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Max Kellermann <mk@cm4all.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/res_counter.h |    2 ++
 kernel/res_counter.c        |   18 ++++++++++++++++++
 2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 1b3fe05..109d118 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -84,6 +84,8 @@ int res_counter_write(struct res_counter *counter, int member,
 
 void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
 
+void res_counter_inherit(struct res_counter *counter, int member);
+
 /*
  * the field descriptors. one for each member of res_counter
  */
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index c46465c..3a93a82 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -207,3 +207,21 @@ int res_counter_write(struct res_counter *counter, int member,
 
 	return 0;
 }
+
+/*
+ * Simple inheritance implementation to get the same value
+ * than a parent. However this doesn't enforce the child value
+ * to be always below the one of the parent. But the child is
+ * subject to its parent limitation anyway.
+ */
+void res_counter_inherit(struct res_counter *counter, int member)
+{
+	struct res_counter *parent;
+	unsigned long long val;
+
+	parent = counter->parent;
+	if (parent) {
+		val = res_counter_read_u64(parent, member);
+		res_counter_write_u64(counter, member, val);
+	}
+}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 03/10] cgroups: ability to stop res charge propagation on bounded ancestor
  2012-02-01  3:37 [PATCH 00/10] cgroups: Task counter subsystem v8 Frederic Weisbecker
  2012-02-01  3:37 ` [PATCH 01/10] cgroups: add res_counter_write_u64() API Frederic Weisbecker
  2012-02-01  3:37 ` [PATCH 02/10] cgroups: new resource counter inheritance API Frederic Weisbecker
@ 2012-02-01  3:37 ` Frederic Weisbecker
  2012-02-01  3:37 ` [PATCH 04/10] cgroups: add res counter common ancestor searching Frederic Weisbecker
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 38+ messages in thread
From: Frederic Weisbecker @ 2012-02-01  3:37 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo, Li Zefan
  Cc: LKML, Frederic Weisbecker, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Tim Hockin, Tejun Heo, Containers, Glauber Costa,
	Cgroups, Daniel J Walsh, Daniel P. Berrange, KAMEZAWA Hiroyuki,
	Max Kellermann, Mandeep Singh Baines

Moving a task from a cgroup to another may require to substract its
resource charge from the old cgroup and add it to the new one.

For this to happen, the uncharge/charge propagation can just stop when we
reach the common ancestor for the two cgroups.  Further the performance
reasons, we also want to avoid to temporarily overload the common
ancestors with a non-accurate resource counter usage if we charge first
the new cgroup and uncharge the old one thereafter.  This is going to be a
requirement for the coming max number of task subsystem.

To solve this, provide a pair of new API that can charge/uncharge a
resource counter until we reach a given ancestor.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Containers <containers@lists.linux-foundation.org>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Cgroups <cgroups@vger.kernel.org>
Cc: Daniel J Walsh <dwalsh@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Max Kellermann <mk@cm4all.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 Documentation/cgroups/resource_counter.txt |   18 +++++++++++++++++-
 include/linux/res_counter.h                |   20 +++++++++++++++++---
 kernel/res_counter.c                       |   13 ++++++++-----
 3 files changed, 42 insertions(+), 9 deletions(-)

diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index 95b24d7..a2cd05b 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -83,7 +83,15 @@ to work with it.
 	res_counter->lock internally (it must be called with res_counter->lock
 	held).
 
- e. void res_counter_uncharge[_locked]
+ e. int res_counter_charge_until(struct res_counter *counter,
+			     struct res_counter *limit, unsigned long val,
+			     struct res_counter **limit_fail_at)
+
+	The same as res_counter_charge(), but the charge propagation to
+	the hierarchy stops at the limit given in the "limit" parameter.
+
+
+ f. void res_counter_uncharge[_locked]
 			(struct res_counter *rc, unsigned long val)
 
 	When a resource is released (freed) it should be de-accounted
@@ -92,6 +100,14 @@ to work with it.
 
 	The _locked routines imply that the res_counter->lock is taken.
 
+
+ g. void res_counter_uncharge_until(struct res_counter *counter,
+				struct res_counter *limit,
+				unsigned long val)
+
+	The same as res_counter_charge, but the uncharge propagation to
+	the hierarchy stops at the limit given in the "limit" parameter.
+
  2.1 Other accounting routines
 
     There are more routines that may help you with common needs, like
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 109d118..de4ba29 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -117,8 +117,16 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
 
 int __must_check res_counter_charge_locked(struct res_counter *counter,
 		unsigned long val);
-int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val, struct res_counter **limit_fail_at);
+int __must_check res_counter_charge_until(struct res_counter *counter,
+					  struct res_counter *limit,
+					  unsigned long val,
+					  struct res_counter **limit_fail_at);
+static inline int __must_check
+res_counter_charge(struct res_counter *counter, unsigned long val,
+		   struct res_counter **limit_fail_at)
+{
+	return res_counter_charge_until(counter, NULL, val, limit_fail_at);
+}
 
 /*
  * uncharge - tell that some portion of the resource is released
@@ -131,7 +139,13 @@ int __must_check res_counter_charge(struct res_counter *counter,
  */
 
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
-void res_counter_uncharge(struct res_counter *counter, unsigned long val);
+void res_counter_uncharge_until(struct res_counter *counter,
+				struct res_counter *limit, unsigned long val);
+static inline void res_counter_uncharge(struct res_counter *counter,
+					unsigned long val)
+{
+	res_counter_uncharge_until(counter, NULL, val);
+}
 
 /**
  * res_counter_margin - calculate chargeable space of a counter
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 3a93a82..40f15aa 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -35,8 +35,9 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 	return 0;
 }
 
-int res_counter_charge(struct res_counter *counter, unsigned long val,
-			struct res_counter **limit_fail_at)
+int res_counter_charge_until(struct res_counter *counter,
+			     struct res_counter *limit, unsigned long val,
+			     struct res_counter **limit_fail_at)
 {
 	int ret;
 	unsigned long flags;
@@ -44,7 +45,7 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
 
 	*limit_fail_at = NULL;
 	local_irq_save(flags);
-	for (c = counter; c != NULL; c = c->parent) {
+	for (c = counter; c != limit; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
 		spin_unlock(&c->lock);
@@ -74,13 +75,15 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
 	counter->usage -= val;
 }
 
-void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+void res_counter_uncharge_until(struct res_counter *counter,
+				struct res_counter *limit,
+				unsigned long val)
 {
 	unsigned long flags;
 	struct res_counter *c;
 
 	local_irq_save(flags);
-	for (c = counter; c != NULL; c = c->parent) {
+	for (c = counter; c != limit; c = c->parent) {
 		spin_lock(&c->lock);
 		res_counter_uncharge_locked(c, val);
 		spin_unlock(&c->lock);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 04/10] cgroups: add res counter common ancestor searching
  2012-02-01  3:37 [PATCH 00/10] cgroups: Task counter subsystem v8 Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2012-02-01  3:37 ` [PATCH 03/10] cgroups: ability to stop res charge propagation on bounded ancestor Frederic Weisbecker
@ 2012-02-01  3:37 ` Frederic Weisbecker
  2012-02-01  3:37 ` [PATCH 05/10] res_counter: allow charge failure pointer to be null Frederic Weisbecker
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 38+ messages in thread
From: Frederic Weisbecker @ 2012-02-01  3:37 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo, Li Zefan
  Cc: LKML, Kirill A. Shutemov, Frederic Weisbecker, Paul Menage,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Tim Hockin,
	Tejun Heo, Containers, Glauber Costa, Cgroups, Daniel J Walsh,
	Daniel P. Berrange, KAMEZAWA Hiroyuki, Max Kellermann,
	Mandeep Singh Baines

From: "Kirill A. Shutemov" <kirill@shutemov.name>

Add a new API to find the common ancestor between two resource counters.
This includes the passed resource counter themselves.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Containers <containers@lists.linux-foundation.org>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Cgroups <cgroups@vger.kernel.org>
Cc: Daniel J Walsh <dwalsh@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Max Kellermann <mk@cm4all.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/res_counter.h |    3 +++
 kernel/res_counter.c        |   33 +++++++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index de4ba29..558f39b 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -147,6 +147,9 @@ static inline void res_counter_uncharge(struct res_counter *counter,
 	res_counter_uncharge_until(counter, NULL, val);
 }
 
+struct res_counter *res_counter_common_ancestor(struct res_counter *l,
+						struct res_counter *r);
+
 /**
  * res_counter_margin - calculate chargeable space of a counter
  * @cnt: the counter
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 40f15aa..6dc6164 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -91,6 +91,39 @@ void res_counter_uncharge_until(struct res_counter *counter,
 	local_irq_restore(flags);
 }
 
+/*
+ * Walk through r1 and r2 parents and try to find the closest common one
+ * between both. If none is found, it returns NULL.
+ */
+struct res_counter *
+res_counter_common_ancestor(struct res_counter *r1, struct res_counter *r2)
+{
+	struct res_counter *iter;
+	int r1_depth = 0, r2_depth = 0;
+
+	for (iter = r1; iter; iter = iter->parent)
+		r1_depth++;
+
+	for (iter = r2; iter; iter = iter->parent)
+		r2_depth++;
+
+	while (r1_depth > r2_depth) {
+		r1 = r1->parent;
+		r1_depth--;
+	}
+
+	while (r2_depth > r1_depth) {
+		r2 = r2->parent;
+		r2_depth--;
+	}
+
+	while (r1 != r2) {
+		r1 = r1->parent;
+		r2 = r2->parent;
+	}
+
+	return r1;
+}
 
 static inline unsigned long long *
 res_counter_member(struct res_counter *counter, int member)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 05/10] res_counter: allow charge failure pointer to be null
  2012-02-01  3:37 [PATCH 00/10] cgroups: Task counter subsystem v8 Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2012-02-01  3:37 ` [PATCH 04/10] cgroups: add res counter common ancestor searching Frederic Weisbecker
@ 2012-02-01  3:37 ` Frederic Weisbecker
  2012-02-01  3:37 ` [PATCH 06/10] cgroups: pull up res counter charge failure interpretation to caller Frederic Weisbecker
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 38+ messages in thread
From: Frederic Weisbecker @ 2012-02-01  3:37 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo, Li Zefan
  Cc: LKML, Frederic Weisbecker, Paul Menage, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Tim Hockin, Tejun Heo, Containers,
	Glauber Costa, Cgroups, Daniel J Walsh, Daniel P. Berrange,
	KAMEZAWA Hiroyuki, Max Kellermann, Mandeep Singh Baines

So that callers of res_counter_charge() don't have to create and pass this
pointer even if they aren't interested in it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Containers <containers@lists.linux-foundation.org>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Cgroups <cgroups@vger.kernel.org>
Cc: Daniel J Walsh <dwalsh@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Max Kellermann <mk@cm4all.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 kernel/res_counter.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 6dc6164..d2fc4cd 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -43,14 +43,16 @@ int res_counter_charge_until(struct res_counter *counter,
 	unsigned long flags;
 	struct res_counter *c, *u;
 
-	*limit_fail_at = NULL;
+	if (limit_fail_at)
+		*limit_fail_at = NULL;
 	local_irq_save(flags);
 	for (c = counter; c != limit; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
 		spin_unlock(&c->lock);
 		if (ret < 0) {
-			*limit_fail_at = c;
+			if (limit_fail_at)
+				*limit_fail_at = c;
 			goto undo;
 		}
 	}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 06/10] cgroups: pull up res counter charge failure interpretation to caller
  2012-02-01  3:37 [PATCH 00/10] cgroups: Task counter subsystem v8 Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2012-02-01  3:37 ` [PATCH 05/10] res_counter: allow charge failure pointer to be null Frederic Weisbecker
@ 2012-02-01  3:37 ` Frederic Weisbecker
  2012-02-01  3:37 ` [PATCH 07/10] cgroups: allow subsystems to cancel a fork Frederic Weisbecker
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 38+ messages in thread
From: Frederic Weisbecker @ 2012-02-01  3:37 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo, Li Zefan
  Cc: LKML, Frederic Weisbecker, Paul Menage, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Tim Hockin, Tejun Heo, Containers,
	Glauber Costa, Cgroups, Daniel J Walsh, Daniel P. Berrange,
	KAMEZAWA Hiroyuki, Max Kellermann, Mandeep Singh Baines

res_counter_charge() always returns -ENOMEM when the limit is reached and
the charge thus can't happen.

However it's up to the caller to interpret this failure and return the
appropriate error value.  The task counter subsystem will need to report
the user that a fork() has been cancelled because of some limit reached,
not because we are too short on memory.

Fix this by returning -1 when res_counter_charge() fails.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Containers <containers@lists.linux-foundation.org>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Cgroups <cgroups@vger.kernel.org>
Cc: Daniel J Walsh <dwalsh@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Max Kellermann <mk@cm4all.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 Documentation/cgroups/resource_counter.txt |    2 ++
 kernel/res_counter.c                       |    2 +-
 2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index a2cd05b..24ec61c 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -76,6 +76,8 @@ to work with it.
 	limit_fail_at parameter is set to the particular res_counter element
 	where the charging failed.
 
+	It returns 0 on success and -1 on failure.
+
  d. int res_counter_charge_locked
 			(struct res_counter *rc, unsigned long val)
 
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index d2fc4cd..78cc444 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -26,7 +26,7 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 {
 	if (counter->usage + val > counter->limit) {
 		counter->failcnt++;
-		return -ENOMEM;
+		return -1;
 	}
 
 	counter->usage += val;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 07/10] cgroups: allow subsystems to cancel a fork
  2012-02-01  3:37 [PATCH 00/10] cgroups: Task counter subsystem v8 Frederic Weisbecker
                   ` (5 preceding siblings ...)
  2012-02-01  3:37 ` [PATCH 06/10] cgroups: pull up res counter charge failure interpretation to caller Frederic Weisbecker
@ 2012-02-01  3:37 ` Frederic Weisbecker
  2012-02-01  3:37 ` [PATCH 08/10] cgroups: Add a task counter subsystem Frederic Weisbecker
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 38+ messages in thread
From: Frederic Weisbecker @ 2012-02-01  3:37 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo, Li Zefan
  Cc: LKML, Frederic Weisbecker, Paul Menage, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Tim Hockin, Tejun Heo, Containers,
	Glauber Costa, Cgroups, Daniel J Walsh, Daniel P. Berrange,
	KAMEZAWA Hiroyuki, Max Kellermann, Mandeep Singh Baines

Let the subsystem's fork callback return an error value so that they can
cancel a fork.  This is going to be used by the task counter subsystem to
implement the limit.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Containers <containers@lists.linux-foundation.org>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Cgroups <cgroups@vger.kernel.org>
Cc: Daniel J Walsh <dwalsh@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Max Kellermann <mk@cm4all.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/cgroup.h  |   20 ++++++++++++++------
 kernel/cgroup.c         |   23 +++++++++++++++++++----
 kernel/cgroup_freezer.c |    6 ++++--
 kernel/exit.c           |    2 +-
 kernel/fork.c           |    7 +++++--
 5 files changed, 43 insertions(+), 15 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 7da3e74..e7d3f0d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -17,10 +17,11 @@
 #include <linux/rwsem.h>
 #include <linux/idr.h>
 
+struct cgroup_subsys;
+
 #ifdef CONFIG_CGROUPS
 
 struct cgroupfs_root;
-struct cgroup_subsys;
 struct inode;
 struct cgroup;
 struct css_id;
@@ -32,9 +33,11 @@ extern int cgroup_lock_is_held(void);
 extern bool cgroup_lock_live_group(struct cgroup *cgrp);
 extern void cgroup_unlock(void);
 extern void cgroup_fork(struct task_struct *p);
-extern void cgroup_fork_callbacks(struct task_struct *p);
+extern int cgroup_fork_callbacks(struct task_struct *p,
+				 struct cgroup_subsys **failed_ss);
 extern void cgroup_post_fork(struct task_struct *p);
-extern void cgroup_exit(struct task_struct *p, int run_callbacks);
+extern void cgroup_exit(struct task_struct *p, int run_callbacks,
+			struct cgroup_subsys *failed_ss);
 extern int cgroupstats_build(struct cgroupstats *stats,
 				struct dentry *dentry);
 extern int cgroup_load_subsys(struct cgroup_subsys *ss);
@@ -462,7 +465,7 @@ struct cgroup_subsys {
 			      struct cgroup_taskset *tset);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 		       struct cgroup_taskset *tset);
-	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
+	int (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
 	void (*exit)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			struct cgroup *old_cgrp, struct task_struct *task);
 	int (*populate)(struct cgroup_subsys *ss,
@@ -614,9 +617,14 @@ struct cgroup_subsys_state *cgroup_css_from_dir(struct file *f, int id);
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
 static inline void cgroup_fork(struct task_struct *p) {}
-static inline void cgroup_fork_callbacks(struct task_struct *p) {}
+static inline int cgroup_fork_callbacks(struct task_struct *p,
+					struct cgroup_subsys **failed_ss)
+{
+	return 0;
+}
 static inline void cgroup_post_fork(struct task_struct *p) {}
-static inline void cgroup_exit(struct task_struct *p, int callbacks) {}
+static inline void cgroup_exit(struct task_struct *p, int callbacks,
+			       struct cgroup_subsys *failed_ss) {}
 
 static inline void cgroup_lock(void) {}
 static inline void cgroup_unlock(void) {}
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1626152..af38004 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4583,8 +4583,11 @@ void cgroup_fork(struct task_struct *child)
  * tasklist. No need to take any locks since no-one can
  * be operating on this task.
  */
-void cgroup_fork_callbacks(struct task_struct *child)
+int cgroup_fork_callbacks(struct task_struct *child,
+			  struct cgroup_subsys **failed_ss)
 {
+	int err;
+
 	if (need_forkexit_callback) {
 		int i;
 		/*
@@ -4594,10 +4597,17 @@ void cgroup_fork_callbacks(struct task_struct *child)
 		 */
 		for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
 			struct cgroup_subsys *ss = subsys[i];
-			if (ss->fork)
-				ss->fork(ss, child);
+			if (ss->fork) {
+				err = ss->fork(ss, child);
+				if (err) {
+					*failed_ss = ss;
+					return err;
+				}
+			}
 		}
 	}
+
+	return 0;
 }
 
 /**
@@ -4664,7 +4674,8 @@ void cgroup_post_fork(struct task_struct *child)
  *    which wards off any cgroup_attach_task() attempts, or task is a failed
  *    fork, never visible to cgroup_attach_task.
  */
-void cgroup_exit(struct task_struct *tsk, int run_callbacks)
+void cgroup_exit(struct task_struct *tsk, int run_callbacks,
+		 struct cgroup_subsys *failed_ss)
 {
 	struct css_set *cg;
 	int i;
@@ -4693,6 +4704,10 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
 		 */
 		for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
 			struct cgroup_subsys *ss = subsys[i];
+
+			if (ss == failed_ss)
+				break;
+
 			if (ss->exit) {
 				struct cgroup *old_cgrp =
 					rcu_dereference_raw(cg->subsys[i])->cgroup;
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index fc0646b..ec86eb7 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -185,7 +185,7 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	return 0;
 }
 
-static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
+static int freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 {
 	struct freezer *freezer;
 
@@ -205,7 +205,7 @@ static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 	 * following check.
 	 */
 	if (!freezer->css.cgroup->parent)
-		return;
+		return 0;
 
 	spin_lock_irq(&freezer->lock);
 	BUG_ON(freezer->state == CGROUP_FROZEN);
@@ -214,6 +214,8 @@ static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 	if (freezer->state == CGROUP_FREEZING)
 		freeze_task(task);
 	spin_unlock_irq(&freezer->lock);
+
+	return 0;
 }
 
 /*
diff --git a/kernel/exit.c b/kernel/exit.c
index 294b170..e2ee0bb 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -990,7 +990,7 @@ void do_exit(long code)
 	 */
 	perf_event_exit_task(tsk);
 
-	cgroup_exit(tsk, 1);
+	cgroup_exit(tsk, 1, NULL);
 
 	if (group_dead)
 		disassociate_ctty(1);
diff --git a/kernel/fork.c b/kernel/fork.c
index 051f090..617ca93 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1054,6 +1054,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
+	struct cgroup_subsys *cgroup_failed_ss = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1308,8 +1309,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	/* Now that the task is set up, run cgroup callbacks if
 	 * necessary. We need to run them before the task is visible
 	 * on the tasklist. */
-	cgroup_fork_callbacks(p);
+	retval = cgroup_fork_callbacks(p, &cgroup_failed_ss);
 	cgroup_callbacks_done = 1;
+	if (retval)
+		goto bad_fork_free_pid;
 
 	/* Need tasklist lock for parent etc handling! */
 	write_lock_irq(&tasklist_lock);
@@ -1413,7 +1416,7 @@ bad_fork_cleanup_cgroup:
 #endif
 	if (clone_flags & CLONE_THREAD)
 		threadgroup_change_end(current);
-	cgroup_exit(p, cgroup_callbacks_done);
+	cgroup_exit(p, cgroup_callbacks_done, cgroup_failed_ss);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);
 bad_fork_cleanup_count:
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 08/10] cgroups: Add a task counter subsystem
  2012-02-01  3:37 [PATCH 00/10] cgroups: Task counter subsystem v8 Frederic Weisbecker
                   ` (6 preceding siblings ...)
  2012-02-01  3:37 ` [PATCH 07/10] cgroups: allow subsystems to cancel a fork Frederic Weisbecker
@ 2012-02-01  3:37 ` Frederic Weisbecker
  2012-02-01  3:37 ` [PATCH 09/10] selftests: Enter each directories before executing selftests Frederic Weisbecker
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 38+ messages in thread
From: Frederic Weisbecker @ 2012-02-01  3:37 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo, Li Zefan
  Cc: LKML, Frederic Weisbecker, Paul Menage, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Tim Hockin, Containers,
	Glauber Costa, Cgroups, Daniel J Walsh, Daniel P. Berrange,
	KAMEZAWA Hiroyuki, Max Kellermann, Mandeep Singh Baines

Add a new subsystem to limit the number of running tasks,
similar to the NR_PROC rlimit but in the scope of a cgroup.

The user can set an upper bound limit that is checked every
time a task forks in a cgroup or is moved into a cgroup
with that subsystem binded.

The primary goal is to protect against forkbombs that explode
inside a container. The traditional NR_PROC rlimit is not
efficient in that case because if we run containers in parallel
under the same user, one of these could starve all the others
by spawning a high number of tasks close to the user wide limit.

This is a prevention against forkbombs, so it's not deemed to
cure the effects of a forkbomb when the system is in a state
where it's not responsive. It's aimed at preventing from ever
reaching that state and stop the spreading of tasks early.
While defining the limit on the allowed number of tasks, it's
up to the user to find the right balance between the resource
its containers may need and what it can afford to provide.

As it's totally dissociated from the rlimit NR_PROC, both
can be complementary: the cgroup task counter can set an upper
bound per container and the rlmit can be an upper bound on the
overall set of containers.

Also this subsystem can be used to kill all the tasks in a cgroup
without races against concurrent forks, by setting the limit of
tasks to 0, any further forks can be rejected. This is a good
way to kill a forkbomb in a container, or simply kill any container
without the need to retry an unbound number of times.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Containers <containers@lists.linux-foundation.org>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Cgroups <cgroups@vger.kernel.org>
Cc: Daniel J Walsh <dwalsh@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Max Kellermann <mk@cm4all.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
---
 Documentation/cgroups/task_counter.txt |  153 ++++++++++++++++++
 include/linux/cgroup_subsys.h          |    5 +
 init/Kconfig                           |    9 +
 kernel/Makefile                        |    1 +
 kernel/cgroup_task_counter.c           |  272 ++++++++++++++++++++++++++++++++
 5 files changed, 440 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/cgroups/task_counter.txt
 create mode 100644 kernel/cgroup_task_counter.c

diff --git a/Documentation/cgroups/task_counter.txt b/Documentation/cgroups/task_counter.txt
new file mode 100644
index 0000000..1562d88
--- /dev/null
+++ b/Documentation/cgroups/task_counter.txt
@@ -0,0 +1,153 @@
+Task counter subsystem
+
+1. Description
+
+The task counter subsystem limits the number of tasks running
+inside a given cgroup. It behaves like the NR_PROC rlimit but in
+the scope of a cgroup instead of a user.
+
+It has two typical usecases, although more can probably be found:
+
+1.1 Protection against forkbomb in a container
+
+One usecase is to protect against forkbombs that explode inside
+a container when that container is implemented using a cgroup. The
+NR_PROC rlimit is known to be a working protection against this type
+of attack but is not suitable anymore when we run containers in
+parallel under the same user. One container could starve all the
+others by spawning a high number of tasks close to the rlimit
+boundary. So in this case we need this limitation to be done in a
+per cgroup granularity.
+
+Note this works by preventing forkbombs propagation. It doesn't cure
+the forkbomb effects when it has already grown up enough to make
+the system hardly responsive. While defining the limit on the number
+of tasks, it's up to the admin to find the right balance between the
+possible needs of a container and the resources the system can afford
+to provide.
+
+Also the NR_PROC rlimit and this cgroup subsystem are totally
+dissociated. But they can be complementary. The task counter limits
+the containers and the rlimit can provide an upper bound on the whole
+set of containers.
+
+
+1.2 Kill tasks inside a cgroup
+
+An other usecase comes along the forkbomb prevention: it brings
+the ability to kill all tasks inside a cgroup without races. By
+setting the limit of running tasks to 0, one can prevent from any
+further fork inside a cgroup and then kill all of its tasks without
+the need to retry an unbound amount of time due to races between
+kills and forks running in parallel (more details in "Kill a cgroup
+safely" paragraph).
+
+This is useful to kill a forkbomb for example. When its gazillion
+of forks are competing with the kills, one need to ensure this
+operation won't run in a nearly endless loop of retry.
+
+And more generally it is useful to kill a cgroup in a bound amount
+of pass.
+
+
+2. Interface
+
+When a hierarchy is mounted with the task counter subsystem binded, it
+adds two files into the cgroups directories, except the root one:
+
+- tasks.usage contains the number of tasks running inside a cgroup and
+its children in the hierarchy (see paragraph about Inheritance).
+
+- tasks.limit contains the maximum number of tasks that can run inside
+a cgroup. We check this limit when a task forks or when it is migrated
+to a cgroup.
+
+Note that the tasks.limit value can be forced below tasks.usage, in which
+case any new task in the cgroup will be rejected until the tasks.usage
+value goes below tasks.limit.
+
+For optimization reasons, the root directory of a hierarchy doesn't have
+a task counter.
+
+
+3. Inheritance
+
+When a task is added to a cgroup, by way of a cgroup migration or a fork,
+it increases the task counter of that cgroup and of all its ancestors.
+Hence a cgroup is also subject to the limit of its ancestors.
+
+In the following hierarchy:
+
+
+             A
+             |
+             B
+           /   \
+          C     D
+
+
+We have 1 task running in B, one running in C and none running in D.
+It means we have tasks.usage = 1 in C and tasks.usage = 2 in B because
+B counts its task and those of its children.
+
+Now lets set tasks.limit = 2 in B and tasks.limit = 1 in D.
+If we move a new task in D, it will be refused because the limit in B has
+been reached already.
+
+
+4. Kill a cgroup safely
+
+As explained in the description, this subsystem is also helpful to
+kill all tasks in a cgroup safely, after setting tasks.limit to 0,
+so that we don't race against parallel forks in an unbound numbers
+of kill iterations.
+
+But there is a small detail to be aware of to use this feature that
+way.
+
+Some typical way to proceed would be:
+
+	echo 0 > tasks.limit
+	for TASK in $(cat cgroup.procs)
+	do
+		kill -KILL $TASK
+	done
+
+However there is a small race window where a task can be in the way to
+be forked but hasn't enough completed the fork to have the PID of the
+fork appearing in the cgroup.procs file.
+
+The only way to get it right is to run a loop that reads tasks.usage, kill
+all the tasks in cgroup.procs and exit the loop only if the value in
+tasks.usage was the same than the number of tasks that were in cgroup.procs,
+ie: the number of tasks that were killed.
+
+It works because the new child appears in tasks.usage right before we check,
+in the fork path, whether the parent has a pending signal, in which case the
+fork is cancelled anyway. So relying on tasks.usage is fine and non-racy.
+
+This race window is tiny and unlikely to happen, so most of the time a single
+kill iteration should be enough. But it's worth knowing about that corner
+case spotted by Oleg Nesterov.
+
+Example of safe use would be:
+
+	echo 0 > tasks.limit
+	END=false
+
+	while [ $END == false ]
+	do
+		NR_TASKS=$(cat tasks.usage)
+		NR_KILLED=0
+
+		for TASK in $(cat cgroup.procs)
+		do
+			let NR_KILLED=NR_KILLED+1
+			kill -KILL $TASK
+		done
+
+		if [ "$NR_TASKS" = "$NR_KILLED" ]
+		then
+			END=true
+		fi
+	done
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 0bd390c..a95eb27 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -72,3 +72,8 @@ SUBSYS(net_prio)
 #endif
 
 /* */
+#ifdef CONFIG_CGROUP_TASK_COUNTER
+SUBSYS(tasks)
+#endif
+
+/* */
diff --git a/init/Kconfig b/init/Kconfig
index 3f42cd6..488916e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -715,6 +715,15 @@ config CGROUP_MEM_RES_CTLR_KMEM
 	  the kmem extension can use it to guarantee that no group of processes
 	  will ever exhaust kernel resources alone.
 
+config CGROUP_TASK_COUNTER
+	bool "Control number of tasks in a cgroup"
+	depends on RESOURCE_COUNTERS
+	help
+	  Let the user set up an upper boundary of the allowed number of tasks
+	  running in a cgroup. When a task forks or is migrated to a cgroup that
+	  has this subsystem binded, the limit is checked to either accept or
+	  reject the fork/migration.
+
 config CGROUP_PERF
 	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
 	depends on PERF_EVENTS && CGROUPS
diff --git a/kernel/Makefile b/kernel/Makefile
index 2d9de86..7bc4ce7 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -59,6 +59,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
+obj-$(CONFIG_CGROUP_TASK_COUNTER) += cgroup_task_counter.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
diff --git a/kernel/cgroup_task_counter.c b/kernel/cgroup_task_counter.c
new file mode 100644
index 0000000..a4d87ac
--- /dev/null
+++ b/kernel/cgroup_task_counter.c
@@ -0,0 +1,272 @@
+/*
+ * Limits on number of tasks subsystem for cgroups
+ *
+ * Copyright (C) 2011-2012 Red Hat, Inc., Frederic Weisbecker <fweisbec@redhat.com>
+ *
+ * Thanks to Andrew Morton, Johannes Weiner, Li Zefan, Oleg Nesterov and
+ * Paul Menage for their suggestions.
+ *
+ */
+
+#include <linux/err.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/res_counter.h>
+
+
+struct task_counter {
+	struct res_counter		res;
+	struct cgroup_subsys_state	css;
+};
+
+/*
+ * The root task counter doesn't exist because it's not part of the
+ * whole task counting. We want to optimize the trivial case of only
+ * one root cgroup living.
+ */
+static struct cgroup_subsys_state root_css;
+
+
+static inline struct task_counter *cgroup_task_counter(struct cgroup *cgrp)
+{
+	if (!cgrp->parent)
+		return NULL;
+
+	return container_of(cgroup_subsys_state(cgrp, tasks_subsys_id),
+			    struct task_counter, css);
+}
+
+static inline struct res_counter *cgroup_task_res_counter(struct cgroup *cgrp)
+{
+	struct task_counter *cnt;
+
+	cnt = cgroup_task_counter(cgrp);
+	if (!cnt)
+		return NULL;
+
+	return &cnt->res;
+}
+
+static struct cgroup_subsys_state *
+task_counter_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct task_counter *cnt;
+	struct res_counter *parent_res;
+
+	if (!cgrp->parent)
+		return &root_css;
+
+	cnt = kzalloc(sizeof(*cnt), GFP_KERNEL);
+	if (!cnt)
+		return ERR_PTR(-ENOMEM);
+
+	parent_res = cgroup_task_res_counter(cgrp->parent);
+
+	res_counter_init(&cnt->res, parent_res);
+
+	return &cnt->css;
+}
+
+/*
+ * Inherit the limit value of the parent. This is not really to enforce
+ * a limit below or equal to the one of the parent which can be changed
+ * concurrently anyway. This is just to honour the clone flag.
+ */
+static void task_counter_post_clone(struct cgroup_subsys *ss,
+				    struct cgroup *cgrp)
+{
+	/* cgrp can't be root, so cgroup_task_res_counter() can't return NULL */
+	res_counter_inherit(cgroup_task_res_counter(cgrp), RES_LIMIT);
+}
+
+static void task_counter_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct task_counter *cnt = cgroup_task_counter(cgrp);
+
+	kfree(cnt);
+}
+
+/* Uncharge the cgroup the task was attached to */
+static void task_counter_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
+			      struct cgroup *old_cgrp, struct task_struct *task)
+{
+	/* Optimize for the root cgroup case */
+	if (old_cgrp->parent)
+		res_counter_uncharge(cgroup_task_res_counter(old_cgrp), 1);
+}
+
+static void task_counter_cancel_attach_until(struct res_counter *res,
+					     struct cgroup_taskset *tset,
+					     struct task_struct *until)
+{
+	struct task_struct *tsk;
+	struct res_counter *old_res;
+	struct cgroup *old_cgrp;
+	struct res_counter *common_ancestor;
+
+	cgroup_taskset_for_each(tsk, NULL, tset) {
+		if (tsk == until)
+			break;
+		old_cgrp = cgroup_taskset_cur_cgroup(tset);
+		old_res = cgroup_task_res_counter(old_cgrp);
+		common_ancestor = res_counter_common_ancestor(res, old_res);
+		res_counter_uncharge_until(res, common_ancestor, 1);
+	}
+}
+
+/*
+ * This does more than just probing the ability to attach to the dest cgroup.
+ * We can not just _check_ if we can attach to the destination and do the real
+ * attachment later in task_counter_attach() because a task in the dest
+ * cgroup can fork before and steal the last remaining count.
+ * Thus we need to charge the dest cgroup right now.
+ */
+static int task_counter_can_attach(struct cgroup_subsys *ss,
+				   struct cgroup *cgrp,
+				   struct cgroup_taskset *tset)
+{
+	struct res_counter *res = cgroup_task_res_counter(cgrp);
+	struct res_counter *old_res;
+	struct cgroup *old_cgrp;
+	struct res_counter *common_ancestor;
+	struct task_struct *tsk;
+	int err = 0;
+
+	cgroup_taskset_for_each(tsk, NULL, tset) {
+		old_cgrp = cgroup_taskset_cur_cgroup(tset);
+		old_res = cgroup_task_res_counter(old_cgrp);
+		/*
+		 * When moving a task from a cgroup to another, we don't want
+		 * to charge the common ancestors, even though they will be
+		 * uncharged later from attach_task(), because during that
+		 * short window between charge and uncharge, a task could fork
+		 * in the ancestor and spuriously fail due to the temporary
+		 * charge.
+		 */
+		common_ancestor = res_counter_common_ancestor(res, old_res);
+
+		/*
+		 * If cgrp is the root then res is NULL, however in this case
+		 * the common ancestor is NULL as well, making the below a NOP.
+		 */
+		err = res_counter_charge_until(res, common_ancestor, 1, NULL);
+		if (err) {
+			task_counter_cancel_attach_until(res, tset, tsk);
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+/* Uncharge the dest cgroup that we charged in task_counter_can_attach() */
+static void task_counter_cancel_attach(struct cgroup_subsys *ss,
+				       struct cgroup *cgrp,
+				       struct cgroup_taskset *tset)
+{
+	task_counter_cancel_attach_until(cgroup_task_res_counter(cgrp),
+					 tset, NULL);
+}
+
+/*
+ * This uncharge the old cgroups. We can do that now that we are sure the
+ * attachment can't cancelled anymore, because this uncharge operation
+ * couldn't be reverted later: a task in the old cgroup could fork after
+ * we uncharge and reach the task counter limit, making our return there
+ * not possible.
+ */
+static void task_counter_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+				struct cgroup_taskset *tset)
+{
+	struct res_counter *res = cgroup_task_res_counter(cgrp);
+	struct task_struct *tsk;
+	struct res_counter *old_res;
+	struct cgroup *old_cgrp;
+	struct res_counter *common_ancestor;
+
+	cgroup_taskset_for_each(tsk, NULL, tset) {
+		old_cgrp = cgroup_taskset_cur_cgroup(tset);
+		old_res = cgroup_task_res_counter(old_cgrp);
+		common_ancestor = res_counter_common_ancestor(res, old_res);
+		res_counter_uncharge_until(old_res, common_ancestor, 1);
+	}
+}
+
+static u64 task_counter_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+	int type = cft->private;
+
+	return res_counter_read_u64(cgroup_task_res_counter(cgrp), type);
+}
+
+static int task_counter_write_u64(struct cgroup *cgrp, struct cftype *cft,
+				  u64 val)
+{
+	int type = cft->private;
+
+	res_counter_write_u64(cgroup_task_res_counter(cgrp), type, val);
+
+	return 0;
+}
+
+static struct cftype files[] = {
+	{
+		.name		= "limit",
+		.read_u64	= task_counter_read_u64,
+		.write_u64	= task_counter_write_u64,
+		.private	= RES_LIMIT,
+	},
+
+	{
+		.name		= "usage",
+		.read_u64	= task_counter_read_u64,
+		.private	= RES_USAGE,
+	},
+};
+
+static int task_counter_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	if (!cgrp->parent)
+		return 0;
+
+	return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
+}
+
+/*
+ * Charge the task counter with the new child coming, or reject it if we
+ * reached the limit.
+ */
+static int task_counter_fork(struct cgroup_subsys *ss,
+			     struct task_struct *child)
+{
+	struct cgroup_subsys_state *css;
+	struct cgroup *cgrp;
+	int err;
+
+	css = child->cgroups->subsys[tasks_subsys_id];
+	cgrp = css->cgroup;
+
+	/* Optimize for the root cgroup case, which doesn't have a limit */
+	if (!cgrp->parent)
+		return 0;
+
+	err = res_counter_charge(cgroup_task_res_counter(cgrp), 1, NULL);
+	if (err)
+		return -EAGAIN;
+
+	return 0;
+}
+
+struct cgroup_subsys tasks_subsys = {
+	.name			= "tasks",
+	.subsys_id		= tasks_subsys_id,
+	.create			= task_counter_create,
+	.post_clone		= task_counter_post_clone,
+	.destroy		= task_counter_destroy,
+	.exit			= task_counter_exit,
+	.can_attach		= task_counter_can_attach,
+	.cancel_attach		= task_counter_cancel_attach,
+	.attach			= task_counter_attach,
+	.fork			= task_counter_fork,
+	.populate		= task_counter_populate,
+};
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 09/10] selftests: Enter each directories before executing selftests
  2012-02-01  3:37 [PATCH 00/10] cgroups: Task counter subsystem v8 Frederic Weisbecker
                   ` (7 preceding siblings ...)
  2012-02-01  3:37 ` [PATCH 08/10] cgroups: Add a task counter subsystem Frederic Weisbecker
@ 2012-02-01  3:37 ` Frederic Weisbecker
  2012-02-01  3:37 ` [PATCH 10/10] selftests: Add a new task counter selftest Frederic Weisbecker
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 38+ messages in thread
From: Frederic Weisbecker @ 2012-02-01  3:37 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo, Li Zefan
  Cc: LKML, Frederic Weisbecker, Kirill A. Shutemov, Paul Menage,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Tim Hockin,
	Containers, Glauber Costa, Cgroups, Daniel J Walsh,
	Daniel P. Berrange, KAMEZAWA Hiroyuki, Max Kellermann,
	Mandeep Singh Baines

This way the selftests can launch local programs from their
directory without trying to guess their absolute path.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Containers <containers@lists.linux-foundation.org>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Cgroups <cgroups@vger.kernel.org>
Cc: Daniel J Walsh <dwalsh@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Max Kellermann <mk@cm4all.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
---
 tools/testing/selftests/run_tests |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/tools/testing/selftests/run_tests b/tools/testing/selftests/run_tests
index 320718a..19043d3 100644
--- a/tools/testing/selftests/run_tests
+++ b/tools/testing/selftests/run_tests
@@ -4,5 +4,7 @@ TARGETS=breakpoints
 
 for TARGET in $TARGETS
 do
-	$TARGET/run_test
+	cd $TARGET
+	./run_test
+	cd ..
 done
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 10/10] selftests: Add a new task counter selftest
  2012-02-01  3:37 [PATCH 00/10] cgroups: Task counter subsystem v8 Frederic Weisbecker
                   ` (8 preceding siblings ...)
  2012-02-01  3:37 ` [PATCH 09/10] selftests: Enter each directories before executing selftests Frederic Weisbecker
@ 2012-02-01  3:37 ` Frederic Weisbecker
  2012-02-01 16:31 ` [PATCH 00/10] cgroups: Task counter subsystem v8 Tejun Heo
  2013-04-01 18:43 ` Tim Hockin
  11 siblings, 0 replies; 38+ messages in thread
From: Frederic Weisbecker @ 2012-02-01  3:37 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo, Li Zefan
  Cc: LKML, Frederic Weisbecker, Kirill A. Shutemov, Paul Menage,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Tim Hockin,
	Containers, Glauber Costa, Cgroups, Daniel J Walsh,
	Daniel P. Berrange, KAMEZAWA Hiroyuki, Max Kellermann,
	Mandeep Singh Baines

This is a batch of tests running against the cgroup task counter
subsystem in order to validate the expected behaviour from this
cgroup feature.

This checks the reliability of the value found in the tasks.usage file
after events like forks, exits or cgroup migration of whole processes
or individual threads.

This also check that forks or cgroup migration are either accepted or
rejected according to the value set in the tasks.limit file.

A forkbomb is also launched to test if the subsystem stops well its
propagation and kills it properly as expected.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Containers <containers@lists.linux-foundation.org>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Cgroups <cgroups@vger.kernel.org>
Cc: Daniel J Walsh <dwalsh@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Max Kellermann <mk@cm4all.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
---
 tools/testing/selftests/Makefile                   |    2 +-
 tools/testing/selftests/run_tests                  |    2 +-
 tools/testing/selftests/task_counter/Makefile      |    8 +
 tools/testing/selftests/task_counter/fork.c        |   40 ++++
 tools/testing/selftests/task_counter/forkbomb.c    |   40 ++++
 tools/testing/selftests/task_counter/multithread.c |   68 +++++++
 tools/testing/selftests/task_counter/run_test      |  198 ++++++++++++++++++++
 .../selftests/task_counter/spread_thread_group.c   |   82 ++++++++
 8 files changed, 438 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/task_counter/Makefile
 create mode 100644 tools/testing/selftests/task_counter/fork.c
 create mode 100644 tools/testing/selftests/task_counter/forkbomb.c
 create mode 100644 tools/testing/selftests/task_counter/multithread.c
 create mode 100755 tools/testing/selftests/task_counter/run_test
 create mode 100644 tools/testing/selftests/task_counter/spread_thread_group.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 4ec8401..c697742 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -1,4 +1,4 @@
-TARGETS = breakpoints
+TARGETS = breakpoints task_counter
 
 all:
 	for TARGET in $(TARGETS); do \
diff --git a/tools/testing/selftests/run_tests b/tools/testing/selftests/run_tests
index 19043d3..11d76f7 100644
--- a/tools/testing/selftests/run_tests
+++ b/tools/testing/selftests/run_tests
@@ -1,6 +1,6 @@
 #!/bin/bash
 
-TARGETS=breakpoints
+TARGETS="breakpoints task_counter"
 
 for TARGET in $TARGETS
 do
diff --git a/tools/testing/selftests/task_counter/Makefile b/tools/testing/selftests/task_counter/Makefile
new file mode 100644
index 0000000..e314ce4
--- /dev/null
+++ b/tools/testing/selftests/task_counter/Makefile
@@ -0,0 +1,8 @@
+all:
+	gcc fork.c -o fork
+	gcc forkbomb.c -o forkbomb
+	gcc multithread.c -o multithread -lpthread
+	gcc spread_thread_group.c -o spread_thread_group -lpthread
+
+clean:
+	rm -f fork forkbomb multithread spread_thread_group *~
diff --git a/tools/testing/selftests/task_counter/fork.c b/tools/testing/selftests/task_counter/fork.c
new file mode 100644
index 0000000..20b48a9
--- /dev/null
+++ b/tools/testing/selftests/task_counter/fork.c
@@ -0,0 +1,40 @@
+/*
+ * Copyright (C) 2012 Red Hat, Inc., Frederic Weisbecker <fweisbec@redhat.com>
+ *
+ * Licensed under the terms of the GNU GPL License version 2
+ *
+ * Move a task to a cgroup and try to fork on it.
+ */
+
+#include <stdio.h>
+#include <sys/types.h>
+#include <unistd.h>
+#include <string.h>
+
+int main(int argc, char **argv)
+{
+	FILE *fp;
+	int pid;
+	char cpid[50];
+
+	if (argc < 2)
+		return -1;
+
+	pid = getpid();
+	fp = fopen(argv[1], "w");
+	if (!fp)
+		return -2;
+
+	sprintf(cpid, "%d\n", pid);
+
+	if (fwrite(cpid, strlen(cpid), 1, fp) != 1) {
+		perror("can't write pid\n");
+		return -3;
+	}
+	fclose(fp);
+
+	if (fork() == -1)
+		return 0;
+
+	return -4;
+}
diff --git a/tools/testing/selftests/task_counter/forkbomb.c b/tools/testing/selftests/task_counter/forkbomb.c
new file mode 100644
index 0000000..221fefb
--- /dev/null
+++ b/tools/testing/selftests/task_counter/forkbomb.c
@@ -0,0 +1,40 @@
+/*
+ * Copyright (C) 2012 Red Hat, Inc., Frederic Weisbecker <fweisbec@redhat.com>
+ *
+ * Licensed under the terms of the GNU GPL License version 2
+ *
+ * Move a task to a cgroup and forkbomb there.
+ */
+
+#include <stdio.h>
+#include <sys/types.h>
+#include <unistd.h>
+#include <string.h>
+
+int main(int argc, char **argv)
+{
+	FILE *fp;
+	int pid;
+	char cpid[50];
+
+	if (argc < 2)
+		return -1;
+
+	pid = getpid();
+	fp = fopen(argv[1], "w");
+	if (!fp)
+		return -2;
+
+	sprintf(cpid, "%d\n", pid);
+
+	if (fwrite(cpid, strlen(cpid), 1, fp) != 1) {
+		perror("can't write pid\n");
+		return -3;
+	}
+	fclose(fp);
+
+	for (;;)
+		fork();
+
+	return 0;
+}
diff --git a/tools/testing/selftests/task_counter/multithread.c b/tools/testing/selftests/task_counter/multithread.c
new file mode 100644
index 0000000..e566a02
--- /dev/null
+++ b/tools/testing/selftests/task_counter/multithread.c
@@ -0,0 +1,68 @@
+/*
+ * Copyright (C) 2012 Red Hat, Inc., Frederic Weisbecker <fweisbec@redhat.com>
+ *
+ * Licensed under the terms of the GNU GPL License version 2
+ *
+ * Try to move a multithread proc to a cgroup.
+ */
+
+
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <string.h>
+#include <pthread.h>
+
+volatile static int thread_end;
+pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
+pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
+
+
+static void *thread_start(void *unused)
+{
+	pthread_mutex_lock(&mutex);
+
+	while (!thread_end)
+		pthread_cond_wait(&cond, &mutex);
+
+	pthread_mutex_unlock(&mutex);
+
+	return NULL;
+}
+
+int main(int argc, char **argv)
+{
+	int fd;
+	int pid;
+	char cpid[50];
+	pthread_t thread;
+
+	if (argc < 2)
+		return -1;
+
+	if (pthread_create(&thread, NULL, thread_start, NULL) != 0)
+		return -2;
+
+	pid = getpid();
+	fd = open(argv[1], O_WRONLY);
+	if (fd < 0)
+		return -3;
+
+	sprintf(cpid, "%d\n", pid);
+
+	if (write(fd, cpid, strlen(cpid)) < 0)
+		return -4;
+
+	close(fd);
+
+	pthread_mutex_lock(&mutex);
+	thread_end = 1;
+	pthread_cond_signal(&cond);
+	pthread_mutex_unlock(&mutex);
+
+	pthread_join(thread, NULL);
+
+	return 0;
+}
diff --git a/tools/testing/selftests/task_counter/run_test b/tools/testing/selftests/task_counter/run_test
new file mode 100755
index 0000000..ee063aa
--- /dev/null
+++ b/tools/testing/selftests/task_counter/run_test
@@ -0,0 +1,198 @@
+#!/bin/bash
+# Copyright (C) 2012 Red Hat, Inc., Frederic Weisbecker <fweisbec@redhat.com>
+#
+# Licensed under the terms of the GNU GPL License version 2
+#
+# Validation tests for the cgroup task counter subsystem
+
+BASE=/dev/cgroup
+
+if [ $UID != 0 ]
+then
+	echo "Must be root to run task counter selftest"
+	exit 0
+fi
+
+if [ -e $BASE ]
+then
+	echo "Directory $BASE already exist"
+	echo "Can't run task counter selftest"
+	exit 0
+fi
+
+mkdir $BASE
+mount -t cgroup -o tasks cgroup $BASE
+if [ $? != 0 ]
+then
+    echo "Unable to mount cgroup filesystem"
+    echo "Can't run task counter selftest."
+    rmdir $BASE
+    exit 0
+fi
+
+sleep 1d &
+PID1=$!
+
+sleep 1d &
+PID2=$!
+
+echo 1 > $BASE/cgroup.clone_children
+mkdir $BASE/cgroup0
+
+function test_result
+{
+    # Result of the test
+    res=$1
+    # Expected result
+    expected=$2
+    # Invert test against expected result
+    # 0 = equal
+    # 1 = non equal
+    inv=$3
+    # String message
+    test_str=$4
+    passed=0
+
+    echo -n $test_str
+
+    if [ $res = $expected ]
+    then
+	passed=1
+    fi
+
+    passed=$[$passed ^ $inv]
+
+    if [ $passed = 1 ]
+    then
+	echo " [OK]"
+    else
+	echo " [FAILED]"
+    fi
+
+}
+
+# simple test limit
+echo 1 > $BASE/cgroup0/tasks.limit
+echo $PID1 >  $BASE/cgroup0/cgroup.procs
+test_result $? 0 0 "Allow 1 task on limit 1"
+
+echo $PID2 >  $BASE/cgroup0/cgroup.procs
+test_result $? 0 1 "Don't allow 2 tasks on limit 1"
+
+# simple test usage
+USAGE=$(cat $BASE/cgroup0/tasks.usage)
+test_result $USAGE 1 0 "Correct usage "
+
+# simple test exit
+kill $PID1
+USAGE=$(cat $BASE/cgroup0/tasks.usage)
+test_result $USAGE 0 0 "Correct usage after exit "
+
+
+sleep 1d &
+PID1=$!
+
+echo 1 > $BASE/cgroup0/tasks.limit
+echo $PID1 >  $BASE/cgroup0/cgroup.procs
+test_result $? 0 0 "Correct reuse after exit "
+
+# simple move to root
+
+echo $PID1 > $BASE/cgroup.procs
+test_result $? 0 0 "Correct move to root "
+
+# propagation tests
+
+mkdir $BASE/cgroup0/cgroup1
+mkdir $BASE/cgroup0/cgroup2
+
+echo 1 > $BASE/cgroup0/cgroup1/tasks.limit
+echo $PID1 > $BASE/cgroup0/cgroup1/cgroup.procs
+USAGE=$(cat $BASE/cgroup0/tasks.usage)
+test_result $USAGE 1 0 "Correct propagated usage "
+
+echo $PID1 > $BASE/cgroup0/cgroup.procs
+test_result $? 0 0 "Correct move on parent "
+
+
+# move
+
+echo $PID1 > $BASE/cgroup0/cgroup1/cgroup.procs
+test_result $? 0 0 "Correct move on child "
+
+echo $PID1 > $BASE/cgroup0/cgroup2/cgroup.procs
+test_result $? 0 0 "Correct move on sibling "
+
+echo $PID2 > $BASE/cgroup0/cgroup1/cgroup.procs
+test_result $? 0 1 "Correct propagation limit "
+
+kill $PID1
+kill $PID2
+
+# test limit on thread group
+echo 1024 > $BASE/cgroup0/tasks.limit
+echo 0 > $BASE/cgroup0/cgroup1/tasks.limit
+
+./multithread $BASE/cgroup0/cgroup1/cgroup.procs
+test_result $? 0 1 "Correct limit on multithreaded"
+
+# test move of a thread group
+echo 2 > $BASE/cgroup0/cgroup1/tasks.limit
+
+./multithread $BASE/cgroup0/cgroup1/cgroup.procs
+test_result $? 0 0 "Correct move of multithreaded"
+
+rmdir $BASE/cgroup0/cgroup1
+rmdir $BASE/cgroup0/cgroup2
+
+# test bug on common ancestor logic
+# as described in https://lkml.org/lkml/2011/11/8/218
+
+./spread_thread_group $BASE/cgroup0/cgroup.procs $BASE/tasks
+test_result $? 0 0 "Test bug on common ancestor logic"
+
+# test fork
+
+echo 1 > $BASE/cgroup0/tasks.limit
+./fork $BASE/cgroup0/cgroup.procs
+test_result $? 0 0 "Correct fork limit "
+
+# test forkbomb
+
+echo 128 > $BASE/cgroup0/tasks.limit
+echo -n "Trying to stop forkbomb propagation..."
+./forkbomb $BASE/cgroup0/cgroup.procs &
+sleep 1
+RES=$(cat $BASE/cgroup0/tasks.usage)
+test_result $RES 128 0 ""
+
+# kill forkbomb
+
+echo -n "Trying to kill forkbomb "
+echo 0 > $BASE/cgroup0/tasks.limit
+END=false
+
+while [ $END = false ]
+do
+	NR_TASKS=$(cat $BASE/cgroup0/tasks.usage)
+	NR_KILLED=0
+
+	for TASK in $(cat $BASE/cgroup0/cgroup.procs)
+	do
+		NR_KILLED=$(($NR_KILLED+1))
+		kill -KILL $TASK
+	done
+
+	if [ "$NR_TASKS" = "$NR_KILLED" ]
+	then
+		END=true
+	fi
+done
+
+echo "[OK]"
+
+# Wait a bit for killed tasks to exit the cgroup 
+sleep 1
+rmdir $BASE/cgroup0
+umount $BASE
+rmdir $BASE
\ No newline at end of file
diff --git a/tools/testing/selftests/task_counter/spread_thread_group.c b/tools/testing/selftests/task_counter/spread_thread_group.c
new file mode 100644
index 0000000..e088deb
--- /dev/null
+++ b/tools/testing/selftests/task_counter/spread_thread_group.c
@@ -0,0 +1,82 @@
+/*
+ * Copyright (C) 2012 Red Hat, Inc., Frederic Weisbecker <fweisbec@redhat.com>
+ *
+ * Licensed under the terms of the GNU GPL License version 2
+ *
+ * Try to reproduce bug on common ancestor logic as described
+ * at https://lkml.org/lkml/2011/11/8/218
+ */
+
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <string.h>
+#include <pthread.h>
+
+volatile static int thread_end;
+pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
+pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
+
+
+static void *thread_start(void *unused)
+{
+	pthread_mutex_lock(&mutex);
+
+	while (!thread_end)
+		pthread_cond_wait(&cond, &mutex);
+
+	pthread_mutex_unlock(&mutex);
+
+	return NULL;
+}
+
+int main(int argc, char **argv)
+{
+	int fd_root, fd;
+	int pid;
+	char cpid[50];
+	pthread_t thread;
+
+	if (argc < 3)
+		return -1;
+
+	if (pthread_create(&thread, NULL, thread_start, NULL) != 0)
+		return -2;
+
+	pid = getpid();
+	fd = open(argv[1], O_WRONLY);
+	if (fd < 0)
+		return -3;
+
+	sprintf(cpid, "%d\n", pid);
+
+	/* Move group to /dev/cgroup/cgroup0 */
+	if (write(fd, cpid, strlen(cpid)) < 0)
+		return -4;
+
+	fd_root = open(argv[2], O_WRONLY);
+	if (fd < 0)
+		return -5;
+
+	/* Move group leader to /dev/cgroup/ root */
+	if (write(fd_root, cpid, strlen(cpid)) < 0)
+		return -6;
+
+	/* Move back group to /dev/cgroup/cgroup0 */
+	if (write(fd, cpid, strlen(cpid)) < 0)
+		return -7;
+
+	close(fd_root);
+	close(fd);
+
+	pthread_mutex_lock(&mutex);
+	thread_end = 1;
+	pthread_cond_signal(&cond);
+	pthread_mutex_unlock(&mutex);
+
+	pthread_join(thread, NULL);
+
+	return 0;
+}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v8
  2012-02-01  3:37 [PATCH 00/10] cgroups: Task counter subsystem v8 Frederic Weisbecker
                   ` (9 preceding siblings ...)
  2012-02-01  3:37 ` [PATCH 10/10] selftests: Add a new task counter selftest Frederic Weisbecker
@ 2012-02-01 16:31 ` Tejun Heo
  2012-02-01 18:50   ` Frederic Weisbecker
  2013-04-01 18:43 ` Tim Hockin
  11 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2012-02-01 16:31 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, Li Zefan, LKML, Kirill A. Shutemov, Paul Menage,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Tim Hockin,
	Containers, Glauber Costa, Cgroups, Daniel J Walsh,
	Daniel P. Berrange, KAMEZAWA Hiroyuki, Max Kellermann,
	Mandeep Singh Baines

On Wed, Feb 01, 2012 at 04:37:40AM +0100, Frederic Weisbecker wrote:
> Changes In this version:
> 
> - Split 32/64 bits version of res_counter_write_u64() [1/10]
>   Courtesy of Kirill A. Shutemov
> 
> - Added Kirill's ack [8/10]
> 
> - Added selftests [9/10], [10/10]
> 
> Please consider for merging. At least two users want this feature:

Has there been further discussion about this approach?  IIRC, we
weren't sure whether this should be merged.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v8
  2012-02-01 16:31 ` [PATCH 00/10] cgroups: Task counter subsystem v8 Tejun Heo
@ 2012-02-01 18:50   ` Frederic Weisbecker
  2012-02-01 19:51     ` Andrew Morton
  0 siblings, 1 reply; 38+ messages in thread
From: Frederic Weisbecker @ 2012-02-01 18:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Li Zefan, LKML, Kirill A. Shutemov, Paul Menage,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Tim Hockin,
	Containers, Glauber Costa, Cgroups, Daniel J Walsh,
	Daniel P. Berrange, KAMEZAWA Hiroyuki, Max Kellermann,
	Mandeep Singh Baines

On Wed, Feb 01, 2012 at 08:31:26AM -0800, Tejun Heo wrote:
> On Wed, Feb 01, 2012 at 04:37:40AM +0100, Frederic Weisbecker wrote:
> > Changes In this version:
> > 
> > - Split 32/64 bits version of res_counter_write_u64() [1/10]
> >   Courtesy of Kirill A. Shutemov
> > 
> > - Added Kirill's ack [8/10]
> > 
> > - Added selftests [9/10], [10/10]
> > 
> > Please consider for merging. At least two users want this feature:
> 
> Has there been further discussion about this approach?  IIRC, we
> weren't sure whether this should be merged.

The doubts I have noticed were:

Q: Can't we rather focus on a global solution to fight forkbombs?

If we can find a reliable solution that works in any case and that
prevent from any forkbomb to impact the rest of the system then it
may be an acceptable solution. But I'm not aware of such feature.

Besides, another point in having this task counter is that we
have a per container limit. Assuming all containers are running under
the same user, we can protect against a container starving all others
with a massive amount of processes close to the NR_PROC rlimit.

Q: Can/should we implement a limitation on the number of "fork" as well?
   (as in https://lkml.org/lkml/2011/11/3/233 )

I'm still not sure about why such a thing is needed. Is it really something we
want? Why can't the task counter be used instead?

I need more details from the author of this patch. But I doubt we can merge
both subsystems, they have pretty different semantics.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v8
  2012-02-01 18:50   ` Frederic Weisbecker
@ 2012-02-01 19:51     ` Andrew Morton
  2012-02-02 14:50       ` Frederic Weisbecker
  2012-03-05 16:44       ` Rik van Riel
  0 siblings, 2 replies; 38+ messages in thread
From: Andrew Morton @ 2012-02-01 19:51 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Tejun Heo, Li Zefan, LKML, Kirill A. Shutemov, Paul Menage,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Tim Hockin,
	Containers, Glauber Costa, Cgroups, Daniel J Walsh,
	Daniel P. Berrange, KAMEZAWA Hiroyuki, Max Kellermann,
	Mandeep Singh Baines

On Wed, 1 Feb 2012 19:50:01 +0100
Frederic Weisbecker <fweisbec@gmail.com> wrote:

> On Wed, Feb 01, 2012 at 08:31:26AM -0800, Tejun Heo wrote:
> > On Wed, Feb 01, 2012 at 04:37:40AM +0100, Frederic Weisbecker wrote:
> > > Changes In this version:
> > > 
> > > - Split 32/64 bits version of res_counter_write_u64() [1/10]
> > >   Courtesy of Kirill A. Shutemov
> > > 
> > > - Added Kirill's ack [8/10]
> > > 
> > > - Added selftests [9/10], [10/10]
> > > 
> > > Please consider for merging. At least two users want this feature:
> > 
> > Has there been further discussion about this approach?  IIRC, we
> > weren't sure whether this should be merged.
> 
> The doubts I have noticed were:
> 
> Q: Can't we rather focus on a global solution to fight forkbombs?
> 
> If we can find a reliable solution that works in any case and that
> prevent from any forkbomb to impact the rest of the system then it
> may be an acceptable solution. But I'm not aware of such feature.
> 
> Besides, another point in having this task counter is that we
> have a per container limit. Assuming all containers are running under
> the same user, we can protect against a container starving all others
> with a massive amount of processes close to the NR_PROC rlimit.
> 
> Q: Can/should we implement a limitation on the number of "fork" as well?
>    (as in https://lkml.org/lkml/2011/11/3/233 )
> 
> I'm still not sure about why such a thing is needed. Is it really something we
> want? Why can't the task counter be used instead?
> 
> I need more details from the author of this patch. But I doubt we can merge
> both subsystems, they have pretty different semantics.

What I struggle with is "is this feature useful enough to warrant
merging it"?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/10] cgroups: add res_counter_write_u64() API
  2012-02-01  3:37 ` [PATCH 01/10] cgroups: add res_counter_write_u64() API Frederic Weisbecker
@ 2012-02-02 12:33   ` Kirill A. Shutemov
  2012-02-02 13:56     ` Frederic Weisbecker
  0 siblings, 1 reply; 38+ messages in thread
From: Kirill A. Shutemov @ 2012-02-02 12:33 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, Tejun Heo, Li Zefan, LKML, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Tim Hockin, Tejun Heo, Containers,
	Glauber Costa, Cgroups, Daniel J Walsh, Daniel P. Berrange,
	KAMEZAWA Hiroyuki, Max Kellermann, Mandeep Singh Baines

On Wed, Feb 01, 2012 at 04:37:41AM +0100, Frederic Weisbecker wrote:
> +#if BITS_PER_LONG == 32
> +void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
> +{
> +	unsigned long long *target;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&counter->lock, flags);
> +	target = res_counter_member(counter, member);
> +	*target = val;

Nitpick: What's the point to have temporary variable here?

> +	spin_unlock_irqrestore(&counter->lock, flags);
> +}
> +#else
> +void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
> +{
> +	unsigned long long *target;
> +
> +	target = res_counter_member(counter, member);
> +	*target = val;

Ditto.

> +}
> +#endif

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/10] cgroups: add res_counter_write_u64() API
  2012-02-02 12:33   ` Kirill A. Shutemov
@ 2012-02-02 13:56     ` Frederic Weisbecker
  0 siblings, 0 replies; 38+ messages in thread
From: Frederic Weisbecker @ 2012-02-02 13:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Tejun Heo, Li Zefan, LKML, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Tim Hockin, Tejun Heo, Containers,
	Glauber Costa, Cgroups, Daniel J Walsh, Daniel P. Berrange,
	KAMEZAWA Hiroyuki, Max Kellermann, Mandeep Singh Baines

On Thu, Feb 02, 2012 at 02:33:22PM +0200, Kirill A. Shutemov wrote:
> On Wed, Feb 01, 2012 at 04:37:41AM +0100, Frederic Weisbecker wrote:
> > +#if BITS_PER_LONG == 32
> > +void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
> > +{
> > +	unsigned long long *target;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&counter->lock, flags);
> > +	target = res_counter_member(counter, member);
> > +	*target = val;
> 
> Nitpick: What's the point to have temporary variable here?

Dunno, just a matter of habit, I use to avoid expressions
like *func(foo) = bar. It looks less readable to me but
perhaps it's because I'm not used to it.

> 
> > +	spin_unlock_irqrestore(&counter->lock, flags);
> > +}
> > +#else
> > +void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
> > +{
> > +	unsigned long long *target;
> > +
> > +	target = res_counter_member(counter, member);
> > +	*target = val;
> 
> Ditto.
> 
> > +}
> > +#endif
> 
> -- 
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v8
  2012-02-01 19:51     ` Andrew Morton
@ 2012-02-02 14:50       ` Frederic Weisbecker
  2012-02-16 15:31         ` Frederic Weisbecker
  2012-03-01 22:53         ` Daniel Lezcano
  2012-03-05 16:44       ` Rik van Riel
  1 sibling, 2 replies; 38+ messages in thread
From: Frederic Weisbecker @ 2012-02-02 14:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Li Zefan, LKML, Kirill A. Shutemov, Paul Menage,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Tim Hockin,
	Containers, Glauber Costa, Cgroups, Daniel J Walsh,
	Daniel P. Berrange, KAMEZAWA Hiroyuki, Max Kellermann,
	Mandeep Singh Baines

On Wed, Feb 01, 2012 at 11:51:07AM -0800, Andrew Morton wrote:
> On Wed, 1 Feb 2012 19:50:01 +0100
> Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > On Wed, Feb 01, 2012 at 08:31:26AM -0800, Tejun Heo wrote:
> > > On Wed, Feb 01, 2012 at 04:37:40AM +0100, Frederic Weisbecker wrote:
> > > > Changes In this version:
> > > > 
> > > > - Split 32/64 bits version of res_counter_write_u64() [1/10]
> > > >   Courtesy of Kirill A. Shutemov
> > > > 
> > > > - Added Kirill's ack [8/10]
> > > > 
> > > > - Added selftests [9/10], [10/10]
> > > > 
> > > > Please consider for merging. At least two users want this feature:
> > > 
> > > Has there been further discussion about this approach?  IIRC, we
> > > weren't sure whether this should be merged.
> > 
> > The doubts I have noticed were:
> > 
> > Q: Can't we rather focus on a global solution to fight forkbombs?
> > 
> > If we can find a reliable solution that works in any case and that
> > prevent from any forkbomb to impact the rest of the system then it
> > may be an acceptable solution. But I'm not aware of such feature.
> > 
> > Besides, another point in having this task counter is that we
> > have a per container limit. Assuming all containers are running under
> > the same user, we can protect against a container starving all others
> > with a massive amount of processes close to the NR_PROC rlimit.
> > 
> > Q: Can/should we implement a limitation on the number of "fork" as well?
> >    (as in https://lkml.org/lkml/2011/11/3/233 )
> > 
> > I'm still not sure about why such a thing is needed. Is it really something we
> > want? Why can't the task counter be used instead?
> > 
> > I need more details from the author of this patch. But I doubt we can merge
> > both subsystems, they have pretty different semantics.
> 
> What I struggle with is "is this feature useful enough to warrant
> merging it"?

The reason why I've been working on it is because we need this feature
(at least) for LXC.

Two people from our teams have jumped onto the discussion to express
that they want this feature and why:

https://lkml.org/lkml/2011/12/13/309
https://lkml.org/lkml/2011/12/13/364

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v8
  2012-02-02 14:50       ` Frederic Weisbecker
@ 2012-02-16 15:31         ` Frederic Weisbecker
  2012-03-01 22:53         ` Daniel Lezcano
  1 sibling, 0 replies; 38+ messages in thread
From: Frederic Weisbecker @ 2012-02-16 15:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Li Zefan, LKML, Kirill A. Shutemov, Paul Menage,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Tim Hockin,
	Containers, Glauber Costa, Cgroups, Daniel J Walsh,
	Daniel P. Berrange, KAMEZAWA Hiroyuki, Max Kellermann,
	Mandeep Singh Baines

2012/2/2 Frederic Weisbecker <fweisbec@gmail.com>:
> On Wed, Feb 01, 2012 at 11:51:07AM -0800, Andrew Morton wrote:
>> On Wed, 1 Feb 2012 19:50:01 +0100
>> Frederic Weisbecker <fweisbec@gmail.com> wrote:
>>
>> > On Wed, Feb 01, 2012 at 08:31:26AM -0800, Tejun Heo wrote:
>> > > On Wed, Feb 01, 2012 at 04:37:40AM +0100, Frederic Weisbecker wrote:
>> > > > Changes In this version:
>> > > >
>> > > > - Split 32/64 bits version of res_counter_write_u64() [1/10]
>> > > >   Courtesy of Kirill A. Shutemov
>> > > >
>> > > > - Added Kirill's ack [8/10]
>> > > >
>> > > > - Added selftests [9/10], [10/10]
>> > > >
>> > > > Please consider for merging. At least two users want this feature:
>> > >
>> > > Has there been further discussion about this approach?  IIRC, we
>> > > weren't sure whether this should be merged.
>> >
>> > The doubts I have noticed were:
>> >
>> > Q: Can't we rather focus on a global solution to fight forkbombs?
>> >
>> > If we can find a reliable solution that works in any case and that
>> > prevent from any forkbomb to impact the rest of the system then it
>> > may be an acceptable solution. But I'm not aware of such feature.
>> >
>> > Besides, another point in having this task counter is that we
>> > have a per container limit. Assuming all containers are running under
>> > the same user, we can protect against a container starving all others
>> > with a massive amount of processes close to the NR_PROC rlimit.
>> >
>> > Q: Can/should we implement a limitation on the number of "fork" as well?
>> >    (as in https://lkml.org/lkml/2011/11/3/233 )
>> >
>> > I'm still not sure about why such a thing is needed. Is it really something we
>> > want? Why can't the task counter be used instead?
>> >
>> > I need more details from the author of this patch. But I doubt we can merge
>> > both subsystems, they have pretty different semantics.
>>
>> What I struggle with is "is this feature useful enough to warrant
>> merging it"?
>
> The reason why I've been working on it is because we need this feature
> (at least) for LXC.
>
> Two people from our teams have jumped onto the discussion to express
> that they want this feature and why:
>
> https://lkml.org/lkml/2011/12/13/309
> https://lkml.org/lkml/2011/12/13/364

Ping?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v8
  2012-02-02 14:50       ` Frederic Weisbecker
  2012-02-16 15:31         ` Frederic Weisbecker
@ 2012-03-01 22:53         ` Daniel Lezcano
  2012-03-05  3:21           ` Frederic Weisbecker
  1 sibling, 1 reply; 38+ messages in thread
From: Daniel Lezcano @ 2012-03-01 22:53 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, Aditya Kali, Daniel P. Berrange, Max Kellermann,
	Tim Hockin, Glauber Costa, Paul Menage, Daniel J Walsh, LKML,
	Oleg Nesterov, Mandeep Singh Baines, Cgroups, Johannes Weiner,
	Tejun Heo, Containers, Papp Tamás, Ulli Horlacher

On 02/02/2012 03:50 PM, Frederic Weisbecker wrote:
> On Wed, Feb 01, 2012 at 11:51:07AM -0800, Andrew Morton wrote:
>> On Wed, 1 Feb 2012 19:50:01 +0100
>> Frederic Weisbecker<fweisbec@gmail.com>  wrote:
>>
>>> On Wed, Feb 01, 2012 at 08:31:26AM -0800, Tejun Heo wrote:
>>>> On Wed, Feb 01, 2012 at 04:37:40AM +0100, Frederic Weisbecker wrote:
>>>>> Changes In this version:
>>>>>
>>>>> - Split 32/64 bits version of res_counter_write_u64() [1/10]
>>>>>    Courtesy of Kirill A. Shutemov
>>>>>
>>>>> - Added Kirill's ack [8/10]
>>>>>
>>>>> - Added selftests [9/10], [10/10]
>>>>>
>>>>> Please consider for merging. At least two users want this feature:
>>>> Has there been further discussion about this approach?  IIRC, we
>>>> weren't sure whether this should be merged.
>>> The doubts I have noticed were:
>>>
>>> Q: Can't we rather focus on a global solution to fight forkbombs?
>>>
>>> If we can find a reliable solution that works in any case and that
>>> prevent from any forkbomb to impact the rest of the system then it
>>> may be an acceptable solution. But I'm not aware of such feature.
>>>
>>> Besides, another point in having this task counter is that we
>>> have a per container limit. Assuming all containers are running under
>>> the same user, we can protect against a container starving all others
>>> with a massive amount of processes close to the NR_PROC rlimit.
>>>
>>> Q: Can/should we implement a limitation on the number of "fork" as well?
>>>     (as in https://lkml.org/lkml/2011/11/3/233 )
>>>
>>> I'm still not sure about why such a thing is needed. Is it really something we
>>> want? Why can't the task counter be used instead?
>>>
>>> I need more details from the author of this patch. But I doubt we can merge
>>> both subsystems, they have pretty different semantics.
>> What I struggle with is "is this feature useful enough to warrant
>> merging it"?
> The reason why I've been working on it is because we need this feature
> (at least) for LXC.

This feature is a recurrent request from the users of LXC. Recently, a 
container administrator complained an user was able to crash the entire 
host from a container.

http://sourceforge.net/mailarchive/message.php?msg_id=28915923

This feature is really useful to make the containers secure.

   -- Daniel

>
> Two people from our teams have jumped onto the discussion to express
> that they want this feature and why:
>
> https://lkml.org/lkml/2011/12/13/309
> https://lkml.org/lkml/2011/12/13/364
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v8
  2012-03-01 22:53         ` Daniel Lezcano
@ 2012-03-05  3:21           ` Frederic Weisbecker
  2012-03-05 16:26             ` Tejun Heo
  0 siblings, 1 reply; 38+ messages in thread
From: Frederic Weisbecker @ 2012-03-05  3:21 UTC (permalink / raw)
  To: Daniel Lezcano, Andrew Morton, Tejun Heo, Li Zefan
  Cc: Aditya Kali, Daniel P. Berrange, Max Kellermann, Tim Hockin,
	Glauber Costa, Paul Menage, Daniel J Walsh, LKML, Oleg Nesterov,
	Mandeep Singh Baines, Cgroups, Johannes Weiner, Containers,
	Papp Tamás, Ulli Horlacher

On Thu, Mar 01, 2012 at 11:53:24PM +0100, Daniel Lezcano wrote:
> On 02/02/2012 03:50 PM, Frederic Weisbecker wrote:
> >On Wed, Feb 01, 2012 at 11:51:07AM -0800, Andrew Morton wrote:
> >>On Wed, 1 Feb 2012 19:50:01 +0100
> >>Frederic Weisbecker<fweisbec@gmail.com>  wrote:
> >>
> >>>On Wed, Feb 01, 2012 at 08:31:26AM -0800, Tejun Heo wrote:
> >>>>On Wed, Feb 01, 2012 at 04:37:40AM +0100, Frederic Weisbecker wrote:
> >>>>>Changes In this version:
> >>>>>
> >>>>>- Split 32/64 bits version of res_counter_write_u64() [1/10]
> >>>>>   Courtesy of Kirill A. Shutemov
> >>>>>
> >>>>>- Added Kirill's ack [8/10]
> >>>>>
> >>>>>- Added selftests [9/10], [10/10]
> >>>>>
> >>>>>Please consider for merging. At least two users want this feature:
> >>>>Has there been further discussion about this approach?  IIRC, we
> >>>>weren't sure whether this should be merged.
> >>>The doubts I have noticed were:
> >>>
> >>>Q: Can't we rather focus on a global solution to fight forkbombs?
> >>>
> >>>If we can find a reliable solution that works in any case and that
> >>>prevent from any forkbomb to impact the rest of the system then it
> >>>may be an acceptable solution. But I'm not aware of such feature.
> >>>
> >>>Besides, another point in having this task counter is that we
> >>>have a per container limit. Assuming all containers are running under
> >>>the same user, we can protect against a container starving all others
> >>>with a massive amount of processes close to the NR_PROC rlimit.
> >>>
> >>>Q: Can/should we implement a limitation on the number of "fork" as well?
> >>>    (as in https://lkml.org/lkml/2011/11/3/233 )
> >>>
> >>>I'm still not sure about why such a thing is needed. Is it really something we
> >>>want? Why can't the task counter be used instead?
> >>>
> >>>I need more details from the author of this patch. But I doubt we can merge
> >>>both subsystems, they have pretty different semantics.
> >>What I struggle with is "is this feature useful enough to warrant
> >>merging it"?
> >The reason why I've been working on it is because we need this feature
> >(at least) for LXC.
> 
> This feature is a recurrent request from the users of LXC. Recently,
> a container administrator complained an user was able to crash the
> entire host from a container.
> 
> http://sourceforge.net/mailarchive/message.php?msg_id=28915923
> 
> This feature is really useful to make the containers secure.

Time for me to try to wake up again this discussion.
Andrew, Tejun, Li, as you can see we don't lack the users
for this feature.

If you think we should solve our security problems on containers
by following another direction, please tell us so we know where to
go and we can move forward.

Otherwise please consider the task counter for merging.

Thanks.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v8
  2012-03-05  3:21           ` Frederic Weisbecker
@ 2012-03-05 16:26             ` Tejun Heo
  2012-03-05 16:27               ` Tejun Heo
  0 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2012-03-05 16:26 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Daniel Lezcano, Andrew Morton, Li Zefan, Aditya Kali,
	Daniel P. Berrange, Max Kellermann, Tim Hockin, Glauber Costa,
	Paul Menage, Daniel J Walsh, LKML, Oleg Nesterov,
	Mandeep Singh Baines, Cgroups, Johannes Weiner, Containers,
	Papp Tamás, Ulli Horlacher

Hello, Frederic.

On Mon, Mar 05, 2012 at 04:21:33AM +0100, Frederic Weisbecker wrote:
> Time for me to try to wake up again this discussion.
> Andrew, Tejun, Li, as you can see we don't lack the users
> for this feature.

Yes, I agree this can be a useful feature for certain use cases.  I'm
just mostly feeling reluctant about adding another utility type
controller without clear direction on how to proceed from here on
regarding multiple hierarchies (and I was on vacation :).  That said,
dealing with multiple hierarchies is a much longer term issue and
we'll probably have to continue on our current path for the time
being.

We're already into -rc6, so unless someone objects I'll try to merge
it after the coming merge window.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v8
  2012-03-05 16:26             ` Tejun Heo
@ 2012-03-05 16:27               ` Tejun Heo
  2012-03-05 16:48                 ` Frederic Weisbecker
  0 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2012-03-05 16:27 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Daniel Lezcano, Andrew Morton, Li Zefan, Aditya Kali,
	Daniel P. Berrange, Max Kellermann, Tim Hockin, Glauber Costa,
	Paul Menage, Daniel J Walsh, LKML, Oleg Nesterov,
	Mandeep Singh Baines, Cgroups, Johannes Weiner, Containers,
	Papp Tamás, Ulli Horlacher

On Mon, Mar 05, 2012 at 08:26:52AM -0800, Tejun Heo wrote:
> Hello, Frederic.
> 
> On Mon, Mar 05, 2012 at 04:21:33AM +0100, Frederic Weisbecker wrote:
> > Time for me to try to wake up again this discussion.
> > Andrew, Tejun, Li, as you can see we don't lack the users
> > for this feature.
> 
> Yes, I agree this can be a useful feature for certain use cases.  I'm
> just mostly feeling reluctant about adding another utility type
> controller without clear direction on how to proceed from here on
> regarding multiple hierarchies (and I was on vacation :).  That said,
> dealing with multiple hierarchies is a much longer term issue and
> we'll probably have to continue on our current path for the time
> being.
> 
> We're already into -rc6, so unless someone objects I'll try to merge
> it after the coming merge window.

If I don't take action after the coming -rc1, please ping me.  Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v8
  2012-02-01 19:51     ` Andrew Morton
  2012-02-02 14:50       ` Frederic Weisbecker
@ 2012-03-05 16:44       ` Rik van Riel
  1 sibling, 0 replies; 38+ messages in thread
From: Rik van Riel @ 2012-03-05 16:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Frederic Weisbecker, Tejun Heo, Li Zefan, LKML,
	Kirill A. Shutemov, Paul Menage, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Tim Hockin, Containers, Glauber Costa, Cgroups,
	Daniel J Walsh, Daniel P. Berrange, KAMEZAWA Hiroyuki,
	Max Kellermann, Mandeep Singh Baines

On 02/01/2012 02:51 PM, Andrew Morton wrote:
> On Wed, 1 Feb 2012 19:50:01 +0100
> Frederic Weisbecker<fweisbec@gmail.com>  wrote:
>
>> On Wed, Feb 01, 2012 at 08:31:26AM -0800, Tejun Heo wrote:
>>> On Wed, Feb 01, 2012 at 04:37:40AM +0100, Frederic Weisbecker wrote:
>>>> Changes In this version:
>>>>
>>>> - Split 32/64 bits version of res_counter_write_u64() [1/10]
>>>>    Courtesy of Kirill A. Shutemov
>>>>
>>>> - Added Kirill's ack [8/10]
>>>>
>>>> - Added selftests [9/10], [10/10]
>>>>
>>>> Please consider for merging. At least two users want this feature:
>>>
>>> Has there been further discussion about this approach?  IIRC, we
>>> weren't sure whether this should be merged.
>>
>> The doubts I have noticed were:
>>
>> Q: Can't we rather focus on a global solution to fight forkbombs?
>>
>> If we can find a reliable solution that works in any case and that
>> prevent from any forkbomb to impact the rest of the system then it
>> may be an acceptable solution. But I'm not aware of such feature.
>>
>> Besides, another point in having this task counter is that we
>> have a per container limit. Assuming all containers are running under
>> the same user, we can protect against a container starving all others
>> with a massive amount of processes close to the NR_PROC rlimit.

> What I struggle with is "is this feature useful enough to warrant
> merging it"?

I have seen thunderbird create as many child processes
as it could (until I hit my rlimit NR_PROC), and have
seen web servers go wrong under a combination of load
and buggy scripts, forking as many processes as they
could.

Since we know rlimit NR_PROC is useful, having the
equivalent per cgroup will be useful, too.

What we need to lose is the focus on malicious
forkbombs - buggy programs are a real issue, and
protecting against them is useful.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v8
  2012-03-05 16:27               ` Tejun Heo
@ 2012-03-05 16:48                 ` Frederic Weisbecker
  0 siblings, 0 replies; 38+ messages in thread
From: Frederic Weisbecker @ 2012-03-05 16:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Daniel Lezcano, Andrew Morton, Li Zefan, Aditya Kali,
	Daniel P. Berrange, Max Kellermann, Tim Hockin, Glauber Costa,
	Paul Menage, Daniel J Walsh, LKML, Oleg Nesterov,
	Mandeep Singh Baines, Cgroups, Johannes Weiner, Containers,
	Papp Tamás, Ulli Horlacher

On Mon, Mar 05, 2012 at 08:27:39AM -0800, Tejun Heo wrote:
> On Mon, Mar 05, 2012 at 08:26:52AM -0800, Tejun Heo wrote:
> > Hello, Frederic.
> > 
> > On Mon, Mar 05, 2012 at 04:21:33AM +0100, Frederic Weisbecker wrote:
> > > Time for me to try to wake up again this discussion.
> > > Andrew, Tejun, Li, as you can see we don't lack the users
> > > for this feature.
> > 
> > Yes, I agree this can be a useful feature for certain use cases.  I'm
> > just mostly feeling reluctant about adding another utility type
> > controller without clear direction on how to proceed from here on
> > regarding multiple hierarchies (and I was on vacation :).  That said,
> > dealing with multiple hierarchies is a much longer term issue and
> > we'll probably have to continue on our current path for the time
> > being.
> > 
> > We're already into -rc6, so unless someone objects I'll try to merge
> > it after the coming merge window.

Thanks a lot!

> 
> If I don't take action after the coming -rc1, please ping me.  Thanks.

Sure I will :)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v8
  2012-02-01  3:37 [PATCH 00/10] cgroups: Task counter subsystem v8 Frederic Weisbecker
                   ` (10 preceding siblings ...)
  2012-02-01 16:31 ` [PATCH 00/10] cgroups: Task counter subsystem v8 Tejun Heo
@ 2013-04-01 18:43 ` Tim Hockin
  2013-04-01 18:46   ` Tejun Heo
  11 siblings, 1 reply; 38+ messages in thread
From: Tim Hockin @ 2013-04-01 18:43 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, Tejun Heo, Li Zefan, LKML, Kirill A. Shutemov,
	Paul Menage, Johannes Weiner, Aditya Kali, Oleg Nesterov,
	Containers, Glauber Costa, Cgroups, Daniel J Walsh,
	Daniel P. Berrange, KAMEZAWA Hiroyuki, Max Kellermann,
	Mandeep Singh Baines

A year later - what ever happened with this?  I want it more than ever
for Google's use.

On Tue, Jan 31, 2012 at 7:37 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> Hi,
>
> Changes In this version:
>
> - Split 32/64 bits version of res_counter_write_u64() [1/10]
>   Courtesy of Kirill A. Shutemov
>
> - Added Kirill's ack [8/10]
>
> - Added selftests [9/10], [10/10]
>
> Please consider for merging. At least two users want this feature:
> https://lkml.org/lkml/2011/12/13/309
> https://lkml.org/lkml/2011/12/13/364
>
> More general details provided in the last version posting:
> https://lkml.org/lkml/2012/1/13/230
>
> Thanks!
>
>
> Frederic Weisbecker (9):
>   cgroups: add res_counter_write_u64() API
>   cgroups: new resource counter inheritance API
>   cgroups: ability to stop res charge propagation on bounded ancestor
>   res_counter: allow charge failure pointer to be null
>   cgroups: pull up res counter charge failure interpretation to caller
>   cgroups: allow subsystems to cancel a fork
>   cgroups: Add a task counter subsystem
>   selftests: Enter each directories before executing selftests
>   selftests: Add a new task counter selftest
>
> Kirill A. Shutemov (1):
>   cgroups: add res counter common ancestor searching
>
>  Documentation/cgroups/resource_counter.txt         |   20 ++-
>  Documentation/cgroups/task_counter.txt             |  153 +++++++++++
>  include/linux/cgroup.h                             |   20 +-
>  include/linux/cgroup_subsys.h                      |    5 +
>  include/linux/res_counter.h                        |   27 ++-
>  init/Kconfig                                       |    9 +
>  kernel/Makefile                                    |    1 +
>  kernel/cgroup.c                                    |   23 ++-
>  kernel/cgroup_freezer.c                            |    6 +-
>  kernel/cgroup_task_counter.c                       |  272 ++++++++++++++++++++
>  kernel/exit.c                                      |    2 +-
>  kernel/fork.c                                      |    7 +-
>  kernel/res_counter.c                               |  103 +++++++-
>  tools/testing/selftests/Makefile                   |    2 +-
>  tools/testing/selftests/run_tests                  |    6 +-
>  tools/testing/selftests/task_counter/Makefile      |    8 +
>  tools/testing/selftests/task_counter/fork.c        |   40 +++
>  tools/testing/selftests/task_counter/forkbomb.c    |   40 +++
>  tools/testing/selftests/task_counter/multithread.c |   68 +++++
>  tools/testing/selftests/task_counter/run_test      |  198 ++++++++++++++
>  .../selftests/task_counter/spread_thread_group.c   |   82 ++++++
>  21 files changed, 1056 insertions(+), 36 deletions(-)
>  create mode 100644 Documentation/cgroups/task_counter.txt
>  create mode 100644 kernel/cgroup_task_counter.c
>  create mode 100644 tools/testing/selftests/task_counter/Makefile
>  create mode 100644 tools/testing/selftests/task_counter/fork.c
>  create mode 100644 tools/testing/selftests/task_counter/forkbomb.c
>  create mode 100644 tools/testing/selftests/task_counter/multithread.c
>  create mode 100755 tools/testing/selftests/task_counter/run_test
>  create mode 100644 tools/testing/selftests/task_counter/spread_thread_group.c
>
> --
> 1.7.5.4
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v8
  2013-04-01 18:43 ` Tim Hockin
@ 2013-04-01 18:46   ` Tejun Heo
  2013-04-01 20:09     ` Tim Hockin
  0 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2013-04-01 18:46 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Frederic Weisbecker, Andrew Morton, Li Zefan, LKML,
	Kirill A. Shutemov, Paul Menage, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Containers, Glauber Costa, Cgroups,
	Daniel J Walsh, Daniel P. Berrange, KAMEZAWA Hiroyuki,
	Max Kellermann, Mandeep Singh Baines

On Mon, Apr 01, 2013 at 11:43:03AM -0700, Tim Hockin wrote:
> A year later - what ever happened with this?  I want it more than ever
> for Google's use.

I think the conclusion was "use kmemcg instead".

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v8
  2013-04-01 18:46   ` Tejun Heo
@ 2013-04-01 20:09     ` Tim Hockin
  2013-04-01 20:29       ` Tejun Heo
  0 siblings, 1 reply; 38+ messages in thread
From: Tim Hockin @ 2013-04-01 20:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Frederic Weisbecker, Andrew Morton, Li Zefan, LKML,
	Kirill A. Shutemov, Paul Menage, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Containers, Glauber Costa, Cgroups,
	Daniel J Walsh, Daniel P. Berrange, KAMEZAWA Hiroyuki,
	Max Kellermann, Mandeep Singh Baines

On Mon, Apr 1, 2013 at 11:46 AM, Tejun Heo <tj@kernel.org> wrote:
> On Mon, Apr 01, 2013 at 11:43:03AM -0700, Tim Hockin wrote:
>> A year later - what ever happened with this?  I want it more than ever
>> for Google's use.
>
> I think the conclusion was "use kmemcg instead".

Pardon my ignorance, but... what?  Use kernel memory limits as a proxy
for process/thread counts?  That sounds terrible - I hope I am
misunderstanding?  This task counter patch had several properties that
mapped very well to what we want.

Is it dead in the water?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v8
  2013-04-01 20:09     ` Tim Hockin
@ 2013-04-01 20:29       ` Tejun Heo
  2013-04-01 21:02         ` Tim Hockin
  0 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2013-04-01 20:29 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Frederic Weisbecker, Andrew Morton, Li Zefan, LKML,
	Kirill A. Shutemov, Paul Menage, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Containers, Glauber Costa, Cgroups,
	Daniel J Walsh, Daniel P. Berrange, KAMEZAWA Hiroyuki,
	Max Kellermann, Mandeep Singh Baines

On Mon, Apr 01, 2013 at 01:09:09PM -0700, Tim Hockin wrote:
> Pardon my ignorance, but... what?  Use kernel memory limits as a proxy
> for process/thread counts?  That sounds terrible - I hope I am

Well, the argument was that process / thread counts were a poor and
unnecessary proxy for kernel memory consumption limit.  IIRC, Johannes
put it as (I'm paraphrasing) "you can't go to Fry's and buy 4k thread
worth of component".

> misunderstanding?  This task counter patch had several properties that
> mapped very well to what we want.
> 
> Is it dead in the water?

After some discussion, Frederic agreed that at least his use case can
be served well by kmemcg, maybe even better - IIRC it was container
fork bomb scenario, so you'll have to argue your way in why kmemcg
isn't a suitable solution for your use case if you wanna revive this.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v8
  2013-04-01 20:29       ` Tejun Heo
@ 2013-04-01 21:02         ` Tim Hockin
  2013-04-01 22:03           ` Tejun Heo
  0 siblings, 1 reply; 38+ messages in thread
From: Tim Hockin @ 2013-04-01 21:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Frederic Weisbecker, Andrew Morton, Li Zefan, LKML,
	Kirill A. Shutemov, Paul Menage, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Containers, Glauber Costa, Cgroups,
	Daniel J Walsh, Daniel P. Berrange, KAMEZAWA Hiroyuki,
	Max Kellermann, Mandeep Singh Baines

On Mon, Apr 1, 2013 at 1:29 PM, Tejun Heo <tj@kernel.org> wrote:
> On Mon, Apr 01, 2013 at 01:09:09PM -0700, Tim Hockin wrote:
>> Pardon my ignorance, but... what?  Use kernel memory limits as a proxy
>> for process/thread counts?  That sounds terrible - I hope I am
>
> Well, the argument was that process / thread counts were a poor and
> unnecessary proxy for kernel memory consumption limit.  IIRC, Johannes
> put it as (I'm paraphrasing) "you can't go to Fry's and buy 4k thread
> worth of component".
>
>> misunderstanding?  This task counter patch had several properties that
>> mapped very well to what we want.
>>
>> Is it dead in the water?
>
> After some discussion, Frederic agreed that at least his use case can
> be served well by kmemcg, maybe even better - IIRC it was container
> fork bomb scenario, so you'll have to argue your way in why kmemcg
> isn't a suitable solution for your use case if you wanna revive this.

We run dozens of jobs from dozens users on a single machine.  We
regularly experience users who leak threads, running into the tens of
thousands.  We are unable to raise the PID_MAX significantly due to
some bad, but really thoroughly baked-in decisions that were made a
long time ago.  What we experience on a daily basis is users
complaining about getting a "pthread_create(): resource unavailable"
error because someone on the machine has leaked.

Today we use RLIMIT_NPROC to lock most users down to a smaller max.
But this is a per-user setting, not a per-container setting, and users
do not control where their jobs land.  Scheduling decisions often put
multiple thread-heavy but non-leaking jobs from one user onto the same
machine, which again causes problems.  Further, it does not help for
some of our use cases where a logical job can run as multiple UIDs for
different processes within.

>From the end-user point of view this is an isolation leak which is
totally non-deterministic for them.  They can not know what to plan
for.  Getting cgroup-level control of this limit is important for a
saner SLA for our users.

In addition, the behavior around locking-out new tasks seems like a
nice way to simplify and clean up end-life work for the administrative
system.  Admittedly, we can mostly work around this with freezer
instead.

What I really don't understand is why so much push back?  We have this
nicely structured cgroup system.  Each cgroup controller's code is
pretty well partitioned - why would we not want more complete
functionality built around it?  We accept device drivers for the most
random, useless crap on the assertion that "if you don't need it,
don't compile it in".  I can think of a half dozen more really useful,
cool things we can do with cgroups, but I know the pushback will be
tremendous, and I just don't grok why.

Tim

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v8
  2013-04-01 21:02         ` Tim Hockin
@ 2013-04-01 22:03           ` Tejun Heo
  2013-04-01 22:20             ` Tim Hockin
  0 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2013-04-01 22:03 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Frederic Weisbecker, Andrew Morton, Li Zefan, LKML,
	Kirill A. Shutemov, Paul Menage, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Containers, Glauber Costa, Cgroups,
	Daniel J Walsh, Daniel P. Berrange, KAMEZAWA Hiroyuki,
	Max Kellermann, Mandeep Singh Baines

Hello, Tim.

On Mon, Apr 01, 2013 at 02:02:06PM -0700, Tim Hockin wrote:
> We run dozens of jobs from dozens users on a single machine.  We
> regularly experience users who leak threads, running into the tens of
> thousands.  We are unable to raise the PID_MAX significantly due to
> some bad, but really thoroughly baked-in decisions that were made a
> long time ago.  What we experience on a daily basis is users

Ummmm.... so that's why you guys can't use kernel memory limit? :(

> complaining about getting a "pthread_create(): resource unavailable"
> error because someone on the machine has leaked.
...
> What I really don't understand is why so much push back?  We have this
> nicely structured cgroup system.  Each cgroup controller's code is
> pretty well partitioned - why would we not want more complete
> functionality built around it?  We accept device drivers for the most
> random, useless crap on the assertion that "if you don't need it,
> don't compile it in".  I can think of a half dozen more really useful,
> cool things we can do with cgroups, but I know the pushback will be
> tremendous, and I just don't grok why.

In general, because it adds to maintenance overhead.  e.g. We've been
trying to make all cgroups follow consistent nesting rules.  We're now
almost there with a couple controllers left.  This one would have been
another thing to do, which is fine if it's necessary but if it isn't
we're just adding up work for no good reason.

More importantly, because cgroup is already plagued with so many bad
design decisions - some from core design decisions - e.g. not being
able to actually identify a resource outside of a context of a task.
Others are added on by each controller going out doing whatever it
wants without thinking about how the whole thing would come together
afterwards - e.g. double accounting between cpu and cpuacct,
completely illogical and unusable hierarchy implementations in
anything other than cpu controllers (they're getting better), and so
on.  Right now it's in a state where there's not many things coherent
about it.  Sure, every controller and feature supports the ones their
makers intended them to but when collected together it's just a mess,
which is one of the common complaints against cgroup.

So, no free-for-all, please.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v8
  2013-04-01 22:03           ` Tejun Heo
@ 2013-04-01 22:20             ` Tim Hockin
  2013-04-01 22:35               ` Tejun Heo
  0 siblings, 1 reply; 38+ messages in thread
From: Tim Hockin @ 2013-04-01 22:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Frederic Weisbecker, Andrew Morton, Li Zefan, LKML,
	Kirill A. Shutemov, Paul Menage, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Containers, Glauber Costa, Cgroups,
	Daniel J Walsh, Daniel P. Berrange, KAMEZAWA Hiroyuki,
	Max Kellermann, Mandeep Singh Baines

On Mon, Apr 1, 2013 at 3:03 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Tim.
>
> On Mon, Apr 01, 2013 at 02:02:06PM -0700, Tim Hockin wrote:
>> We run dozens of jobs from dozens users on a single machine.  We
>> regularly experience users who leak threads, running into the tens of
>> thousands.  We are unable to raise the PID_MAX significantly due to
>> some bad, but really thoroughly baked-in decisions that were made a
>> long time ago.  What we experience on a daily basis is users
>
> Ummmm.... so that's why you guys can't use kernel memory limit? :(

Because it is completely non-obvious how to map between the two in a
way that is safe across kernel versions and not likely to blow up in
our faces.  It's a hack, in other words.

>> complaining about getting a "pthread_create(): resource unavailable"
>> error because someone on the machine has leaked.
> ...
>> What I really don't understand is why so much push back?  We have this
>> nicely structured cgroup system.  Each cgroup controller's code is
>> pretty well partitioned - why would we not want more complete
>> functionality built around it?  We accept device drivers for the most
>> random, useless crap on the assertion that "if you don't need it,
>> don't compile it in".  I can think of a half dozen more really useful,
>> cool things we can do with cgroups, but I know the pushback will be
>> tremendous, and I just don't grok why.
>
> In general, because it adds to maintenance overhead.  e.g. We've been
> trying to make all cgroups follow consistent nesting rules.  We're now
> almost there with a couple controllers left.  This one would have been
> another thing to do, which is fine if it's necessary but if it isn't
> we're just adding up work for no good reason.
>
> More importantly, because cgroup is already plagued with so many bad
> design decisions - some from core design decisions - e.g. not being
> able to actually identify a resource outside of a context of a task.
> Others are added on by each controller going out doing whatever it
> wants without thinking about how the whole thing would come together
> afterwards - e.g. double accounting between cpu and cpuacct,
> completely illogical and unusable hierarchy implementations in
> anything other than cpu controllers (they're getting better), and so
> on.  Right now it's in a state where there's not many things coherent
> about it.  Sure, every controller and feature supports the ones their
> makers intended them to but when collected together it's just a mess,
> which is one of the common complaints against cgroup.
>
> So, no free-for-all, please.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v8
  2013-04-01 22:20             ` Tim Hockin
@ 2013-04-01 22:35               ` Tejun Heo
  2013-04-01 22:57                 ` Tim Hockin
  0 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2013-04-01 22:35 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Frederic Weisbecker, Andrew Morton, Li Zefan, LKML,
	Kirill A. Shutemov, Paul Menage, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Containers, Glauber Costa, Cgroups,
	Daniel J Walsh, Daniel P. Berrange, KAMEZAWA Hiroyuki,
	Max Kellermann, Mandeep Singh Baines

Hey,

On Mon, Apr 01, 2013 at 03:20:47PM -0700, Tim Hockin wrote:
> > Ummmm.... so that's why you guys can't use kernel memory limit? :(
> 
> Because it is completely non-obvious how to map between the two in a
> way that is safe across kernel versions and not likely to blow up in
> our faces.  It's a hack, in other words.

Now we're repeating the argument Frederic and Johannes had, so I'd
suggest going back the thread and reading the discussion and if you
still think using kmemcg is a bad idea, please explain why that is so.
For the specific point that you just raised, the scale tilted toward
thread/process count is a hacky and unreliable representation of
kernel memory resource than the other way around, at least back then.
If you think you can tilt it the other way, please feel free to try.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v8
  2013-04-01 22:35               ` Tejun Heo
@ 2013-04-01 22:57                 ` Tim Hockin
  2013-04-01 23:18                   ` Tejun Heo
  0 siblings, 1 reply; 38+ messages in thread
From: Tim Hockin @ 2013-04-01 22:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Frederic Weisbecker, Andrew Morton, Li Zefan, LKML,
	Kirill A. Shutemov, Paul Menage, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Containers, Glauber Costa, Cgroups,
	Daniel J Walsh, Daniel P. Berrange, KAMEZAWA Hiroyuki,
	Max Kellermann, Mandeep Singh Baines

On Mon, Apr 1, 2013 at 3:35 PM, Tejun Heo <tj@kernel.org> wrote:
> Hey,
>
> On Mon, Apr 01, 2013 at 03:20:47PM -0700, Tim Hockin wrote:
>> > Ummmm.... so that's why you guys can't use kernel memory limit? :(
>>
>> Because it is completely non-obvious how to map between the two in a
>> way that is safe across kernel versions and not likely to blow up in
>> our faces.  It's a hack, in other words.
>
> Now we're repeating the argument Frederic and Johannes had, so I'd
> suggest going back the thread and reading the discussion and if you
> still think using kmemcg is a bad idea, please explain why that is so.
> For the specific point that you just raised, the scale tilted toward
> thread/process count is a hacky and unreliable representation of
> kernel memory resource than the other way around, at least back then.

I am not limited by kernel memory, I am limited by PIDs, and I need to
be able to manage them.  memory.kmem.usage_in_bytes seems to be far
too noisy to be useful for this purpose.  It may work fine for "just
stop a fork bomb" but not for any sort of finer-grained control.

> If you think you can tilt it the other way, please feel free to try.

Just because others caved, doesn't make it less of a hack.  And I will
cave, too, because I don't have time to bang my head against a wall,
especially when I can see the remnants of other people who have tried.

We'll work around it, or we'll hack around it, or we'll carry this
patch in our own tree and just grumble about ridiculous hacks every
time we have to forward port it.

I was just hoping that things had worked themselves out in the last year.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v8
  2013-04-01 22:57                 ` Tim Hockin
@ 2013-04-01 23:18                   ` Tejun Heo
  2013-04-02  0:07                     ` Tim Hockin
  0 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2013-04-01 23:18 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Frederic Weisbecker, Andrew Morton, Li Zefan, LKML,
	Kirill A. Shutemov, Paul Menage, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Containers, Glauber Costa, Cgroups,
	Daniel J Walsh, Daniel P. Berrange, KAMEZAWA Hiroyuki,
	Max Kellermann, Mandeep Singh Baines

Hello,

On Mon, Apr 01, 2013 at 03:57:46PM -0700, Tim Hockin wrote:
> I am not limited by kernel memory, I am limited by PIDs, and I need to
> be able to manage them.  memory.kmem.usage_in_bytes seems to be far
> too noisy to be useful for this purpose.  It may work fine for "just
> stop a fork bomb" but not for any sort of finer-grained control.

So, why are you limited by PIDs other than the arcane / weird
limitation that you have whereever that limitation is?

> > If you think you can tilt it the other way, please feel free to try.
> 
> Just because others caved, doesn't make it less of a hack.  And I will
> cave, too, because I don't have time to bang my head against a wall,
> especially when I can see the remnants of other people who have tried.
> 
> We'll work around it, or we'll hack around it, or we'll carry this
> patch in our own tree and just grumble about ridiculous hacks every
> time we have to forward port it.
> 
> I was just hoping that things had worked themselves out in the last year.

It's kinda weird getting this response, as I don't think it has been
particularly walley.  The arguments were pretty sound from what I
recall and Frederic's use case was actually better covered by kmemcg,
so where's the said wall?  And I asked you why your use case is
different and the only reason you gave me is some arbitrary PID
limitation on whatever thing you're using, which you gotta agree is a
pretty hard sell.  So, if you think you have a valid case, please just
explain it.  Why go passive agressive on it?  If you don't have a
valid case for pushing it, yes, you'll have to hack around it - carry
the patches in your tree, whatever, or better, fix the weird PID
problem.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/10] cgroups: Task counter subsystem v8
  2013-04-01 23:18                   ` Tejun Heo
@ 2013-04-02  0:07                     ` Tim Hockin
  0 siblings, 0 replies; 38+ messages in thread
From: Tim Hockin @ 2013-04-02  0:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Frederic Weisbecker, Andrew Morton, Li Zefan, LKML,
	Kirill A. Shutemov, Paul Menage, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Containers, Glauber Costa, Cgroups,
	Daniel J Walsh, Daniel P. Berrange, KAMEZAWA Hiroyuki,
	Max Kellermann, Mandeep Singh Baines

On Mon, Apr 1, 2013 at 4:18 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Mon, Apr 01, 2013 at 03:57:46PM -0700, Tim Hockin wrote:
>> I am not limited by kernel memory, I am limited by PIDs, and I need to
>> be able to manage them.  memory.kmem.usage_in_bytes seems to be far
>> too noisy to be useful for this purpose.  It may work fine for "just
>> stop a fork bomb" but not for any sort of finer-grained control.
>
> So, why are you limited by PIDs other than the arcane / weird
> limitation that you have whereever that limitation is?

Does anyone anywhere actually set PID_MAX > 64K?  As far as I can
tell, distros default it to 32K or 64K because there's a lot of stuff
out there that assumes this to be true.  This is the problem we have -
deep down in the bowels of code that is taking literally years to
overhaul, we have identified a bad assumption that PIDs are always 5
characters long.  I can't fix it any faster.  That said, we also
identified other software that make similar assumptions, though they
are less critical to us.

>> > If you think you can tilt it the other way, please feel free to try.
>>
>> Just because others caved, doesn't make it less of a hack.  And I will
>> cave, too, because I don't have time to bang my head against a wall,
>> especially when I can see the remnants of other people who have tried.
>>
>> We'll work around it, or we'll hack around it, or we'll carry this
>> patch in our own tree and just grumble about ridiculous hacks every
>> time we have to forward port it.
>>
>> I was just hoping that things had worked themselves out in the last year.
>
> It's kinda weird getting this response, as I don't think it has been
> particularly walley.  The arguments were pretty sound from what I
> recall and Frederic's use case was actually better covered by kmemcg,
> so where's the said wall?  And I asked you why your use case is
> different and the only reason you gave me is some arbitrary PID
> limitation on whatever thing you're using, which you gotta agree is a
> pretty hard sell.  So, if you think you have a valid case, please just
> explain it.  Why go passive agressive on it?  If you don't have a
> valid case for pushing it, yes, you'll have to hack around it - carry
> the patches in your tree, whatever, or better, fix the weird PID
> problem.

Sorry Tejun, you're being very reasonable, I was not.  The history of
this patch is what makes me frustrated.  It seems like such an obvious
thing to support that it blows my mind that people argue it.

You know our environment.  Users can use their memory budgets however
they like - kernel or userspace.  We have good accounting, but we are
PID limited.  We've even implemented some hacks of our own to make
that hurt less because the previously-mentioned assumptions are just
NOT going away any time soon.  I literally have user bugs every week
on this.  Hopefully the hacks we have put in place will make the users
stop hurting.  But we're left with some residual problems, some of
which are because the only limits we can apply are per-user rather
than per-container.

>From our POV building the cluster, cgroups are strictly superior to
most other control interfaces because they work at the same
granularity that we do.  I want more things to support cgroup control.
 This particular one was double-tasty because the ability to set the
limit to 0 would actually solve a different problem we have in
teardown.  But as I said, we can mostly work around that.

So I am frustrated because I don't think my use case will convince you
(at the root of it, it is a problem of our own making, but it LONG
predates me), despite my belief that it is obviously a good feature.
I find myself hoping that someone else comes along and says "me too"
rather than using a totally different hack for this.

Oh well.  Thanks for the update.  Off to do our own thing again.


> --
> tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/10] cgroups: Add res_counter_write_u64() API
  2011-10-04  0:17   ` Kirill A. Shutemov
@ 2011-10-11 13:44     ` Frederic Weisbecker
  0 siblings, 0 replies; 38+ messages in thread
From: Frederic Weisbecker @ 2011-10-11 13:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, LKML, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Containers

On Tue, Oct 04, 2011 at 03:17:51AM +0300, Kirill A. Shutemov wrote:
> On Mon, Oct 03, 2011 at 09:07:03PM +0200, Frederic Weisbecker wrote:
> > Extend the resource counter API with a mirror of
> > res_counter_read_u64() to make it handy to update a resource
> > counter value from a cgroup subsystem u64 value file.
> > 
> > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > Acked-by: Paul Menage <paul@paulmenage.org>
> > Cc: Li Zefan <lizf@cn.fujitsu.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Aditya Kali <adityakali@google.com>
> > Cc: Oleg Nesterov <oleg@redhat.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Kay Sievers <kay.sievers@vrfy.org>
> > Cc: Tim Hockin <thockin@hockin.org>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Kirill A. Shutemov <kirill@shutemov.name>
> > Cc: Containers <containers@lists.linux-foundation.org>
> > ---
> >  include/linux/res_counter.h |    2 ++
> >  kernel/res_counter.c        |   25 +++++++++++++++++++------
> >  2 files changed, 21 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> > index c9d625c..1b3fe05 100644
> > --- a/include/linux/res_counter.h
> > +++ b/include/linux/res_counter.h
> > @@ -82,6 +82,8 @@ int res_counter_memparse_write_strategy(const char *buf,
> >  int res_counter_write(struct res_counter *counter, int member,
> >  		      const char *buffer, write_strategy_fn write_strategy);
> >  
> > +void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
> > +
> >  /*
> >   * the field descriptors. one for each member of res_counter
> >   */
> > diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> > index 34683ef..0faafcc 100644
> > --- a/kernel/res_counter.c
> > +++ b/kernel/res_counter.c
> > @@ -168,12 +168,26 @@ int res_counter_memparse_write_strategy(const char *buf,
> >  	return 0;
> >  }
> >  
> > +void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
> > +{
> > +	unsigned long long *target;
> > +	unsigned long flags;
> > +
> > +	/*
> > +	 * We need the lock to protect against concurrent add/dec on 32 bits.
> > +	 * No need to ifdef it's seldom used.
> > +	 */
> 
> Should we hace two version of res_counter_write_u64() for 32/64 bit host?
> As with res_counter_read_u64().

I thought about it yeah.

But this is used in rare cases when the user writes the value from the cgroup
filesystem. In pratice the overhead won't be noticed.
I fear that would rather uglify the code with more ifdeffery.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/10] cgroups: Add res_counter_write_u64() API
  2011-10-03 19:07 ` [PATCH 01/10] cgroups: Add res_counter_write_u64() API Frederic Weisbecker
@ 2011-10-04  0:17   ` Kirill A. Shutemov
  2011-10-11 13:44     ` Frederic Weisbecker
  0 siblings, 1 reply; 38+ messages in thread
From: Kirill A. Shutemov @ 2011-10-04  0:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, LKML, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Containers

On Mon, Oct 03, 2011 at 09:07:03PM +0200, Frederic Weisbecker wrote:
> Extend the resource counter API with a mirror of
> res_counter_read_u64() to make it handy to update a resource
> counter value from a cgroup subsystem u64 value file.
> 
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Acked-by: Paul Menage <paul@paulmenage.org>
> Cc: Li Zefan <lizf@cn.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Aditya Kali <adityakali@google.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Kay Sievers <kay.sievers@vrfy.org>
> Cc: Tim Hockin <thockin@hockin.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Containers <containers@lists.linux-foundation.org>
> ---
>  include/linux/res_counter.h |    2 ++
>  kernel/res_counter.c        |   25 +++++++++++++++++++------
>  2 files changed, 21 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index c9d625c..1b3fe05 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -82,6 +82,8 @@ int res_counter_memparse_write_strategy(const char *buf,
>  int res_counter_write(struct res_counter *counter, int member,
>  		      const char *buffer, write_strategy_fn write_strategy);
>  
> +void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
> +
>  /*
>   * the field descriptors. one for each member of res_counter
>   */
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index 34683ef..0faafcc 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -168,12 +168,26 @@ int res_counter_memparse_write_strategy(const char *buf,
>  	return 0;
>  }
>  
> +void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
> +{
> +	unsigned long long *target;
> +	unsigned long flags;
> +
> +	/*
> +	 * We need the lock to protect against concurrent add/dec on 32 bits.
> +	 * No need to ifdef it's seldom used.
> +	 */

Should we hace two version of res_counter_write_u64() for 32/64 bit host?
As with res_counter_read_u64().

> +	spin_lock_irqsave(&counter->lock, flags);
> +	target = res_counter_member(counter, member);
> +	*target = val;
> +	spin_unlock_irqrestore(&counter->lock, flags);
> +}
> +
>  int res_counter_write(struct res_counter *counter, int member,
>  		      const char *buf, write_strategy_fn write_strategy)
>  {
>  	char *end;
> -	unsigned long flags;
> -	unsigned long long tmp, *val;
> +	unsigned long long tmp;
>  
>  	if (write_strategy) {
>  		if (write_strategy(buf, &tmp))
> @@ -183,9 +197,8 @@ int res_counter_write(struct res_counter *counter, int member,
>  		if (*end != '\0')
>  			return -EINVAL;
>  	}
> -	spin_lock_irqsave(&counter->lock, flags);
> -	val = res_counter_member(counter, member);
> -	*val = tmp;
> -	spin_unlock_irqrestore(&counter->lock, flags);
> +
> +	res_counter_write_u64(counter, member, tmp);
> +
>  	return 0;
>  }
> -- 
> 1.7.5.4
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 01/10] cgroups: Add res_counter_write_u64() API
  2011-10-03 19:07 [PATCH 00/10] cgroups: Task counter subsystem v6 Frederic Weisbecker
@ 2011-10-03 19:07 ` Frederic Weisbecker
  2011-10-04  0:17   ` Kirill A. Shutemov
  0 siblings, 1 reply; 38+ messages in thread
From: Frederic Weisbecker @ 2011-10-03 19:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Frederic Weisbecker, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo, Kirill A. Shutemov, Containers

Extend the resource counter API with a mirror of
res_counter_read_u64() to make it handy to update a resource
counter value from a cgroup subsystem u64 value file.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Containers <containers@lists.linux-foundation.org>
---
 include/linux/res_counter.h |    2 ++
 kernel/res_counter.c        |   25 +++++++++++++++++++------
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index c9d625c..1b3fe05 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -82,6 +82,8 @@ int res_counter_memparse_write_strategy(const char *buf,
 int res_counter_write(struct res_counter *counter, int member,
 		      const char *buffer, write_strategy_fn write_strategy);
 
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
+
 /*
  * the field descriptors. one for each member of res_counter
  */
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 34683ef..0faafcc 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -168,12 +168,26 @@ int res_counter_memparse_write_strategy(const char *buf,
 	return 0;
 }
 
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
+{
+	unsigned long long *target;
+	unsigned long flags;
+
+	/*
+	 * We need the lock to protect against concurrent add/dec on 32 bits.
+	 * No need to ifdef it's seldom used.
+	 */
+	spin_lock_irqsave(&counter->lock, flags);
+	target = res_counter_member(counter, member);
+	*target = val;
+	spin_unlock_irqrestore(&counter->lock, flags);
+}
+
 int res_counter_write(struct res_counter *counter, int member,
 		      const char *buf, write_strategy_fn write_strategy)
 {
 	char *end;
-	unsigned long flags;
-	unsigned long long tmp, *val;
+	unsigned long long tmp;
 
 	if (write_strategy) {
 		if (write_strategy(buf, &tmp))
@@ -183,9 +197,8 @@ int res_counter_write(struct res_counter *counter, int member,
 		if (*end != '\0')
 			return -EINVAL;
 	}
-	spin_lock_irqsave(&counter->lock, flags);
-	val = res_counter_member(counter, member);
-	*val = tmp;
-	spin_unlock_irqrestore(&counter->lock, flags);
+
+	res_counter_write_u64(counter, member, tmp);
+
 	return 0;
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2013-04-02  0:08 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-01  3:37 [PATCH 00/10] cgroups: Task counter subsystem v8 Frederic Weisbecker
2012-02-01  3:37 ` [PATCH 01/10] cgroups: add res_counter_write_u64() API Frederic Weisbecker
2012-02-02 12:33   ` Kirill A. Shutemov
2012-02-02 13:56     ` Frederic Weisbecker
2012-02-01  3:37 ` [PATCH 02/10] cgroups: new resource counter inheritance API Frederic Weisbecker
2012-02-01  3:37 ` [PATCH 03/10] cgroups: ability to stop res charge propagation on bounded ancestor Frederic Weisbecker
2012-02-01  3:37 ` [PATCH 04/10] cgroups: add res counter common ancestor searching Frederic Weisbecker
2012-02-01  3:37 ` [PATCH 05/10] res_counter: allow charge failure pointer to be null Frederic Weisbecker
2012-02-01  3:37 ` [PATCH 06/10] cgroups: pull up res counter charge failure interpretation to caller Frederic Weisbecker
2012-02-01  3:37 ` [PATCH 07/10] cgroups: allow subsystems to cancel a fork Frederic Weisbecker
2012-02-01  3:37 ` [PATCH 08/10] cgroups: Add a task counter subsystem Frederic Weisbecker
2012-02-01  3:37 ` [PATCH 09/10] selftests: Enter each directories before executing selftests Frederic Weisbecker
2012-02-01  3:37 ` [PATCH 10/10] selftests: Add a new task counter selftest Frederic Weisbecker
2012-02-01 16:31 ` [PATCH 00/10] cgroups: Task counter subsystem v8 Tejun Heo
2012-02-01 18:50   ` Frederic Weisbecker
2012-02-01 19:51     ` Andrew Morton
2012-02-02 14:50       ` Frederic Weisbecker
2012-02-16 15:31         ` Frederic Weisbecker
2012-03-01 22:53         ` Daniel Lezcano
2012-03-05  3:21           ` Frederic Weisbecker
2012-03-05 16:26             ` Tejun Heo
2012-03-05 16:27               ` Tejun Heo
2012-03-05 16:48                 ` Frederic Weisbecker
2012-03-05 16:44       ` Rik van Riel
2013-04-01 18:43 ` Tim Hockin
2013-04-01 18:46   ` Tejun Heo
2013-04-01 20:09     ` Tim Hockin
2013-04-01 20:29       ` Tejun Heo
2013-04-01 21:02         ` Tim Hockin
2013-04-01 22:03           ` Tejun Heo
2013-04-01 22:20             ` Tim Hockin
2013-04-01 22:35               ` Tejun Heo
2013-04-01 22:57                 ` Tim Hockin
2013-04-01 23:18                   ` Tejun Heo
2013-04-02  0:07                     ` Tim Hockin
  -- strict thread matches above, loose matches on Subject: below --
2011-10-03 19:07 [PATCH 00/10] cgroups: Task counter subsystem v6 Frederic Weisbecker
2011-10-03 19:07 ` [PATCH 01/10] cgroups: Add res_counter_write_u64() API Frederic Weisbecker
2011-10-04  0:17   ` Kirill A. Shutemov
2011-10-11 13:44     ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).