linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] cgroups: Task counter subsystem v7
@ 2012-01-13 18:13 Frederic Weisbecker
  2012-01-13 18:13 ` Frederic Weisbecker
                   ` (16 more replies)
  0 siblings, 17 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2012-01-13 18:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Andrew Morton, Kay Sievers,
	Tim Hockin, Tejun Heo, Kirill A. Shutemov, Containers,
	Glauber Costa, Cgroups, Daniel J Walsh, Daniel P. Berrange,
	KAMEZAWA Hiroyuki, Max Kellermann, Mandeep Singh Baines

Hi,

This is the task counter limitation patchset rebased on top
of Tejun's latest cgroup tree (cgroup/for-3.3). In a later
iteration, I also intend to include its selftests once the
selftest subsystem is merged after -rc1.

In fact, the rebase mostly is a concern of the last patch. The
others haven't changed except a few unnoticeable dusts. Some patches
have also been removed because either the last cgroup patches cover
what they were doing or they were tiny changes I merged in the last
patch (like a missing include of err.h fixed by Stephen Rothwell).

Please note that Andrew Morton had doubts whether we want to merge
it upstream or not. So don't merge it too eagerly before we sort out
the debate.


= What is this ? =

The task counter subsystem counts the tasks inside a cgroup and
rejects forks and cgroup migration when they result in a number
of task above the user tunable limit.

= Why is this needed ? =

We want to be able to run untrustee programs into sandboxes and
secure containers while protecting against forkbombs.

This patchset allow us to:

1) Prevent against forkbombs by setting an upper bound number of tasks
in a cgroup. This prevents from a forkbomb to spread. This is typically
NR_PROC rlimit but in the scope of a cgroup. Traditional NR_PROC doesn't
help us here because we don't want to have some container starving all the
others by spawning a high number of tasks when all these containers
are running under the same user.

2) Kill safely a cgroup. We want a non-racy and reliable way to kill
all tasks in a cgroup, without racing against concurrent forks.

Some practical cases from people who request this can be found here:

     https://lkml.org/lkml/2011/12/13/309
     https://lkml.org/lkml/2011/12/13/364

More details on the last patch that provides the documentation.


= Can that be used by Systemd? =

Systemd uses cgroups to keep track of services and the processes it
creates. Some feature have been requested in order to be able to reliably
kill all the processes in a cgroup such that systemd to kill services without
race.

(Note I'm not debating here to know if Systemd is doing the right thing by
using cgroups. I'm just focusing here on this particular feature request).

The task counter subsystem could be used to solve this problem. However
this involves the whole task counting machinery and this is too much
overhead to be used for system services that tend to fork often.

A simple core latch that rejects forks in a cgroup would be much more efficient
for this precise purpose.


= How does it interact with NR_PROC rlimit? =

Both can be used at the same time. They don't conflict, they
are just complementary.


= Why not rather focus on a generic solution to protect against forkbomb ? =

If you know a more generic solution to protect against forkbombs that not
only works in containers but in more cases, I'll be happy to drop this patchset
and focus on that instead.

Note we need a solution that meets our requirements for untrustees running in
containers, something that also prevents a forkbomb from doing any damage like
even a temporary DDOS. We don't want sandboxes and containers to severely impact
the rest of the system.

Thanks.

---
Frederic Weisbecker (7):
  cgroups: add res_counter_write_u64() API
  cgroups: new resource counter inheritance API
  cgroups: ability to stop res charge propagation on bounded ancestor
  res_counter: allow charge failure pointer to be null
  cgroups: pull up res counter charge failure interpretation to caller
  cgroups: allow subsystems to cancel a fork
  cgroups: Add a task counter subsystem

Kirill A. Shutemov (1):
  cgroups: add res counter common ancestor searching

 Documentation/cgroups/resource_counter.txt |   20 ++-
 Documentation/cgroups/task_counter.txt     |  153 ++++++++++++++++
 include/linux/cgroup.h                     |   20 ++-
 include/linux/cgroup_subsys.h              |    8 +
 include/linux/res_counter.h                |   27 +++-
 init/Kconfig                               |    9 +
 kernel/Makefile                            |    1 +
 kernel/cgroup.c                            |   23 ++-
 kernel/cgroup_freezer.c                    |    6 +-
 kernel/cgroup_task_counter.c               |  272 ++++++++++++++++++++++++++++
 kernel/exit.c                              |    2 +-
 kernel/fork.c                              |    7 +-
 kernel/res_counter.c                       |   97 +++++++++--
 13 files changed, 612 insertions(+), 33 deletions(-)
 create mode 100644 Documentation/cgroups/task_counter.txt
 create mode 100644 kernel/cgroup_task_counter.c

-- 
1.7.5.4


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 0/8] cgroups: Task counter subsystem v7
  2012-01-13 18:13 [PATCH 0/8] cgroups: Task counter subsystem v7 Frederic Weisbecker
@ 2012-01-13 18:13 ` Frederic Weisbecker
  2012-01-13 18:13 ` [PATCH 1/8] cgroups: add res_counter_write_u64() API Frederic Weisbecker
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2012-01-13 18:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Andrew Morton, Kay Sievers,
	Tim Hockin, Tejun Heo, Kirill A. Shutemov, Containers,
	Glauber Costa, Cgroups, Daniel J Walsh, Daniel P. Berrange,
	KAMEZAWA Hiroyuki, Max Kellermann, Mandeep Singh Baines

Hi,

This is the task counter limitation patchset rebased on top
of Tejun's latest cgroup tree (cgroup/for-3.3). In a later
iteration, I also intend to include its selftests once the
selftest subsystem is merged after -rc1.

In fact, the rebase mostly is a concern of the last patch. The
others haven't changed except a few unnoticeable dusts. Some patches
have also been removed because either the last cgroup patches cover
what they were doing or they were tiny changes I merged in the last
patch (like a missing include of err.h fixed by Stephen Rothwell).

Please note that Andrew Morton had doubts whether we want to merge
it upstream or not. So don't merge it too eagerly before we sort out
the debate.


= What is this ? =

The task counter subsystem counts the tasks inside a cgroup and
rejects forks and cgroup migration when they result in a number
of task above the user tunable limit.

= Why is this needed ? =

We want to be able to run untrustee programs into sandboxes and
secure containers while protecting against forkbombs.

This patchset allow us to:

1) Prevent against forkbombs by setting an upper bound number of tasks
in a cgroup. This prevents from a forkbomb to spread. This is typically
NR_PROC rlimit but in the scope of a cgroup. Traditional NR_PROC doesn't
help us here because we don't want to have some container starving all the
others by spawning a high number of tasks when all these containers
are running under the same user.

2) Kill safely a cgroup. We want a non-racy and reliable way to kill
all tasks in a cgroup, without racing against concurrent forks.

Some practical cases from people who request this can be found here:

     https://lkml.org/lkml/2011/12/13/309
     https://lkml.org/lkml/2011/12/13/364

More details on the last patch that provides the documentation.


= Can that be used by Systemd? =

Systemd uses cgroups to keep track of services and the processes it
creates. Some feature have been requested in order to be able to reliably
kill all the processes in a cgroup such that systemd to kill services without
race.

(Note I'm not debating here to know if Systemd is doing the right thing by
using cgroups. I'm just focusing here on this particular feature request).

The task counter subsystem could be used to solve this problem. However
this involves the whole task counting machinery and this is too much
overhead to be used for system services that tend to fork often.

A simple core latch that rejects forks in a cgroup would be much more efficient
for this precise purpose.


= How does it interact with NR_PROC rlimit? =

Both can be used at the same time. They don't conflict, they
are just complementary.



Frederic Weisbecker (7):
  cgroups: add res_counter_write_u64() API
  cgroups: new resource counter inheritance API
  cgroups: ability to stop res charge propagation on bounded ancestor
  res_counter: allow charge failure pointer to be null
  cgroups: pull up res counter charge failure interpretation to caller
  cgroups: allow subsystems to cancel a fork
  cgroups: Add a task counter subsystem

Kirill A. Shutemov (1):
  cgroups: add res counter common ancestor searching

 Documentation/cgroups/resource_counter.txt |   20 ++-
 Documentation/cgroups/task_counter.txt     |  153 ++++++++++++++++
 include/linux/cgroup.h                     |   20 ++-
 include/linux/cgroup_subsys.h              |    8 +
 include/linux/res_counter.h                |   27 +++-
 init/Kconfig                               |    9 +
 kernel/Makefile                            |    1 +
 kernel/cgroup.c                            |   23 ++-
 kernel/cgroup_freezer.c                    |    6 +-
 kernel/cgroup_task_counter.c               |  272 ++++++++++++++++++++++++++++
 kernel/exit.c                              |    2 +-
 kernel/fork.c                              |    7 +-
 kernel/res_counter.c                       |   97 +++++++++--
 13 files changed, 612 insertions(+), 33 deletions(-)
 create mode 100644 Documentation/cgroups/task_counter.txt
 create mode 100644 kernel/cgroup_task_counter.c

-- 
1.7.5.4


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/8] cgroups: add res_counter_write_u64() API
  2012-01-13 18:13 [PATCH 0/8] cgroups: Task counter subsystem v7 Frederic Weisbecker
  2012-01-13 18:13 ` Frederic Weisbecker
@ 2012-01-13 18:13 ` Frederic Weisbecker
  2012-01-13 18:13 ` Frederic Weisbecker
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2012-01-13 18:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Glauber Costa, Cgroups, Kirill A. Shutemov,
	Daniel J Walsh, Daniel P. Berrange, KAMEZAWA Hiroyuki,
	Max Kellermann, Mandeep Singh Baines, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Andrew Morton

Extend the resource counter API with a mirror of res_counter_read_u64() to
make it handy to update a resource counter value from a cgroup subsystem
u64 value file.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/res_counter.h |    2 ++
 kernel/res_counter.c        |   25 +++++++++++++++++++------
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index c9d625c..1b3fe05 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -82,6 +82,8 @@ int res_counter_memparse_write_strategy(const char *buf,
 int res_counter_write(struct res_counter *counter, int member,
 		      const char *buffer, write_strategy_fn write_strategy);
 
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
+
 /*
  * the field descriptors. one for each member of res_counter
  */
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 6d269cc..58ae85e 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -167,12 +167,26 @@ int res_counter_memparse_write_strategy(const char *buf,
 	return 0;
 }
 
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
+{
+	unsigned long long *target;
+	unsigned long flags;
+
+	/*
+	 * We need the lock to protect against concurrent add/dec on 32 bits.
+	 * No need to ifdef it's seldom used.
+	 */
+	spin_lock_irqsave(&counter->lock, flags);
+	target = res_counter_member(counter, member);
+	*target = val;
+	spin_unlock_irqrestore(&counter->lock, flags);
+}
+
 int res_counter_write(struct res_counter *counter, int member,
 		      const char *buf, write_strategy_fn write_strategy)
 {
 	char *end;
-	unsigned long flags;
-	unsigned long long tmp, *val;
+	unsigned long long tmp;
 
 	if (write_strategy) {
 		if (write_strategy(buf, &tmp))
@@ -182,9 +196,8 @@ int res_counter_write(struct res_counter *counter, int member,
 		if (*end != '\0')
 			return -EINVAL;
 	}
-	spin_lock_irqsave(&counter->lock, flags);
-	val = res_counter_member(counter, member);
-	*val = tmp;
-	spin_unlock_irqrestore(&counter->lock, flags);
+
+	res_counter_write_u64(counter, member, tmp);
+
 	return 0;
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 1/8] cgroups: add res_counter_write_u64() API
  2012-01-13 18:13 [PATCH 0/8] cgroups: Task counter subsystem v7 Frederic Weisbecker
  2012-01-13 18:13 ` Frederic Weisbecker
  2012-01-13 18:13 ` [PATCH 1/8] cgroups: add res_counter_write_u64() API Frederic Weisbecker
@ 2012-01-13 18:13 ` Frederic Weisbecker
  2012-01-16 12:27   ` Kirill A. Shutemov
  2012-01-13 18:13 ` [PATCH 2/8] cgroups: new resource counter inheritance API Frederic Weisbecker
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 22+ messages in thread
From: Frederic Weisbecker @ 2012-01-13 18:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Li Zefan, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Kirill A. Shutemov, Andrew Morton

Extend the resource counter API with a mirror of res_counter_read_u64() to
make it handy to update a resource counter value from a cgroup subsystem
u64 value file.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/res_counter.h |    2 ++
 kernel/res_counter.c        |   25 +++++++++++++++++++------
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index c9d625c..1b3fe05 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -82,6 +82,8 @@ int res_counter_memparse_write_strategy(const char *buf,
 int res_counter_write(struct res_counter *counter, int member,
 		      const char *buffer, write_strategy_fn write_strategy);
 
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
+
 /*
  * the field descriptors. one for each member of res_counter
  */
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 6d269cc..58ae85e 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -167,12 +167,26 @@ int res_counter_memparse_write_strategy(const char *buf,
 	return 0;
 }
 
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
+{
+	unsigned long long *target;
+	unsigned long flags;
+
+	/*
+	 * We need the lock to protect against concurrent add/dec on 32 bits.
+	 * No need to ifdef it's seldom used.
+	 */
+	spin_lock_irqsave(&counter->lock, flags);
+	target = res_counter_member(counter, member);
+	*target = val;
+	spin_unlock_irqrestore(&counter->lock, flags);
+}
+
 int res_counter_write(struct res_counter *counter, int member,
 		      const char *buf, write_strategy_fn write_strategy)
 {
 	char *end;
-	unsigned long flags;
-	unsigned long long tmp, *val;
+	unsigned long long tmp;
 
 	if (write_strategy) {
 		if (write_strategy(buf, &tmp))
@@ -182,9 +196,8 @@ int res_counter_write(struct res_counter *counter, int member,
 		if (*end != '\0')
 			return -EINVAL;
 	}
-	spin_lock_irqsave(&counter->lock, flags);
-	val = res_counter_member(counter, member);
-	*val = tmp;
-	spin_unlock_irqrestore(&counter->lock, flags);
+
+	res_counter_write_u64(counter, member, tmp);
+
 	return 0;
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/8] cgroups: new resource counter inheritance API
  2012-01-13 18:13 [PATCH 0/8] cgroups: Task counter subsystem v7 Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2012-01-13 18:13 ` Frederic Weisbecker
@ 2012-01-13 18:13 ` Frederic Weisbecker
  2012-01-13 18:13 ` Frederic Weisbecker
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2012-01-13 18:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Glauber Costa, Cgroups, Kirill A. Shutemov,
	Daniel J Walsh, Daniel P. Berrange, KAMEZAWA Hiroyuki,
	Max Kellermann, Mandeep Singh Baines, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo, Andrew Morton

Provide an API to inherit a counter value from a parent.  This can be
useful to implement cgroup.clone_children on a resource counter.

Still the resources of the children are limited by those of the parent, so
this is only to provide a default setting behaviour when clone_children is
set.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/res_counter.h |    2 ++
 kernel/res_counter.c        |   18 ++++++++++++++++++
 2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 1b3fe05..109d118 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -84,6 +84,8 @@ int res_counter_write(struct res_counter *counter, int member,
 
 void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
 
+void res_counter_inherit(struct res_counter *counter, int member);
+
 /*
  * the field descriptors. one for each member of res_counter
  */
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 58ae85e..016b4d4 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -201,3 +201,21 @@ int res_counter_write(struct res_counter *counter, int member,
 
 	return 0;
 }
+
+/*
+ * Simple inheritance implementation to get the same value
+ * than a parent. However this doesn't enforce the child value
+ * to be always below the one of the parent. But the child is
+ * subject to its parent limitation anyway.
+ */
+void res_counter_inherit(struct res_counter *counter, int member)
+{
+	struct res_counter *parent;
+	unsigned long long val;
+
+	parent = counter->parent;
+	if (parent) {
+		val = res_counter_read_u64(parent, member);
+		res_counter_write_u64(counter, member, val);
+	}
+}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/8] cgroups: new resource counter inheritance API
  2012-01-13 18:13 [PATCH 0/8] cgroups: Task counter subsystem v7 Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2012-01-13 18:13 ` [PATCH 2/8] cgroups: new resource counter inheritance API Frederic Weisbecker
@ 2012-01-13 18:13 ` Frederic Weisbecker
  2012-01-13 18:13 ` [PATCH 3/8] cgroups: ability to stop res charge propagation on bounded ancestor Frederic Weisbecker
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2012-01-13 18:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Andrew Morton

Provide an API to inherit a counter value from a parent.  This can be
useful to implement cgroup.clone_children on a resource counter.

Still the resources of the children are limited by those of the parent, so
this is only to provide a default setting behaviour when clone_children is
set.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/res_counter.h |    2 ++
 kernel/res_counter.c        |   18 ++++++++++++++++++
 2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 1b3fe05..109d118 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -84,6 +84,8 @@ int res_counter_write(struct res_counter *counter, int member,
 
 void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
 
+void res_counter_inherit(struct res_counter *counter, int member);
+
 /*
  * the field descriptors. one for each member of res_counter
  */
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 58ae85e..016b4d4 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -201,3 +201,21 @@ int res_counter_write(struct res_counter *counter, int member,
 
 	return 0;
 }
+
+/*
+ * Simple inheritance implementation to get the same value
+ * than a parent. However this doesn't enforce the child value
+ * to be always below the one of the parent. But the child is
+ * subject to its parent limitation anyway.
+ */
+void res_counter_inherit(struct res_counter *counter, int member)
+{
+	struct res_counter *parent;
+	unsigned long long val;
+
+	parent = counter->parent;
+	if (parent) {
+		val = res_counter_read_u64(parent, member);
+		res_counter_write_u64(counter, member, val);
+	}
+}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/8] cgroups: ability to stop res charge propagation on bounded ancestor
  2012-01-13 18:13 [PATCH 0/8] cgroups: Task counter subsystem v7 Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2012-01-13 18:13 ` Frederic Weisbecker
@ 2012-01-13 18:13 ` Frederic Weisbecker
  2012-01-13 18:13 ` Frederic Weisbecker
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2012-01-13 18:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Glauber Costa, Cgroups, Kirill A. Shutemov,
	Daniel J Walsh, Daniel P. Berrange, KAMEZAWA Hiroyuki,
	Max Kellermann, Mandeep Singh Baines, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Andrew Morton

Moving a task from a cgroup to another may require to substract its
resource charge from the old cgroup and add it to the new one.

For this to happen, the uncharge/charge propagation can just stop when we
reach the common ancestor for the two cgroups.  Further the performance
reasons, we also want to avoid to temporarily overload the common
ancestors with a non-accurate resource counter usage if we charge first
the new cgroup and uncharge the old one thereafter.  This is going to be a
requirement for the coming max number of task subsystem.

To solve this, provide a pair of new API that can charge/uncharge a
resource counter until we reach a given ancestor.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 Documentation/cgroups/resource_counter.txt |   18 +++++++++++++++++-
 include/linux/res_counter.h                |   20 +++++++++++++++++---
 kernel/res_counter.c                       |   13 ++++++++-----
 3 files changed, 42 insertions(+), 9 deletions(-)

diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index 95b24d7..a2cd05b 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -83,7 +83,15 @@ to work with it.
 	res_counter->lock internally (it must be called with res_counter->lock
 	held).
 
- e. void res_counter_uncharge[_locked]
+ e. int res_counter_charge_until(struct res_counter *counter,
+			     struct res_counter *limit, unsigned long val,
+			     struct res_counter **limit_fail_at)
+
+	The same as res_counter_charge(), but the charge propagation to
+	the hierarchy stops at the limit given in the "limit" parameter.
+
+
+ f. void res_counter_uncharge[_locked]
 			(struct res_counter *rc, unsigned long val)
 
 	When a resource is released (freed) it should be de-accounted
@@ -92,6 +100,14 @@ to work with it.
 
 	The _locked routines imply that the res_counter->lock is taken.
 
+
+ g. void res_counter_uncharge_until(struct res_counter *counter,
+				struct res_counter *limit,
+				unsigned long val)
+
+	The same as res_counter_charge, but the uncharge propagation to
+	the hierarchy stops at the limit given in the "limit" parameter.
+
  2.1 Other accounting routines
 
     There are more routines that may help you with common needs, like
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 109d118..de4ba29 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -117,8 +117,16 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
 
 int __must_check res_counter_charge_locked(struct res_counter *counter,
 		unsigned long val);
-int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val, struct res_counter **limit_fail_at);
+int __must_check res_counter_charge_until(struct res_counter *counter,
+					  struct res_counter *limit,
+					  unsigned long val,
+					  struct res_counter **limit_fail_at);
+static inline int __must_check
+res_counter_charge(struct res_counter *counter, unsigned long val,
+		   struct res_counter **limit_fail_at)
+{
+	return res_counter_charge_until(counter, NULL, val, limit_fail_at);
+}
 
 /*
  * uncharge - tell that some portion of the resource is released
@@ -131,7 +139,13 @@ int __must_check res_counter_charge(struct res_counter *counter,
  */
 
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
-void res_counter_uncharge(struct res_counter *counter, unsigned long val);
+void res_counter_uncharge_until(struct res_counter *counter,
+				struct res_counter *limit, unsigned long val);
+static inline void res_counter_uncharge(struct res_counter *counter,
+					unsigned long val)
+{
+	res_counter_uncharge_until(counter, NULL, val);
+}
 
 /**
  * res_counter_margin - calculate chargeable space of a counter
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 016b4d4..8ef09a4 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -35,8 +35,9 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 	return 0;
 }
 
-int res_counter_charge(struct res_counter *counter, unsigned long val,
-			struct res_counter **limit_fail_at)
+int res_counter_charge_until(struct res_counter *counter,
+			     struct res_counter *limit, unsigned long val,
+			     struct res_counter **limit_fail_at)
 {
 	int ret;
 	unsigned long flags;
@@ -44,7 +45,7 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
 
 	*limit_fail_at = NULL;
 	local_irq_save(flags);
-	for (c = counter; c != NULL; c = c->parent) {
+	for (c = counter; c != limit; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
 		spin_unlock(&c->lock);
@@ -74,13 +75,15 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
 	counter->usage -= val;
 }
 
-void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+void res_counter_uncharge_until(struct res_counter *counter,
+				struct res_counter *limit,
+				unsigned long val)
 {
 	unsigned long flags;
 	struct res_counter *c;
 
 	local_irq_save(flags);
-	for (c = counter; c != NULL; c = c->parent) {
+	for (c = counter; c != limit; c = c->parent) {
 		spin_lock(&c->lock);
 		res_counter_uncharge_locked(c, val);
 		spin_unlock(&c->lock);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/8] cgroups: ability to stop res charge propagation on bounded ancestor
  2012-01-13 18:13 [PATCH 0/8] cgroups: Task counter subsystem v7 Frederic Weisbecker
                   ` (5 preceding siblings ...)
  2012-01-13 18:13 ` [PATCH 3/8] cgroups: ability to stop res charge propagation on bounded ancestor Frederic Weisbecker
@ 2012-01-13 18:13 ` Frederic Weisbecker
  2012-01-13 18:13 ` [PATCH 4/8] cgroups: add res counter common ancestor searching Frederic Weisbecker
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2012-01-13 18:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Li Zefan, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo, Andrew Morton

Moving a task from a cgroup to another may require to substract its
resource charge from the old cgroup and add it to the new one.

For this to happen, the uncharge/charge propagation can just stop when we
reach the common ancestor for the two cgroups.  Further the performance
reasons, we also want to avoid to temporarily overload the common
ancestors with a non-accurate resource counter usage if we charge first
the new cgroup and uncharge the old one thereafter.  This is going to be a
requirement for the coming max number of task subsystem.

To solve this, provide a pair of new API that can charge/uncharge a
resource counter until we reach a given ancestor.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 Documentation/cgroups/resource_counter.txt |   18 +++++++++++++++++-
 include/linux/res_counter.h                |   20 +++++++++++++++++---
 kernel/res_counter.c                       |   13 ++++++++-----
 3 files changed, 42 insertions(+), 9 deletions(-)

diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index 95b24d7..a2cd05b 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -83,7 +83,15 @@ to work with it.
 	res_counter->lock internally (it must be called with res_counter->lock
 	held).
 
- e. void res_counter_uncharge[_locked]
+ e. int res_counter_charge_until(struct res_counter *counter,
+			     struct res_counter *limit, unsigned long val,
+			     struct res_counter **limit_fail_at)
+
+	The same as res_counter_charge(), but the charge propagation to
+	the hierarchy stops at the limit given in the "limit" parameter.
+
+
+ f. void res_counter_uncharge[_locked]
 			(struct res_counter *rc, unsigned long val)
 
 	When a resource is released (freed) it should be de-accounted
@@ -92,6 +100,14 @@ to work with it.
 
 	The _locked routines imply that the res_counter->lock is taken.
 
+
+ g. void res_counter_uncharge_until(struct res_counter *counter,
+				struct res_counter *limit,
+				unsigned long val)
+
+	The same as res_counter_charge, but the uncharge propagation to
+	the hierarchy stops at the limit given in the "limit" parameter.
+
  2.1 Other accounting routines
 
     There are more routines that may help you with common needs, like
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 109d118..de4ba29 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -117,8 +117,16 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
 
 int __must_check res_counter_charge_locked(struct res_counter *counter,
 		unsigned long val);
-int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val, struct res_counter **limit_fail_at);
+int __must_check res_counter_charge_until(struct res_counter *counter,
+					  struct res_counter *limit,
+					  unsigned long val,
+					  struct res_counter **limit_fail_at);
+static inline int __must_check
+res_counter_charge(struct res_counter *counter, unsigned long val,
+		   struct res_counter **limit_fail_at)
+{
+	return res_counter_charge_until(counter, NULL, val, limit_fail_at);
+}
 
 /*
  * uncharge - tell that some portion of the resource is released
@@ -131,7 +139,13 @@ int __must_check res_counter_charge(struct res_counter *counter,
  */
 
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
-void res_counter_uncharge(struct res_counter *counter, unsigned long val);
+void res_counter_uncharge_until(struct res_counter *counter,
+				struct res_counter *limit, unsigned long val);
+static inline void res_counter_uncharge(struct res_counter *counter,
+					unsigned long val)
+{
+	res_counter_uncharge_until(counter, NULL, val);
+}
 
 /**
  * res_counter_margin - calculate chargeable space of a counter
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 016b4d4..8ef09a4 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -35,8 +35,9 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 	return 0;
 }
 
-int res_counter_charge(struct res_counter *counter, unsigned long val,
-			struct res_counter **limit_fail_at)
+int res_counter_charge_until(struct res_counter *counter,
+			     struct res_counter *limit, unsigned long val,
+			     struct res_counter **limit_fail_at)
 {
 	int ret;
 	unsigned long flags;
@@ -44,7 +45,7 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
 
 	*limit_fail_at = NULL;
 	local_irq_save(flags);
-	for (c = counter; c != NULL; c = c->parent) {
+	for (c = counter; c != limit; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
 		spin_unlock(&c->lock);
@@ -74,13 +75,15 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
 	counter->usage -= val;
 }
 
-void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+void res_counter_uncharge_until(struct res_counter *counter,
+				struct res_counter *limit,
+				unsigned long val)
 {
 	unsigned long flags;
 	struct res_counter *c;
 
 	local_irq_save(flags);
-	for (c = counter; c != NULL; c = c->parent) {
+	for (c = counter; c != limit; c = c->parent) {
 		spin_lock(&c->lock);
 		res_counter_uncharge_locked(c, val);
 		spin_unlock(&c->lock);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/8] cgroups: add res counter common ancestor searching
  2012-01-13 18:13 [PATCH 0/8] cgroups: Task counter subsystem v7 Frederic Weisbecker
                   ` (6 preceding siblings ...)
  2012-01-13 18:13 ` Frederic Weisbecker
@ 2012-01-13 18:13 ` Frederic Weisbecker
  2012-01-13 18:13 ` Frederic Weisbecker
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2012-01-13 18:13 UTC (permalink / raw)
  To: LKML
  Cc: Kirill A. Shutemov, Glauber Costa, Cgroups, Daniel J Walsh,
	Daniel P. Berrange, KAMEZAWA Hiroyuki, Max Kellermann,
	Mandeep Singh Baines, Frederic Weisbecker, Li Zefan, Paul Menage,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo, Andrew Morton

From: "Kirill A. Shutemov" <kirill@shutemov.name>

Add a new API to find the common ancestor between two resource counters.
This includes the passed resource counter themselves.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/res_counter.h |    3 +++
 kernel/res_counter.c        |   33 +++++++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index de4ba29..558f39b 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -147,6 +147,9 @@ static inline void res_counter_uncharge(struct res_counter *counter,
 	res_counter_uncharge_until(counter, NULL, val);
 }
 
+struct res_counter *res_counter_common_ancestor(struct res_counter *l,
+						struct res_counter *r);
+
 /**
  * res_counter_margin - calculate chargeable space of a counter
  * @cnt: the counter
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 8ef09a4..86d72b8 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -91,6 +91,39 @@ void res_counter_uncharge_until(struct res_counter *counter,
 	local_irq_restore(flags);
 }
 
+/*
+ * Walk through r1 and r2 parents and try to find the closest common one
+ * between both. If none is found, it returns NULL.
+ */
+struct res_counter *
+res_counter_common_ancestor(struct res_counter *r1, struct res_counter *r2)
+{
+	struct res_counter *iter;
+	int r1_depth = 0, r2_depth = 0;
+
+	for (iter = r1; iter; iter = iter->parent)
+		r1_depth++;
+
+	for (iter = r2; iter; iter = iter->parent)
+		r2_depth++;
+
+	while (r1_depth > r2_depth) {
+		r1 = r1->parent;
+		r1_depth--;
+	}
+
+	while (r2_depth > r1_depth) {
+		r2 = r2->parent;
+		r2_depth--;
+	}
+
+	while (r1 != r2) {
+		r1 = r1->parent;
+		r2 = r2->parent;
+	}
+
+	return r1;
+}
 
 static inline unsigned long long *
 res_counter_member(struct res_counter *counter, int member)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/8] cgroups: add res counter common ancestor searching
  2012-01-13 18:13 [PATCH 0/8] cgroups: Task counter subsystem v7 Frederic Weisbecker
                   ` (7 preceding siblings ...)
  2012-01-13 18:13 ` [PATCH 4/8] cgroups: add res counter common ancestor searching Frederic Weisbecker
@ 2012-01-13 18:13 ` Frederic Weisbecker
  2012-01-13 18:13 ` [PATCH 5/8] res_counter: allow charge failure pointer to be null Frederic Weisbecker
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2012-01-13 18:13 UTC (permalink / raw)
  To: LKML
  Cc: Kirill A. Shutemov, Frederic Weisbecker, Li Zefan, Paul Menage,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo, Andrew Morton

From: "Kirill A. Shutemov" <kirill@shutemov.name>

Add a new API to find the common ancestor between two resource counters.
This includes the passed resource counter themselves.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/res_counter.h |    3 +++
 kernel/res_counter.c        |   33 +++++++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index de4ba29..558f39b 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -147,6 +147,9 @@ static inline void res_counter_uncharge(struct res_counter *counter,
 	res_counter_uncharge_until(counter, NULL, val);
 }
 
+struct res_counter *res_counter_common_ancestor(struct res_counter *l,
+						struct res_counter *r);
+
 /**
  * res_counter_margin - calculate chargeable space of a counter
  * @cnt: the counter
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 8ef09a4..86d72b8 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -91,6 +91,39 @@ void res_counter_uncharge_until(struct res_counter *counter,
 	local_irq_restore(flags);
 }
 
+/*
+ * Walk through r1 and r2 parents and try to find the closest common one
+ * between both. If none is found, it returns NULL.
+ */
+struct res_counter *
+res_counter_common_ancestor(struct res_counter *r1, struct res_counter *r2)
+{
+	struct res_counter *iter;
+	int r1_depth = 0, r2_depth = 0;
+
+	for (iter = r1; iter; iter = iter->parent)
+		r1_depth++;
+
+	for (iter = r2; iter; iter = iter->parent)
+		r2_depth++;
+
+	while (r1_depth > r2_depth) {
+		r1 = r1->parent;
+		r1_depth--;
+	}
+
+	while (r2_depth > r1_depth) {
+		r2 = r2->parent;
+		r2_depth--;
+	}
+
+	while (r1 != r2) {
+		r1 = r1->parent;
+		r2 = r2->parent;
+	}
+
+	return r1;
+}
 
 static inline unsigned long long *
 res_counter_member(struct res_counter *counter, int member)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/8] res_counter: allow charge failure pointer to be null
  2012-01-13 18:13 [PATCH 0/8] cgroups: Task counter subsystem v7 Frederic Weisbecker
                   ` (8 preceding siblings ...)
  2012-01-13 18:13 ` Frederic Weisbecker
@ 2012-01-13 18:13 ` Frederic Weisbecker
  2012-01-13 18:13 ` Frederic Weisbecker
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2012-01-13 18:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Glauber Costa, Cgroups, Kirill A. Shutemov,
	Daniel J Walsh, Daniel P. Berrange, KAMEZAWA Hiroyuki,
	Max Kellermann, Mandeep Singh Baines, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo, Andrew Morton

So that callers of res_counter_charge() don't have to create and pass this
pointer even if they aren't interested in it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 kernel/res_counter.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 86d72b8..644898a 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -43,14 +43,16 @@ int res_counter_charge_until(struct res_counter *counter,
 	unsigned long flags;
 	struct res_counter *c, *u;
 
-	*limit_fail_at = NULL;
+	if (limit_fail_at)
+		*limit_fail_at = NULL;
 	local_irq_save(flags);
 	for (c = counter; c != limit; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
 		spin_unlock(&c->lock);
 		if (ret < 0) {
-			*limit_fail_at = c;
+			if (limit_fail_at)
+				*limit_fail_at = c;
 			goto undo;
 		}
 	}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/8] res_counter: allow charge failure pointer to be null
  2012-01-13 18:13 [PATCH 0/8] cgroups: Task counter subsystem v7 Frederic Weisbecker
                   ` (9 preceding siblings ...)
  2012-01-13 18:13 ` [PATCH 5/8] res_counter: allow charge failure pointer to be null Frederic Weisbecker
@ 2012-01-13 18:13 ` Frederic Weisbecker
  2012-01-13 18:13 ` [PATCH 6/8] cgroups: pull up res counter charge failure interpretation to caller Frederic Weisbecker
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2012-01-13 18:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Andrew Morton

So that callers of res_counter_charge() don't have to create and pass this
pointer even if they aren't interested in it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 kernel/res_counter.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 86d72b8..644898a 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -43,14 +43,16 @@ int res_counter_charge_until(struct res_counter *counter,
 	unsigned long flags;
 	struct res_counter *c, *u;
 
-	*limit_fail_at = NULL;
+	if (limit_fail_at)
+		*limit_fail_at = NULL;
 	local_irq_save(flags);
 	for (c = counter; c != limit; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
 		spin_unlock(&c->lock);
 		if (ret < 0) {
-			*limit_fail_at = c;
+			if (limit_fail_at)
+				*limit_fail_at = c;
 			goto undo;
 		}
 	}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/8] cgroups: pull up res counter charge failure interpretation to caller
  2012-01-13 18:13 [PATCH 0/8] cgroups: Task counter subsystem v7 Frederic Weisbecker
                   ` (10 preceding siblings ...)
  2012-01-13 18:13 ` Frederic Weisbecker
@ 2012-01-13 18:13 ` Frederic Weisbecker
  2012-01-13 18:13 ` Frederic Weisbecker
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2012-01-13 18:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Glauber Costa, Cgroups, Kirill A. Shutemov,
	Daniel J Walsh, Daniel P. Berrange, KAMEZAWA Hiroyuki,
	Max Kellermann, Mandeep Singh Baines, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo, Andrew Morton

res_counter_charge() always returns -ENOMEM when the limit is reached and
the charge thus can't happen.

However it's up to the caller to interpret this failure and return the
appropriate error value.  The task counter subsystem will need to report
the user that a fork() has been cancelled because of some limit reached,
not because we are too short on memory.

Fix this by returning -1 when res_counter_charge() fails.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 Documentation/cgroups/resource_counter.txt |    2 ++
 kernel/res_counter.c                       |    2 +-
 2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index a2cd05b..24ec61c 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -76,6 +76,8 @@ to work with it.
 	limit_fail_at parameter is set to the particular res_counter element
 	where the charging failed.
 
+	It returns 0 on success and -1 on failure.
+
  d. int res_counter_charge_locked
 			(struct res_counter *rc, unsigned long val)
 
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 644898a..8f47ac6 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -26,7 +26,7 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 {
 	if (counter->usage + val > counter->limit) {
 		counter->failcnt++;
-		return -ENOMEM;
+		return -1;
 	}
 
 	counter->usage += val;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/8] cgroups: pull up res counter charge failure interpretation to caller
  2012-01-13 18:13 [PATCH 0/8] cgroups: Task counter subsystem v7 Frederic Weisbecker
                   ` (11 preceding siblings ...)
  2012-01-13 18:13 ` [PATCH 6/8] cgroups: pull up res counter charge failure interpretation to caller Frederic Weisbecker
@ 2012-01-13 18:13 ` Frederic Weisbecker
  2012-01-13 18:13 ` [PATCH 7/8] cgroups: allow subsystems to cancel a fork Frederic Weisbecker
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2012-01-13 18:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Andrew Morton

res_counter_charge() always returns -ENOMEM when the limit is reached and
the charge thus can't happen.

However it's up to the caller to interpret this failure and return the
appropriate error value.  The task counter subsystem will need to report
the user that a fork() has been cancelled because of some limit reached,
not because we are too short on memory.

Fix this by returning -1 when res_counter_charge() fails.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 Documentation/cgroups/resource_counter.txt |    2 ++
 kernel/res_counter.c                       |    2 +-
 2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index a2cd05b..24ec61c 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -76,6 +76,8 @@ to work with it.
 	limit_fail_at parameter is set to the particular res_counter element
 	where the charging failed.
 
+	It returns 0 on success and -1 on failure.
+
  d. int res_counter_charge_locked
 			(struct res_counter *rc, unsigned long val)
 
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 644898a..8f47ac6 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -26,7 +26,7 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 {
 	if (counter->usage + val > counter->limit) {
 		counter->failcnt++;
-		return -ENOMEM;
+		return -1;
 	}
 
 	counter->usage += val;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 7/8] cgroups: allow subsystems to cancel a fork
  2012-01-13 18:13 [PATCH 0/8] cgroups: Task counter subsystem v7 Frederic Weisbecker
                   ` (12 preceding siblings ...)
  2012-01-13 18:13 ` Frederic Weisbecker
@ 2012-01-13 18:13 ` Frederic Weisbecker
  2012-01-13 18:13 ` Frederic Weisbecker
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2012-01-13 18:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Glauber Costa, Cgroups, Kirill A. Shutemov,
	Daniel J Walsh, Daniel P. Berrange, KAMEZAWA Hiroyuki,
	Max Kellermann, Mandeep Singh Baines, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Kay Sievers,
	Tim Hockin, Tejun Heo, Andrew Morton

Let the subsystem's fork callback return an error value so that they can
cancel a fork.  This is going to be used by the task counter subsystem to
implement the limit.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/cgroup.h  |   20 ++++++++++++++------
 kernel/cgroup.c         |   23 +++++++++++++++++++----
 kernel/cgroup_freezer.c |    6 ++++--
 kernel/exit.c           |    2 +-
 kernel/fork.c           |    7 +++++--
 5 files changed, 43 insertions(+), 15 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 7ad5e40..095e66d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -17,10 +17,11 @@
 #include <linux/rwsem.h>
 #include <linux/idr.h>
 
+struct cgroup_subsys;
+
 #ifdef CONFIG_CGROUPS
 
 struct cgroupfs_root;
-struct cgroup_subsys;
 struct inode;
 struct cgroup;
 struct css_id;
@@ -32,9 +33,11 @@ extern int cgroup_lock_is_held(void);
 extern bool cgroup_lock_live_group(struct cgroup *cgrp);
 extern void cgroup_unlock(void);
 extern void cgroup_fork(struct task_struct *p);
-extern void cgroup_fork_callbacks(struct task_struct *p);
+extern int cgroup_fork_callbacks(struct task_struct *p,
+				 struct cgroup_subsys **failed_ss);
 extern void cgroup_post_fork(struct task_struct *p);
-extern void cgroup_exit(struct task_struct *p, int run_callbacks);
+extern void cgroup_exit(struct task_struct *p, int run_callbacks,
+			struct cgroup_subsys *failed_ss);
 extern int cgroupstats_build(struct cgroupstats *stats,
 				struct dentry *dentry);
 extern int cgroup_load_subsys(struct cgroup_subsys *ss);
@@ -494,7 +497,7 @@ struct cgroup_subsys {
 			      struct cgroup_taskset *tset);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 		       struct cgroup_taskset *tset);
-	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
+	int (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
 	void (*exit)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			struct cgroup *old_cgrp, struct task_struct *task);
 	int (*populate)(struct cgroup_subsys *ss,
@@ -651,9 +654,14 @@ struct cgroup_subsys_state *cgroup_css_from_dir(struct file *f, int id);
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
 static inline void cgroup_fork(struct task_struct *p) {}
-static inline void cgroup_fork_callbacks(struct task_struct *p) {}
+static inline int cgroup_fork_callbacks(struct task_struct *p,
+					struct cgroup_subsys **failed_ss)
+{
+	return 0;
+}
 static inline void cgroup_post_fork(struct task_struct *p) {}
-static inline void cgroup_exit(struct task_struct *p, int callbacks) {}
+static inline void cgroup_exit(struct task_struct *p, int callbacks,
+			       struct cgroup_subsys *failed_ss) {}
 
 static inline void cgroup_lock(void) {}
 static inline void cgroup_unlock(void) {}
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 39c7cae..109530a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4568,8 +4568,11 @@ void cgroup_fork(struct task_struct *child)
  * tasklist. No need to take any locks since no-one can
  * be operating on this task.
  */
-void cgroup_fork_callbacks(struct task_struct *child)
+int cgroup_fork_callbacks(struct task_struct *child,
+			  struct cgroup_subsys **failed_ss)
 {
+	int err;
+
 	if (need_forkexit_callback) {
 		int i;
 		/*
@@ -4579,10 +4582,17 @@ void cgroup_fork_callbacks(struct task_struct *child)
 		 */
 		for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
 			struct cgroup_subsys *ss = subsys[i];
-			if (ss->fork)
-				ss->fork(ss, child);
+			if (ss->fork) {
+				err = ss->fork(ss, child);
+				if (err) {
+					*failed_ss = ss;
+					return err;
+				}
+			}
 		}
 	}
+
+	return 0;
 }
 
 /**
@@ -4649,7 +4659,8 @@ void cgroup_post_fork(struct task_struct *child)
  *    which wards off any cgroup_attach_task() attempts, or task is a failed
  *    fork, never visible to cgroup_attach_task.
  */
-void cgroup_exit(struct task_struct *tsk, int run_callbacks)
+void cgroup_exit(struct task_struct *tsk, int run_callbacks,
+		 struct cgroup_subsys *failed_ss)
 {
 	struct css_set *cg;
 	int i;
@@ -4678,6 +4689,10 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
 		 */
 		for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
 			struct cgroup_subsys *ss = subsys[i];
+
+			if (ss == failed_ss)
+				break;
+
 			if (ss->exit) {
 				struct cgroup *old_cgrp =
 					rcu_dereference_raw(cg->subsys[i])->cgroup;
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index 0e74805..6cdce8f 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -178,7 +178,7 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	return 0;
 }
 
-static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
+static int freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 {
 	struct freezer *freezer;
 
@@ -198,7 +198,7 @@ static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 	 * following check.
 	 */
 	if (!freezer->css.cgroup->parent)
-		return;
+		return 0;
 
 	spin_lock_irq(&freezer->lock);
 	BUG_ON(freezer->state == CGROUP_FROZEN);
@@ -207,6 +207,8 @@ static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 	if (freezer->state == CGROUP_FREEZING)
 		freeze_task(task);
 	spin_unlock_irq(&freezer->lock);
+
+	return 0;
 }
 
 /*
diff --git a/kernel/exit.c b/kernel/exit.c
index 95a4141..b9728d9 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -990,7 +990,7 @@ NORET_TYPE void do_exit(long code)
 	 */
 	perf_event_exit_task(tsk);
 
-	cgroup_exit(tsk, 1);
+	cgroup_exit(tsk, 1, NULL);
 
 	if (group_dead)
 		disassociate_ctty(1);
diff --git a/kernel/fork.c b/kernel/fork.c
index d4ac9e3..097c597 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1049,6 +1049,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
+	struct cgroup_subsys *cgroup_failed_ss = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1306,8 +1307,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	/* Now that the task is set up, run cgroup callbacks if
 	 * necessary. We need to run them before the task is visible
 	 * on the tasklist. */
-	cgroup_fork_callbacks(p);
+	retval = cgroup_fork_callbacks(p, &cgroup_failed_ss);
 	cgroup_callbacks_done = 1;
+	if (retval)
+		goto bad_fork_free_pid;
 
 	/* Need tasklist lock for parent etc handling! */
 	write_lock_irq(&tasklist_lock);
@@ -1408,7 +1411,7 @@ bad_fork_cleanup_cgroup:
 #endif
 	if (clone_flags & CLONE_THREAD)
 		threadgroup_change_end(current);
-	cgroup_exit(p, cgroup_callbacks_done);
+	cgroup_exit(p, cgroup_callbacks_done, cgroup_failed_ss);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);
 bad_fork_cleanup_count:
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 7/8] cgroups: allow subsystems to cancel a fork
  2012-01-13 18:13 [PATCH 0/8] cgroups: Task counter subsystem v7 Frederic Weisbecker
                   ` (13 preceding siblings ...)
  2012-01-13 18:13 ` [PATCH 7/8] cgroups: allow subsystems to cancel a fork Frederic Weisbecker
@ 2012-01-13 18:13 ` Frederic Weisbecker
  2012-01-13 18:14 ` [PATCH 8/8] cgroups: Add a task counter subsystem Frederic Weisbecker
  2012-01-13 18:14 ` Frederic Weisbecker
  16 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2012-01-13 18:13 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Kay Sievers, Tim Hockin, Tejun Heo,
	Andrew Morton

Let the subsystem's fork callback return an error value so that they can
cancel a fork.  This is going to be used by the task counter subsystem to
implement the limit.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/cgroup.h  |   20 ++++++++++++++------
 kernel/cgroup.c         |   23 +++++++++++++++++++----
 kernel/cgroup_freezer.c |    6 ++++--
 kernel/exit.c           |    2 +-
 kernel/fork.c           |    7 +++++--
 5 files changed, 43 insertions(+), 15 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 7ad5e40..095e66d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -17,10 +17,11 @@
 #include <linux/rwsem.h>
 #include <linux/idr.h>
 
+struct cgroup_subsys;
+
 #ifdef CONFIG_CGROUPS
 
 struct cgroupfs_root;
-struct cgroup_subsys;
 struct inode;
 struct cgroup;
 struct css_id;
@@ -32,9 +33,11 @@ extern int cgroup_lock_is_held(void);
 extern bool cgroup_lock_live_group(struct cgroup *cgrp);
 extern void cgroup_unlock(void);
 extern void cgroup_fork(struct task_struct *p);
-extern void cgroup_fork_callbacks(struct task_struct *p);
+extern int cgroup_fork_callbacks(struct task_struct *p,
+				 struct cgroup_subsys **failed_ss);
 extern void cgroup_post_fork(struct task_struct *p);
-extern void cgroup_exit(struct task_struct *p, int run_callbacks);
+extern void cgroup_exit(struct task_struct *p, int run_callbacks,
+			struct cgroup_subsys *failed_ss);
 extern int cgroupstats_build(struct cgroupstats *stats,
 				struct dentry *dentry);
 extern int cgroup_load_subsys(struct cgroup_subsys *ss);
@@ -494,7 +497,7 @@ struct cgroup_subsys {
 			      struct cgroup_taskset *tset);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 		       struct cgroup_taskset *tset);
-	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
+	int (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
 	void (*exit)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			struct cgroup *old_cgrp, struct task_struct *task);
 	int (*populate)(struct cgroup_subsys *ss,
@@ -651,9 +654,14 @@ struct cgroup_subsys_state *cgroup_css_from_dir(struct file *f, int id);
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
 static inline void cgroup_fork(struct task_struct *p) {}
-static inline void cgroup_fork_callbacks(struct task_struct *p) {}
+static inline int cgroup_fork_callbacks(struct task_struct *p,
+					struct cgroup_subsys **failed_ss)
+{
+	return 0;
+}
 static inline void cgroup_post_fork(struct task_struct *p) {}
-static inline void cgroup_exit(struct task_struct *p, int callbacks) {}
+static inline void cgroup_exit(struct task_struct *p, int callbacks,
+			       struct cgroup_subsys *failed_ss) {}
 
 static inline void cgroup_lock(void) {}
 static inline void cgroup_unlock(void) {}
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 39c7cae..109530a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4568,8 +4568,11 @@ void cgroup_fork(struct task_struct *child)
  * tasklist. No need to take any locks since no-one can
  * be operating on this task.
  */
-void cgroup_fork_callbacks(struct task_struct *child)
+int cgroup_fork_callbacks(struct task_struct *child,
+			  struct cgroup_subsys **failed_ss)
 {
+	int err;
+
 	if (need_forkexit_callback) {
 		int i;
 		/*
@@ -4579,10 +4582,17 @@ void cgroup_fork_callbacks(struct task_struct *child)
 		 */
 		for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
 			struct cgroup_subsys *ss = subsys[i];
-			if (ss->fork)
-				ss->fork(ss, child);
+			if (ss->fork) {
+				err = ss->fork(ss, child);
+				if (err) {
+					*failed_ss = ss;
+					return err;
+				}
+			}
 		}
 	}
+
+	return 0;
 }
 
 /**
@@ -4649,7 +4659,8 @@ void cgroup_post_fork(struct task_struct *child)
  *    which wards off any cgroup_attach_task() attempts, or task is a failed
  *    fork, never visible to cgroup_attach_task.
  */
-void cgroup_exit(struct task_struct *tsk, int run_callbacks)
+void cgroup_exit(struct task_struct *tsk, int run_callbacks,
+		 struct cgroup_subsys *failed_ss)
 {
 	struct css_set *cg;
 	int i;
@@ -4678,6 +4689,10 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
 		 */
 		for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
 			struct cgroup_subsys *ss = subsys[i];
+
+			if (ss == failed_ss)
+				break;
+
 			if (ss->exit) {
 				struct cgroup *old_cgrp =
 					rcu_dereference_raw(cg->subsys[i])->cgroup;
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index 0e74805..6cdce8f 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -178,7 +178,7 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	return 0;
 }
 
-static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
+static int freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 {
 	struct freezer *freezer;
 
@@ -198,7 +198,7 @@ static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 	 * following check.
 	 */
 	if (!freezer->css.cgroup->parent)
-		return;
+		return 0;
 
 	spin_lock_irq(&freezer->lock);
 	BUG_ON(freezer->state == CGROUP_FROZEN);
@@ -207,6 +207,8 @@ static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 	if (freezer->state == CGROUP_FREEZING)
 		freeze_task(task);
 	spin_unlock_irq(&freezer->lock);
+
+	return 0;
 }
 
 /*
diff --git a/kernel/exit.c b/kernel/exit.c
index 95a4141..b9728d9 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -990,7 +990,7 @@ NORET_TYPE void do_exit(long code)
 	 */
 	perf_event_exit_task(tsk);
 
-	cgroup_exit(tsk, 1);
+	cgroup_exit(tsk, 1, NULL);
 
 	if (group_dead)
 		disassociate_ctty(1);
diff --git a/kernel/fork.c b/kernel/fork.c
index d4ac9e3..097c597 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1049,6 +1049,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
+	struct cgroup_subsys *cgroup_failed_ss = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1306,8 +1307,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	/* Now that the task is set up, run cgroup callbacks if
 	 * necessary. We need to run them before the task is visible
 	 * on the tasklist. */
-	cgroup_fork_callbacks(p);
+	retval = cgroup_fork_callbacks(p, &cgroup_failed_ss);
 	cgroup_callbacks_done = 1;
+	if (retval)
+		goto bad_fork_free_pid;
 
 	/* Need tasklist lock for parent etc handling! */
 	write_lock_irq(&tasklist_lock);
@@ -1408,7 +1411,7 @@ bad_fork_cleanup_cgroup:
 #endif
 	if (clone_flags & CLONE_THREAD)
 		threadgroup_change_end(current);
-	cgroup_exit(p, cgroup_callbacks_done);
+	cgroup_exit(p, cgroup_callbacks_done, cgroup_failed_ss);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);
 bad_fork_cleanup_count:
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 8/8] cgroups: Add a task counter subsystem
  2012-01-13 18:13 [PATCH 0/8] cgroups: Task counter subsystem v7 Frederic Weisbecker
                   ` (14 preceding siblings ...)
  2012-01-13 18:13 ` Frederic Weisbecker
@ 2012-01-13 18:14 ` Frederic Weisbecker
  2012-01-16 12:38   ` Kirill A. Shutemov
  2012-01-13 18:14 ` Frederic Weisbecker
  16 siblings, 1 reply; 22+ messages in thread
From: Frederic Weisbecker @ 2012-01-13 18:14 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Glauber Costa, Cgroups, Kirill A. Shutemov,
	Daniel J Walsh, Daniel P. Berrange, KAMEZAWA Hiroyuki,
	Max Kellermann, Mandeep Singh Baines, Paul Menage, Li Zefan,
	Johannes Weiner, Aditya Kali, Oleg Nesterov, Andrew Morton,
	Kay Sievers, Tim Hockin, Tejun Heo, Containers

Add a new subsystem to limit the number of running tasks,
similar to the NR_PROC rlimit but in the scope of a cgroup.

The user can set an upper bound limit that is checked every
time a task forks in a cgroup or is moved into a cgroup
with that subsystem binded.

The primary goal is to protect against forkbombs that explode
inside a container. The traditional NR_PROC rlimit is not
efficient in that case because if we run containers in parallel
under the same user, one of these could starve all the others
by spawning a high number of tasks close to the user wide limit.

This is a prevention against forkbombs, so it's not deemed to
cure the effects of a forkbomb when the system is in a state
where it's not responsive. It's aimed at preventing from ever
reaching that state and stop the spreading of tasks early.
While defining the limit on the allowed number of tasks, it's
up to the user to find the right balance between the resource
its containers may need and what it can afford to provide.

As it's totally dissociated from the rlimit NR_PROC, both
can be complementary: the cgroup task counter can set an upper
bound per container and the rlmit can be an upper bound on the
overall set of containers.

Also this subsystem can be used to kill all the tasks in a cgroup
without races against concurrent forks, by setting the limit of
tasks to 0, any further forks can be rejected. This is a good
way to kill a forkbomb in a container, or simply kill any container
without the need to retry an unbound number of times.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Containers <containers@lists.linux-foundation.org>
---
 Documentation/cgroups/task_counter.txt |  153 ++++++++++++++++++
 include/linux/cgroup_subsys.h          |    8 +
 init/Kconfig                           |    9 +
 kernel/Makefile                        |    1 +
 kernel/cgroup_task_counter.c           |  272 ++++++++++++++++++++++++++++++++
 5 files changed, 443 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/cgroups/task_counter.txt
 create mode 100644 kernel/cgroup_task_counter.c

diff --git a/Documentation/cgroups/task_counter.txt b/Documentation/cgroups/task_counter.txt
new file mode 100644
index 0000000..1562d88
--- /dev/null
+++ b/Documentation/cgroups/task_counter.txt
@@ -0,0 +1,153 @@
+Task counter subsystem
+
+1. Description
+
+The task counter subsystem limits the number of tasks running
+inside a given cgroup. It behaves like the NR_PROC rlimit but in
+the scope of a cgroup instead of a user.
+
+It has two typical usecases, although more can probably be found:
+
+1.1 Protection against forkbomb in a container
+
+One usecase is to protect against forkbombs that explode inside
+a container when that container is implemented using a cgroup. The
+NR_PROC rlimit is known to be a working protection against this type
+of attack but is not suitable anymore when we run containers in
+parallel under the same user. One container could starve all the
+others by spawning a high number of tasks close to the rlimit
+boundary. So in this case we need this limitation to be done in a
+per cgroup granularity.
+
+Note this works by preventing forkbombs propagation. It doesn't cure
+the forkbomb effects when it has already grown up enough to make
+the system hardly responsive. While defining the limit on the number
+of tasks, it's up to the admin to find the right balance between the
+possible needs of a container and the resources the system can afford
+to provide.
+
+Also the NR_PROC rlimit and this cgroup subsystem are totally
+dissociated. But they can be complementary. The task counter limits
+the containers and the rlimit can provide an upper bound on the whole
+set of containers.
+
+
+1.2 Kill tasks inside a cgroup
+
+An other usecase comes along the forkbomb prevention: it brings
+the ability to kill all tasks inside a cgroup without races. By
+setting the limit of running tasks to 0, one can prevent from any
+further fork inside a cgroup and then kill all of its tasks without
+the need to retry an unbound amount of time due to races between
+kills and forks running in parallel (more details in "Kill a cgroup
+safely" paragraph).
+
+This is useful to kill a forkbomb for example. When its gazillion
+of forks are competing with the kills, one need to ensure this
+operation won't run in a nearly endless loop of retry.
+
+And more generally it is useful to kill a cgroup in a bound amount
+of pass.
+
+
+2. Interface
+
+When a hierarchy is mounted with the task counter subsystem binded, it
+adds two files into the cgroups directories, except the root one:
+
+- tasks.usage contains the number of tasks running inside a cgroup and
+its children in the hierarchy (see paragraph about Inheritance).
+
+- tasks.limit contains the maximum number of tasks that can run inside
+a cgroup. We check this limit when a task forks or when it is migrated
+to a cgroup.
+
+Note that the tasks.limit value can be forced below tasks.usage, in which
+case any new task in the cgroup will be rejected until the tasks.usage
+value goes below tasks.limit.
+
+For optimization reasons, the root directory of a hierarchy doesn't have
+a task counter.
+
+
+3. Inheritance
+
+When a task is added to a cgroup, by way of a cgroup migration or a fork,
+it increases the task counter of that cgroup and of all its ancestors.
+Hence a cgroup is also subject to the limit of its ancestors.
+
+In the following hierarchy:
+
+
+             A
+             |
+             B
+           /   \
+          C     D
+
+
+We have 1 task running in B, one running in C and none running in D.
+It means we have tasks.usage = 1 in C and tasks.usage = 2 in B because
+B counts its task and those of its children.
+
+Now lets set tasks.limit = 2 in B and tasks.limit = 1 in D.
+If we move a new task in D, it will be refused because the limit in B has
+been reached already.
+
+
+4. Kill a cgroup safely
+
+As explained in the description, this subsystem is also helpful to
+kill all tasks in a cgroup safely, after setting tasks.limit to 0,
+so that we don't race against parallel forks in an unbound numbers
+of kill iterations.
+
+But there is a small detail to be aware of to use this feature that
+way.
+
+Some typical way to proceed would be:
+
+	echo 0 > tasks.limit
+	for TASK in $(cat cgroup.procs)
+	do
+		kill -KILL $TASK
+	done
+
+However there is a small race window where a task can be in the way to
+be forked but hasn't enough completed the fork to have the PID of the
+fork appearing in the cgroup.procs file.
+
+The only way to get it right is to run a loop that reads tasks.usage, kill
+all the tasks in cgroup.procs and exit the loop only if the value in
+tasks.usage was the same than the number of tasks that were in cgroup.procs,
+ie: the number of tasks that were killed.
+
+It works because the new child appears in tasks.usage right before we check,
+in the fork path, whether the parent has a pending signal, in which case the
+fork is cancelled anyway. So relying on tasks.usage is fine and non-racy.
+
+This race window is tiny and unlikely to happen, so most of the time a single
+kill iteration should be enough. But it's worth knowing about that corner
+case spotted by Oleg Nesterov.
+
+Example of safe use would be:
+
+	echo 0 > tasks.limit
+	END=false
+
+	while [ $END == false ]
+	do
+		NR_TASKS=$(cat tasks.usage)
+		NR_KILLED=0
+
+		for TASK in $(cat cgroup.procs)
+		do
+			let NR_KILLED=NR_KILLED+1
+			kill -KILL $TASK
+		done
+
+		if [ "$NR_TASKS" = "$NR_KILLED" ]
+		then
+			END=true
+		fi
+	done
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index ac663c1..5425822 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -59,8 +59,16 @@ SUBSYS(net_cls)
 SUBSYS(blkio)
 #endif
 
+/* */
+
 #ifdef CONFIG_CGROUP_PERF
 SUBSYS(perf)
 #endif
 
 /* */
+
+#ifdef CONFIG_CGROUP_TASK_COUNTER
+SUBSYS(tasks)
+#endif
+
+/* */
diff --git a/init/Kconfig b/init/Kconfig
index 43298f9..6dfc8c3 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -690,6 +690,15 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
 	  select this option (if, for some reason, they need to disable it
 	  then swapaccount=0 does the trick).
 
+config CGROUP_TASK_COUNTER
+	bool "Control number of tasks in a cgroup"
+	depends on RESOURCE_COUNTERS
+	help
+	  Let the user set up an upper boundary of the allowed number of tasks
+	  running in a cgroup. When a task forks or is migrated to a cgroup that
+	  has this subsystem binded, the limit is checked to either accept or
+	  reject the fork/migration.
+
 config CGROUP_PERF
 	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
 	depends on PERF_EVENTS && CGROUPS
diff --git a/kernel/Makefile b/kernel/Makefile
index e898c5b..833b692 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -60,6 +60,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
+obj-$(CONFIG_CGROUP_TASK_COUNTER) += cgroup_task_counter.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
diff --git a/kernel/cgroup_task_counter.c b/kernel/cgroup_task_counter.c
new file mode 100644
index 0000000..a4d87ac
--- /dev/null
+++ b/kernel/cgroup_task_counter.c
@@ -0,0 +1,272 @@
+/*
+ * Limits on number of tasks subsystem for cgroups
+ *
+ * Copyright (C) 2011-2012 Red Hat, Inc., Frederic Weisbecker <fweisbec@redhat.com>
+ *
+ * Thanks to Andrew Morton, Johannes Weiner, Li Zefan, Oleg Nesterov and
+ * Paul Menage for their suggestions.
+ *
+ */
+
+#include <linux/err.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/res_counter.h>
+
+
+struct task_counter {
+	struct res_counter		res;
+	struct cgroup_subsys_state	css;
+};
+
+/*
+ * The root task counter doesn't exist because it's not part of the
+ * whole task counting. We want to optimize the trivial case of only
+ * one root cgroup living.
+ */
+static struct cgroup_subsys_state root_css;
+
+
+static inline struct task_counter *cgroup_task_counter(struct cgroup *cgrp)
+{
+	if (!cgrp->parent)
+		return NULL;
+
+	return container_of(cgroup_subsys_state(cgrp, tasks_subsys_id),
+			    struct task_counter, css);
+}
+
+static inline struct res_counter *cgroup_task_res_counter(struct cgroup *cgrp)
+{
+	struct task_counter *cnt;
+
+	cnt = cgroup_task_counter(cgrp);
+	if (!cnt)
+		return NULL;
+
+	return &cnt->res;
+}
+
+static struct cgroup_subsys_state *
+task_counter_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct task_counter *cnt;
+	struct res_counter *parent_res;
+
+	if (!cgrp->parent)
+		return &root_css;
+
+	cnt = kzalloc(sizeof(*cnt), GFP_KERNEL);
+	if (!cnt)
+		return ERR_PTR(-ENOMEM);
+
+	parent_res = cgroup_task_res_counter(cgrp->parent);
+
+	res_counter_init(&cnt->res, parent_res);
+
+	return &cnt->css;
+}
+
+/*
+ * Inherit the limit value of the parent. This is not really to enforce
+ * a limit below or equal to the one of the parent which can be changed
+ * concurrently anyway. This is just to honour the clone flag.
+ */
+static void task_counter_post_clone(struct cgroup_subsys *ss,
+				    struct cgroup *cgrp)
+{
+	/* cgrp can't be root, so cgroup_task_res_counter() can't return NULL */
+	res_counter_inherit(cgroup_task_res_counter(cgrp), RES_LIMIT);
+}
+
+static void task_counter_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct task_counter *cnt = cgroup_task_counter(cgrp);
+
+	kfree(cnt);
+}
+
+/* Uncharge the cgroup the task was attached to */
+static void task_counter_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
+			      struct cgroup *old_cgrp, struct task_struct *task)
+{
+	/* Optimize for the root cgroup case */
+	if (old_cgrp->parent)
+		res_counter_uncharge(cgroup_task_res_counter(old_cgrp), 1);
+}
+
+static void task_counter_cancel_attach_until(struct res_counter *res,
+					     struct cgroup_taskset *tset,
+					     struct task_struct *until)
+{
+	struct task_struct *tsk;
+	struct res_counter *old_res;
+	struct cgroup *old_cgrp;
+	struct res_counter *common_ancestor;
+
+	cgroup_taskset_for_each(tsk, NULL, tset) {
+		if (tsk == until)
+			break;
+		old_cgrp = cgroup_taskset_cur_cgroup(tset);
+		old_res = cgroup_task_res_counter(old_cgrp);
+		common_ancestor = res_counter_common_ancestor(res, old_res);
+		res_counter_uncharge_until(res, common_ancestor, 1);
+	}
+}
+
+/*
+ * This does more than just probing the ability to attach to the dest cgroup.
+ * We can not just _check_ if we can attach to the destination and do the real
+ * attachment later in task_counter_attach() because a task in the dest
+ * cgroup can fork before and steal the last remaining count.
+ * Thus we need to charge the dest cgroup right now.
+ */
+static int task_counter_can_attach(struct cgroup_subsys *ss,
+				   struct cgroup *cgrp,
+				   struct cgroup_taskset *tset)
+{
+	struct res_counter *res = cgroup_task_res_counter(cgrp);
+	struct res_counter *old_res;
+	struct cgroup *old_cgrp;
+	struct res_counter *common_ancestor;
+	struct task_struct *tsk;
+	int err = 0;
+
+	cgroup_taskset_for_each(tsk, NULL, tset) {
+		old_cgrp = cgroup_taskset_cur_cgroup(tset);
+		old_res = cgroup_task_res_counter(old_cgrp);
+		/*
+		 * When moving a task from a cgroup to another, we don't want
+		 * to charge the common ancestors, even though they will be
+		 * uncharged later from attach_task(), because during that
+		 * short window between charge and uncharge, a task could fork
+		 * in the ancestor and spuriously fail due to the temporary
+		 * charge.
+		 */
+		common_ancestor = res_counter_common_ancestor(res, old_res);
+
+		/*
+		 * If cgrp is the root then res is NULL, however in this case
+		 * the common ancestor is NULL as well, making the below a NOP.
+		 */
+		err = res_counter_charge_until(res, common_ancestor, 1, NULL);
+		if (err) {
+			task_counter_cancel_attach_until(res, tset, tsk);
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+/* Uncharge the dest cgroup that we charged in task_counter_can_attach() */
+static void task_counter_cancel_attach(struct cgroup_subsys *ss,
+				       struct cgroup *cgrp,
+				       struct cgroup_taskset *tset)
+{
+	task_counter_cancel_attach_until(cgroup_task_res_counter(cgrp),
+					 tset, NULL);
+}
+
+/*
+ * This uncharge the old cgroups. We can do that now that we are sure the
+ * attachment can't cancelled anymore, because this uncharge operation
+ * couldn't be reverted later: a task in the old cgroup could fork after
+ * we uncharge and reach the task counter limit, making our return there
+ * not possible.
+ */
+static void task_counter_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+				struct cgroup_taskset *tset)
+{
+	struct res_counter *res = cgroup_task_res_counter(cgrp);
+	struct task_struct *tsk;
+	struct res_counter *old_res;
+	struct cgroup *old_cgrp;
+	struct res_counter *common_ancestor;
+
+	cgroup_taskset_for_each(tsk, NULL, tset) {
+		old_cgrp = cgroup_taskset_cur_cgroup(tset);
+		old_res = cgroup_task_res_counter(old_cgrp);
+		common_ancestor = res_counter_common_ancestor(res, old_res);
+		res_counter_uncharge_until(old_res, common_ancestor, 1);
+	}
+}
+
+static u64 task_counter_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+	int type = cft->private;
+
+	return res_counter_read_u64(cgroup_task_res_counter(cgrp), type);
+}
+
+static int task_counter_write_u64(struct cgroup *cgrp, struct cftype *cft,
+				  u64 val)
+{
+	int type = cft->private;
+
+	res_counter_write_u64(cgroup_task_res_counter(cgrp), type, val);
+
+	return 0;
+}
+
+static struct cftype files[] = {
+	{
+		.name		= "limit",
+		.read_u64	= task_counter_read_u64,
+		.write_u64	= task_counter_write_u64,
+		.private	= RES_LIMIT,
+	},
+
+	{
+		.name		= "usage",
+		.read_u64	= task_counter_read_u64,
+		.private	= RES_USAGE,
+	},
+};
+
+static int task_counter_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	if (!cgrp->parent)
+		return 0;
+
+	return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
+}
+
+/*
+ * Charge the task counter with the new child coming, or reject it if we
+ * reached the limit.
+ */
+static int task_counter_fork(struct cgroup_subsys *ss,
+			     struct task_struct *child)
+{
+	struct cgroup_subsys_state *css;
+	struct cgroup *cgrp;
+	int err;
+
+	css = child->cgroups->subsys[tasks_subsys_id];
+	cgrp = css->cgroup;
+
+	/* Optimize for the root cgroup case, which doesn't have a limit */
+	if (!cgrp->parent)
+		return 0;
+
+	err = res_counter_charge(cgroup_task_res_counter(cgrp), 1, NULL);
+	if (err)
+		return -EAGAIN;
+
+	return 0;
+}
+
+struct cgroup_subsys tasks_subsys = {
+	.name			= "tasks",
+	.subsys_id		= tasks_subsys_id,
+	.create			= task_counter_create,
+	.post_clone		= task_counter_post_clone,
+	.destroy		= task_counter_destroy,
+	.exit			= task_counter_exit,
+	.can_attach		= task_counter_can_attach,
+	.cancel_attach		= task_counter_cancel_attach,
+	.attach			= task_counter_attach,
+	.fork			= task_counter_fork,
+	.populate		= task_counter_populate,
+};
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 8/8] cgroups: Add a task counter subsystem
  2012-01-13 18:13 [PATCH 0/8] cgroups: Task counter subsystem v7 Frederic Weisbecker
                   ` (15 preceding siblings ...)
  2012-01-13 18:14 ` [PATCH 8/8] cgroups: Add a task counter subsystem Frederic Weisbecker
@ 2012-01-13 18:14 ` Frederic Weisbecker
  16 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2012-01-13 18:14 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Paul Menage, Li Zefan, Johannes Weiner,
	Aditya Kali, Oleg Nesterov, Andrew Morton, Kay Sievers,
	Tim Hockin, Tejun Heo, Kirill A. Shutemov, Containers

Add a new subsystem to limit the number of running tasks,
similar to the NR_PROC rlimit but in the scope of a cgroup.

The user can set an upper bound limit that is checked every
time a task forks in a cgroup or is moved into a cgroup
with that subsystem binded.

The primary goal is to protect against forkbombs that explode
inside a container. The traditional NR_PROC rlimit is not
efficient in that case because if we run containers in parallel
under the same user, one of these could starve all the others
by spawning a high number of tasks close to the user wide limit.

This is a prevention against forkbombs, so it's not deemed to
cure the effects of a forkbomb when the system is in a state
where it's not responsive. It's aimed at preventing from ever
reaching that state and stop the spreading of tasks early.
While defining the limit on the allowed number of tasks, it's
up to the user to find the right balance between the resource
its containers may need and what it can afford to provide.

As it's totally dissociated from the rlimit NR_PROC, both
can be complementary: the cgroup task counter can set an upper
bound per container and the rlmit can be an upper bound on the
overall set of containers.

Also this subsystem can be used to kill all the tasks in a cgroup
without races against concurrent forks, by setting the limit of
tasks to 0, any further forks can be rejected. This is a good
way to kill a forkbomb in a container, or simply kill any container
without the need to retry an unbound number of times.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Containers <containers@lists.linux-foundation.org>
---
 Documentation/cgroups/task_counter.txt |  153 ++++++++++++++++++
 include/linux/cgroup_subsys.h          |    8 +
 init/Kconfig                           |    9 +
 kernel/Makefile                        |    1 +
 kernel/cgroup_task_counter.c           |  272 ++++++++++++++++++++++++++++++++
 5 files changed, 443 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/cgroups/task_counter.txt
 create mode 100644 kernel/cgroup_task_counter.c

diff --git a/Documentation/cgroups/task_counter.txt b/Documentation/cgroups/task_counter.txt
new file mode 100644
index 0000000..1562d88
--- /dev/null
+++ b/Documentation/cgroups/task_counter.txt
@@ -0,0 +1,153 @@
+Task counter subsystem
+
+1. Description
+
+The task counter subsystem limits the number of tasks running
+inside a given cgroup. It behaves like the NR_PROC rlimit but in
+the scope of a cgroup instead of a user.
+
+It has two typical usecases, although more can probably be found:
+
+1.1 Protection against forkbomb in a container
+
+One usecase is to protect against forkbombs that explode inside
+a container when that container is implemented using a cgroup. The
+NR_PROC rlimit is known to be a working protection against this type
+of attack but is not suitable anymore when we run containers in
+parallel under the same user. One container could starve all the
+others by spawning a high number of tasks close to the rlimit
+boundary. So in this case we need this limitation to be done in a
+per cgroup granularity.
+
+Note this works by preventing forkbombs propagation. It doesn't cure
+the forkbomb effects when it has already grown up enough to make
+the system hardly responsive. While defining the limit on the number
+of tasks, it's up to the admin to find the right balance between the
+possible needs of a container and the resources the system can afford
+to provide.
+
+Also the NR_PROC rlimit and this cgroup subsystem are totally
+dissociated. But they can be complementary. The task counter limits
+the containers and the rlimit can provide an upper bound on the whole
+set of containers.
+
+
+1.2 Kill tasks inside a cgroup
+
+An other usecase comes along the forkbomb prevention: it brings
+the ability to kill all tasks inside a cgroup without races. By
+setting the limit of running tasks to 0, one can prevent from any
+further fork inside a cgroup and then kill all of its tasks without
+the need to retry an unbound amount of time due to races between
+kills and forks running in parallel (more details in "Kill a cgroup
+safely" paragraph).
+
+This is useful to kill a forkbomb for example. When its gazillion
+of forks are competing with the kills, one need to ensure this
+operation won't run in a nearly endless loop of retry.
+
+And more generally it is useful to kill a cgroup in a bound amount
+of pass.
+
+
+2. Interface
+
+When a hierarchy is mounted with the task counter subsystem binded, it
+adds two files into the cgroups directories, except the root one:
+
+- tasks.usage contains the number of tasks running inside a cgroup and
+its children in the hierarchy (see paragraph about Inheritance).
+
+- tasks.limit contains the maximum number of tasks that can run inside
+a cgroup. We check this limit when a task forks or when it is migrated
+to a cgroup.
+
+Note that the tasks.limit value can be forced below tasks.usage, in which
+case any new task in the cgroup will be rejected until the tasks.usage
+value goes below tasks.limit.
+
+For optimization reasons, the root directory of a hierarchy doesn't have
+a task counter.
+
+
+3. Inheritance
+
+When a task is added to a cgroup, by way of a cgroup migration or a fork,
+it increases the task counter of that cgroup and of all its ancestors.
+Hence a cgroup is also subject to the limit of its ancestors.
+
+In the following hierarchy:
+
+
+             A
+             |
+             B
+           /   \
+          C     D
+
+
+We have 1 task running in B, one running in C and none running in D.
+It means we have tasks.usage = 1 in C and tasks.usage = 2 in B because
+B counts its task and those of its children.
+
+Now lets set tasks.limit = 2 in B and tasks.limit = 1 in D.
+If we move a new task in D, it will be refused because the limit in B has
+been reached already.
+
+
+4. Kill a cgroup safely
+
+As explained in the description, this subsystem is also helpful to
+kill all tasks in a cgroup safely, after setting tasks.limit to 0,
+so that we don't race against parallel forks in an unbound numbers
+of kill iterations.
+
+But there is a small detail to be aware of to use this feature that
+way.
+
+Some typical way to proceed would be:
+
+	echo 0 > tasks.limit
+	for TASK in $(cat cgroup.procs)
+	do
+		kill -KILL $TASK
+	done
+
+However there is a small race window where a task can be in the way to
+be forked but hasn't enough completed the fork to have the PID of the
+fork appearing in the cgroup.procs file.
+
+The only way to get it right is to run a loop that reads tasks.usage, kill
+all the tasks in cgroup.procs and exit the loop only if the value in
+tasks.usage was the same than the number of tasks that were in cgroup.procs,
+ie: the number of tasks that were killed.
+
+It works because the new child appears in tasks.usage right before we check,
+in the fork path, whether the parent has a pending signal, in which case the
+fork is cancelled anyway. So relying on tasks.usage is fine and non-racy.
+
+This race window is tiny and unlikely to happen, so most of the time a single
+kill iteration should be enough. But it's worth knowing about that corner
+case spotted by Oleg Nesterov.
+
+Example of safe use would be:
+
+	echo 0 > tasks.limit
+	END=false
+
+	while [ $END == false ]
+	do
+		NR_TASKS=$(cat tasks.usage)
+		NR_KILLED=0
+
+		for TASK in $(cat cgroup.procs)
+		do
+			let NR_KILLED=NR_KILLED+1
+			kill -KILL $TASK
+		done
+
+		if [ "$NR_TASKS" = "$NR_KILLED" ]
+		then
+			END=true
+		fi
+	done
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index ac663c1..5425822 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -59,8 +59,16 @@ SUBSYS(net_cls)
 SUBSYS(blkio)
 #endif
 
+/* */
+
 #ifdef CONFIG_CGROUP_PERF
 SUBSYS(perf)
 #endif
 
 /* */
+
+#ifdef CONFIG_CGROUP_TASK_COUNTER
+SUBSYS(tasks)
+#endif
+
+/* */
diff --git a/init/Kconfig b/init/Kconfig
index 43298f9..6dfc8c3 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -690,6 +690,15 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
 	  select this option (if, for some reason, they need to disable it
 	  then swapaccount=0 does the trick).
 
+config CGROUP_TASK_COUNTER
+	bool "Control number of tasks in a cgroup"
+	depends on RESOURCE_COUNTERS
+	help
+	  Let the user set up an upper boundary of the allowed number of tasks
+	  running in a cgroup. When a task forks or is migrated to a cgroup that
+	  has this subsystem binded, the limit is checked to either accept or
+	  reject the fork/migration.
+
 config CGROUP_PERF
 	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
 	depends on PERF_EVENTS && CGROUPS
diff --git a/kernel/Makefile b/kernel/Makefile
index e898c5b..833b692 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -60,6 +60,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
+obj-$(CONFIG_CGROUP_TASK_COUNTER) += cgroup_task_counter.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
diff --git a/kernel/cgroup_task_counter.c b/kernel/cgroup_task_counter.c
new file mode 100644
index 0000000..a4d87ac
--- /dev/null
+++ b/kernel/cgroup_task_counter.c
@@ -0,0 +1,272 @@
+/*
+ * Limits on number of tasks subsystem for cgroups
+ *
+ * Copyright (C) 2011-2012 Red Hat, Inc., Frederic Weisbecker <fweisbec@redhat.com>
+ *
+ * Thanks to Andrew Morton, Johannes Weiner, Li Zefan, Oleg Nesterov and
+ * Paul Menage for their suggestions.
+ *
+ */
+
+#include <linux/err.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/res_counter.h>
+
+
+struct task_counter {
+	struct res_counter		res;
+	struct cgroup_subsys_state	css;
+};
+
+/*
+ * The root task counter doesn't exist because it's not part of the
+ * whole task counting. We want to optimize the trivial case of only
+ * one root cgroup living.
+ */
+static struct cgroup_subsys_state root_css;
+
+
+static inline struct task_counter *cgroup_task_counter(struct cgroup *cgrp)
+{
+	if (!cgrp->parent)
+		return NULL;
+
+	return container_of(cgroup_subsys_state(cgrp, tasks_subsys_id),
+			    struct task_counter, css);
+}
+
+static inline struct res_counter *cgroup_task_res_counter(struct cgroup *cgrp)
+{
+	struct task_counter *cnt;
+
+	cnt = cgroup_task_counter(cgrp);
+	if (!cnt)
+		return NULL;
+
+	return &cnt->res;
+}
+
+static struct cgroup_subsys_state *
+task_counter_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct task_counter *cnt;
+	struct res_counter *parent_res;
+
+	if (!cgrp->parent)
+		return &root_css;
+
+	cnt = kzalloc(sizeof(*cnt), GFP_KERNEL);
+	if (!cnt)
+		return ERR_PTR(-ENOMEM);
+
+	parent_res = cgroup_task_res_counter(cgrp->parent);
+
+	res_counter_init(&cnt->res, parent_res);
+
+	return &cnt->css;
+}
+
+/*
+ * Inherit the limit value of the parent. This is not really to enforce
+ * a limit below or equal to the one of the parent which can be changed
+ * concurrently anyway. This is just to honour the clone flag.
+ */
+static void task_counter_post_clone(struct cgroup_subsys *ss,
+				    struct cgroup *cgrp)
+{
+	/* cgrp can't be root, so cgroup_task_res_counter() can't return NULL */
+	res_counter_inherit(cgroup_task_res_counter(cgrp), RES_LIMIT);
+}
+
+static void task_counter_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct task_counter *cnt = cgroup_task_counter(cgrp);
+
+	kfree(cnt);
+}
+
+/* Uncharge the cgroup the task was attached to */
+static void task_counter_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
+			      struct cgroup *old_cgrp, struct task_struct *task)
+{
+	/* Optimize for the root cgroup case */
+	if (old_cgrp->parent)
+		res_counter_uncharge(cgroup_task_res_counter(old_cgrp), 1);
+}
+
+static void task_counter_cancel_attach_until(struct res_counter *res,
+					     struct cgroup_taskset *tset,
+					     struct task_struct *until)
+{
+	struct task_struct *tsk;
+	struct res_counter *old_res;
+	struct cgroup *old_cgrp;
+	struct res_counter *common_ancestor;
+
+	cgroup_taskset_for_each(tsk, NULL, tset) {
+		if (tsk == until)
+			break;
+		old_cgrp = cgroup_taskset_cur_cgroup(tset);
+		old_res = cgroup_task_res_counter(old_cgrp);
+		common_ancestor = res_counter_common_ancestor(res, old_res);
+		res_counter_uncharge_until(res, common_ancestor, 1);
+	}
+}
+
+/*
+ * This does more than just probing the ability to attach to the dest cgroup.
+ * We can not just _check_ if we can attach to the destination and do the real
+ * attachment later in task_counter_attach() because a task in the dest
+ * cgroup can fork before and steal the last remaining count.
+ * Thus we need to charge the dest cgroup right now.
+ */
+static int task_counter_can_attach(struct cgroup_subsys *ss,
+				   struct cgroup *cgrp,
+				   struct cgroup_taskset *tset)
+{
+	struct res_counter *res = cgroup_task_res_counter(cgrp);
+	struct res_counter *old_res;
+	struct cgroup *old_cgrp;
+	struct res_counter *common_ancestor;
+	struct task_struct *tsk;
+	int err = 0;
+
+	cgroup_taskset_for_each(tsk, NULL, tset) {
+		old_cgrp = cgroup_taskset_cur_cgroup(tset);
+		old_res = cgroup_task_res_counter(old_cgrp);
+		/*
+		 * When moving a task from a cgroup to another, we don't want
+		 * to charge the common ancestors, even though they will be
+		 * uncharged later from attach_task(), because during that
+		 * short window between charge and uncharge, a task could fork
+		 * in the ancestor and spuriously fail due to the temporary
+		 * charge.
+		 */
+		common_ancestor = res_counter_common_ancestor(res, old_res);
+
+		/*
+		 * If cgrp is the root then res is NULL, however in this case
+		 * the common ancestor is NULL as well, making the below a NOP.
+		 */
+		err = res_counter_charge_until(res, common_ancestor, 1, NULL);
+		if (err) {
+			task_counter_cancel_attach_until(res, tset, tsk);
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+/* Uncharge the dest cgroup that we charged in task_counter_can_attach() */
+static void task_counter_cancel_attach(struct cgroup_subsys *ss,
+				       struct cgroup *cgrp,
+				       struct cgroup_taskset *tset)
+{
+	task_counter_cancel_attach_until(cgroup_task_res_counter(cgrp),
+					 tset, NULL);
+}
+
+/*
+ * This uncharge the old cgroups. We can do that now that we are sure the
+ * attachment can't cancelled anymore, because this uncharge operation
+ * couldn't be reverted later: a task in the old cgroup could fork after
+ * we uncharge and reach the task counter limit, making our return there
+ * not possible.
+ */
+static void task_counter_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+				struct cgroup_taskset *tset)
+{
+	struct res_counter *res = cgroup_task_res_counter(cgrp);
+	struct task_struct *tsk;
+	struct res_counter *old_res;
+	struct cgroup *old_cgrp;
+	struct res_counter *common_ancestor;
+
+	cgroup_taskset_for_each(tsk, NULL, tset) {
+		old_cgrp = cgroup_taskset_cur_cgroup(tset);
+		old_res = cgroup_task_res_counter(old_cgrp);
+		common_ancestor = res_counter_common_ancestor(res, old_res);
+		res_counter_uncharge_until(old_res, common_ancestor, 1);
+	}
+}
+
+static u64 task_counter_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+	int type = cft->private;
+
+	return res_counter_read_u64(cgroup_task_res_counter(cgrp), type);
+}
+
+static int task_counter_write_u64(struct cgroup *cgrp, struct cftype *cft,
+				  u64 val)
+{
+	int type = cft->private;
+
+	res_counter_write_u64(cgroup_task_res_counter(cgrp), type, val);
+
+	return 0;
+}
+
+static struct cftype files[] = {
+	{
+		.name		= "limit",
+		.read_u64	= task_counter_read_u64,
+		.write_u64	= task_counter_write_u64,
+		.private	= RES_LIMIT,
+	},
+
+	{
+		.name		= "usage",
+		.read_u64	= task_counter_read_u64,
+		.private	= RES_USAGE,
+	},
+};
+
+static int task_counter_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	if (!cgrp->parent)
+		return 0;
+
+	return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
+}
+
+/*
+ * Charge the task counter with the new child coming, or reject it if we
+ * reached the limit.
+ */
+static int task_counter_fork(struct cgroup_subsys *ss,
+			     struct task_struct *child)
+{
+	struct cgroup_subsys_state *css;
+	struct cgroup *cgrp;
+	int err;
+
+	css = child->cgroups->subsys[tasks_subsys_id];
+	cgrp = css->cgroup;
+
+	/* Optimize for the root cgroup case, which doesn't have a limit */
+	if (!cgrp->parent)
+		return 0;
+
+	err = res_counter_charge(cgroup_task_res_counter(cgrp), 1, NULL);
+	if (err)
+		return -EAGAIN;
+
+	return 0;
+}
+
+struct cgroup_subsys tasks_subsys = {
+	.name			= "tasks",
+	.subsys_id		= tasks_subsys_id,
+	.create			= task_counter_create,
+	.post_clone		= task_counter_post_clone,
+	.destroy		= task_counter_destroy,
+	.exit			= task_counter_exit,
+	.can_attach		= task_counter_can_attach,
+	.cancel_attach		= task_counter_cancel_attach,
+	.attach			= task_counter_attach,
+	.fork			= task_counter_fork,
+	.populate		= task_counter_populate,
+};
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/8] cgroups: add res_counter_write_u64() API
  2012-01-13 18:13 ` Frederic Weisbecker
@ 2012-01-16 12:27   ` Kirill A. Shutemov
  2012-01-25 18:31     ` Frederic Weisbecker
  0 siblings, 1 reply; 22+ messages in thread
From: Kirill A. Shutemov @ 2012-01-16 12:27 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Li Zefan, Johannes Weiner, Aditya Kali, Oleg Nesterov,
	Kay Sievers, Tim Hockin, Tejun Heo, Andrew Morton

On Fri, Jan 13, 2012 at 07:13:47PM +0100, Frederic Weisbecker wrote:
> Extend the resource counter API with a mirror of res_counter_read_u64() to
> make it handy to update a resource counter value from a cgroup subsystem
> u64 value file.

I still think it worth to have two versions of res_counter_write_u64().
>From my POV, it's clean enough, but consistent with res_counter_read_u64()
and faster on 64bit system.

What do you think?

---
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index c9d625c..1b3fe05 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -82,6 +82,8 @@ int res_counter_memparse_write_strategy(const char *buf,
 int res_counter_write(struct res_counter *counter, int member,
 		      const char *buffer, write_strategy_fn write_strategy);
 
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
+
 /*
  * the field descriptors. one for each member of res_counter
  */
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 6d269cc..aad4ddb 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -167,12 +167,27 @@ int res_counter_memparse_write_strategy(const char *buf,
 	return 0;
 }
 
+#if BITS_PER_LONG == 32
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&counter->lock, flags);
+	*res_counter_member(counter, member) = val;
+	spin_unlock_irqrestore(&counter->lock, flags);
+}
+#else
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
+{
+	*res_counter_member(counter, member) = val;
+}
+#endif
+
 int res_counter_write(struct res_counter *counter, int member,
 		      const char *buf, write_strategy_fn write_strategy)
 {
 	char *end;
-	unsigned long flags;
-	unsigned long long tmp, *val;
+	unsigned long long tmp;
 
 	if (write_strategy) {
 		if (write_strategy(buf, &tmp))
@@ -182,9 +197,8 @@ int res_counter_write(struct res_counter *counter, int member,
 		if (*end != '\0')
 			return -EINVAL;
 	}
-	spin_lock_irqsave(&counter->lock, flags);
-	val = res_counter_member(counter, member);
-	*val = tmp;
-	spin_unlock_irqrestore(&counter->lock, flags);
+
+	res_counter_write_u64(counter, member, tmp);
+
 	return 0;
 }
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 8/8] cgroups: Add a task counter subsystem
  2012-01-13 18:14 ` [PATCH 8/8] cgroups: Add a task counter subsystem Frederic Weisbecker
@ 2012-01-16 12:38   ` Kirill A. Shutemov
  0 siblings, 0 replies; 22+ messages in thread
From: Kirill A. Shutemov @ 2012-01-16 12:38 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Glauber Costa, Cgroups, Daniel J Walsh, Daniel P. Berrange,
	KAMEZAWA Hiroyuki, Max Kellermann, Mandeep Singh Baines,
	Paul Menage, Li Zefan, Johannes Weiner, Aditya Kali,
	Oleg Nesterov, Andrew Morton, Kay Sievers, Tim Hockin, Tejun Heo,
	Containers

On Fri, Jan 13, 2012 at 07:14:00PM +0100, Frederic Weisbecker wrote:
> Add a new subsystem to limit the number of running tasks,
> similar to the NR_PROC rlimit but in the scope of a cgroup.
> 
> The user can set an upper bound limit that is checked every
> time a task forks in a cgroup or is moved into a cgroup
> with that subsystem binded.
> 
> The primary goal is to protect against forkbombs that explode
> inside a container. The traditional NR_PROC rlimit is not
> efficient in that case because if we run containers in parallel
> under the same user, one of these could starve all the others
> by spawning a high number of tasks close to the user wide limit.
> 
> This is a prevention against forkbombs, so it's not deemed to
> cure the effects of a forkbomb when the system is in a state
> where it's not responsive. It's aimed at preventing from ever
> reaching that state and stop the spreading of tasks early.
> While defining the limit on the allowed number of tasks, it's
> up to the user to find the right balance between the resource
> its containers may need and what it can afford to provide.
> 
> As it's totally dissociated from the rlimit NR_PROC, both
> can be complementary: the cgroup task counter can set an upper
> bound per container and the rlmit can be an upper bound on the
> overall set of containers.
> 
> Also this subsystem can be used to kill all the tasks in a cgroup
> without races against concurrent forks, by setting the limit of
> tasks to 0, any further forks can be rejected. This is a good
> way to kill a forkbomb in a container, or simply kill any container
> without the need to retry an unbound number of times.
> 
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Paul Menage <paul@paulmenage.org>
> Cc: Li Zefan <lizf@cn.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Aditya Kali <adityakali@google.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Kay Sievers <kay.sievers@vrfy.org>
> Cc: Tim Hockin <thockin@hockin.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Containers <containers@lists.linux-foundation.org>

Acked-by: Kirill A. Shutemov <kirill@shutemov.name>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/8] cgroups: add res_counter_write_u64() API
  2012-01-16 12:27   ` Kirill A. Shutemov
@ 2012-01-25 18:31     ` Frederic Weisbecker
  2012-01-25 18:35       ` Kirill A. Shutemov
  0 siblings, 1 reply; 22+ messages in thread
From: Frederic Weisbecker @ 2012-01-25 18:31 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: LKML, Li Zefan, Johannes Weiner, Aditya Kali, Oleg Nesterov,
	Kay Sievers, Tim Hockin, Tejun Heo, Andrew Morton

On Mon, Jan 16, 2012 at 02:27:13PM +0200, Kirill A. Shutemov wrote:
> On Fri, Jan 13, 2012 at 07:13:47PM +0100, Frederic Weisbecker wrote:
> > Extend the resource counter API with a mirror of res_counter_read_u64() to
> > make it handy to update a resource counter value from a cgroup subsystem
> > u64 value file.
> 
> I still think it worth to have two versions of res_counter_write_u64().
> From my POV, it's clean enough, but consistent with res_counter_read_u64()
> and faster on 64bit system.
> 
> What do you think?

Yeah right let's do this. Can I get your signed-off-by to take the below?

Thanks.

> 
> ---
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index c9d625c..1b3fe05 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -82,6 +82,8 @@ int res_counter_memparse_write_strategy(const char *buf,
>  int res_counter_write(struct res_counter *counter, int member,
>  		      const char *buffer, write_strategy_fn write_strategy);
>  
> +void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
> +
>  /*
>   * the field descriptors. one for each member of res_counter
>   */
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index 6d269cc..aad4ddb 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -167,12 +167,27 @@ int res_counter_memparse_write_strategy(const char *buf,
>  	return 0;
>  }
>  
> +#if BITS_PER_LONG == 32
> +void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&counter->lock, flags);
> +	*res_counter_member(counter, member) = val;
> +	spin_unlock_irqrestore(&counter->lock, flags);
> +}
> +#else
> +void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
> +{
> +	*res_counter_member(counter, member) = val;
> +}
> +#endif
> +
>  int res_counter_write(struct res_counter *counter, int member,
>  		      const char *buf, write_strategy_fn write_strategy)
>  {
>  	char *end;
> -	unsigned long flags;
> -	unsigned long long tmp, *val;
> +	unsigned long long tmp;
>  
>  	if (write_strategy) {
>  		if (write_strategy(buf, &tmp))
> @@ -182,9 +197,8 @@ int res_counter_write(struct res_counter *counter, int member,
>  		if (*end != '\0')
>  			return -EINVAL;
>  	}
> -	spin_lock_irqsave(&counter->lock, flags);
> -	val = res_counter_member(counter, member);
> -	*val = tmp;
> -	spin_unlock_irqrestore(&counter->lock, flags);
> +
> +	res_counter_write_u64(counter, member, tmp);
> +
>  	return 0;
>  }
> -- 
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/8] cgroups: add res_counter_write_u64() API
  2012-01-25 18:31     ` Frederic Weisbecker
@ 2012-01-25 18:35       ` Kirill A. Shutemov
  0 siblings, 0 replies; 22+ messages in thread
From: Kirill A. Shutemov @ 2012-01-25 18:35 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Li Zefan, Johannes Weiner, Aditya Kali, Oleg Nesterov,
	Kay Sievers, Tim Hockin, Tejun Heo, Andrew Morton

On Wed, Jan 25, 2012 at 07:31:03PM +0100, Frederic Weisbecker wrote:
> On Mon, Jan 16, 2012 at 02:27:13PM +0200, Kirill A. Shutemov wrote:
> > On Fri, Jan 13, 2012 at 07:13:47PM +0100, Frederic Weisbecker wrote:
> > > Extend the resource counter API with a mirror of res_counter_read_u64() to
> > > make it handy to update a resource counter value from a cgroup subsystem
> > > u64 value file.
> > 
> > I still think it worth to have two versions of res_counter_write_u64().
> > From my POV, it's clean enough, but consistent with res_counter_read_u64()
> > and faster on 64bit system.
> > 
> > What do you think?
> 
> Yeah right let's do this. Can I get your signed-off-by to take the below?

Yes.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2012-01-25 18:35 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-13 18:13 [PATCH 0/8] cgroups: Task counter subsystem v7 Frederic Weisbecker
2012-01-13 18:13 ` Frederic Weisbecker
2012-01-13 18:13 ` [PATCH 1/8] cgroups: add res_counter_write_u64() API Frederic Weisbecker
2012-01-13 18:13 ` Frederic Weisbecker
2012-01-16 12:27   ` Kirill A. Shutemov
2012-01-25 18:31     ` Frederic Weisbecker
2012-01-25 18:35       ` Kirill A. Shutemov
2012-01-13 18:13 ` [PATCH 2/8] cgroups: new resource counter inheritance API Frederic Weisbecker
2012-01-13 18:13 ` Frederic Weisbecker
2012-01-13 18:13 ` [PATCH 3/8] cgroups: ability to stop res charge propagation on bounded ancestor Frederic Weisbecker
2012-01-13 18:13 ` Frederic Weisbecker
2012-01-13 18:13 ` [PATCH 4/8] cgroups: add res counter common ancestor searching Frederic Weisbecker
2012-01-13 18:13 ` Frederic Weisbecker
2012-01-13 18:13 ` [PATCH 5/8] res_counter: allow charge failure pointer to be null Frederic Weisbecker
2012-01-13 18:13 ` Frederic Weisbecker
2012-01-13 18:13 ` [PATCH 6/8] cgroups: pull up res counter charge failure interpretation to caller Frederic Weisbecker
2012-01-13 18:13 ` Frederic Weisbecker
2012-01-13 18:13 ` [PATCH 7/8] cgroups: allow subsystems to cancel a fork Frederic Weisbecker
2012-01-13 18:13 ` Frederic Weisbecker
2012-01-13 18:14 ` [PATCH 8/8] cgroups: Add a task counter subsystem Frederic Weisbecker
2012-01-16 12:38   ` Kirill A. Shutemov
2012-01-13 18:14 ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).