All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] cgroup: Remove RCU from task->cgroups
@ 2010-11-21  2:00 Colin Cross
       [not found] ` <1290304824-22722-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
  2010-11-21 23:02 ` Colin Cross
  0 siblings, 2 replies; 55+ messages in thread
From: Colin Cross @ 2010-11-21  2:00 UTC (permalink / raw)
  To: linux-kernel; +Cc: Colin Cross, Paul Menage, Li Zefan, containers

The synchronize_rcu call in cgroup_attach_task can be very
expensive.  All fastpath accesses to task->cgroups already
use task_lock() or cgroup_lock() to protect against updates,
and only the CGROUP_DEBUG files have RCU read-side critical
sections.

This patch replaces rcu_read_lock() with task_lock(current)
around the debug file acceses to current->cgroups and removes
the synchronize_rcu call in cgroup_attach_task.

Signed-off-by: Colin Cross <ccross@android.com>
---
 kernel/cgroup.c |   22 ++++++++--------------
 1 files changed, 8 insertions(+), 14 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 66a416b..4a40183 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -725,14 +725,11 @@ static struct cgroup *task_cgroup_from_root(struct task_struct *task,
  * cgroup_attach_task(), which overwrites one tasks cgroup pointer with
  * another.  It does so using cgroup_mutex, however there are
  * several performance critical places that need to reference
- * task->cgroup without the expense of grabbing a system global
+ * task->cgroups without the expense of grabbing a system global
  * mutex.  Therefore except as noted below, when dereferencing or, as
- * in cgroup_attach_task(), modifying a task'ss cgroup pointer we use
+ * in cgroup_attach_task(), modifying a task's cgroups pointer we use
  * task_lock(), which acts on a spinlock (task->alloc_lock) already in
  * the task_struct routinely used for such matters.
- *
- * P.S.  One more locking exception.  RCU is used to guard the
- * update of a tasks cgroup pointer by cgroup_attach_task()
  */
 
 /**
@@ -1786,7 +1783,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		retval = -ESRCH;
 		goto out;
 	}
-	rcu_assign_pointer(tsk->cgroups, newcg);
+	tsk->cgroups = newcg;
 	task_unlock(tsk);
 
 	/* Update the css_set linked lists if we're using them */
@@ -1802,7 +1799,6 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 			ss->attach(ss, cgrp, oldcgrp, tsk, false);
 	}
 	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
-	synchronize_rcu();
 	put_css_set(cg);
 
 	/*
@@ -4827,9 +4823,9 @@ static u64 current_css_set_refcount_read(struct cgroup *cont,
 {
 	u64 count;
 
-	rcu_read_lock();
+	task_lock(current);
 	count = atomic_read(&current->cgroups->refcount);
-	rcu_read_unlock();
+	task_unlock(current);
 	return count;
 }
 
@@ -4838,12 +4834,10 @@ static int current_css_set_cg_links_read(struct cgroup *cont,
 					 struct seq_file *seq)
 {
 	struct cg_cgroup_link *link;
-	struct css_set *cg;
 
 	read_lock(&css_set_lock);
-	rcu_read_lock();
-	cg = rcu_dereference(current->cgroups);
-	list_for_each_entry(link, &cg->cg_links, cg_link_list) {
+	task_lock(current);
+	list_for_each_entry(link, &current->cgroups->cg_links, cg_link_list) {
 		struct cgroup *c = link->cgrp;
 		const char *name;
 
@@ -4854,7 +4848,7 @@ static int current_css_set_cg_links_read(struct cgroup *cont,
 		seq_printf(seq, "Root %d group %s\n",
 			   c->root->hierarchy_id, name);
 	}
-	rcu_read_unlock();
+	task_unlock(current);
 	read_unlock(&css_set_lock);
 	return 0;
 }
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Remove RCU from task->cgroups
       [not found] ` <1290304824-22722-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
@ 2010-11-21 23:02   ` Colin Cross
  0 siblings, 0 replies; 55+ messages in thread
From: Colin Cross @ 2010-11-21 23:02 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Paul Menage,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Colin Cross

On Sat, Nov 20, 2010 at 6:00 PM, Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org> wrote:
> The synchronize_rcu call in cgroup_attach_task can be very
> expensive.  All fastpath accesses to task->cgroups already
> use task_lock() or cgroup_lock() to protect against updates,
> and only the CGROUP_DEBUG files have RCU read-side critical
> sections.
>
> This patch replaces rcu_read_lock() with task_lock(current)
> around the debug file acceses to current->cgroups and removes
> the synchronize_rcu call in cgroup_attach_task.
>
> Signed-off-by: Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
> ---
>  kernel/cgroup.c |   22 ++++++++--------------
>  1 files changed, 8 insertions(+), 14 deletions(-)
>

This patch isn't correct, there's an rcu_dereference I missed inside
task_group(), and that's the important one.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Remove RCU from task->cgroups
  2010-11-21  2:00 [PATCH] cgroup: Remove RCU from task->cgroups Colin Cross
       [not found] ` <1290304824-22722-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
@ 2010-11-21 23:02 ` Colin Cross
  2010-11-22  4:06   ` [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task Colin Cross
       [not found]   ` <AANLkTikx6d0_VFtZ4zWQucRCf=vFt7N2M6=0jpnKasEE-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 2 replies; 55+ messages in thread
From: Colin Cross @ 2010-11-21 23:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: Colin Cross, Paul Menage, Li Zefan, containers

On Sat, Nov 20, 2010 at 6:00 PM, Colin Cross <ccross@android.com> wrote:
> The synchronize_rcu call in cgroup_attach_task can be very
> expensive.  All fastpath accesses to task->cgroups already
> use task_lock() or cgroup_lock() to protect against updates,
> and only the CGROUP_DEBUG files have RCU read-side critical
> sections.
>
> This patch replaces rcu_read_lock() with task_lock(current)
> around the debug file acceses to current->cgroups and removes
> the synchronize_rcu call in cgroup_attach_task.
>
> Signed-off-by: Colin Cross <ccross@android.com>
> ---
>  kernel/cgroup.c |   22 ++++++++--------------
>  1 files changed, 8 insertions(+), 14 deletions(-)
>

This patch isn't correct, there's an rcu_dereference I missed inside
task_group(), and that's the important one.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
       [not found]   ` <AANLkTikx6d0_VFtZ4zWQucRCf=vFt7N2M6=0jpnKasEE-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-11-22  4:06     ` Colin Cross
  0 siblings, 0 replies; 55+ messages in thread
From: Colin Cross @ 2010-11-22  4:06 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Paul Menage,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Colin Cross

The synchronize_rcu call in cgroup_attach_task can be very
expensive.  All fastpath accesses to task->cgroups that expect
task->cgroups not to change already use task_lock() or
cgroup_lock() to protect against updates, and, in cgroup.c,
only the CGROUP_DEBUG files have RCU read-side critical
sections.

sched.c uses RCU read-side-critical sections on task->cgroups,
but only to ensure that a dereference of task->cgroups does
not become invalid, not that it doesn't change.

This patch adds a function put_css_set_rcu, which delays the
put until after a grace period has elapsed.  This ensures that
any RCU read-side critical sections that dereferenced
task->cgroups in sched.c have completed before the css_set is
deleted.  The synchronize_rcu()/put_css_set() combo in
cgroup_attach_task() can then be replaced with
put_css_set_rcu().

Also converts the CGROUP_DEBUG files that access
current->cgroups to use task_lock(current) instead of
rcu_read_lock().

Signed-off-by: Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>

---

This version fixes the problems with the previous patch by
keeping the use of RCU in cgroup_attach_task, but allowing
cgroup_attach_task to return immediately by deferring the
final put_css_reg to an rcu callback.

 include/linux/cgroup.h |    4 +++
 kernel/cgroup.c        |   58 ++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 50 insertions(+), 12 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ed4ba11..fd26218 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -287,6 +287,10 @@ struct css_set {
 
 	/* For RCU-protected deletion */
 	struct rcu_head rcu_head;
+
+	/* For RCU-delayed puts */
+	atomic_t delayed_put_count;
+	struct work_struct delayed_put_work;
 };
 
 /*
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 66a416b..c7348e7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -298,7 +298,8 @@ static int cgroup_init_idr(struct cgroup_subsys *ss,
 
 /* css_set_lock protects the list of css_set objects, and the
  * chain of tasks off each css_set.  Nests outside task->alloc_lock
- * due to cgroup_iter_start() */
+ * due to cgroup_iter_start().  Never locked in irq context, so
+ * the non-irq variants of write_lock and read_lock are used. */
 static DEFINE_RWLOCK(css_set_lock);
 static int css_set_count;
 
@@ -396,6 +397,39 @@ static inline void put_css_set_taskexit(struct css_set *cg)
 	__put_css_set(cg, 1);
 }
 
+/* work function, executes in process context */
+static void __put_css_set_rcu(struct work_struct *work)
+{
+	struct css_set *cg;
+	cg = container_of(work, struct css_set, delayed_put_work);
+
+	while (atomic_add_unless(&cg->delayed_put_count, -1, 0))
+		put_css_set(cg);
+}
+
+/* rcu callback, executes in softirq context */
+static void _put_css_set_rcu(struct rcu_head *obj)
+{
+	struct css_set *cg = container_of(obj, struct css_set, rcu_head);
+
+	/* the rcu callback happens in softirq context, but css_set_lock
+	 * is not irq safe, so bounce to process context.
+	 */
+	schedule_work(&cg->delayed_put_work);
+}
+
+/* put_css_set_rcu - helper function to delay a put until after an rcu
+ * grace period
+ *
+ * free_css_set_rcu can never be called if there are outstanding
+ * put_css_set_rcu calls, so we can reuse cg->rcu_head.
+ */
+static inline void put_css_set_rcu(struct css_set *cg)
+{
+	if (atomic_inc_return(&cg->delayed_put_count) == 1)
+		call_rcu(&cg->rcu_head, _put_css_set_rcu);
+}
+
 /*
  * compare_css_sets - helper function for find_existing_css_set().
  * @cg: candidate css_set being tested
@@ -620,9 +654,11 @@ static struct css_set *find_css_set(
 	}
 
 	atomic_set(&res->refcount, 1);
+	atomic_set(&res->delayed_put_count, 0);
 	INIT_LIST_HEAD(&res->cg_links);
 	INIT_LIST_HEAD(&res->tasks);
 	INIT_HLIST_NODE(&res->hlist);
+	INIT_WORK(&res->delayed_put_work, __put_css_set_rcu);
 
 	/* Copy the set of subsystem state objects generated in
 	 * find_existing_css_set() */
@@ -725,9 +761,9 @@ static struct cgroup *task_cgroup_from_root(struct task_struct *task,
  * cgroup_attach_task(), which overwrites one tasks cgroup pointer with
  * another.  It does so using cgroup_mutex, however there are
  * several performance critical places that need to reference
- * task->cgroup without the expense of grabbing a system global
+ * task->cgroups without the expense of grabbing a system global
  * mutex.  Therefore except as noted below, when dereferencing or, as
- * in cgroup_attach_task(), modifying a task'ss cgroup pointer we use
+ * in cgroup_attach_task(), modifying a task's cgroups pointer we use
  * task_lock(), which acts on a spinlock (task->alloc_lock) already in
  * the task_struct routinely used for such matters.
  *
@@ -1802,8 +1838,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 			ss->attach(ss, cgrp, oldcgrp, tsk, false);
 	}
 	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
-	synchronize_rcu();
-	put_css_set(cg);
+	put_css_set_rcu(cg);
 
 	/*
 	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
@@ -3900,6 +3935,7 @@ int __init cgroup_init_early(void)
 	INIT_LIST_HEAD(&init_css_set.cg_links);
 	INIT_LIST_HEAD(&init_css_set.tasks);
 	INIT_HLIST_NODE(&init_css_set.hlist);
+	INIT_WORK(&init_css_set.delayed_put_work, __put_css_set_rcu);
 	css_set_count = 1;
 	init_cgroup_root(&rootnode);
 	root_count = 1;
@@ -4827,9 +4863,9 @@ static u64 current_css_set_refcount_read(struct cgroup *cont,
 {
 	u64 count;
 
-	rcu_read_lock();
+	task_lock(current);
 	count = atomic_read(&current->cgroups->refcount);
-	rcu_read_unlock();
+	task_unlock(current);
 	return count;
 }
 
@@ -4838,12 +4874,10 @@ static int current_css_set_cg_links_read(struct cgroup *cont,
 					 struct seq_file *seq)
 {
 	struct cg_cgroup_link *link;
-	struct css_set *cg;
 
 	read_lock(&css_set_lock);
-	rcu_read_lock();
-	cg = rcu_dereference(current->cgroups);
-	list_for_each_entry(link, &cg->cg_links, cg_link_list) {
+	task_lock(current);
+	list_for_each_entry(link, &current->cgroups->cg_links, cg_link_list) {
 		struct cgroup *c = link->cgrp;
 		const char *name;
 
@@ -4854,7 +4888,7 @@ static int current_css_set_cg_links_read(struct cgroup *cont,
 		seq_printf(seq, "Root %d group %s\n",
 			   c->root->hierarchy_id, name);
 	}
-	rcu_read_unlock();
+	task_unlock(current);
 	read_unlock(&css_set_lock);
 	return 0;
 }
-- 
1.7.3.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
  2010-11-21 23:02 ` Colin Cross
@ 2010-11-22  4:06   ` Colin Cross
       [not found]     ` <1290398767-15230-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
                       ` (2 more replies)
       [not found]   ` <AANLkTikx6d0_VFtZ4zWQucRCf=vFt7N2M6=0jpnKasEE-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 3 replies; 55+ messages in thread
From: Colin Cross @ 2010-11-22  4:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: Colin Cross, Paul Menage, Li Zefan, containers

The synchronize_rcu call in cgroup_attach_task can be very
expensive.  All fastpath accesses to task->cgroups that expect
task->cgroups not to change already use task_lock() or
cgroup_lock() to protect against updates, and, in cgroup.c,
only the CGROUP_DEBUG files have RCU read-side critical
sections.

sched.c uses RCU read-side-critical sections on task->cgroups,
but only to ensure that a dereference of task->cgroups does
not become invalid, not that it doesn't change.

This patch adds a function put_css_set_rcu, which delays the
put until after a grace period has elapsed.  This ensures that
any RCU read-side critical sections that dereferenced
task->cgroups in sched.c have completed before the css_set is
deleted.  The synchronize_rcu()/put_css_set() combo in
cgroup_attach_task() can then be replaced with
put_css_set_rcu().

Also converts the CGROUP_DEBUG files that access
current->cgroups to use task_lock(current) instead of
rcu_read_lock().

Signed-off-by: Colin Cross <ccross@android.com>

---

This version fixes the problems with the previous patch by
keeping the use of RCU in cgroup_attach_task, but allowing
cgroup_attach_task to return immediately by deferring the
final put_css_reg to an rcu callback.

 include/linux/cgroup.h |    4 +++
 kernel/cgroup.c        |   58 ++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 50 insertions(+), 12 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ed4ba11..fd26218 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -287,6 +287,10 @@ struct css_set {
 
 	/* For RCU-protected deletion */
 	struct rcu_head rcu_head;
+
+	/* For RCU-delayed puts */
+	atomic_t delayed_put_count;
+	struct work_struct delayed_put_work;
 };
 
 /*
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 66a416b..c7348e7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -298,7 +298,8 @@ static int cgroup_init_idr(struct cgroup_subsys *ss,
 
 /* css_set_lock protects the list of css_set objects, and the
  * chain of tasks off each css_set.  Nests outside task->alloc_lock
- * due to cgroup_iter_start() */
+ * due to cgroup_iter_start().  Never locked in irq context, so
+ * the non-irq variants of write_lock and read_lock are used. */
 static DEFINE_RWLOCK(css_set_lock);
 static int css_set_count;
 
@@ -396,6 +397,39 @@ static inline void put_css_set_taskexit(struct css_set *cg)
 	__put_css_set(cg, 1);
 }
 
+/* work function, executes in process context */
+static void __put_css_set_rcu(struct work_struct *work)
+{
+	struct css_set *cg;
+	cg = container_of(work, struct css_set, delayed_put_work);
+
+	while (atomic_add_unless(&cg->delayed_put_count, -1, 0))
+		put_css_set(cg);
+}
+
+/* rcu callback, executes in softirq context */
+static void _put_css_set_rcu(struct rcu_head *obj)
+{
+	struct css_set *cg = container_of(obj, struct css_set, rcu_head);
+
+	/* the rcu callback happens in softirq context, but css_set_lock
+	 * is not irq safe, so bounce to process context.
+	 */
+	schedule_work(&cg->delayed_put_work);
+}
+
+/* put_css_set_rcu - helper function to delay a put until after an rcu
+ * grace period
+ *
+ * free_css_set_rcu can never be called if there are outstanding
+ * put_css_set_rcu calls, so we can reuse cg->rcu_head.
+ */
+static inline void put_css_set_rcu(struct css_set *cg)
+{
+	if (atomic_inc_return(&cg->delayed_put_count) == 1)
+		call_rcu(&cg->rcu_head, _put_css_set_rcu);
+}
+
 /*
  * compare_css_sets - helper function for find_existing_css_set().
  * @cg: candidate css_set being tested
@@ -620,9 +654,11 @@ static struct css_set *find_css_set(
 	}
 
 	atomic_set(&res->refcount, 1);
+	atomic_set(&res->delayed_put_count, 0);
 	INIT_LIST_HEAD(&res->cg_links);
 	INIT_LIST_HEAD(&res->tasks);
 	INIT_HLIST_NODE(&res->hlist);
+	INIT_WORK(&res->delayed_put_work, __put_css_set_rcu);
 
 	/* Copy the set of subsystem state objects generated in
 	 * find_existing_css_set() */
@@ -725,9 +761,9 @@ static struct cgroup *task_cgroup_from_root(struct task_struct *task,
  * cgroup_attach_task(), which overwrites one tasks cgroup pointer with
  * another.  It does so using cgroup_mutex, however there are
  * several performance critical places that need to reference
- * task->cgroup without the expense of grabbing a system global
+ * task->cgroups without the expense of grabbing a system global
  * mutex.  Therefore except as noted below, when dereferencing or, as
- * in cgroup_attach_task(), modifying a task'ss cgroup pointer we use
+ * in cgroup_attach_task(), modifying a task's cgroups pointer we use
  * task_lock(), which acts on a spinlock (task->alloc_lock) already in
  * the task_struct routinely used for such matters.
  *
@@ -1802,8 +1838,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 			ss->attach(ss, cgrp, oldcgrp, tsk, false);
 	}
 	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
-	synchronize_rcu();
-	put_css_set(cg);
+	put_css_set_rcu(cg);
 
 	/*
 	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
@@ -3900,6 +3935,7 @@ int __init cgroup_init_early(void)
 	INIT_LIST_HEAD(&init_css_set.cg_links);
 	INIT_LIST_HEAD(&init_css_set.tasks);
 	INIT_HLIST_NODE(&init_css_set.hlist);
+	INIT_WORK(&init_css_set.delayed_put_work, __put_css_set_rcu);
 	css_set_count = 1;
 	init_cgroup_root(&rootnode);
 	root_count = 1;
@@ -4827,9 +4863,9 @@ static u64 current_css_set_refcount_read(struct cgroup *cont,
 {
 	u64 count;
 
-	rcu_read_lock();
+	task_lock(current);
 	count = atomic_read(&current->cgroups->refcount);
-	rcu_read_unlock();
+	task_unlock(current);
 	return count;
 }
 
@@ -4838,12 +4874,10 @@ static int current_css_set_cg_links_read(struct cgroup *cont,
 					 struct seq_file *seq)
 {
 	struct cg_cgroup_link *link;
-	struct css_set *cg;
 
 	read_lock(&css_set_lock);
-	rcu_read_lock();
-	cg = rcu_dereference(current->cgroups);
-	list_for_each_entry(link, &cg->cg_links, cg_link_list) {
+	task_lock(current);
+	list_for_each_entry(link, &current->cgroups->cg_links, cg_link_list) {
 		struct cgroup *c = link->cgrp;
 		const char *name;
 
@@ -4854,7 +4888,7 @@ static int current_css_set_cg_links_read(struct cgroup *cont,
 		seq_printf(seq, "Root %d group %s\n",
 			   c->root->hierarchy_id, name);
 	}
-	rcu_read_unlock();
+	task_unlock(current);
 	read_unlock(&css_set_lock);
 	return 0;
 }
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
       [not found]     ` <1290398767-15230-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
@ 2010-11-23  8:14       ` Li Zefan
  2010-11-24  1:24       ` Paul Menage
  1 sibling, 0 replies; 55+ messages in thread
From: Li Zefan @ 2010-11-23  8:14 UTC (permalink / raw)
  To: Colin Cross
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Paul Menage, linux-kernel-u79uwXL29TY76Z2rM5mHXA

12:06, Colin Cross wrote:
> The synchronize_rcu call in cgroup_attach_task can be very
> expensive.  All fastpath accesses to task->cgroups that expect
> task->cgroups not to change already use task_lock() or
> cgroup_lock() to protect against updates, and, in cgroup.c,
> only the CGROUP_DEBUG files have RCU read-side critical
> sections.
> 
> sched.c uses RCU read-side-critical sections on task->cgroups,
> but only to ensure that a dereference of task->cgroups does
> not become invalid, not that it doesn't change.
> 

Other cgroup subsystems also use rcu_read_lock to access task->cgroups,
for example net_cls cgroup and device cgroup.

I don't think the performance of task attaching is so critically
important that we have to use call_rcu() instead of synchronize_rcu()?

> This patch adds a function put_css_set_rcu, which delays the
> put until after a grace period has elapsed.  This ensures that
> any RCU read-side critical sections that dereferenced
> task->cgroups in sched.c have completed before the css_set is
> deleted.  The synchronize_rcu()/put_css_set() combo in
> cgroup_attach_task() can then be replaced with
> put_css_set_rcu().
> 

> Also converts the CGROUP_DEBUG files that access
> current->cgroups to use task_lock(current) instead of
> rcu_read_lock().
> 

What for? What do we gain from doing this for those debug
interfaces?

> Signed-off-by: Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
> 
> ---
> 
> This version fixes the problems with the previous patch by
> keeping the use of RCU in cgroup_attach_task, but allowing
> cgroup_attach_task to return immediately by deferring the
> final put_css_reg to an rcu callback.
> 
>  include/linux/cgroup.h |    4 +++
>  kernel/cgroup.c        |   58 ++++++++++++++++++++++++++++++++++++++----------
>  2 files changed, 50 insertions(+), 12 deletions(-)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
  2010-11-22  4:06   ` [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task Colin Cross
       [not found]     ` <1290398767-15230-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
@ 2010-11-23  8:14     ` Li Zefan
  2010-11-23  8:58       ` Colin Cross
       [not found]       ` <4CEB77E0.10202-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2010-11-24  1:24     ` Paul Menage
  2 siblings, 2 replies; 55+ messages in thread
From: Li Zefan @ 2010-11-23  8:14 UTC (permalink / raw)
  To: Colin Cross; +Cc: linux-kernel, Paul Menage, containers

12:06, Colin Cross wrote:
> The synchronize_rcu call in cgroup_attach_task can be very
> expensive.  All fastpath accesses to task->cgroups that expect
> task->cgroups not to change already use task_lock() or
> cgroup_lock() to protect against updates, and, in cgroup.c,
> only the CGROUP_DEBUG files have RCU read-side critical
> sections.
> 
> sched.c uses RCU read-side-critical sections on task->cgroups,
> but only to ensure that a dereference of task->cgroups does
> not become invalid, not that it doesn't change.
> 

Other cgroup subsystems also use rcu_read_lock to access task->cgroups,
for example net_cls cgroup and device cgroup.

I don't think the performance of task attaching is so critically
important that we have to use call_rcu() instead of synchronize_rcu()?

> This patch adds a function put_css_set_rcu, which delays the
> put until after a grace period has elapsed.  This ensures that
> any RCU read-side critical sections that dereferenced
> task->cgroups in sched.c have completed before the css_set is
> deleted.  The synchronize_rcu()/put_css_set() combo in
> cgroup_attach_task() can then be replaced with
> put_css_set_rcu().
> 

> Also converts the CGROUP_DEBUG files that access
> current->cgroups to use task_lock(current) instead of
> rcu_read_lock().
> 

What for? What do we gain from doing this for those debug
interfaces?

> Signed-off-by: Colin Cross <ccross@android.com>
> 
> ---
> 
> This version fixes the problems with the previous patch by
> keeping the use of RCU in cgroup_attach_task, but allowing
> cgroup_attach_task to return immediately by deferring the
> final put_css_reg to an rcu callback.
> 
>  include/linux/cgroup.h |    4 +++
>  kernel/cgroup.c        |   58 ++++++++++++++++++++++++++++++++++++++----------
>  2 files changed, 50 insertions(+), 12 deletions(-)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
       [not found]       ` <4CEB77E0.10202-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2010-11-23  8:58         ` Colin Cross
  0 siblings, 0 replies; 55+ messages in thread
From: Colin Cross @ 2010-11-23  8:58 UTC (permalink / raw)
  To: Li Zefan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Paul Menage, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Nov 23, 2010 at 12:14 AM, Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
> 12:06, Colin Cross wrote:
>> The synchronize_rcu call in cgroup_attach_task can be very
>> expensive.  All fastpath accesses to task->cgroups that expect
>> task->cgroups not to change already use task_lock() or
>> cgroup_lock() to protect against updates, and, in cgroup.c,
>> only the CGROUP_DEBUG files have RCU read-side critical
>> sections.
>>
>> sched.c uses RCU read-side-critical sections on task->cgroups,
>> but only to ensure that a dereference of task->cgroups does
>> not become invalid, not that it doesn't change.
>>
>
> Other cgroup subsystems also use rcu_read_lock to access task->cgroups,
> for example net_cls cgroup and device cgroup.
I believe the same comment applies as sched.c, I'll update the commit message.

> I don't think the performance of task attaching is so critically
> important that we have to use call_rcu() instead of synchronize_rcu()?
On my desktop, moving a task between cgroups averages 100 ms, and on
an Tegra2 SMP ARM platform it takes 20 ms.  Moving a task with many
threads can take hundreds of milliseconds or more.  With this patch it
takes 50 microseconds to move one task, a 400x improvement.

>> This patch adds a function put_css_set_rcu, which delays the
>> put until after a grace period has elapsed.  This ensures that
>> any RCU read-side critical sections that dereferenced
>> task->cgroups in sched.c have completed before the css_set is
>> deleted.  The synchronize_rcu()/put_css_set() combo in
>> cgroup_attach_task() can then be replaced with
>> put_css_set_rcu().
>>
>
>> Also converts the CGROUP_DEBUG files that access
>> current->cgroups to use task_lock(current) instead of
>> rcu_read_lock().
>>
>
> What for? What do we gain from doing this for those debug
> interfaces?
Left over from the previous patch that incorrectly dropped RCU
completely.  I'll put the rcu_read_locks back.

>> Signed-off-by: Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
>>
>> ---
>>
>> This version fixes the problems with the previous patch by
>> keeping the use of RCU in cgroup_attach_task, but allowing
>> cgroup_attach_task to return immediately by deferring the
>> final put_css_reg to an rcu callback.
>>
>>  include/linux/cgroup.h |    4 +++
>>  kernel/cgroup.c        |   58 ++++++++++++++++++++++++++++++++++++++----------
>>  2 files changed, 50 insertions(+), 12 deletions(-)
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
  2010-11-23  8:14     ` Li Zefan
@ 2010-11-23  8:58       ` Colin Cross
       [not found]         ` <AANLkTimjpW6NZ6fEiVi0VzjkpQGVob4=VHsohXUiDQkJ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2010-11-23 20:22         ` Colin Cross
       [not found]       ` <4CEB77E0.10202-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  1 sibling, 2 replies; 55+ messages in thread
From: Colin Cross @ 2010-11-23  8:58 UTC (permalink / raw)
  To: Li Zefan; +Cc: linux-kernel, Paul Menage, containers

On Tue, Nov 23, 2010 at 12:14 AM, Li Zefan <lizf@cn.fujitsu.com> wrote:
> 12:06, Colin Cross wrote:
>> The synchronize_rcu call in cgroup_attach_task can be very
>> expensive.  All fastpath accesses to task->cgroups that expect
>> task->cgroups not to change already use task_lock() or
>> cgroup_lock() to protect against updates, and, in cgroup.c,
>> only the CGROUP_DEBUG files have RCU read-side critical
>> sections.
>>
>> sched.c uses RCU read-side-critical sections on task->cgroups,
>> but only to ensure that a dereference of task->cgroups does
>> not become invalid, not that it doesn't change.
>>
>
> Other cgroup subsystems also use rcu_read_lock to access task->cgroups,
> for example net_cls cgroup and device cgroup.
I believe the same comment applies as sched.c, I'll update the commit message.

> I don't think the performance of task attaching is so critically
> important that we have to use call_rcu() instead of synchronize_rcu()?
On my desktop, moving a task between cgroups averages 100 ms, and on
an Tegra2 SMP ARM platform it takes 20 ms.  Moving a task with many
threads can take hundreds of milliseconds or more.  With this patch it
takes 50 microseconds to move one task, a 400x improvement.

>> This patch adds a function put_css_set_rcu, which delays the
>> put until after a grace period has elapsed.  This ensures that
>> any RCU read-side critical sections that dereferenced
>> task->cgroups in sched.c have completed before the css_set is
>> deleted.  The synchronize_rcu()/put_css_set() combo in
>> cgroup_attach_task() can then be replaced with
>> put_css_set_rcu().
>>
>
>> Also converts the CGROUP_DEBUG files that access
>> current->cgroups to use task_lock(current) instead of
>> rcu_read_lock().
>>
>
> What for? What do we gain from doing this for those debug
> interfaces?
Left over from the previous patch that incorrectly dropped RCU
completely.  I'll put the rcu_read_locks back.

>> Signed-off-by: Colin Cross <ccross@android.com>
>>
>> ---
>>
>> This version fixes the problems with the previous patch by
>> keeping the use of RCU in cgroup_attach_task, but allowing
>> cgroup_attach_task to return immediately by deferring the
>> final put_css_reg to an rcu callback.
>>
>>  include/linux/cgroup.h |    4 +++
>>  kernel/cgroup.c        |   58 ++++++++++++++++++++++++++++++++++++++----------
>>  2 files changed, 50 insertions(+), 12 deletions(-)
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
       [not found]         ` <AANLkTimjpW6NZ6fEiVi0VzjkpQGVob4=VHsohXUiDQkJ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-11-23 20:22           ` Colin Cross
  0 siblings, 0 replies; 55+ messages in thread
From: Colin Cross @ 2010-11-23 20:22 UTC (permalink / raw)
  To: Li Zefan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Paul Menage, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Nov 23, 2010 at 12:58 AM, Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org> wrote:
> On Tue, Nov 23, 2010 at 12:14 AM, Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
>> 12:06, Colin Cross wrote:
>>> The synchronize_rcu call in cgroup_attach_task can be very
>>> expensive.  All fastpath accesses to task->cgroups that expect
>>> task->cgroups not to change already use task_lock() or
>>> cgroup_lock() to protect against updates, and, in cgroup.c,
>>> only the CGROUP_DEBUG files have RCU read-side critical
>>> sections.
>>>
>>> sched.c uses RCU read-side-critical sections on task->cgroups,
>>> but only to ensure that a dereference of task->cgroups does
>>> not become invalid, not that it doesn't change.
>>>
>>
>> Other cgroup subsystems also use rcu_read_lock to access task->cgroups,
>> for example net_cls cgroup and device cgroup.
> I believe the same comment applies as sched.c, I'll update the commit message.
>
>> I don't think the performance of task attaching is so critically
>> important that we have to use call_rcu() instead of synchronize_rcu()?
> On my desktop, moving a task between cgroups averages 100 ms, and on
> an Tegra2 SMP ARM platform it takes 20 ms.  Moving a task with many
> threads can take hundreds of milliseconds or more.  With this patch it
> takes 50 microseconds to move one task, a 400x improvement.
>
>>> This patch adds a function put_css_set_rcu, which delays the
>>> put until after a grace period has elapsed.  This ensures that
>>> any RCU read-side critical sections that dereferenced
>>> task->cgroups in sched.c have completed before the css_set is
>>> deleted.  The synchronize_rcu()/put_css_set() combo in
>>> cgroup_attach_task() can then be replaced with
>>> put_css_set_rcu().
>>>
>>
>>> Also converts the CGROUP_DEBUG files that access
>>> current->cgroups to use task_lock(current) instead of
>>> rcu_read_lock().
>>>
>>
>> What for? What do we gain from doing this for those debug
>> interfaces?
> Left over from the previous patch that incorrectly dropped RCU
> completely.  I'll put the rcu_read_locks back.
>
>>> Signed-off-by: Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
>>>
>>> ---
>>>
>>> This version fixes the problems with the previous patch by
>>> keeping the use of RCU in cgroup_attach_task, but allowing
>>> cgroup_attach_task to return immediately by deferring the
>>> final put_css_reg to an rcu callback.
>>>
>>>  include/linux/cgroup.h |    4 +++
>>>  kernel/cgroup.c        |   58 ++++++++++++++++++++++++++++++++++++++----------
>>>  2 files changed, 50 insertions(+), 12 deletions(-)
>>
>

This patch has another problem - calling put_css_set_rcu twice before
an rcu grace period has elapsed would not guarantee the appropriate
rcu grace period for the second call.  I'll try a new approach, moving
the parts of put_css_set that need to be protected by rcu into
free_css_set_rcu.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
  2010-11-23  8:58       ` Colin Cross
       [not found]         ` <AANLkTimjpW6NZ6fEiVi0VzjkpQGVob4=VHsohXUiDQkJ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-11-23 20:22         ` Colin Cross
  1 sibling, 0 replies; 55+ messages in thread
From: Colin Cross @ 2010-11-23 20:22 UTC (permalink / raw)
  To: Li Zefan; +Cc: linux-kernel, Paul Menage, containers

On Tue, Nov 23, 2010 at 12:58 AM, Colin Cross <ccross@android.com> wrote:
> On Tue, Nov 23, 2010 at 12:14 AM, Li Zefan <lizf@cn.fujitsu.com> wrote:
>> 12:06, Colin Cross wrote:
>>> The synchronize_rcu call in cgroup_attach_task can be very
>>> expensive.  All fastpath accesses to task->cgroups that expect
>>> task->cgroups not to change already use task_lock() or
>>> cgroup_lock() to protect against updates, and, in cgroup.c,
>>> only the CGROUP_DEBUG files have RCU read-side critical
>>> sections.
>>>
>>> sched.c uses RCU read-side-critical sections on task->cgroups,
>>> but only to ensure that a dereference of task->cgroups does
>>> not become invalid, not that it doesn't change.
>>>
>>
>> Other cgroup subsystems also use rcu_read_lock to access task->cgroups,
>> for example net_cls cgroup and device cgroup.
> I believe the same comment applies as sched.c, I'll update the commit message.
>
>> I don't think the performance of task attaching is so critically
>> important that we have to use call_rcu() instead of synchronize_rcu()?
> On my desktop, moving a task between cgroups averages 100 ms, and on
> an Tegra2 SMP ARM platform it takes 20 ms.  Moving a task with many
> threads can take hundreds of milliseconds or more.  With this patch it
> takes 50 microseconds to move one task, a 400x improvement.
>
>>> This patch adds a function put_css_set_rcu, which delays the
>>> put until after a grace period has elapsed.  This ensures that
>>> any RCU read-side critical sections that dereferenced
>>> task->cgroups in sched.c have completed before the css_set is
>>> deleted.  The synchronize_rcu()/put_css_set() combo in
>>> cgroup_attach_task() can then be replaced with
>>> put_css_set_rcu().
>>>
>>
>>> Also converts the CGROUP_DEBUG files that access
>>> current->cgroups to use task_lock(current) instead of
>>> rcu_read_lock().
>>>
>>
>> What for? What do we gain from doing this for those debug
>> interfaces?
> Left over from the previous patch that incorrectly dropped RCU
> completely.  I'll put the rcu_read_locks back.
>
>>> Signed-off-by: Colin Cross <ccross@android.com>
>>>
>>> ---
>>>
>>> This version fixes the problems with the previous patch by
>>> keeping the use of RCU in cgroup_attach_task, but allowing
>>> cgroup_attach_task to return immediately by deferring the
>>> final put_css_reg to an rcu callback.
>>>
>>>  include/linux/cgroup.h |    4 +++
>>>  kernel/cgroup.c        |   58 ++++++++++++++++++++++++++++++++++++++----------
>>>  2 files changed, 50 insertions(+), 12 deletions(-)
>>
>

This patch has another problem - calling put_css_set_rcu twice before
an rcu grace period has elapsed would not guarantee the appropriate
rcu grace period for the second call.  I'll try a new approach, moving
the parts of put_css_set that need to be protected by rcu into
free_css_set_rcu.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
       [not found]     ` <1290398767-15230-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
  2010-11-23  8:14       ` Li Zefan
@ 2010-11-24  1:24       ` Paul Menage
  1 sibling, 0 replies; 55+ messages in thread
From: Paul Menage @ 2010-11-24  1:24 UTC (permalink / raw)
  To: Colin Cross
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Sun, Nov 21, 2010 at 8:06 PM, Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org> wrote:
> The synchronize_rcu call in cgroup_attach_task can be very
> expensive.  All fastpath accesses to task->cgroups that expect
> task->cgroups not to change already use task_lock() or
> cgroup_lock() to protect against updates, and, in cgroup.c,
> only the CGROUP_DEBUG files have RCU read-side critical
> sections.

I definitely agree with the goal of using lighter-weight
synchronization than the current synchronize_rcu() call. However,
there are definitely some subtleties to worry about in this code.

One of the reasons originally for the current synchronization was to
avoid the case of calling subsystem destroy() callbacks while there
could still be threads with RCU references to the subsystem state. The
fact that synchronize_rcu() was called within a cgroup_mutex critical
section meant that an rmdir (or any other significant cgrooup
management action) couldn't possibly start until any RCU read sections
were done.

I suspect that when we moved a lot of the cgroup teardown code from
cgroup_rmdir() to cgroup_diput() (which also has a synchronize_rcu()
call in it) this restriction could have been eased, but I think I left
it as it was mostly out of paranoia that I was missing/forgetting some
crucial reason for keeping it in place.

I'd suggest trying the following approach, which I suspect is similar
to what you were suggesting in your last email

1) make find_existing_css_set ignore css_set objects with a zero refcount
2) change __put_css_set to be simply

if (atomic_dec_and_test(&cg->refcount)) {
  call_rcu(&cg->rcu_head, free_css_set_rcu);
}

3) move the rest of __put_css_set into a delayed work struct that's
scheduled by free_css_set_rcu

4) Get rid of the taskexit parameter - I think we can do that via a
simple flag that indicates whether any task has ever been moved into
the cgroup.

5) Put extra checks in cgroup_rmdir() such that if it tries to remove
a cgroup that has a non-zero refcount, it scans the cgroup's css_sets
list - if it finds only zero-refcount entries, then wait (via
synchronize_rcu() or some other appropriate means, maybe reusing the
CGRP_WAIT_ON_RMDIR mechanism?) until the css_set objects have been
fully cleaned up and the cgroup's refcounts have been released.
Otherwise the operation of moving the last thread out of a cgroup and
immediately deleting the cgroup would very likely fail with an EBUSY

Paul

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
  2010-11-22  4:06   ` [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task Colin Cross
       [not found]     ` <1290398767-15230-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
  2010-11-23  8:14     ` Li Zefan
@ 2010-11-24  1:24     ` Paul Menage
  2010-11-24  2:06       ` Li Zefan
       [not found]       ` <AANLkTi=4-OgPUugnUBaqSU3oC=3wxTjAsOB_Ais3Or+i-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2 siblings, 2 replies; 55+ messages in thread
From: Paul Menage @ 2010-11-24  1:24 UTC (permalink / raw)
  To: Colin Cross; +Cc: linux-kernel, Li Zefan, containers

On Sun, Nov 21, 2010 at 8:06 PM, Colin Cross <ccross@android.com> wrote:
> The synchronize_rcu call in cgroup_attach_task can be very
> expensive.  All fastpath accesses to task->cgroups that expect
> task->cgroups not to change already use task_lock() or
> cgroup_lock() to protect against updates, and, in cgroup.c,
> only the CGROUP_DEBUG files have RCU read-side critical
> sections.

I definitely agree with the goal of using lighter-weight
synchronization than the current synchronize_rcu() call. However,
there are definitely some subtleties to worry about in this code.

One of the reasons originally for the current synchronization was to
avoid the case of calling subsystem destroy() callbacks while there
could still be threads with RCU references to the subsystem state. The
fact that synchronize_rcu() was called within a cgroup_mutex critical
section meant that an rmdir (or any other significant cgrooup
management action) couldn't possibly start until any RCU read sections
were done.

I suspect that when we moved a lot of the cgroup teardown code from
cgroup_rmdir() to cgroup_diput() (which also has a synchronize_rcu()
call in it) this restriction could have been eased, but I think I left
it as it was mostly out of paranoia that I was missing/forgetting some
crucial reason for keeping it in place.

I'd suggest trying the following approach, which I suspect is similar
to what you were suggesting in your last email

1) make find_existing_css_set ignore css_set objects with a zero refcount
2) change __put_css_set to be simply

if (atomic_dec_and_test(&cg->refcount)) {
  call_rcu(&cg->rcu_head, free_css_set_rcu);
}

3) move the rest of __put_css_set into a delayed work struct that's
scheduled by free_css_set_rcu

4) Get rid of the taskexit parameter - I think we can do that via a
simple flag that indicates whether any task has ever been moved into
the cgroup.

5) Put extra checks in cgroup_rmdir() such that if it tries to remove
a cgroup that has a non-zero refcount, it scans the cgroup's css_sets
list - if it finds only zero-refcount entries, then wait (via
synchronize_rcu() or some other appropriate means, maybe reusing the
CGRP_WAIT_ON_RMDIR mechanism?) until the css_set objects have been
fully cleaned up and the cgroup's refcounts have been released.
Otherwise the operation of moving the last thread out of a cgroup and
immediately deleting the cgroup would very likely fail with an EBUSY

Paul

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] cgroup: Remove call to synchronize_rcu in cgroup_attach_task
  2010-11-24  1:24     ` Paul Menage
@ 2010-11-24  1:43           ` Colin Cross
       [not found]       ` <AANLkTi=4-OgPUugnUBaqSU3oC=3wxTjAsOB_Ais3Or+i-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 0 replies; 55+ messages in thread
From: Colin Cross @ 2010-11-24  1:43 UTC (permalink / raw)
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Colin Cross

synchronize_rcu can be very expensive, averaging 100 ms in
some cases.  In cgroup_attach_task, it is used to prevent
a task->cgroups pointer dereferenced in an RCU read side
critical section from being invalidated by delaying the call
to put_css_set until after an RCU grace period.

To avoid the call to synchronize_rcu, make the put_css_set
call rcu-safe by moving the deletion of the css_set links
into rcu-protected free_css_set_rcu.

The calls to check_for_release in free_css_set_rcu now occur
in softirq context, so convert all uses of the
release_list_lock spinlock to irq safe versions.

The decrement of the cgroup refcount is no longer
synchronous with the call to put_css_set, which can result
in the cgroup refcount staying positive after the last call
to cgroup_attach_task returns.  To allow the cgroup to be
deleted with cgroup_rmdir synchronously after
cgroup_attach_task, introduce a second refcount,
rmdir_count, that is decremented synchronously in
put_css_set.  If cgroup_rmdir is called on a cgroup for
hich rmdir_count is zero but count is nonzero, reuse the
rmdir waitqueue to block the rmdir until the rcu callback
is called.

Signed-off-by: Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
---

This patch is similar to what you described.  The main differences are
that I used a new atomic to handle the rmdir case, and I converted
check_for_release to be callable in softirq context rather than schedule
work in free_css_set_rcu.  Your css_set scanning in rmdir sounds better,
I'll take another look at that.  Is there any problem with disabling irqs
with spin_lock_irqsave in check_for_release?

 include/linux/cgroup.h |    6 ++
 kernel/cgroup.c        |  124 ++++++++++++++++++++++++++++--------------------
 2 files changed, 78 insertions(+), 52 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ed4ba11..3b6e73d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -202,6 +202,12 @@ struct cgroup {
 	atomic_t count;
 
 	/*
+	 * separate refcount for rmdir on a cgroup.  When rmdir_count is 0,
+	 * rmdir should block until count is 0.
+	 */
+	atomic_t rmdir_count;
+
+	/*
 	 * We link our 'sibling' struct into our parent's 'children'.
 	 * Our children link their 'sibling' into our 'children'.
 	 */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 66a416b..fa3c0ac 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -267,6 +267,33 @@ static void cgroup_release_agent(struct work_struct *work);
 static DECLARE_WORK(release_agent_work, cgroup_release_agent);
 static void check_for_release(struct cgroup *cgrp);
 
+/*
+ * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
+ * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
+ * reference to css->refcnt. In general, this refcnt is expected to goes down
+ * to zero, soon.
+ *
+ * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
+ */
+DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
+
+static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
+{
+	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
+		wake_up_all(&cgroup_rmdir_waitq);
+}
+
+void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
+{
+	css_get(css);
+}
+
+void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
+{
+	cgroup_wakeup_rmdir_waiter(css->cgroup);
+	css_put(css);
+}
+
 /* Link structure for associating css_set objects with cgroups */
 struct cg_cgroup_link {
 	/*
@@ -329,6 +356,22 @@ static struct hlist_head *css_set_hash(struct cgroup_subsys_state *css[])
 static void free_css_set_rcu(struct rcu_head *obj)
 {
 	struct css_set *cg = container_of(obj, struct css_set, rcu_head);
+	struct cg_cgroup_link *link;
+	struct cg_cgroup_link *saved_link;
+
+	/* Nothing else can have a reference to cg, no need for css_set_lock */
+	list_for_each_entry_safe(link, saved_link, &cg->cg_links,
+				 cg_link_list) {
+		struct cgroup *cgrp = link->cgrp;
+		list_del(&link->cg_link_list);
+		list_del(&link->cgrp_link_list);
+		if (atomic_dec_and_test(&cgrp->count)) {
+			check_for_release(cgrp);
+			cgroup_wakeup_rmdir_waiter(cgrp);
+		}
+		kfree(link);
+	}
+
 	kfree(cg);
 }
 
@@ -355,23 +398,20 @@ static void __put_css_set(struct css_set *cg, int taskexit)
 		return;
 	}
 
-	/* This css_set is dead. unlink it and release cgroup refcounts */
 	hlist_del(&cg->hlist);
 	css_set_count--;
 
+	/* This css_set is now unreachable from the css_set_table, but RCU
+	 * read-side critical sections may still have a reference to it.
+	 * Decrement the cgroup rmdir_count so that rmdir's on an empty
+	 * cgroup can block until the free_css_set_rcu callback */
 	list_for_each_entry_safe(link, saved_link, &cg->cg_links,
 				 cg_link_list) {
 		struct cgroup *cgrp = link->cgrp;
-		list_del(&link->cg_link_list);
-		list_del(&link->cgrp_link_list);
-		if (atomic_dec_and_test(&cgrp->count) &&
-		    notify_on_release(cgrp)) {
-			if (taskexit)
-				set_bit(CGRP_RELEASABLE, &cgrp->flags);
-			check_for_release(cgrp);
-		}
-
-		kfree(link);
+		if (taskexit)
+			set_bit(CGRP_RELEASABLE, &cgrp->flags);
+		atomic_dec(&cgrp->rmdir_count);
+		smp_mb();
 	}
 
 	write_unlock(&css_set_lock);
@@ -571,6 +611,8 @@ static void link_css_set(struct list_head *tmp_cg_links,
 				cgrp_link_list);
 	link->cg = cg;
 	link->cgrp = cgrp;
+	atomic_inc(&cgrp->rmdir_count);
+	smp_mb(); /* make sure rmdir_count increments first */
 	atomic_inc(&cgrp->count);
 	list_move(&link->cgrp_link_list, &cgrp->css_sets);
 	/*
@@ -725,9 +767,9 @@ static struct cgroup *task_cgroup_from_root(struct task_struct *task,
  * cgroup_attach_task(), which overwrites one tasks cgroup pointer with
  * another.  It does so using cgroup_mutex, however there are
  * several performance critical places that need to reference
- * task->cgroup without the expense of grabbing a system global
+ * task->cgroups without the expense of grabbing a system global
  * mutex.  Therefore except as noted below, when dereferencing or, as
- * in cgroup_attach_task(), modifying a task'ss cgroup pointer we use
+ * in cgroup_attach_task(), modifying a task's cgroups pointer we use
  * task_lock(), which acts on a spinlock (task->alloc_lock) already in
  * the task_struct routinely used for such matters.
  *
@@ -909,33 +951,6 @@ static void cgroup_d_remove_dir(struct dentry *dentry)
 }
 
 /*
- * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
- * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
- * reference to css->refcnt. In general, this refcnt is expected to goes down
- * to zero, soon.
- *
- * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
- */
-DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
-
-static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
-{
-	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
-		wake_up_all(&cgroup_rmdir_waitq);
-}
-
-void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
-{
-	css_get(css);
-}
-
-void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
-{
-	cgroup_wakeup_rmdir_waiter(css->cgroup);
-	css_put(css);
-}
-
-/*
  * Call with cgroup_mutex held. Drops reference counts on modules, including
  * any duplicate ones that parse_cgroupfs_options took. If this function
  * returns an error, no reference counts are touched.
@@ -1802,7 +1817,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 			ss->attach(ss, cgrp, oldcgrp, tsk, false);
 	}
 	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
-	synchronize_rcu();
+	/* put_css_set will not destroy cg until after an RCU grace period */
 	put_css_set(cg);
 
 	/*
@@ -3566,11 +3581,12 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
 	DEFINE_WAIT(wait);
 	struct cgroup_event *event, *tmp;
 	int ret;
+	unsigned long flags;
 
 	/* the vfs holds both inode->i_mutex already */
 again:
 	mutex_lock(&cgroup_mutex);
-	if (atomic_read(&cgrp->count) != 0) {
+	if (atomic_read(&cgrp->rmdir_count) != 0) {
 		mutex_unlock(&cgroup_mutex);
 		return -EBUSY;
 	}
@@ -3603,13 +3619,13 @@ again:
 
 	mutex_lock(&cgroup_mutex);
 	parent = cgrp->parent;
-	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
+	if (atomic_read(&cgrp->rmdir_count) || !list_empty(&cgrp->children)) {
 		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 		mutex_unlock(&cgroup_mutex);
 		return -EBUSY;
 	}
 	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
-	if (!cgroup_clear_css_refs(cgrp)) {
+	if (atomic_read(&cgrp->count) != 0 || !cgroup_clear_css_refs(cgrp)) {
 		mutex_unlock(&cgroup_mutex);
 		/*
 		 * Because someone may call cgroup_wakeup_rmdir_waiter() before
@@ -3627,11 +3643,11 @@ again:
 	finish_wait(&cgroup_rmdir_waitq, &wait);
 	clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 
-	spin_lock(&release_list_lock);
+	spin_lock_irqsave(&release_list_lock, flags);
 	set_bit(CGRP_REMOVED, &cgrp->flags);
 	if (!list_empty(&cgrp->release_list))
 		list_del(&cgrp->release_list);
-	spin_unlock(&release_list_lock);
+	spin_unlock_irqrestore(&release_list_lock, flags);
 
 	cgroup_lock_hierarchy(cgrp->root);
 	/* delete this cgroup from parent->children */
@@ -4389,6 +4405,8 @@ int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task)
 
 static void check_for_release(struct cgroup *cgrp)
 {
+	unsigned long flags;
+
 	/* All of these checks rely on RCU to keep the cgroup
 	 * structure alive */
 	if (cgroup_is_releasable(cgrp) && !atomic_read(&cgrp->count)
@@ -4397,13 +4415,13 @@ static void check_for_release(struct cgroup *cgrp)
 		 * already queued for a userspace notification, queue
 		 * it now */
 		int need_schedule_work = 0;
-		spin_lock(&release_list_lock);
+		spin_lock_irqsave(&release_list_lock, flags);
 		if (!cgroup_is_removed(cgrp) &&
 		    list_empty(&cgrp->release_list)) {
 			list_add(&cgrp->release_list, &release_list);
 			need_schedule_work = 1;
 		}
-		spin_unlock(&release_list_lock);
+		spin_unlock_irqrestore(&release_list_lock, flags);
 		if (need_schedule_work)
 			schedule_work(&release_agent_work);
 	}
@@ -4453,9 +4471,11 @@ EXPORT_SYMBOL_GPL(__css_put);
  */
 static void cgroup_release_agent(struct work_struct *work)
 {
+	unsigned long flags;
+
 	BUG_ON(work != &release_agent_work);
 	mutex_lock(&cgroup_mutex);
-	spin_lock(&release_list_lock);
+	spin_lock_irqsave(&release_list_lock, flags);
 	while (!list_empty(&release_list)) {
 		char *argv[3], *envp[3];
 		int i;
@@ -4464,7 +4484,7 @@ static void cgroup_release_agent(struct work_struct *work)
 						    struct cgroup,
 						    release_list);
 		list_del_init(&cgrp->release_list);
-		spin_unlock(&release_list_lock);
+		spin_unlock_irqrestore(&release_list_lock, flags);
 		pathbuf = kmalloc(PAGE_SIZE, GFP_KERNEL);
 		if (!pathbuf)
 			goto continue_free;
@@ -4494,9 +4514,9 @@ static void cgroup_release_agent(struct work_struct *work)
  continue_free:
 		kfree(pathbuf);
 		kfree(agentbuf);
-		spin_lock(&release_list_lock);
+		spin_lock_irqsave(&release_list_lock, flags);
 	}
-	spin_unlock(&release_list_lock);
+	spin_unlock_irqrestore(&release_list_lock, flags);
 	mutex_unlock(&cgroup_mutex);
 }
 
-- 
1.7.3.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH] cgroup: Remove call to synchronize_rcu in cgroup_attach_task
@ 2010-11-24  1:43           ` Colin Cross
  0 siblings, 0 replies; 55+ messages in thread
From: Colin Cross @ 2010-11-24  1:43 UTC (permalink / raw)
  To: Paul Menage; +Cc: Colin Cross, Paul Menage, Li Zefan, containers, linux-kernel

synchronize_rcu can be very expensive, averaging 100 ms in
some cases.  In cgroup_attach_task, it is used to prevent
a task->cgroups pointer dereferenced in an RCU read side
critical section from being invalidated by delaying the call
to put_css_set until after an RCU grace period.

To avoid the call to synchronize_rcu, make the put_css_set
call rcu-safe by moving the deletion of the css_set links
into rcu-protected free_css_set_rcu.

The calls to check_for_release in free_css_set_rcu now occur
in softirq context, so convert all uses of the
release_list_lock spinlock to irq safe versions.

The decrement of the cgroup refcount is no longer
synchronous with the call to put_css_set, which can result
in the cgroup refcount staying positive after the last call
to cgroup_attach_task returns.  To allow the cgroup to be
deleted with cgroup_rmdir synchronously after
cgroup_attach_task, introduce a second refcount,
rmdir_count, that is decremented synchronously in
put_css_set.  If cgroup_rmdir is called on a cgroup for
hich rmdir_count is zero but count is nonzero, reuse the
rmdir waitqueue to block the rmdir until the rcu callback
is called.

Signed-off-by: Colin Cross <ccross@android.com>
---

This patch is similar to what you described.  The main differences are
that I used a new atomic to handle the rmdir case, and I converted
check_for_release to be callable in softirq context rather than schedule
work in free_css_set_rcu.  Your css_set scanning in rmdir sounds better,
I'll take another look at that.  Is there any problem with disabling irqs
with spin_lock_irqsave in check_for_release?

 include/linux/cgroup.h |    6 ++
 kernel/cgroup.c        |  124 ++++++++++++++++++++++++++++--------------------
 2 files changed, 78 insertions(+), 52 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ed4ba11..3b6e73d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -202,6 +202,12 @@ struct cgroup {
 	atomic_t count;
 
 	/*
+	 * separate refcount for rmdir on a cgroup.  When rmdir_count is 0,
+	 * rmdir should block until count is 0.
+	 */
+	atomic_t rmdir_count;
+
+	/*
 	 * We link our 'sibling' struct into our parent's 'children'.
 	 * Our children link their 'sibling' into our 'children'.
 	 */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 66a416b..fa3c0ac 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -267,6 +267,33 @@ static void cgroup_release_agent(struct work_struct *work);
 static DECLARE_WORK(release_agent_work, cgroup_release_agent);
 static void check_for_release(struct cgroup *cgrp);
 
+/*
+ * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
+ * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
+ * reference to css->refcnt. In general, this refcnt is expected to goes down
+ * to zero, soon.
+ *
+ * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
+ */
+DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
+
+static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
+{
+	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
+		wake_up_all(&cgroup_rmdir_waitq);
+}
+
+void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
+{
+	css_get(css);
+}
+
+void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
+{
+	cgroup_wakeup_rmdir_waiter(css->cgroup);
+	css_put(css);
+}
+
 /* Link structure for associating css_set objects with cgroups */
 struct cg_cgroup_link {
 	/*
@@ -329,6 +356,22 @@ static struct hlist_head *css_set_hash(struct cgroup_subsys_state *css[])
 static void free_css_set_rcu(struct rcu_head *obj)
 {
 	struct css_set *cg = container_of(obj, struct css_set, rcu_head);
+	struct cg_cgroup_link *link;
+	struct cg_cgroup_link *saved_link;
+
+	/* Nothing else can have a reference to cg, no need for css_set_lock */
+	list_for_each_entry_safe(link, saved_link, &cg->cg_links,
+				 cg_link_list) {
+		struct cgroup *cgrp = link->cgrp;
+		list_del(&link->cg_link_list);
+		list_del(&link->cgrp_link_list);
+		if (atomic_dec_and_test(&cgrp->count)) {
+			check_for_release(cgrp);
+			cgroup_wakeup_rmdir_waiter(cgrp);
+		}
+		kfree(link);
+	}
+
 	kfree(cg);
 }
 
@@ -355,23 +398,20 @@ static void __put_css_set(struct css_set *cg, int taskexit)
 		return;
 	}
 
-	/* This css_set is dead. unlink it and release cgroup refcounts */
 	hlist_del(&cg->hlist);
 	css_set_count--;
 
+	/* This css_set is now unreachable from the css_set_table, but RCU
+	 * read-side critical sections may still have a reference to it.
+	 * Decrement the cgroup rmdir_count so that rmdir's on an empty
+	 * cgroup can block until the free_css_set_rcu callback */
 	list_for_each_entry_safe(link, saved_link, &cg->cg_links,
 				 cg_link_list) {
 		struct cgroup *cgrp = link->cgrp;
-		list_del(&link->cg_link_list);
-		list_del(&link->cgrp_link_list);
-		if (atomic_dec_and_test(&cgrp->count) &&
-		    notify_on_release(cgrp)) {
-			if (taskexit)
-				set_bit(CGRP_RELEASABLE, &cgrp->flags);
-			check_for_release(cgrp);
-		}
-
-		kfree(link);
+		if (taskexit)
+			set_bit(CGRP_RELEASABLE, &cgrp->flags);
+		atomic_dec(&cgrp->rmdir_count);
+		smp_mb();
 	}
 
 	write_unlock(&css_set_lock);
@@ -571,6 +611,8 @@ static void link_css_set(struct list_head *tmp_cg_links,
 				cgrp_link_list);
 	link->cg = cg;
 	link->cgrp = cgrp;
+	atomic_inc(&cgrp->rmdir_count);
+	smp_mb(); /* make sure rmdir_count increments first */
 	atomic_inc(&cgrp->count);
 	list_move(&link->cgrp_link_list, &cgrp->css_sets);
 	/*
@@ -725,9 +767,9 @@ static struct cgroup *task_cgroup_from_root(struct task_struct *task,
  * cgroup_attach_task(), which overwrites one tasks cgroup pointer with
  * another.  It does so using cgroup_mutex, however there are
  * several performance critical places that need to reference
- * task->cgroup without the expense of grabbing a system global
+ * task->cgroups without the expense of grabbing a system global
  * mutex.  Therefore except as noted below, when dereferencing or, as
- * in cgroup_attach_task(), modifying a task'ss cgroup pointer we use
+ * in cgroup_attach_task(), modifying a task's cgroups pointer we use
  * task_lock(), which acts on a spinlock (task->alloc_lock) already in
  * the task_struct routinely used for such matters.
  *
@@ -909,33 +951,6 @@ static void cgroup_d_remove_dir(struct dentry *dentry)
 }
 
 /*
- * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
- * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
- * reference to css->refcnt. In general, this refcnt is expected to goes down
- * to zero, soon.
- *
- * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
- */
-DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
-
-static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
-{
-	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
-		wake_up_all(&cgroup_rmdir_waitq);
-}
-
-void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
-{
-	css_get(css);
-}
-
-void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
-{
-	cgroup_wakeup_rmdir_waiter(css->cgroup);
-	css_put(css);
-}
-
-/*
  * Call with cgroup_mutex held. Drops reference counts on modules, including
  * any duplicate ones that parse_cgroupfs_options took. If this function
  * returns an error, no reference counts are touched.
@@ -1802,7 +1817,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 			ss->attach(ss, cgrp, oldcgrp, tsk, false);
 	}
 	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
-	synchronize_rcu();
+	/* put_css_set will not destroy cg until after an RCU grace period */
 	put_css_set(cg);
 
 	/*
@@ -3566,11 +3581,12 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
 	DEFINE_WAIT(wait);
 	struct cgroup_event *event, *tmp;
 	int ret;
+	unsigned long flags;
 
 	/* the vfs holds both inode->i_mutex already */
 again:
 	mutex_lock(&cgroup_mutex);
-	if (atomic_read(&cgrp->count) != 0) {
+	if (atomic_read(&cgrp->rmdir_count) != 0) {
 		mutex_unlock(&cgroup_mutex);
 		return -EBUSY;
 	}
@@ -3603,13 +3619,13 @@ again:
 
 	mutex_lock(&cgroup_mutex);
 	parent = cgrp->parent;
-	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
+	if (atomic_read(&cgrp->rmdir_count) || !list_empty(&cgrp->children)) {
 		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 		mutex_unlock(&cgroup_mutex);
 		return -EBUSY;
 	}
 	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
-	if (!cgroup_clear_css_refs(cgrp)) {
+	if (atomic_read(&cgrp->count) != 0 || !cgroup_clear_css_refs(cgrp)) {
 		mutex_unlock(&cgroup_mutex);
 		/*
 		 * Because someone may call cgroup_wakeup_rmdir_waiter() before
@@ -3627,11 +3643,11 @@ again:
 	finish_wait(&cgroup_rmdir_waitq, &wait);
 	clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 
-	spin_lock(&release_list_lock);
+	spin_lock_irqsave(&release_list_lock, flags);
 	set_bit(CGRP_REMOVED, &cgrp->flags);
 	if (!list_empty(&cgrp->release_list))
 		list_del(&cgrp->release_list);
-	spin_unlock(&release_list_lock);
+	spin_unlock_irqrestore(&release_list_lock, flags);
 
 	cgroup_lock_hierarchy(cgrp->root);
 	/* delete this cgroup from parent->children */
@@ -4389,6 +4405,8 @@ int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task)
 
 static void check_for_release(struct cgroup *cgrp)
 {
+	unsigned long flags;
+
 	/* All of these checks rely on RCU to keep the cgroup
 	 * structure alive */
 	if (cgroup_is_releasable(cgrp) && !atomic_read(&cgrp->count)
@@ -4397,13 +4415,13 @@ static void check_for_release(struct cgroup *cgrp)
 		 * already queued for a userspace notification, queue
 		 * it now */
 		int need_schedule_work = 0;
-		spin_lock(&release_list_lock);
+		spin_lock_irqsave(&release_list_lock, flags);
 		if (!cgroup_is_removed(cgrp) &&
 		    list_empty(&cgrp->release_list)) {
 			list_add(&cgrp->release_list, &release_list);
 			need_schedule_work = 1;
 		}
-		spin_unlock(&release_list_lock);
+		spin_unlock_irqrestore(&release_list_lock, flags);
 		if (need_schedule_work)
 			schedule_work(&release_agent_work);
 	}
@@ -4453,9 +4471,11 @@ EXPORT_SYMBOL_GPL(__css_put);
  */
 static void cgroup_release_agent(struct work_struct *work)
 {
+	unsigned long flags;
+
 	BUG_ON(work != &release_agent_work);
 	mutex_lock(&cgroup_mutex);
-	spin_lock(&release_list_lock);
+	spin_lock_irqsave(&release_list_lock, flags);
 	while (!list_empty(&release_list)) {
 		char *argv[3], *envp[3];
 		int i;
@@ -4464,7 +4484,7 @@ static void cgroup_release_agent(struct work_struct *work)
 						    struct cgroup,
 						    release_list);
 		list_del_init(&cgrp->release_list);
-		spin_unlock(&release_list_lock);
+		spin_unlock_irqrestore(&release_list_lock, flags);
 		pathbuf = kmalloc(PAGE_SIZE, GFP_KERNEL);
 		if (!pathbuf)
 			goto continue_free;
@@ -4494,9 +4514,9 @@ static void cgroup_release_agent(struct work_struct *work)
  continue_free:
 		kfree(pathbuf);
 		kfree(agentbuf);
-		spin_lock(&release_list_lock);
+		spin_lock_irqsave(&release_list_lock, flags);
 	}
-	spin_unlock(&release_list_lock);
+	spin_unlock_irqrestore(&release_list_lock, flags);
 	mutex_unlock(&cgroup_mutex);
 }
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
       [not found]       ` <AANLkTi=4-OgPUugnUBaqSU3oC=3wxTjAsOB_Ais3Or+i-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2010-11-24  1:43           ` Colin Cross
@ 2010-11-24  2:06         ` Li Zefan
  1 sibling, 0 replies; 55+ messages in thread
From: Li Zefan @ 2010-11-24  2:06 UTC (permalink / raw)
  To: Paul Menage
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Colin Cross

Paul Menage wrote:
> On Sun, Nov 21, 2010 at 8:06 PM, Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org> wrote:
>> The synchronize_rcu call in cgroup_attach_task can be very
>> expensive.  All fastpath accesses to task->cgroups that expect
>> task->cgroups not to change already use task_lock() or
>> cgroup_lock() to protect against updates, and, in cgroup.c,
>> only the CGROUP_DEBUG files have RCU read-side critical
>> sections.
> 
> I definitely agree with the goal of using lighter-weight
> synchronization than the current synchronize_rcu() call. However,
> there are definitely some subtleties to worry about in this code.
> 
> One of the reasons originally for the current synchronization was to
> avoid the case of calling subsystem destroy() callbacks while there
> could still be threads with RCU references to the subsystem state. The
> fact that synchronize_rcu() was called within a cgroup_mutex critical
> section meant that an rmdir (or any other significant cgrooup
> management action) couldn't possibly start until any RCU read sections
> were done.
> 
> I suspect that when we moved a lot of the cgroup teardown code from
> cgroup_rmdir() to cgroup_diput() (which also has a synchronize_rcu()
> call in it) this restriction could have been eased, but I think I left
> it as it was mostly out of paranoia that I was missing/forgetting some
> crucial reason for keeping it in place.
> 
> I'd suggest trying the following approach, which I suspect is similar
> to what you were suggesting in your last email
> 
> 1) make find_existing_css_set ignore css_set objects with a zero refcount
> 2) change __put_css_set to be simply
> 
> if (atomic_dec_and_test(&cg->refcount)) {
>   call_rcu(&cg->rcu_head, free_css_set_rcu);
> }

If we do this, it's not anymore safe to use get_css_set(), which just
increments the refcount without checking if it's zero.

> 
> 3) move the rest of __put_css_set into a delayed work struct that's
> scheduled by free_css_set_rcu
> 
> 4) Get rid of the taskexit parameter - I think we can do that via a
> simple flag that indicates whether any task has ever been moved into
> the cgroup.
> 
> 5) Put extra checks in cgroup_rmdir() such that if it tries to remove
> a cgroup that has a non-zero refcount, it scans the cgroup's css_sets
> list - if it finds only zero-refcount entries, then wait (via
> synchronize_rcu() or some other appropriate means, maybe reusing the
> CGRP_WAIT_ON_RMDIR mechanism?) until the css_set objects have been
> fully cleaned up and the cgroup's refcounts have been released.
> Otherwise the operation of moving the last thread out of a cgroup and
> immediately deleting the cgroup would very likely fail with an EBUSY
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
  2010-11-24  1:24     ` Paul Menage
@ 2010-11-24  2:06       ` Li Zefan
  2010-11-24  2:10         ` Colin Cross
                           ` (2 more replies)
       [not found]       ` <AANLkTi=4-OgPUugnUBaqSU3oC=3wxTjAsOB_Ais3Or+i-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 3 replies; 55+ messages in thread
From: Li Zefan @ 2010-11-24  2:06 UTC (permalink / raw)
  To: Paul Menage; +Cc: Colin Cross, linux-kernel, containers

Paul Menage wrote:
> On Sun, Nov 21, 2010 at 8:06 PM, Colin Cross <ccross@android.com> wrote:
>> The synchronize_rcu call in cgroup_attach_task can be very
>> expensive.  All fastpath accesses to task->cgroups that expect
>> task->cgroups not to change already use task_lock() or
>> cgroup_lock() to protect against updates, and, in cgroup.c,
>> only the CGROUP_DEBUG files have RCU read-side critical
>> sections.
> 
> I definitely agree with the goal of using lighter-weight
> synchronization than the current synchronize_rcu() call. However,
> there are definitely some subtleties to worry about in this code.
> 
> One of the reasons originally for the current synchronization was to
> avoid the case of calling subsystem destroy() callbacks while there
> could still be threads with RCU references to the subsystem state. The
> fact that synchronize_rcu() was called within a cgroup_mutex critical
> section meant that an rmdir (or any other significant cgrooup
> management action) couldn't possibly start until any RCU read sections
> were done.
> 
> I suspect that when we moved a lot of the cgroup teardown code from
> cgroup_rmdir() to cgroup_diput() (which also has a synchronize_rcu()
> call in it) this restriction could have been eased, but I think I left
> it as it was mostly out of paranoia that I was missing/forgetting some
> crucial reason for keeping it in place.
> 
> I'd suggest trying the following approach, which I suspect is similar
> to what you were suggesting in your last email
> 
> 1) make find_existing_css_set ignore css_set objects with a zero refcount
> 2) change __put_css_set to be simply
> 
> if (atomic_dec_and_test(&cg->refcount)) {
>   call_rcu(&cg->rcu_head, free_css_set_rcu);
> }

If we do this, it's not anymore safe to use get_css_set(), which just
increments the refcount without checking if it's zero.

> 
> 3) move the rest of __put_css_set into a delayed work struct that's
> scheduled by free_css_set_rcu
> 
> 4) Get rid of the taskexit parameter - I think we can do that via a
> simple flag that indicates whether any task has ever been moved into
> the cgroup.
> 
> 5) Put extra checks in cgroup_rmdir() such that if it tries to remove
> a cgroup that has a non-zero refcount, it scans the cgroup's css_sets
> list - if it finds only zero-refcount entries, then wait (via
> synchronize_rcu() or some other appropriate means, maybe reusing the
> CGRP_WAIT_ON_RMDIR mechanism?) until the css_set objects have been
> fully cleaned up and the cgroup's refcounts have been released.
> Otherwise the operation of moving the last thread out of a cgroup and
> immediately deleting the cgroup would very likely fail with an EBUSY
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
       [not found]         ` <4CEC7329.7070909-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2010-11-24  2:10           ` Colin Cross
  2010-11-24 18:58           ` Paul Menage
  1 sibling, 0 replies; 55+ messages in thread
From: Colin Cross @ 2010-11-24  2:10 UTC (permalink / raw)
  To: Li Zefan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Paul Menage, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Nov 23, 2010 at 6:06 PM, Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
> Paul Menage wrote:
>> On Sun, Nov 21, 2010 at 8:06 PM, Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org> wrote:
>>> The synchronize_rcu call in cgroup_attach_task can be very
>>> expensive.  All fastpath accesses to task->cgroups that expect
>>> task->cgroups not to change already use task_lock() or
>>> cgroup_lock() to protect against updates, and, in cgroup.c,
>>> only the CGROUP_DEBUG files have RCU read-side critical
>>> sections.
>>
>> I definitely agree with the goal of using lighter-weight
>> synchronization than the current synchronize_rcu() call. However,
>> there are definitely some subtleties to worry about in this code.
>>
>> One of the reasons originally for the current synchronization was to
>> avoid the case of calling subsystem destroy() callbacks while there
>> could still be threads with RCU references to the subsystem state. The
>> fact that synchronize_rcu() was called within a cgroup_mutex critical
>> section meant that an rmdir (or any other significant cgrooup
>> management action) couldn't possibly start until any RCU read sections
>> were done.
>>
>> I suspect that when we moved a lot of the cgroup teardown code from
>> cgroup_rmdir() to cgroup_diput() (which also has a synchronize_rcu()
>> call in it) this restriction could have been eased, but I think I left
>> it as it was mostly out of paranoia that I was missing/forgetting some
>> crucial reason for keeping it in place.
>>
>> I'd suggest trying the following approach, which I suspect is similar
>> to what you were suggesting in your last email
>>
>> 1) make find_existing_css_set ignore css_set objects with a zero refcount
>> 2) change __put_css_set to be simply
>>
>> if (atomic_dec_and_test(&cg->refcount)) {
>>   call_rcu(&cg->rcu_head, free_css_set_rcu);
>> }
>
> If we do this, it's not anymore safe to use get_css_set(), which just
> increments the refcount without checking if it's zero.

I used an alternate approach, removing the css_set from the hash table
in put_css_set, but delaying the deletion to free_css_set_rcu.  That
way, nothing can get another reference to the css_set to call
get_css_set on.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
  2010-11-24  2:06       ` Li Zefan
@ 2010-11-24  2:10         ` Colin Cross
  2010-11-24  5:37           ` [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup Colin Cross
                             ` (2 more replies)
       [not found]         ` <4CEC7329.7070909-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2010-11-24 18:58         ` Paul Menage
  2 siblings, 3 replies; 55+ messages in thread
From: Colin Cross @ 2010-11-24  2:10 UTC (permalink / raw)
  To: Li Zefan; +Cc: Paul Menage, linux-kernel, containers

On Tue, Nov 23, 2010 at 6:06 PM, Li Zefan <lizf@cn.fujitsu.com> wrote:
> Paul Menage wrote:
>> On Sun, Nov 21, 2010 at 8:06 PM, Colin Cross <ccross@android.com> wrote:
>>> The synchronize_rcu call in cgroup_attach_task can be very
>>> expensive.  All fastpath accesses to task->cgroups that expect
>>> task->cgroups not to change already use task_lock() or
>>> cgroup_lock() to protect against updates, and, in cgroup.c,
>>> only the CGROUP_DEBUG files have RCU read-side critical
>>> sections.
>>
>> I definitely agree with the goal of using lighter-weight
>> synchronization than the current synchronize_rcu() call. However,
>> there are definitely some subtleties to worry about in this code.
>>
>> One of the reasons originally for the current synchronization was to
>> avoid the case of calling subsystem destroy() callbacks while there
>> could still be threads with RCU references to the subsystem state. The
>> fact that synchronize_rcu() was called within a cgroup_mutex critical
>> section meant that an rmdir (or any other significant cgrooup
>> management action) couldn't possibly start until any RCU read sections
>> were done.
>>
>> I suspect that when we moved a lot of the cgroup teardown code from
>> cgroup_rmdir() to cgroup_diput() (which also has a synchronize_rcu()
>> call in it) this restriction could have been eased, but I think I left
>> it as it was mostly out of paranoia that I was missing/forgetting some
>> crucial reason for keeping it in place.
>>
>> I'd suggest trying the following approach, which I suspect is similar
>> to what you were suggesting in your last email
>>
>> 1) make find_existing_css_set ignore css_set objects with a zero refcount
>> 2) change __put_css_set to be simply
>>
>> if (atomic_dec_and_test(&cg->refcount)) {
>>   call_rcu(&cg->rcu_head, free_css_set_rcu);
>> }
>
> If we do this, it's not anymore safe to use get_css_set(), which just
> increments the refcount without checking if it's zero.

I used an alternate approach, removing the css_set from the hash table
in put_css_set, but delaying the deletion to free_css_set_rcu.  That
way, nothing can get another reference to the css_set to call
get_css_set on.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Remove call to synchronize_rcu in cgroup_attach_task
       [not found]           ` <1290563018-2804-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
@ 2010-11-24  2:29             ` Colin Cross
  2011-01-22  1:17             ` Bryan Huntsman
  2011-01-28  1:17             ` Bryan Huntsman
  2 siblings, 0 replies; 55+ messages in thread
From: Colin Cross @ 2010-11-24  2:29 UTC (permalink / raw)
  To: Paul Menage
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Colin Cross

On Tue, Nov 23, 2010 at 5:43 PM, Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org> wrote:
> This patch is similar to what you described.  The main differences are
> that I used a new atomic to handle the rmdir case, and I converted
> check_for_release to be callable in softirq context rather than schedule
> work in free_css_set_rcu.  Your css_set scanning in rmdir sounds better,
> I'll take another look at that.  Is there any problem with disabling irqs
> with spin_lock_irqsave in check_for_release?

free_css_set_rcu needs to take a write lock on css_set_lock to protect the
list_del(&link->cgrp_link_list).  I'll convert it to schedule work, and change
the spin_lock_irqsave back to spin_lock.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Remove call to synchronize_rcu in cgroup_attach_task
  2010-11-24  1:43           ` Colin Cross
  (?)
@ 2010-11-24  2:29           ` Colin Cross
  -1 siblings, 0 replies; 55+ messages in thread
From: Colin Cross @ 2010-11-24  2:29 UTC (permalink / raw)
  To: Paul Menage; +Cc: Colin Cross, Li Zefan, containers, linux-kernel

On Tue, Nov 23, 2010 at 5:43 PM, Colin Cross <ccross@android.com> wrote:
> This patch is similar to what you described.  The main differences are
> that I used a new atomic to handle the rmdir case, and I converted
> check_for_release to be callable in softirq context rather than schedule
> work in free_css_set_rcu.  Your css_set scanning in rmdir sounds better,
> I'll take another look at that.  Is there any problem with disabling irqs
> with spin_lock_irqsave in check_for_release?

free_css_set_rcu needs to take a write lock on css_set_lock to protect the
list_del(&link->cgrp_link_list).  I'll convert it to schedule work, and change
the spin_lock_irqsave back to spin_lock.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
       [not found]           ` <AANLkTi=6nwDCdzDz7E2EaAw2pf3KUVjmKMRqGfz5zVhP-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-11-24  5:37             ` Colin Cross
  2010-11-24  5:37             ` [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task Colin Cross
  1 sibling, 0 replies; 55+ messages in thread
From: Colin Cross @ 2010-11-24  5:37 UTC (permalink / raw)
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Colin Cross

Changes the meaning of CGRP_RELEASABLE to be set on any cgroup
that has ever had a task or cgroup in it, or had css_get called
on it.  The bit is set in cgroup_attach_task, cgroup_create,
and __css_get.  It is not necessary to set the bit in
cgroup_fork, as the task is either in the root cgroup, in
which can never be released, or the task it was forked from
already set the bit in croup_attach_task.

Signed-off-by: Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
---
 include/linux/cgroup.h |   12 +--------
 kernel/cgroup.c        |   54 ++++++++++++++++++++---------------------------
 2 files changed, 25 insertions(+), 41 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ed4ba11..9e13078 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -84,12 +84,6 @@ enum {
 	CSS_REMOVED, /* This CSS is dead */
 };
 
-/* Caller must verify that the css is not for root cgroup */
-static inline void __css_get(struct cgroup_subsys_state *css, int count)
-{
-	atomic_add(count, &css->refcnt);
-}
-
 /*
  * Call css_get() to hold a reference on the css; it can be used
  * for a reference obtained via:
@@ -97,6 +91,7 @@ static inline void __css_get(struct cgroup_subsys_state *css, int count)
  * - task->cgroups for a locked task
  */
 
+extern void __css_get(struct cgroup_subsys_state *css, int count);
 static inline void css_get(struct cgroup_subsys_state *css)
 {
 	/* We don't need to reference count the root state */
@@ -143,10 +138,7 @@ static inline void css_put(struct cgroup_subsys_state *css)
 enum {
 	/* Control Group is dead */
 	CGRP_REMOVED,
-	/*
-	 * Control Group has previously had a child cgroup or a task,
-	 * but no longer (only if CGRP_NOTIFY_ON_RELEASE is set)
-	 */
+	/* Control Group has ever had a child cgroup or a task */
 	CGRP_RELEASABLE,
 	/* Control Group requires release notifications to userspace */
 	CGRP_NOTIFY_ON_RELEASE,
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 66a416b..34e855e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -338,7 +338,15 @@ static void free_css_set_rcu(struct rcu_head *obj)
  * compiled into their kernel but not actually in use */
 static int use_task_css_set_links __read_mostly;
 
-static void __put_css_set(struct css_set *cg, int taskexit)
+/*
+ * refcounted get/put for css_set objects
+ */
+static inline void get_css_set(struct css_set *cg)
+{
+	atomic_inc(&cg->refcount);
+}
+
+static void put_css_set(struct css_set *cg)
 {
 	struct cg_cgroup_link *link;
 	struct cg_cgroup_link *saved_link;
@@ -364,12 +372,8 @@ static void __put_css_set(struct css_set *cg, int taskexit)
 		struct cgroup *cgrp = link->cgrp;
 		list_del(&link->cg_link_list);
 		list_del(&link->cgrp_link_list);
-		if (atomic_dec_and_test(&cgrp->count) &&
-		    notify_on_release(cgrp)) {
-			if (taskexit)
-				set_bit(CGRP_RELEASABLE, &cgrp->flags);
+		if (atomic_dec_and_test(&cgrp->count))
 			check_for_release(cgrp);
-		}
 
 		kfree(link);
 	}
@@ -379,24 +383,6 @@ static void __put_css_set(struct css_set *cg, int taskexit)
 }
 
 /*
- * refcounted get/put for css_set objects
- */
-static inline void get_css_set(struct css_set *cg)
-{
-	atomic_inc(&cg->refcount);
-}
-
-static inline void put_css_set(struct css_set *cg)
-{
-	__put_css_set(cg, 0);
-}
-
-static inline void put_css_set_taskexit(struct css_set *cg)
-{
-	__put_css_set(cg, 1);
-}
-
-/*
  * compare_css_sets - helper function for find_existing_css_set().
  * @cg: candidate css_set being tested
  * @old_cg: existing css_set for a task
@@ -1801,7 +1787,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		if (ss->attach)
 			ss->attach(ss, cgrp, oldcgrp, tsk, false);
 	}
-	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+	set_bit(CGRP_RELEASABLE, &cgrp->flags);
 	synchronize_rcu();
 	put_css_set(cg);
 
@@ -3427,6 +3413,8 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
 	if (err < 0)
 		goto err_remove;
 
+	set_bit(CGRP_RELEASABLE, &parent->flags);
+
 	/* The cgroup directory was pre-locked for us */
 	BUG_ON(!mutex_is_locked(&cgrp->dentry->d_inode->i_mutex));
 
@@ -3645,7 +3633,6 @@ again:
 	cgroup_d_remove_dir(d);
 	dput(d);
 
-	set_bit(CGRP_RELEASABLE, &parent->flags);
 	check_for_release(parent);
 
 	/*
@@ -4240,7 +4227,7 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
 	tsk->cgroups = &init_css_set;
 	task_unlock(tsk);
 	if (cg)
-		put_css_set_taskexit(cg);
+		put_css_set(cg);
 }
 
 /**
@@ -4410,6 +4397,14 @@ static void check_for_release(struct cgroup *cgrp)
 }
 
 /* Caller must verify that the css is not for root cgroup */
+void __css_get(struct cgroup_subsys_state *css, int count)
+{
+	atomic_add(count, &css->refcnt);
+	set_bit(CGRP_RELEASABLE, &css->cgroup->flags);
+}
+EXPORT_SYMBOL_GPL(__css_get);
+
+/* Caller must verify that the css is not for root cgroup */
 void __css_put(struct cgroup_subsys_state *css, int count)
 {
 	struct cgroup *cgrp = css->cgroup;
@@ -4417,10 +4412,7 @@ void __css_put(struct cgroup_subsys_state *css, int count)
 	rcu_read_lock();
 	val = atomic_sub_return(count, &css->refcnt);
 	if (val == 1) {
-		if (notify_on_release(cgrp)) {
-			set_bit(CGRP_RELEASABLE, &cgrp->flags);
-			check_for_release(cgrp);
-		}
+		check_for_release(cgrp);
 		cgroup_wakeup_rmdir_waiter(cgrp);
 	}
 	rcu_read_unlock();
-- 
1.7.3.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
  2010-11-24  2:10         ` Colin Cross
@ 2010-11-24  5:37           ` Colin Cross
       [not found]             ` <1290577024-12347-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
                               ` (2 more replies)
       [not found]           ` <AANLkTi=6nwDCdzDz7E2EaAw2pf3KUVjmKMRqGfz5zVhP-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2010-11-24  5:37           ` Colin Cross
  2 siblings, 3 replies; 55+ messages in thread
From: Colin Cross @ 2010-11-24  5:37 UTC (permalink / raw)
  To: Paul Menage; +Cc: Colin Cross, Paul Menage, Li Zefan, containers, linux-kernel

Changes the meaning of CGRP_RELEASABLE to be set on any cgroup
that has ever had a task or cgroup in it, or had css_get called
on it.  The bit is set in cgroup_attach_task, cgroup_create,
and __css_get.  It is not necessary to set the bit in
cgroup_fork, as the task is either in the root cgroup, in
which can never be released, or the task it was forked from
already set the bit in croup_attach_task.

Signed-off-by: Colin Cross <ccross@android.com>
---
 include/linux/cgroup.h |   12 +--------
 kernel/cgroup.c        |   54 ++++++++++++++++++++---------------------------
 2 files changed, 25 insertions(+), 41 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ed4ba11..9e13078 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -84,12 +84,6 @@ enum {
 	CSS_REMOVED, /* This CSS is dead */
 };
 
-/* Caller must verify that the css is not for root cgroup */
-static inline void __css_get(struct cgroup_subsys_state *css, int count)
-{
-	atomic_add(count, &css->refcnt);
-}
-
 /*
  * Call css_get() to hold a reference on the css; it can be used
  * for a reference obtained via:
@@ -97,6 +91,7 @@ static inline void __css_get(struct cgroup_subsys_state *css, int count)
  * - task->cgroups for a locked task
  */
 
+extern void __css_get(struct cgroup_subsys_state *css, int count);
 static inline void css_get(struct cgroup_subsys_state *css)
 {
 	/* We don't need to reference count the root state */
@@ -143,10 +138,7 @@ static inline void css_put(struct cgroup_subsys_state *css)
 enum {
 	/* Control Group is dead */
 	CGRP_REMOVED,
-	/*
-	 * Control Group has previously had a child cgroup or a task,
-	 * but no longer (only if CGRP_NOTIFY_ON_RELEASE is set)
-	 */
+	/* Control Group has ever had a child cgroup or a task */
 	CGRP_RELEASABLE,
 	/* Control Group requires release notifications to userspace */
 	CGRP_NOTIFY_ON_RELEASE,
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 66a416b..34e855e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -338,7 +338,15 @@ static void free_css_set_rcu(struct rcu_head *obj)
  * compiled into their kernel but not actually in use */
 static int use_task_css_set_links __read_mostly;
 
-static void __put_css_set(struct css_set *cg, int taskexit)
+/*
+ * refcounted get/put for css_set objects
+ */
+static inline void get_css_set(struct css_set *cg)
+{
+	atomic_inc(&cg->refcount);
+}
+
+static void put_css_set(struct css_set *cg)
 {
 	struct cg_cgroup_link *link;
 	struct cg_cgroup_link *saved_link;
@@ -364,12 +372,8 @@ static void __put_css_set(struct css_set *cg, int taskexit)
 		struct cgroup *cgrp = link->cgrp;
 		list_del(&link->cg_link_list);
 		list_del(&link->cgrp_link_list);
-		if (atomic_dec_and_test(&cgrp->count) &&
-		    notify_on_release(cgrp)) {
-			if (taskexit)
-				set_bit(CGRP_RELEASABLE, &cgrp->flags);
+		if (atomic_dec_and_test(&cgrp->count))
 			check_for_release(cgrp);
-		}
 
 		kfree(link);
 	}
@@ -379,24 +383,6 @@ static void __put_css_set(struct css_set *cg, int taskexit)
 }
 
 /*
- * refcounted get/put for css_set objects
- */
-static inline void get_css_set(struct css_set *cg)
-{
-	atomic_inc(&cg->refcount);
-}
-
-static inline void put_css_set(struct css_set *cg)
-{
-	__put_css_set(cg, 0);
-}
-
-static inline void put_css_set_taskexit(struct css_set *cg)
-{
-	__put_css_set(cg, 1);
-}
-
-/*
  * compare_css_sets - helper function for find_existing_css_set().
  * @cg: candidate css_set being tested
  * @old_cg: existing css_set for a task
@@ -1801,7 +1787,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		if (ss->attach)
 			ss->attach(ss, cgrp, oldcgrp, tsk, false);
 	}
-	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+	set_bit(CGRP_RELEASABLE, &cgrp->flags);
 	synchronize_rcu();
 	put_css_set(cg);
 
@@ -3427,6 +3413,8 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
 	if (err < 0)
 		goto err_remove;
 
+	set_bit(CGRP_RELEASABLE, &parent->flags);
+
 	/* The cgroup directory was pre-locked for us */
 	BUG_ON(!mutex_is_locked(&cgrp->dentry->d_inode->i_mutex));
 
@@ -3645,7 +3633,6 @@ again:
 	cgroup_d_remove_dir(d);
 	dput(d);
 
-	set_bit(CGRP_RELEASABLE, &parent->flags);
 	check_for_release(parent);
 
 	/*
@@ -4240,7 +4227,7 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
 	tsk->cgroups = &init_css_set;
 	task_unlock(tsk);
 	if (cg)
-		put_css_set_taskexit(cg);
+		put_css_set(cg);
 }
 
 /**
@@ -4410,6 +4397,14 @@ static void check_for_release(struct cgroup *cgrp)
 }
 
 /* Caller must verify that the css is not for root cgroup */
+void __css_get(struct cgroup_subsys_state *css, int count)
+{
+	atomic_add(count, &css->refcnt);
+	set_bit(CGRP_RELEASABLE, &css->cgroup->flags);
+}
+EXPORT_SYMBOL_GPL(__css_get);
+
+/* Caller must verify that the css is not for root cgroup */
 void __css_put(struct cgroup_subsys_state *css, int count)
 {
 	struct cgroup *cgrp = css->cgroup;
@@ -4417,10 +4412,7 @@ void __css_put(struct cgroup_subsys_state *css, int count)
 	rcu_read_lock();
 	val = atomic_sub_return(count, &css->refcnt);
 	if (val == 1) {
-		if (notify_on_release(cgrp)) {
-			set_bit(CGRP_RELEASABLE, &cgrp->flags);
-			check_for_release(cgrp);
-		}
+		check_for_release(cgrp);
 		cgroup_wakeup_rmdir_waiter(cgrp);
 	}
 	rcu_read_unlock();
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task
       [not found]           ` <AANLkTi=6nwDCdzDz7E2EaAw2pf3KUVjmKMRqGfz5zVhP-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2010-11-24  5:37             ` Colin Cross
@ 2010-11-24  5:37             ` Colin Cross
  1 sibling, 0 replies; 55+ messages in thread
From: Colin Cross @ 2010-11-24  5:37 UTC (permalink / raw)
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Colin Cross

synchronize_rcu can be very expensive, averaging 100 ms in
some cases.  In cgroup_attach_task, it is used to prevent
a task->cgroups pointer dereferenced in an RCU read side
critical section from being invalidated, by delaying the
call to put_css_set until after an RCU grace period.

To avoid the call to synchronize_rcu, make the put_css_set
call rcu-safe by moving the deletion of the css_set links
into free_css_set_work, scheduled by the rcu callback
free_css_set_rcu.

The decrement of the cgroup refcount is no longer
synchronous with the call to put_css_set, which can result
in the cgroup refcount staying positive after the last call
to cgroup_attach_task returns.  To allow the cgroup to be
deleted with cgroup_rmdir synchronously after
cgroup_attach_task, have rmdir check the refcount of all
associated css_sets.  If cgroup_rmdir is called on a cgroup
for which the css_sets all have refcount zero but the
cgroup refcount is nonzero, reuse the rmdir waitqueue to
block the rmdir until free_css_set_work is called.

Signed-off-by: Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
---
 include/linux/cgroup.h |    1 +
 kernel/cgroup.c        |  120 +++++++++++++++++++++++++++++-------------------
 2 files changed, 74 insertions(+), 47 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 9e13078..49fdff0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -279,6 +279,7 @@ struct css_set {
 
 	/* For RCU-protected deletion */
 	struct rcu_head rcu_head;
+	struct work_struct work;
 };
 
 /*
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 34e855e..e752c83 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -267,6 +267,33 @@ static void cgroup_release_agent(struct work_struct *work);
 static DECLARE_WORK(release_agent_work, cgroup_release_agent);
 static void check_for_release(struct cgroup *cgrp);
 
+/*
+ * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
+ * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
+ * reference to css->refcnt. In general, this refcnt is expected to goes down
+ * to zero, soon.
+ *
+ * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
+ */
+DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
+
+static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
+{
+	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
+		wake_up_all(&cgroup_rmdir_waitq);
+}
+
+void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
+{
+	css_get(css);
+}
+
+void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
+{
+	cgroup_wakeup_rmdir_waiter(css->cgroup);
+	css_put(css);
+}
+
 /* Link structure for associating css_set objects with cgroups */
 struct cg_cgroup_link {
 	/*
@@ -326,10 +353,35 @@ static struct hlist_head *css_set_hash(struct cgroup_subsys_state *css[])
 	return &css_set_table[index];
 }
 
+static void free_css_set_work(struct work_struct *work)
+{
+	struct css_set *cg = container_of(work, struct css_set, work);
+	struct cg_cgroup_link *link;
+	struct cg_cgroup_link *saved_link;
+
+	write_lock(&css_set_lock);
+	list_for_each_entry_safe(link, saved_link, &cg->cg_links,
+				 cg_link_list) {
+		struct cgroup *cgrp = link->cgrp;
+		list_del(&link->cg_link_list);
+		list_del(&link->cgrp_link_list);
+		if (atomic_dec_and_test(&cgrp->count)) {
+			check_for_release(cgrp);
+			cgroup_wakeup_rmdir_waiter(cgrp);
+		}
+		kfree(link);
+	}
+	write_unlock(&css_set_lock);
+
+	kfree(cg);
+}
+
 static void free_css_set_rcu(struct rcu_head *obj)
 {
 	struct css_set *cg = container_of(obj, struct css_set, rcu_head);
-	kfree(cg);
+
+	INIT_WORK(&cg->work, free_css_set_work);
+	schedule_work(&cg->work);
 }
 
 /* We don't maintain the lists running through each css_set to its
@@ -348,8 +400,6 @@ static inline void get_css_set(struct css_set *cg)
 
 static void put_css_set(struct css_set *cg)
 {
-	struct cg_cgroup_link *link;
-	struct cg_cgroup_link *saved_link;
 	/*
 	 * Ensure that the refcount doesn't hit zero while any readers
 	 * can see it. Similar to atomic_dec_and_lock(), but for an
@@ -363,21 +413,9 @@ static void put_css_set(struct css_set *cg)
 		return;
 	}
 
-	/* This css_set is dead. unlink it and release cgroup refcounts */
 	hlist_del(&cg->hlist);
 	css_set_count--;
 
-	list_for_each_entry_safe(link, saved_link, &cg->cg_links,
-				 cg_link_list) {
-		struct cgroup *cgrp = link->cgrp;
-		list_del(&link->cg_link_list);
-		list_del(&link->cgrp_link_list);
-		if (atomic_dec_and_test(&cgrp->count))
-			check_for_release(cgrp);
-
-		kfree(link);
-	}
-
 	write_unlock(&css_set_lock);
 	call_rcu(&cg->rcu_head, free_css_set_rcu);
 }
@@ -711,9 +749,9 @@ static struct cgroup *task_cgroup_from_root(struct task_struct *task,
  * cgroup_attach_task(), which overwrites one tasks cgroup pointer with
  * another.  It does so using cgroup_mutex, however there are
  * several performance critical places that need to reference
- * task->cgroup without the expense of grabbing a system global
+ * task->cgroups without the expense of grabbing a system global
  * mutex.  Therefore except as noted below, when dereferencing or, as
- * in cgroup_attach_task(), modifying a task'ss cgroup pointer we use
+ * in cgroup_attach_task(), modifying a task's cgroups pointer we use
  * task_lock(), which acts on a spinlock (task->alloc_lock) already in
  * the task_struct routinely used for such matters.
  *
@@ -895,33 +933,6 @@ static void cgroup_d_remove_dir(struct dentry *dentry)
 }
 
 /*
- * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
- * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
- * reference to css->refcnt. In general, this refcnt is expected to goes down
- * to zero, soon.
- *
- * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
- */
-DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
-
-static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
-{
-	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
-		wake_up_all(&cgroup_rmdir_waitq);
-}
-
-void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
-{
-	css_get(css);
-}
-
-void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
-{
-	cgroup_wakeup_rmdir_waiter(css->cgroup);
-	css_put(css);
-}
-
-/*
  * Call with cgroup_mutex held. Drops reference counts on modules, including
  * any duplicate ones that parse_cgroupfs_options took. If this function
  * returns an error, no reference counts are touched.
@@ -1788,7 +1799,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 			ss->attach(ss, cgrp, oldcgrp, tsk, false);
 	}
 	set_bit(CGRP_RELEASABLE, &cgrp->flags);
-	synchronize_rcu();
+	/* put_css_set will not destroy cg until after an RCU grace period */
 	put_css_set(cg);
 
 	/*
@@ -3546,6 +3557,21 @@ static int cgroup_clear_css_refs(struct cgroup *cgrp)
 	return !failed;
 }
 
+/* checks if all of the css_sets attached to a cgroup have a refcount of 0.
+ * Must be called with css_set_lock held */
+static int cgroup_css_sets_empty(struct cgroup *cgrp)
+{
+	struct cg_cgroup_link *link;
+
+	list_for_each_entry(link, &cgrp->css_sets, cgrp_link_list) {
+		struct css_set *cg = link->cg;
+		if (atomic_read(&cg->refcount) > 0)
+			return 0;
+	}
+
+	return 1;
+}
+
 static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
 {
 	struct cgroup *cgrp = dentry->d_fsdata;
@@ -3558,7 +3584,7 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
 	/* the vfs holds both inode->i_mutex already */
 again:
 	mutex_lock(&cgroup_mutex);
-	if (atomic_read(&cgrp->count) != 0) {
+	if (!cgroup_css_sets_empty(cgrp)) {
 		mutex_unlock(&cgroup_mutex);
 		return -EBUSY;
 	}
@@ -3591,7 +3617,7 @@ again:
 
 	mutex_lock(&cgroup_mutex);
 	parent = cgrp->parent;
-	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
+	if (!cgroup_css_sets_empty(cgrp) || !list_empty(&cgrp->children)) {
 		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 		mutex_unlock(&cgroup_mutex);
 		return -EBUSY;
-- 
1.7.3.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task
  2010-11-24  2:10         ` Colin Cross
  2010-11-24  5:37           ` [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup Colin Cross
       [not found]           ` <AANLkTi=6nwDCdzDz7E2EaAw2pf3KUVjmKMRqGfz5zVhP-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-11-24  5:37           ` Colin Cross
  2011-01-28  1:17             ` Bryan Huntsman
       [not found]             ` <1290577024-12347-2-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
  2 siblings, 2 replies; 55+ messages in thread
From: Colin Cross @ 2010-11-24  5:37 UTC (permalink / raw)
  To: Paul Menage; +Cc: Colin Cross, Paul Menage, Li Zefan, containers, linux-kernel

synchronize_rcu can be very expensive, averaging 100 ms in
some cases.  In cgroup_attach_task, it is used to prevent
a task->cgroups pointer dereferenced in an RCU read side
critical section from being invalidated, by delaying the
call to put_css_set until after an RCU grace period.

To avoid the call to synchronize_rcu, make the put_css_set
call rcu-safe by moving the deletion of the css_set links
into free_css_set_work, scheduled by the rcu callback
free_css_set_rcu.

The decrement of the cgroup refcount is no longer
synchronous with the call to put_css_set, which can result
in the cgroup refcount staying positive after the last call
to cgroup_attach_task returns.  To allow the cgroup to be
deleted with cgroup_rmdir synchronously after
cgroup_attach_task, have rmdir check the refcount of all
associated css_sets.  If cgroup_rmdir is called on a cgroup
for which the css_sets all have refcount zero but the
cgroup refcount is nonzero, reuse the rmdir waitqueue to
block the rmdir until free_css_set_work is called.

Signed-off-by: Colin Cross <ccross@android.com>
---
 include/linux/cgroup.h |    1 +
 kernel/cgroup.c        |  120 +++++++++++++++++++++++++++++-------------------
 2 files changed, 74 insertions(+), 47 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 9e13078..49fdff0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -279,6 +279,7 @@ struct css_set {
 
 	/* For RCU-protected deletion */
 	struct rcu_head rcu_head;
+	struct work_struct work;
 };
 
 /*
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 34e855e..e752c83 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -267,6 +267,33 @@ static void cgroup_release_agent(struct work_struct *work);
 static DECLARE_WORK(release_agent_work, cgroup_release_agent);
 static void check_for_release(struct cgroup *cgrp);
 
+/*
+ * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
+ * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
+ * reference to css->refcnt. In general, this refcnt is expected to goes down
+ * to zero, soon.
+ *
+ * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
+ */
+DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
+
+static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
+{
+	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
+		wake_up_all(&cgroup_rmdir_waitq);
+}
+
+void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
+{
+	css_get(css);
+}
+
+void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
+{
+	cgroup_wakeup_rmdir_waiter(css->cgroup);
+	css_put(css);
+}
+
 /* Link structure for associating css_set objects with cgroups */
 struct cg_cgroup_link {
 	/*
@@ -326,10 +353,35 @@ static struct hlist_head *css_set_hash(struct cgroup_subsys_state *css[])
 	return &css_set_table[index];
 }
 
+static void free_css_set_work(struct work_struct *work)
+{
+	struct css_set *cg = container_of(work, struct css_set, work);
+	struct cg_cgroup_link *link;
+	struct cg_cgroup_link *saved_link;
+
+	write_lock(&css_set_lock);
+	list_for_each_entry_safe(link, saved_link, &cg->cg_links,
+				 cg_link_list) {
+		struct cgroup *cgrp = link->cgrp;
+		list_del(&link->cg_link_list);
+		list_del(&link->cgrp_link_list);
+		if (atomic_dec_and_test(&cgrp->count)) {
+			check_for_release(cgrp);
+			cgroup_wakeup_rmdir_waiter(cgrp);
+		}
+		kfree(link);
+	}
+	write_unlock(&css_set_lock);
+
+	kfree(cg);
+}
+
 static void free_css_set_rcu(struct rcu_head *obj)
 {
 	struct css_set *cg = container_of(obj, struct css_set, rcu_head);
-	kfree(cg);
+
+	INIT_WORK(&cg->work, free_css_set_work);
+	schedule_work(&cg->work);
 }
 
 /* We don't maintain the lists running through each css_set to its
@@ -348,8 +400,6 @@ static inline void get_css_set(struct css_set *cg)
 
 static void put_css_set(struct css_set *cg)
 {
-	struct cg_cgroup_link *link;
-	struct cg_cgroup_link *saved_link;
 	/*
 	 * Ensure that the refcount doesn't hit zero while any readers
 	 * can see it. Similar to atomic_dec_and_lock(), but for an
@@ -363,21 +413,9 @@ static void put_css_set(struct css_set *cg)
 		return;
 	}
 
-	/* This css_set is dead. unlink it and release cgroup refcounts */
 	hlist_del(&cg->hlist);
 	css_set_count--;
 
-	list_for_each_entry_safe(link, saved_link, &cg->cg_links,
-				 cg_link_list) {
-		struct cgroup *cgrp = link->cgrp;
-		list_del(&link->cg_link_list);
-		list_del(&link->cgrp_link_list);
-		if (atomic_dec_and_test(&cgrp->count))
-			check_for_release(cgrp);
-
-		kfree(link);
-	}
-
 	write_unlock(&css_set_lock);
 	call_rcu(&cg->rcu_head, free_css_set_rcu);
 }
@@ -711,9 +749,9 @@ static struct cgroup *task_cgroup_from_root(struct task_struct *task,
  * cgroup_attach_task(), which overwrites one tasks cgroup pointer with
  * another.  It does so using cgroup_mutex, however there are
  * several performance critical places that need to reference
- * task->cgroup without the expense of grabbing a system global
+ * task->cgroups without the expense of grabbing a system global
  * mutex.  Therefore except as noted below, when dereferencing or, as
- * in cgroup_attach_task(), modifying a task'ss cgroup pointer we use
+ * in cgroup_attach_task(), modifying a task's cgroups pointer we use
  * task_lock(), which acts on a spinlock (task->alloc_lock) already in
  * the task_struct routinely used for such matters.
  *
@@ -895,33 +933,6 @@ static void cgroup_d_remove_dir(struct dentry *dentry)
 }
 
 /*
- * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
- * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
- * reference to css->refcnt. In general, this refcnt is expected to goes down
- * to zero, soon.
- *
- * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
- */
-DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
-
-static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
-{
-	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
-		wake_up_all(&cgroup_rmdir_waitq);
-}
-
-void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
-{
-	css_get(css);
-}
-
-void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
-{
-	cgroup_wakeup_rmdir_waiter(css->cgroup);
-	css_put(css);
-}
-
-/*
  * Call with cgroup_mutex held. Drops reference counts on modules, including
  * any duplicate ones that parse_cgroupfs_options took. If this function
  * returns an error, no reference counts are touched.
@@ -1788,7 +1799,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 			ss->attach(ss, cgrp, oldcgrp, tsk, false);
 	}
 	set_bit(CGRP_RELEASABLE, &cgrp->flags);
-	synchronize_rcu();
+	/* put_css_set will not destroy cg until after an RCU grace period */
 	put_css_set(cg);
 
 	/*
@@ -3546,6 +3557,21 @@ static int cgroup_clear_css_refs(struct cgroup *cgrp)
 	return !failed;
 }
 
+/* checks if all of the css_sets attached to a cgroup have a refcount of 0.
+ * Must be called with css_set_lock held */
+static int cgroup_css_sets_empty(struct cgroup *cgrp)
+{
+	struct cg_cgroup_link *link;
+
+	list_for_each_entry(link, &cgrp->css_sets, cgrp_link_list) {
+		struct css_set *cg = link->cg;
+		if (atomic_read(&cg->refcount) > 0)
+			return 0;
+	}
+
+	return 1;
+}
+
 static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
 {
 	struct cgroup *cgrp = dentry->d_fsdata;
@@ -3558,7 +3584,7 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
 	/* the vfs holds both inode->i_mutex already */
 again:
 	mutex_lock(&cgroup_mutex);
-	if (atomic_read(&cgrp->count) != 0) {
+	if (!cgroup_css_sets_empty(cgrp)) {
 		mutex_unlock(&cgroup_mutex);
 		return -EBUSY;
 	}
@@ -3591,7 +3617,7 @@ again:
 
 	mutex_lock(&cgroup_mutex);
 	parent = cgrp->parent;
-	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
+	if (!cgroup_css_sets_empty(cgrp) || !list_empty(&cgrp->children)) {
 		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 		mutex_unlock(&cgroup_mutex);
 		return -EBUSY;
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
       [not found]         ` <4CEC7329.7070909-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2010-11-24  2:10           ` [PATCH] cgroup: Convert synchronize_rcu to call_rcu " Colin Cross
@ 2010-11-24 18:58           ` Paul Menage
  1 sibling, 0 replies; 55+ messages in thread
From: Paul Menage @ 2010-11-24 18:58 UTC (permalink / raw)
  To: Li Zefan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Colin Cross

On Tue, Nov 23, 2010 at 6:06 PM, Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
>
> If we do this, it's not anymore safe to use get_css_set(), which just
> increments the refcount without checking if it's zero.

I don't believe that it's currently safe to use get_css_set() on a
zero-refcount css_set anyway - a css_set starts with refcount 1 and
can be freed the moment that its refcount reaches 0. You can only use
get_css_set() on a stable reference to an existing css_set, which by
definition has a refcount of at least 1. Were you thinking of any
particular existing uses?

Paul

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
  2010-11-24  2:06       ` Li Zefan
  2010-11-24  2:10         ` Colin Cross
       [not found]         ` <4CEC7329.7070909-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2010-11-24 18:58         ` Paul Menage
  2 siblings, 0 replies; 55+ messages in thread
From: Paul Menage @ 2010-11-24 18:58 UTC (permalink / raw)
  To: Li Zefan; +Cc: Colin Cross, linux-kernel, containers

On Tue, Nov 23, 2010 at 6:06 PM, Li Zefan <lizf@cn.fujitsu.com> wrote:
>
> If we do this, it's not anymore safe to use get_css_set(), which just
> increments the refcount without checking if it's zero.

I don't believe that it's currently safe to use get_css_set() on a
zero-refcount css_set anyway - a css_set starts with refcount 1 and
can be freed the moment that its refcount reaches 0. You can only use
get_css_set() on a stable reference to an existing css_set, which by
definition has a refcount of at least 1. Were you thinking of any
particular existing uses?

Paul

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
       [not found]             ` <1290577024-12347-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
@ 2010-11-24 23:54               ` Paul Menage
  2011-01-28  1:17               ` Bryan Huntsman
  1 sibling, 0 replies; 55+ messages in thread
From: Paul Menage @ 2010-11-24 23:54 UTC (permalink / raw)
  To: Colin Cross
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Nov 23, 2010 at 9:37 PM, Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org> wrote:
> @@ -364,12 +372,8 @@ static void __put_css_set(struct css_set *cg, int taskexit)
>                struct cgroup *cgrp = link->cgrp;
>                list_del(&link->cg_link_list);
>                list_del(&link->cgrp_link_list);
> -               if (atomic_dec_and_test(&cgrp->count) &&
> -                   notify_on_release(cgrp)) {
> -                       if (taskexit)
> -                               set_bit(CGRP_RELEASABLE, &cgrp->flags);
> +               if (atomic_dec_and_test(&cgrp->count))
>                        check_for_release(cgrp);
> -               }

We seem to have lost some notify_on_release() checks - maybe move that
to check_for_release()?
>  /* Caller must verify that the css is not for root cgroup */
> +void __css_get(struct cgroup_subsys_state *css, int count)
> +{
> +       atomic_add(count, &css->refcnt);
> +       set_bit(CGRP_RELEASABLE, &css->cgroup->flags);
> +}

Is css_get() the right place to be putting this? It's not clear to me
why a subsystem taking a refcount on a cgroup's state should render it
releasable when it drops that refcount.

Should we maybe clear the CGRP_RELEASABLE flag right before doing the
userspace callback?

Paul

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
  2010-11-24  5:37           ` [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup Colin Cross
       [not found]             ` <1290577024-12347-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
@ 2010-11-24 23:54             ` Paul Menage
       [not found]               ` <AANLkTimFqJ+qPidS_81DKd7ExSxDG7GNi0gjcUEEq_7j-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2011-01-28  1:17             ` Bryan Huntsman
  2 siblings, 1 reply; 55+ messages in thread
From: Paul Menage @ 2010-11-24 23:54 UTC (permalink / raw)
  To: Colin Cross; +Cc: Li Zefan, containers, linux-kernel

On Tue, Nov 23, 2010 at 9:37 PM, Colin Cross <ccross@android.com> wrote:
> @@ -364,12 +372,8 @@ static void __put_css_set(struct css_set *cg, int taskexit)
>                struct cgroup *cgrp = link->cgrp;
>                list_del(&link->cg_link_list);
>                list_del(&link->cgrp_link_list);
> -               if (atomic_dec_and_test(&cgrp->count) &&
> -                   notify_on_release(cgrp)) {
> -                       if (taskexit)
> -                               set_bit(CGRP_RELEASABLE, &cgrp->flags);
> +               if (atomic_dec_and_test(&cgrp->count))
>                        check_for_release(cgrp);
> -               }

We seem to have lost some notify_on_release() checks - maybe move that
to check_for_release()?
>  /* Caller must verify that the css is not for root cgroup */
> +void __css_get(struct cgroup_subsys_state *css, int count)
> +{
> +       atomic_add(count, &css->refcnt);
> +       set_bit(CGRP_RELEASABLE, &css->cgroup->flags);
> +}

Is css_get() the right place to be putting this? It's not clear to me
why a subsystem taking a refcount on a cgroup's state should render it
releasable when it drops that refcount.

Should we maybe clear the CGRP_RELEASABLE flag right before doing the
userspace callback?

Paul

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
  2010-11-24 23:54             ` Paul Menage
@ 2010-11-25  0:11                   ` Colin Cross
  0 siblings, 0 replies; 55+ messages in thread
From: Colin Cross @ 2010-11-25  0:11 UTC (permalink / raw)
  To: Paul Menage
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Nov 24, 2010 at 3:54 PM, Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Nov 23, 2010 at 9:37 PM, Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org> wrote:
>> @@ -364,12 +372,8 @@ static void __put_css_set(struct css_set *cg, int taskexit)
>>                struct cgroup *cgrp = link->cgrp;
>>                list_del(&link->cg_link_list);
>>                list_del(&link->cgrp_link_list);
>> -               if (atomic_dec_and_test(&cgrp->count) &&
>> -                   notify_on_release(cgrp)) {
>> -                       if (taskexit)
>> -                               set_bit(CGRP_RELEASABLE, &cgrp->flags);
>> +               if (atomic_dec_and_test(&cgrp->count))
>>                        check_for_release(cgrp);
>> -               }
>
> We seem to have lost some notify_on_release() checks - maybe move that
> to check_for_release()?
check_for_release immediately calls cgroup_is_releasable, which checks
for the same bit as notify_on_release.  There's no need for
CGRP_RELEASABLE to depend on notify_on_release, or to check
notify_on_release before calling check_for_release.

>>  /* Caller must verify that the css is not for root cgroup */
>> +void __css_get(struct cgroup_subsys_state *css, int count)
>> +{
>> +       atomic_add(count, &css->refcnt);
>> +       set_bit(CGRP_RELEASABLE, &css->cgroup->flags);
>> +}
>
> Is css_get() the right place to be putting this? It's not clear to me
> why a subsystem taking a refcount on a cgroup's state should render it
> releasable when it drops that refcount.
I matched the existing behavior, __css_put sets CGRP_RELEASABLE when
refcnt goes to 0.

> Should we maybe clear the CGRP_RELEASABLE flag right before doing the
> userspace callback?
Actually, I think CGRP_RELEASABLE can be dropped entirely.
check_for_release is only called from __css_put, cgroup_rmdir, and
__put_css_set (or free_css_set_work after my second patch).  Those all
imply that __css_get, get_css_set, or cgroup_create have been
previously called, which are the functions that set CGRP_RELEASABLE.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
@ 2010-11-25  0:11                   ` Colin Cross
  0 siblings, 0 replies; 55+ messages in thread
From: Colin Cross @ 2010-11-25  0:11 UTC (permalink / raw)
  To: Paul Menage; +Cc: Li Zefan, containers, linux-kernel

On Wed, Nov 24, 2010 at 3:54 PM, Paul Menage <menage@google.com> wrote:
> On Tue, Nov 23, 2010 at 9:37 PM, Colin Cross <ccross@android.com> wrote:
>> @@ -364,12 +372,8 @@ static void __put_css_set(struct css_set *cg, int taskexit)
>>                struct cgroup *cgrp = link->cgrp;
>>                list_del(&link->cg_link_list);
>>                list_del(&link->cgrp_link_list);
>> -               if (atomic_dec_and_test(&cgrp->count) &&
>> -                   notify_on_release(cgrp)) {
>> -                       if (taskexit)
>> -                               set_bit(CGRP_RELEASABLE, &cgrp->flags);
>> +               if (atomic_dec_and_test(&cgrp->count))
>>                        check_for_release(cgrp);
>> -               }
>
> We seem to have lost some notify_on_release() checks - maybe move that
> to check_for_release()?
check_for_release immediately calls cgroup_is_releasable, which checks
for the same bit as notify_on_release.  There's no need for
CGRP_RELEASABLE to depend on notify_on_release, or to check
notify_on_release before calling check_for_release.

>>  /* Caller must verify that the css is not for root cgroup */
>> +void __css_get(struct cgroup_subsys_state *css, int count)
>> +{
>> +       atomic_add(count, &css->refcnt);
>> +       set_bit(CGRP_RELEASABLE, &css->cgroup->flags);
>> +}
>
> Is css_get() the right place to be putting this? It's not clear to me
> why a subsystem taking a refcount on a cgroup's state should render it
> releasable when it drops that refcount.
I matched the existing behavior, __css_put sets CGRP_RELEASABLE when
refcnt goes to 0.

> Should we maybe clear the CGRP_RELEASABLE flag right before doing the
> userspace callback?
Actually, I think CGRP_RELEASABLE can be dropped entirely.
check_for_release is only called from __css_put, cgroup_rmdir, and
__put_css_set (or free_css_set_work after my second patch).  Those all
imply that __css_get, get_css_set, or cgroup_create have been
previously called, which are the functions that set CGRP_RELEASABLE.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
       [not found]                   ` <AANLkTimwvP2Ey1gJ6AbbFNtDKjGZt4cwqL=08nGBa_PT-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-11-25  0:18                     ` Colin Cross
  2010-11-25  0:21                     ` Paul Menage
  1 sibling, 0 replies; 55+ messages in thread
From: Colin Cross @ 2010-11-25  0:18 UTC (permalink / raw)
  To: Paul Menage
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Nov 24, 2010 at 4:11 PM, Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org> wrote:
> On Wed, Nov 24, 2010 at 3:54 PM, Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Tue, Nov 23, 2010 at 9:37 PM, Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org> wrote:
>>> @@ -364,12 +372,8 @@ static void __put_css_set(struct css_set *cg, int taskexit)
>>>                struct cgroup *cgrp = link->cgrp;
>>>                list_del(&link->cg_link_list);
>>>                list_del(&link->cgrp_link_list);
>>> -               if (atomic_dec_and_test(&cgrp->count) &&
>>> -                   notify_on_release(cgrp)) {
>>> -                       if (taskexit)
>>> -                               set_bit(CGRP_RELEASABLE, &cgrp->flags);
>>> +               if (atomic_dec_and_test(&cgrp->count))
>>>                        check_for_release(cgrp);
>>> -               }
>>
>> We seem to have lost some notify_on_release() checks - maybe move that
>> to check_for_release()?
> check_for_release immediately calls cgroup_is_releasable, which checks
> for the same bit as notify_on_release.  There's no need for
> CGRP_RELEASABLE to depend on notify_on_release, or to check
> notify_on_release before calling check_for_release.
>
>>>  /* Caller must verify that the css is not for root cgroup */
>>> +void __css_get(struct cgroup_subsys_state *css, int count)
>>> +{
>>> +       atomic_add(count, &css->refcnt);
>>> +       set_bit(CGRP_RELEASABLE, &css->cgroup->flags);
>>> +}
>>
>> Is css_get() the right place to be putting this? It's not clear to me
>> why a subsystem taking a refcount on a cgroup's state should render it
>> releasable when it drops that refcount.
> I matched the existing behavior, __css_put sets CGRP_RELEASABLE when
> refcnt goes to 0.
>
>> Should we maybe clear the CGRP_RELEASABLE flag right before doing the
>> userspace callback?
> Actually, I think CGRP_RELEASABLE can be dropped entirely.
> check_for_release is only called from __css_put, cgroup_rmdir, and
> __put_css_set (or free_css_set_work after my second patch).  Those all
> imply that __css_get, get_css_set, or cgroup_create have been
> previously called, which are the functions that set CGRP_RELEASABLE.
Nevermind, that's not true - get_css_set does not set CGRP_RELEASABLE,
cgroup_attach_task does.

If CGRP_RELEASABLE is not cleared before the callback, the
release_agent would be run once when the last task was removed from
the cgroup, and then again if a task failed to be added to the empty
cgroup because the task was exiting, so clearing the flag sounds like
a good idea.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
  2010-11-25  0:11                   ` Colin Cross
  (?)
@ 2010-11-25  0:18                   ` Colin Cross
  -1 siblings, 0 replies; 55+ messages in thread
From: Colin Cross @ 2010-11-25  0:18 UTC (permalink / raw)
  To: Paul Menage; +Cc: Li Zefan, containers, linux-kernel

On Wed, Nov 24, 2010 at 4:11 PM, Colin Cross <ccross@android.com> wrote:
> On Wed, Nov 24, 2010 at 3:54 PM, Paul Menage <menage@google.com> wrote:
>> On Tue, Nov 23, 2010 at 9:37 PM, Colin Cross <ccross@android.com> wrote:
>>> @@ -364,12 +372,8 @@ static void __put_css_set(struct css_set *cg, int taskexit)
>>>                struct cgroup *cgrp = link->cgrp;
>>>                list_del(&link->cg_link_list);
>>>                list_del(&link->cgrp_link_list);
>>> -               if (atomic_dec_and_test(&cgrp->count) &&
>>> -                   notify_on_release(cgrp)) {
>>> -                       if (taskexit)
>>> -                               set_bit(CGRP_RELEASABLE, &cgrp->flags);
>>> +               if (atomic_dec_and_test(&cgrp->count))
>>>                        check_for_release(cgrp);
>>> -               }
>>
>> We seem to have lost some notify_on_release() checks - maybe move that
>> to check_for_release()?
> check_for_release immediately calls cgroup_is_releasable, which checks
> for the same bit as notify_on_release.  There's no need for
> CGRP_RELEASABLE to depend on notify_on_release, or to check
> notify_on_release before calling check_for_release.
>
>>>  /* Caller must verify that the css is not for root cgroup */
>>> +void __css_get(struct cgroup_subsys_state *css, int count)
>>> +{
>>> +       atomic_add(count, &css->refcnt);
>>> +       set_bit(CGRP_RELEASABLE, &css->cgroup->flags);
>>> +}
>>
>> Is css_get() the right place to be putting this? It's not clear to me
>> why a subsystem taking a refcount on a cgroup's state should render it
>> releasable when it drops that refcount.
> I matched the existing behavior, __css_put sets CGRP_RELEASABLE when
> refcnt goes to 0.
>
>> Should we maybe clear the CGRP_RELEASABLE flag right before doing the
>> userspace callback?
> Actually, I think CGRP_RELEASABLE can be dropped entirely.
> check_for_release is only called from __css_put, cgroup_rmdir, and
> __put_css_set (or free_css_set_work after my second patch).  Those all
> imply that __css_get, get_css_set, or cgroup_create have been
> previously called, which are the functions that set CGRP_RELEASABLE.
Nevermind, that's not true - get_css_set does not set CGRP_RELEASABLE,
cgroup_attach_task does.

If CGRP_RELEASABLE is not cleared before the callback, the
release_agent would be run once when the last task was removed from
the cgroup, and then again if a task failed to be added to the empty
cgroup because the task was exiting, so clearing the flag sounds like
a good idea.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
       [not found]                   ` <AANLkTimwvP2Ey1gJ6AbbFNtDKjGZt4cwqL=08nGBa_PT-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2010-11-25  0:18                     ` Colin Cross
@ 2010-11-25  0:21                     ` Paul Menage
  1 sibling, 0 replies; 55+ messages in thread
From: Paul Menage @ 2010-11-25  0:21 UTC (permalink / raw)
  To: Colin Cross
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Nov 24, 2010 at 4:11 PM, Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org> wrote:
>>
>> We seem to have lost some notify_on_release() checks - maybe move that
>> to check_for_release()?
> check_for_release immediately calls cgroup_is_releasable, which checks
> for the same bit as notify_on_release.  There's no need for
> CGRP_RELEASABLE to depend on notify_on_release, or to check
> notify_on_release before calling check_for_release.

OK.

> I matched the existing behavior, __css_put sets CGRP_RELEASABLE when
> refcnt goes to 0.
>

Ah, we do appear to have had that behaviour for a while. I don't
remember the justification for it at this point :-)

> check_for_release is only called from __css_put, cgroup_rmdir, and
> __put_css_set (or free_css_set_work after my second patch).  Those all
> imply that __css_get, get_css_set, or cgroup_create have been
> previously called, which are the functions that set CGRP_RELEASABLE.

Not in one case - if we create a new cgroup and try to move a thread
into it, but the thread is exiting as we move it, we'll call
put_css_set() on the new css_set, which will drop the refcount on the
target cgroup back to 0. We wouldn't want the auto-release
notification to kick in in that situation, I think.

Paul

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
  2010-11-25  0:11                   ` Colin Cross
                                     ` (2 preceding siblings ...)
  (?)
@ 2010-11-25  0:21                   ` Paul Menage
       [not found]                     ` <AANLkTimJA52-GTM=AzS+tkOugrsi6Keh0_j87vK1BkGv-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2010-12-03  3:07                     ` Colin Cross
  -1 siblings, 2 replies; 55+ messages in thread
From: Paul Menage @ 2010-11-25  0:21 UTC (permalink / raw)
  To: Colin Cross; +Cc: Li Zefan, containers, linux-kernel

On Wed, Nov 24, 2010 at 4:11 PM, Colin Cross <ccross@android.com> wrote:
>>
>> We seem to have lost some notify_on_release() checks - maybe move that
>> to check_for_release()?
> check_for_release immediately calls cgroup_is_releasable, which checks
> for the same bit as notify_on_release.  There's no need for
> CGRP_RELEASABLE to depend on notify_on_release, or to check
> notify_on_release before calling check_for_release.

OK.

> I matched the existing behavior, __css_put sets CGRP_RELEASABLE when
> refcnt goes to 0.
>

Ah, we do appear to have had that behaviour for a while. I don't
remember the justification for it at this point :-)

> check_for_release is only called from __css_put, cgroup_rmdir, and
> __put_css_set (or free_css_set_work after my second patch).  Those all
> imply that __css_get, get_css_set, or cgroup_create have been
> previously called, which are the functions that set CGRP_RELEASABLE.

Not in one case - if we create a new cgroup and try to move a thread
into it, but the thread is exiting as we move it, we'll call
put_css_set() on the new css_set, which will drop the refcount on the
target cgroup back to 0. We wouldn't want the auto-release
notification to kick in in that situation, I think.

Paul

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
       [not found]                     ` <AANLkTimJA52-GTM=AzS+tkOugrsi6Keh0_j87vK1BkGv-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-12-03  3:07                       ` Colin Cross
  0 siblings, 0 replies; 55+ messages in thread
From: Colin Cross @ 2010-12-03  3:07 UTC (permalink / raw)
  To: Paul Menage
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Nov 24, 2010 at 4:21 PM, Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Wed, Nov 24, 2010 at 4:11 PM, Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org> wrote:
>>>
>>> We seem to have lost some notify_on_release() checks - maybe move that
>>> to check_for_release()?
>> check_for_release immediately calls cgroup_is_releasable, which checks
>> for the same bit as notify_on_release.  There's no need for
>> CGRP_RELEASABLE to depend on notify_on_release, or to check
>> notify_on_release before calling check_for_release.
>
> OK.
>
>> I matched the existing behavior, __css_put sets CGRP_RELEASABLE when
>> refcnt goes to 0.
>>
>
> Ah, we do appear to have had that behaviour for a while. I don't
> remember the justification for it at this point :-)
>
>> check_for_release is only called from __css_put, cgroup_rmdir, and
>> __put_css_set (or free_css_set_work after my second patch).  Those all
>> imply that __css_get, get_css_set, or cgroup_create have been
>> previously called, which are the functions that set CGRP_RELEASABLE.
>
> Not in one case - if we create a new cgroup and try to move a thread
> into it, but the thread is exiting as we move it, we'll call
> put_css_set() on the new css_set, which will drop the refcount on the
> target cgroup back to 0. We wouldn't want the auto-release
> notification to kick in in that situation, I think.

Clearing the CGRP_RELEASABLE bit any time after the tests in
check_for_release introduces a race if __css_get is called between the
check and clearing the bit - the cgroup will have an entry, but the
bit will not be set.  Without additional locking in __css_get, I don't
see any way to safely clear CGRP_RELEASABLE.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
  2010-11-25  0:21                   ` Paul Menage
       [not found]                     ` <AANLkTimJA52-GTM=AzS+tkOugrsi6Keh0_j87vK1BkGv-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-12-03  3:07                     ` Colin Cross
       [not found]                       ` <AANLkTim67fLN+PYz-P0TM0QRmvQKP80tyXSNKNSZhFZ2-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2010-12-17  0:54                       ` Paul Menage
  1 sibling, 2 replies; 55+ messages in thread
From: Colin Cross @ 2010-12-03  3:07 UTC (permalink / raw)
  To: Paul Menage; +Cc: Li Zefan, containers, linux-kernel

On Wed, Nov 24, 2010 at 4:21 PM, Paul Menage <menage@google.com> wrote:
> On Wed, Nov 24, 2010 at 4:11 PM, Colin Cross <ccross@android.com> wrote:
>>>
>>> We seem to have lost some notify_on_release() checks - maybe move that
>>> to check_for_release()?
>> check_for_release immediately calls cgroup_is_releasable, which checks
>> for the same bit as notify_on_release.  There's no need for
>> CGRP_RELEASABLE to depend on notify_on_release, or to check
>> notify_on_release before calling check_for_release.
>
> OK.
>
>> I matched the existing behavior, __css_put sets CGRP_RELEASABLE when
>> refcnt goes to 0.
>>
>
> Ah, we do appear to have had that behaviour for a while. I don't
> remember the justification for it at this point :-)
>
>> check_for_release is only called from __css_put, cgroup_rmdir, and
>> __put_css_set (or free_css_set_work after my second patch).  Those all
>> imply that __css_get, get_css_set, or cgroup_create have been
>> previously called, which are the functions that set CGRP_RELEASABLE.
>
> Not in one case - if we create a new cgroup and try to move a thread
> into it, but the thread is exiting as we move it, we'll call
> put_css_set() on the new css_set, which will drop the refcount on the
> target cgroup back to 0. We wouldn't want the auto-release
> notification to kick in in that situation, I think.

Clearing the CGRP_RELEASABLE bit any time after the tests in
check_for_release introduces a race if __css_get is called between the
check and clearing the bit - the cgroup will have an entry, but the
bit will not be set.  Without additional locking in __css_get, I don't
see any way to safely clear CGRP_RELEASABLE.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
       [not found]                       ` <AANLkTim67fLN+PYz-P0TM0QRmvQKP80tyXSNKNSZhFZ2-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-12-17  0:54                         ` Paul Menage
  0 siblings, 0 replies; 55+ messages in thread
From: Paul Menage @ 2010-12-17  0:54 UTC (permalink / raw)
  To: Colin Cross
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Dec 2, 2010 at 7:07 PM, Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org> wrote:
>> Not in one case - if we create a new cgroup and try to move a thread
>> into it, but the thread is exiting as we move it, we'll call
>> put_css_set() on the new css_set, which will drop the refcount on the
>> target cgroup back to 0. We wouldn't want the auto-release
>> notification to kick in in that situation, I think.
>
> Clearing the CGRP_RELEASABLE bit any time after the tests in
> check_for_release introduces a race if __css_get is called between the
> check and clearing the bit - the cgroup will have an entry, but the
> bit will not be set.  Without additional locking in __css_get, I don't
> see any way to safely clear CGRP_RELEASABLE.

I don't quite follow your argument here. Are you saying that the
problem is that you could end up spawning a release agent for a cgroup
that was no longer releasable since it now had a process in it again?
If so, then I don't think that's a problem - spurious release agent
invocations for non-empty cgroups will always happen occasionally due
to races between the kernel and userspace. But a failed move of a task
into a previously-empty cgroup shouldn't trigger the agent.

Paul

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
  2010-12-03  3:07                     ` Colin Cross
       [not found]                       ` <AANLkTim67fLN+PYz-P0TM0QRmvQKP80tyXSNKNSZhFZ2-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-12-17  0:54                       ` Paul Menage
       [not found]                         ` <AANLkTinZarXbEyb1xfJWjG4gN2qhTVTXTdso4Cym5M9T-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2010-12-17  1:12                         ` Colin Cross
  1 sibling, 2 replies; 55+ messages in thread
From: Paul Menage @ 2010-12-17  0:54 UTC (permalink / raw)
  To: Colin Cross; +Cc: Li Zefan, containers, linux-kernel

On Thu, Dec 2, 2010 at 7:07 PM, Colin Cross <ccross@android.com> wrote:
>> Not in one case - if we create a new cgroup and try to move a thread
>> into it, but the thread is exiting as we move it, we'll call
>> put_css_set() on the new css_set, which will drop the refcount on the
>> target cgroup back to 0. We wouldn't want the auto-release
>> notification to kick in in that situation, I think.
>
> Clearing the CGRP_RELEASABLE bit any time after the tests in
> check_for_release introduces a race if __css_get is called between the
> check and clearing the bit - the cgroup will have an entry, but the
> bit will not be set.  Without additional locking in __css_get, I don't
> see any way to safely clear CGRP_RELEASABLE.

I don't quite follow your argument here. Are you saying that the
problem is that you could end up spawning a release agent for a cgroup
that was no longer releasable since it now had a process in it again?
If so, then I don't think that's a problem - spurious release agent
invocations for non-empty cgroups will always happen occasionally due
to races between the kernel and userspace. But a failed move of a task
into a previously-empty cgroup shouldn't trigger the agent.

Paul

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
       [not found]                         ` <AANLkTinZarXbEyb1xfJWjG4gN2qhTVTXTdso4Cym5M9T-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-12-17  1:12                           ` Colin Cross
  0 siblings, 0 replies; 55+ messages in thread
From: Colin Cross @ 2010-12-17  1:12 UTC (permalink / raw)
  To: Paul Menage
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Dec 16, 2010 at 4:54 PM, Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Thu, Dec 2, 2010 at 7:07 PM, Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org> wrote:
>>> Not in one case - if we create a new cgroup and try to move a thread
>>> into it, but the thread is exiting as we move it, we'll call
>>> put_css_set() on the new css_set, which will drop the refcount on the
>>> target cgroup back to 0. We wouldn't want the auto-release
>>> notification to kick in in that situation, I think.
>>
>> Clearing the CGRP_RELEASABLE bit any time after the tests in
>> check_for_release introduces a race if __css_get is called between the
>> check and clearing the bit - the cgroup will have an entry, but the
>> bit will not be set.  Without additional locking in __css_get, I don't
>> see any way to safely clear CGRP_RELEASABLE.
>
> I don't quite follow your argument here. Are you saying that the
> problem is that you could end up spawning a release agent for a cgroup
> that was no longer releasable since it now had a process in it again?
> If so, then I don't think that's a problem - spurious release agent
> invocations for non-empty cgroups will always happen occasionally due
> to races between the kernel and userspace. But a failed move of a task
> into a previously-empty cgroup shouldn't trigger the agent.

No, if you add a new process to the group while check_for_release, the
bit could get set by the add for the new process, then cleared by the
concurrently running check_for_release.  The release agent would be
spawned with a process in the group, which is fine, but when
RELEASABLE bit would be clear.  When the new process was removed,
check_for_release would not call the release agent at all.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
  2010-12-17  0:54                       ` Paul Menage
       [not found]                         ` <AANLkTinZarXbEyb1xfJWjG4gN2qhTVTXTdso4Cym5M9T-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-12-17  1:12                         ` Colin Cross
  1 sibling, 0 replies; 55+ messages in thread
From: Colin Cross @ 2010-12-17  1:12 UTC (permalink / raw)
  To: Paul Menage; +Cc: Li Zefan, containers, linux-kernel

On Thu, Dec 16, 2010 at 4:54 PM, Paul Menage <menage@google.com> wrote:
> On Thu, Dec 2, 2010 at 7:07 PM, Colin Cross <ccross@android.com> wrote:
>>> Not in one case - if we create a new cgroup and try to move a thread
>>> into it, but the thread is exiting as we move it, we'll call
>>> put_css_set() on the new css_set, which will drop the refcount on the
>>> target cgroup back to 0. We wouldn't want the auto-release
>>> notification to kick in in that situation, I think.
>>
>> Clearing the CGRP_RELEASABLE bit any time after the tests in
>> check_for_release introduces a race if __css_get is called between the
>> check and clearing the bit - the cgroup will have an entry, but the
>> bit will not be set.  Without additional locking in __css_get, I don't
>> see any way to safely clear CGRP_RELEASABLE.
>
> I don't quite follow your argument here. Are you saying that the
> problem is that you could end up spawning a release agent for a cgroup
> that was no longer releasable since it now had a process in it again?
> If so, then I don't think that's a problem - spurious release agent
> invocations for non-empty cgroups will always happen occasionally due
> to races between the kernel and userspace. But a failed move of a task
> into a previously-empty cgroup shouldn't trigger the agent.

No, if you add a new process to the group while check_for_release, the
bit could get set by the add for the new process, then cleared by the
concurrently running check_for_release.  The release agent would be
spawned with a process in the group, which is fine, but when
RELEASABLE bit would be clear.  When the new process was removed,
check_for_release would not call the release agent at all.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Remove call to synchronize_rcu in cgroup_attach_task
       [not found]           ` <1290563018-2804-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
  2010-11-24  2:29             ` Colin Cross
@ 2011-01-22  1:17             ` Bryan Huntsman
  2011-01-28  1:17             ` Bryan Huntsman
  2 siblings, 0 replies; 55+ messages in thread
From: Bryan Huntsman @ 2011-01-22  1:17 UTC (permalink / raw)
  To: Colin Cross
  Cc: Paul Menage, mbohan-sgV2jX0FEOL9JmXXK+q4OQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 11/23/2010 05:43 PM, Colin Cross wrote:
> synchronize_rcu can be very expensive, averaging 100 ms in
> some cases.  In cgroup_attach_task, it is used to prevent
> a task->cgroups pointer dereferenced in an RCU read side
> critical section from being invalidated by delaying the call
> to put_css_set until after an RCU grace period.
> 
> To avoid the call to synchronize_rcu, make the put_css_set
> call rcu-safe by moving the deletion of the css_set links
> into rcu-protected free_css_set_rcu.
> 
> The calls to check_for_release in free_css_set_rcu now occur
> in softirq context, so convert all uses of the
> release_list_lock spinlock to irq safe versions.
> 
> The decrement of the cgroup refcount is no longer
> synchronous with the call to put_css_set, which can result
> in the cgroup refcount staying positive after the last call
> to cgroup_attach_task returns.  To allow the cgroup to be
> deleted with cgroup_rmdir synchronously after
> cgroup_attach_task, introduce a second refcount,
> rmdir_count, that is decremented synchronously in
> put_css_set.  If cgroup_rmdir is called on a cgroup for
> hich rmdir_count is zero but count is nonzero, reuse the
> rmdir waitqueue to block the rmdir until the rcu callback
> is called.
> 
> Signed-off-by: Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
> ---
> 
> This patch is similar to what you described.  The main differences are
> that I used a new atomic to handle the rmdir case, and I converted
> check_for_release to be callable in softirq context rather than schedule
> work in free_css_set_rcu.  Your css_set scanning in rmdir sounds better,
> I'll take another look at that.  Is there any problem with disabling irqs
> with spin_lock_irqsave in check_for_release?
> 
>  include/linux/cgroup.h |    6 ++
>  kernel/cgroup.c        |  124 ++++++++++++++++++++++++++++--------------------
>  2 files changed, 78 insertions(+), 52 deletions(-)
> 

Colin, what became of this patch?  I see this in your Tegra tree for
Android.

http://android.git.kernel.org/?p=kernel/tegra.git;a=commit;h=05946a1e0fdb011ac0e6638ee60b181c584f127b

I looked in linux-next but didn't see it there.  This resolves a
performance issue on MSM SMP so I'm curious if this is going upstream.
 Thanks.

- Bryan


-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Remove call to synchronize_rcu in cgroup_attach_task
  2010-11-24  1:43           ` Colin Cross
  (?)
  (?)
@ 2011-01-22  1:17           ` Bryan Huntsman
  2011-01-22  2:04             ` Colin Cross
       [not found]             ` <4D3A3024.9040402-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
  -1 siblings, 2 replies; 55+ messages in thread
From: Bryan Huntsman @ 2011-01-22  1:17 UTC (permalink / raw)
  To: Colin Cross; +Cc: Paul Menage, Li Zefan, containers, linux-kernel, mbohan

On 11/23/2010 05:43 PM, Colin Cross wrote:
> synchronize_rcu can be very expensive, averaging 100 ms in
> some cases.  In cgroup_attach_task, it is used to prevent
> a task->cgroups pointer dereferenced in an RCU read side
> critical section from being invalidated by delaying the call
> to put_css_set until after an RCU grace period.
> 
> To avoid the call to synchronize_rcu, make the put_css_set
> call rcu-safe by moving the deletion of the css_set links
> into rcu-protected free_css_set_rcu.
> 
> The calls to check_for_release in free_css_set_rcu now occur
> in softirq context, so convert all uses of the
> release_list_lock spinlock to irq safe versions.
> 
> The decrement of the cgroup refcount is no longer
> synchronous with the call to put_css_set, which can result
> in the cgroup refcount staying positive after the last call
> to cgroup_attach_task returns.  To allow the cgroup to be
> deleted with cgroup_rmdir synchronously after
> cgroup_attach_task, introduce a second refcount,
> rmdir_count, that is decremented synchronously in
> put_css_set.  If cgroup_rmdir is called on a cgroup for
> hich rmdir_count is zero but count is nonzero, reuse the
> rmdir waitqueue to block the rmdir until the rcu callback
> is called.
> 
> Signed-off-by: Colin Cross <ccross@android.com>
> ---
> 
> This patch is similar to what you described.  The main differences are
> that I used a new atomic to handle the rmdir case, and I converted
> check_for_release to be callable in softirq context rather than schedule
> work in free_css_set_rcu.  Your css_set scanning in rmdir sounds better,
> I'll take another look at that.  Is there any problem with disabling irqs
> with spin_lock_irqsave in check_for_release?
> 
>  include/linux/cgroup.h |    6 ++
>  kernel/cgroup.c        |  124 ++++++++++++++++++++++++++++--------------------
>  2 files changed, 78 insertions(+), 52 deletions(-)
> 

Colin, what became of this patch?  I see this in your Tegra tree for
Android.

http://android.git.kernel.org/?p=kernel/tegra.git;a=commit;h=05946a1e0fdb011ac0e6638ee60b181c584f127b

I looked in linux-next but didn't see it there.  This resolves a
performance issue on MSM SMP so I'm curious if this is going upstream.
 Thanks.

- Bryan


-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Remove call to synchronize_rcu in cgroup_attach_task
       [not found]             ` <4D3A3024.9040402-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
@ 2011-01-22  2:04               ` Colin Cross
  0 siblings, 0 replies; 55+ messages in thread
From: Colin Cross @ 2011-01-22  2:04 UTC (permalink / raw)
  To: Bryan Huntsman
  Cc: Paul Menage, mbohan-sgV2jX0FEOL9JmXXK+q4OQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri, Jan 21, 2011 at 5:17 PM, Bryan Huntsman <bryanh-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org> wrote:
> On 11/23/2010 05:43 PM, Colin Cross wrote:
>> synchronize_rcu can be very expensive, averaging 100 ms in
>> some cases.  In cgroup_attach_task, it is used to prevent
>> a task->cgroups pointer dereferenced in an RCU read side
>> critical section from being invalidated by delaying the call
>> to put_css_set until after an RCU grace period.
>>
>> To avoid the call to synchronize_rcu, make the put_css_set
>> call rcu-safe by moving the deletion of the css_set links
>> into rcu-protected free_css_set_rcu.
>>
>> The calls to check_for_release in free_css_set_rcu now occur
>> in softirq context, so convert all uses of the
>> release_list_lock spinlock to irq safe versions.
>>
>> The decrement of the cgroup refcount is no longer
>> synchronous with the call to put_css_set, which can result
>> in the cgroup refcount staying positive after the last call
>> to cgroup_attach_task returns.  To allow the cgroup to be
>> deleted with cgroup_rmdir synchronously after
>> cgroup_attach_task, introduce a second refcount,
>> rmdir_count, that is decremented synchronously in
>> put_css_set.  If cgroup_rmdir is called on a cgroup for
>> hich rmdir_count is zero but count is nonzero, reuse the
>> rmdir waitqueue to block the rmdir until the rcu callback
>> is called.
>>
>> Signed-off-by: Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
>> ---
>>
>> This patch is similar to what you described.  The main differences are
>> that I used a new atomic to handle the rmdir case, and I converted
>> check_for_release to be callable in softirq context rather than schedule
>> work in free_css_set_rcu.  Your css_set scanning in rmdir sounds better,
>> I'll take another look at that.  Is there any problem with disabling irqs
>> with spin_lock_irqsave in check_for_release?
>>
>>  include/linux/cgroup.h |    6 ++
>>  kernel/cgroup.c        |  124 ++++++++++++++++++++++++++++--------------------
>>  2 files changed, 78 insertions(+), 52 deletions(-)
>>
>
> Colin, what became of this patch?  I see this in your Tegra tree for
> Android.
>
> http://android.git.kernel.org/?p=kernel/tegra.git;a=commit;h=05946a1e0fdb011ac0e6638ee60b181c584f127b
>
> I looked in linux-next but didn't see it there.  This resolves a
> performance issue on MSM SMP so I'm curious if this is going upstream.
>  Thanks.
>

It's been posted, there are no outstanding comments I am working on,
but they haven't been picked up.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Remove call to synchronize_rcu in cgroup_attach_task
  2011-01-22  1:17           ` Bryan Huntsman
@ 2011-01-22  2:04             ` Colin Cross
       [not found]             ` <4D3A3024.9040402-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
  1 sibling, 0 replies; 55+ messages in thread
From: Colin Cross @ 2011-01-22  2:04 UTC (permalink / raw)
  To: Bryan Huntsman; +Cc: Paul Menage, Li Zefan, containers, linux-kernel, mbohan

On Fri, Jan 21, 2011 at 5:17 PM, Bryan Huntsman <bryanh@codeaurora.org> wrote:
> On 11/23/2010 05:43 PM, Colin Cross wrote:
>> synchronize_rcu can be very expensive, averaging 100 ms in
>> some cases.  In cgroup_attach_task, it is used to prevent
>> a task->cgroups pointer dereferenced in an RCU read side
>> critical section from being invalidated by delaying the call
>> to put_css_set until after an RCU grace period.
>>
>> To avoid the call to synchronize_rcu, make the put_css_set
>> call rcu-safe by moving the deletion of the css_set links
>> into rcu-protected free_css_set_rcu.
>>
>> The calls to check_for_release in free_css_set_rcu now occur
>> in softirq context, so convert all uses of the
>> release_list_lock spinlock to irq safe versions.
>>
>> The decrement of the cgroup refcount is no longer
>> synchronous with the call to put_css_set, which can result
>> in the cgroup refcount staying positive after the last call
>> to cgroup_attach_task returns.  To allow the cgroup to be
>> deleted with cgroup_rmdir synchronously after
>> cgroup_attach_task, introduce a second refcount,
>> rmdir_count, that is decremented synchronously in
>> put_css_set.  If cgroup_rmdir is called on a cgroup for
>> hich rmdir_count is zero but count is nonzero, reuse the
>> rmdir waitqueue to block the rmdir until the rcu callback
>> is called.
>>
>> Signed-off-by: Colin Cross <ccross@android.com>
>> ---
>>
>> This patch is similar to what you described.  The main differences are
>> that I used a new atomic to handle the rmdir case, and I converted
>> check_for_release to be callable in softirq context rather than schedule
>> work in free_css_set_rcu.  Your css_set scanning in rmdir sounds better,
>> I'll take another look at that.  Is there any problem with disabling irqs
>> with spin_lock_irqsave in check_for_release?
>>
>>  include/linux/cgroup.h |    6 ++
>>  kernel/cgroup.c        |  124 ++++++++++++++++++++++++++++--------------------
>>  2 files changed, 78 insertions(+), 52 deletions(-)
>>
>
> Colin, what became of this patch?  I see this in your Tegra tree for
> Android.
>
> http://android.git.kernel.org/?p=kernel/tegra.git;a=commit;h=05946a1e0fdb011ac0e6638ee60b181c584f127b
>
> I looked in linux-next but didn't see it there.  This resolves a
> performance issue on MSM SMP so I'm curious if this is going upstream.
>  Thanks.
>

It's been posted, there are no outstanding comments I am working on,
but they haven't been picked up.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Remove call to synchronize_rcu in cgroup_attach_task
       [not found]           ` <1290563018-2804-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
  2010-11-24  2:29             ` Colin Cross
  2011-01-22  1:17             ` Bryan Huntsman
@ 2011-01-28  1:17             ` Bryan Huntsman
  2 siblings, 0 replies; 55+ messages in thread
From: Bryan Huntsman @ 2011-01-28  1:17 UTC (permalink / raw)
  To: Colin Cross
  Cc: Paul Menage, mbohan-sgV2jX0FEOL9JmXXK+q4OQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 11/23/2010 05:43 PM, Colin Cross wrote:
> synchronize_rcu can be very expensive, averaging 100 ms in
> some cases.  In cgroup_attach_task, it is used to prevent
> a task->cgroups pointer dereferenced in an RCU read side
> critical section from being invalidated by delaying the call
> to put_css_set until after an RCU grace period.
> 
> To avoid the call to synchronize_rcu, make the put_css_set
> call rcu-safe by moving the deletion of the css_set links
> into rcu-protected free_css_set_rcu.
> 
> The calls to check_for_release in free_css_set_rcu now occur
> in softirq context, so convert all uses of the
> release_list_lock spinlock to irq safe versions.
> 
> The decrement of the cgroup refcount is no longer
> synchronous with the call to put_css_set, which can result
> in the cgroup refcount staying positive after the last call
> to cgroup_attach_task returns.  To allow the cgroup to be
> deleted with cgroup_rmdir synchronously after
> cgroup_attach_task, introduce a second refcount,
> rmdir_count, that is decremented synchronously in
> put_css_set.  If cgroup_rmdir is called on a cgroup for
> hich rmdir_count is zero but count is nonzero, reuse the
> rmdir waitqueue to block the rmdir until the rcu callback
> is called.
> 
> Signed-off-by: Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
> ---
> 
> This patch is similar to what you described.  The main differences are
> that I used a new atomic to handle the rmdir case, and I converted
> check_for_release to be callable in softirq context rather than schedule
> work in free_css_set_rcu.  Your css_set scanning in rmdir sounds better,
> I'll take another look at that.  Is there any problem with disabling irqs
> with spin_lock_irqsave in check_for_release?
> 
>  include/linux/cgroup.h |    6 ++
>  kernel/cgroup.c        |  124 ++++++++++++++++++++++++++++--------------------
>  2 files changed, 78 insertions(+), 52 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index ed4ba11..3b6e73d 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -202,6 +202,12 @@ struct cgroup {
>  	atomic_t count;
>  
>  	/*
> +	 * separate refcount for rmdir on a cgroup.  When rmdir_count is 0,
> +	 * rmdir should block until count is 0.
> +	 */
> +	atomic_t rmdir_count;
> +
> +	/*
>  	 * We link our 'sibling' struct into our parent's 'children'.
>  	 * Our children link their 'sibling' into our 'children'.
>  	 */
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 66a416b..fa3c0ac 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -267,6 +267,33 @@ static void cgroup_release_agent(struct work_struct *work);
>  static DECLARE_WORK(release_agent_work, cgroup_release_agent);
>  static void check_for_release(struct cgroup *cgrp);
>  
> +/*
> + * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
> + * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
> + * reference to css->refcnt. In general, this refcnt is expected to goes down
> + * to zero, soon.
> + *
> + * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
> + */
> +DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
> +
> +static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
> +{
> +	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> +		wake_up_all(&cgroup_rmdir_waitq);
> +}
> +
> +void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
> +{
> +	css_get(css);
> +}
> +
> +void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
> +{
> +	cgroup_wakeup_rmdir_waiter(css->cgroup);
> +	css_put(css);
> +}
> +
>  /* Link structure for associating css_set objects with cgroups */
>  struct cg_cgroup_link {
>  	/*
> @@ -329,6 +356,22 @@ static struct hlist_head *css_set_hash(struct cgroup_subsys_state *css[])
>  static void free_css_set_rcu(struct rcu_head *obj)
>  {
>  	struct css_set *cg = container_of(obj, struct css_set, rcu_head);
> +	struct cg_cgroup_link *link;
> +	struct cg_cgroup_link *saved_link;
> +
> +	/* Nothing else can have a reference to cg, no need for css_set_lock */
> +	list_for_each_entry_safe(link, saved_link, &cg->cg_links,
> +				 cg_link_list) {
> +		struct cgroup *cgrp = link->cgrp;
> +		list_del(&link->cg_link_list);
> +		list_del(&link->cgrp_link_list);
> +		if (atomic_dec_and_test(&cgrp->count)) {
> +			check_for_release(cgrp);
> +			cgroup_wakeup_rmdir_waiter(cgrp);
> +		}
> +		kfree(link);
> +	}
> +
>  	kfree(cg);
>  }
>  
> @@ -355,23 +398,20 @@ static void __put_css_set(struct css_set *cg, int taskexit)
>  		return;
>  	}
>  
> -	/* This css_set is dead. unlink it and release cgroup refcounts */
>  	hlist_del(&cg->hlist);
>  	css_set_count--;
>  
> +	/* This css_set is now unreachable from the css_set_table, but RCU
> +	 * read-side critical sections may still have a reference to it.
> +	 * Decrement the cgroup rmdir_count so that rmdir's on an empty
> +	 * cgroup can block until the free_css_set_rcu callback */
>  	list_for_each_entry_safe(link, saved_link, &cg->cg_links,
>  				 cg_link_list) {
>  		struct cgroup *cgrp = link->cgrp;
> -		list_del(&link->cg_link_list);
> -		list_del(&link->cgrp_link_list);
> -		if (atomic_dec_and_test(&cgrp->count) &&
> -		    notify_on_release(cgrp)) {
> -			if (taskexit)
> -				set_bit(CGRP_RELEASABLE, &cgrp->flags);
> -			check_for_release(cgrp);
> -		}
> -
> -		kfree(link);
> +		if (taskexit)
> +			set_bit(CGRP_RELEASABLE, &cgrp->flags);
> +		atomic_dec(&cgrp->rmdir_count);
> +		smp_mb();
>  	}
>  
>  	write_unlock(&css_set_lock);
> @@ -571,6 +611,8 @@ static void link_css_set(struct list_head *tmp_cg_links,
>  				cgrp_link_list);
>  	link->cg = cg;
>  	link->cgrp = cgrp;
> +	atomic_inc(&cgrp->rmdir_count);
> +	smp_mb(); /* make sure rmdir_count increments first */
>  	atomic_inc(&cgrp->count);
>  	list_move(&link->cgrp_link_list, &cgrp->css_sets);
>  	/*
> @@ -725,9 +767,9 @@ static struct cgroup *task_cgroup_from_root(struct task_struct *task,
>   * cgroup_attach_task(), which overwrites one tasks cgroup pointer with
>   * another.  It does so using cgroup_mutex, however there are
>   * several performance critical places that need to reference
> - * task->cgroup without the expense of grabbing a system global
> + * task->cgroups without the expense of grabbing a system global
>   * mutex.  Therefore except as noted below, when dereferencing or, as
> - * in cgroup_attach_task(), modifying a task'ss cgroup pointer we use
> + * in cgroup_attach_task(), modifying a task's cgroups pointer we use
>   * task_lock(), which acts on a spinlock (task->alloc_lock) already in
>   * the task_struct routinely used for such matters.
>   *
> @@ -909,33 +951,6 @@ static void cgroup_d_remove_dir(struct dentry *dentry)
>  }
>  
>  /*
> - * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
> - * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
> - * reference to css->refcnt. In general, this refcnt is expected to goes down
> - * to zero, soon.
> - *
> - * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
> - */
> -DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
> -
> -static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
> -{
> -	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> -		wake_up_all(&cgroup_rmdir_waitq);
> -}
> -
> -void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
> -{
> -	css_get(css);
> -}
> -
> -void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
> -{
> -	cgroup_wakeup_rmdir_waiter(css->cgroup);
> -	css_put(css);
> -}
> -
> -/*
>   * Call with cgroup_mutex held. Drops reference counts on modules, including
>   * any duplicate ones that parse_cgroupfs_options took. If this function
>   * returns an error, no reference counts are touched.
> @@ -1802,7 +1817,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  			ss->attach(ss, cgrp, oldcgrp, tsk, false);
>  	}
>  	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> -	synchronize_rcu();
> +	/* put_css_set will not destroy cg until after an RCU grace period */
>  	put_css_set(cg);
>  
>  	/*
> @@ -3566,11 +3581,12 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
>  	DEFINE_WAIT(wait);
>  	struct cgroup_event *event, *tmp;
>  	int ret;
> +	unsigned long flags;
>  
>  	/* the vfs holds both inode->i_mutex already */
>  again:
>  	mutex_lock(&cgroup_mutex);
> -	if (atomic_read(&cgrp->count) != 0) {
> +	if (atomic_read(&cgrp->rmdir_count) != 0) {
>  		mutex_unlock(&cgroup_mutex);
>  		return -EBUSY;
>  	}
> @@ -3603,13 +3619,13 @@ again:
>  
>  	mutex_lock(&cgroup_mutex);
>  	parent = cgrp->parent;
> -	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
> +	if (atomic_read(&cgrp->rmdir_count) || !list_empty(&cgrp->children)) {
>  		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		mutex_unlock(&cgroup_mutex);
>  		return -EBUSY;
>  	}
>  	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> -	if (!cgroup_clear_css_refs(cgrp)) {
> +	if (atomic_read(&cgrp->count) != 0 || !cgroup_clear_css_refs(cgrp)) {
>  		mutex_unlock(&cgroup_mutex);
>  		/*
>  		 * Because someone may call cgroup_wakeup_rmdir_waiter() before
> @@ -3627,11 +3643,11 @@ again:
>  	finish_wait(&cgroup_rmdir_waitq, &wait);
>  	clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  
> -	spin_lock(&release_list_lock);
> +	spin_lock_irqsave(&release_list_lock, flags);
>  	set_bit(CGRP_REMOVED, &cgrp->flags);
>  	if (!list_empty(&cgrp->release_list))
>  		list_del(&cgrp->release_list);
> -	spin_unlock(&release_list_lock);
> +	spin_unlock_irqrestore(&release_list_lock, flags);
>  
>  	cgroup_lock_hierarchy(cgrp->root);
>  	/* delete this cgroup from parent->children */
> @@ -4389,6 +4405,8 @@ int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task)
>  
>  static void check_for_release(struct cgroup *cgrp)
>  {
> +	unsigned long flags;
> +
>  	/* All of these checks rely on RCU to keep the cgroup
>  	 * structure alive */
>  	if (cgroup_is_releasable(cgrp) && !atomic_read(&cgrp->count)
> @@ -4397,13 +4415,13 @@ static void check_for_release(struct cgroup *cgrp)
>  		 * already queued for a userspace notification, queue
>  		 * it now */
>  		int need_schedule_work = 0;
> -		spin_lock(&release_list_lock);
> +		spin_lock_irqsave(&release_list_lock, flags);
>  		if (!cgroup_is_removed(cgrp) &&
>  		    list_empty(&cgrp->release_list)) {
>  			list_add(&cgrp->release_list, &release_list);
>  			need_schedule_work = 1;
>  		}
> -		spin_unlock(&release_list_lock);
> +		spin_unlock_irqrestore(&release_list_lock, flags);
>  		if (need_schedule_work)
>  			schedule_work(&release_agent_work);
>  	}
> @@ -4453,9 +4471,11 @@ EXPORT_SYMBOL_GPL(__css_put);
>   */
>  static void cgroup_release_agent(struct work_struct *work)
>  {
> +	unsigned long flags;
> +
>  	BUG_ON(work != &release_agent_work);
>  	mutex_lock(&cgroup_mutex);
> -	spin_lock(&release_list_lock);
> +	spin_lock_irqsave(&release_list_lock, flags);
>  	while (!list_empty(&release_list)) {
>  		char *argv[3], *envp[3];
>  		int i;
> @@ -4464,7 +4484,7 @@ static void cgroup_release_agent(struct work_struct *work)
>  						    struct cgroup,
>  						    release_list);
>  		list_del_init(&cgrp->release_list);
> -		spin_unlock(&release_list_lock);
> +		spin_unlock_irqrestore(&release_list_lock, flags);
>  		pathbuf = kmalloc(PAGE_SIZE, GFP_KERNEL);
>  		if (!pathbuf)
>  			goto continue_free;
> @@ -4494,9 +4514,9 @@ static void cgroup_release_agent(struct work_struct *work)
>   continue_free:
>  		kfree(pathbuf);
>  		kfree(agentbuf);
> -		spin_lock(&release_list_lock);
> +		spin_lock_irqsave(&release_list_lock, flags);
>  	}
> -	spin_unlock(&release_list_lock);
> +	spin_unlock_irqrestore(&release_list_lock, flags);
>  	mutex_unlock(&cgroup_mutex);
>  }
>  

Tested-by: Mike Bohan <mbohan-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>

I'm responding on Mike's behalf and adding him to this thread.  This
patch improves launch time of a test app from ~700ms to ~250ms on MSM,
with much lower variance across tests.  We also see UI latency
improvements, but have not quantified the gains.

- Bryan

-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] cgroup: Remove call to synchronize_rcu in cgroup_attach_task
  2010-11-24  1:43           ` Colin Cross
                             ` (2 preceding siblings ...)
  (?)
@ 2011-01-28  1:17           ` Bryan Huntsman
  -1 siblings, 0 replies; 55+ messages in thread
From: Bryan Huntsman @ 2011-01-28  1:17 UTC (permalink / raw)
  To: Colin Cross; +Cc: Paul Menage, Li Zefan, containers, linux-kernel, mbohan

On 11/23/2010 05:43 PM, Colin Cross wrote:
> synchronize_rcu can be very expensive, averaging 100 ms in
> some cases.  In cgroup_attach_task, it is used to prevent
> a task->cgroups pointer dereferenced in an RCU read side
> critical section from being invalidated by delaying the call
> to put_css_set until after an RCU grace period.
> 
> To avoid the call to synchronize_rcu, make the put_css_set
> call rcu-safe by moving the deletion of the css_set links
> into rcu-protected free_css_set_rcu.
> 
> The calls to check_for_release in free_css_set_rcu now occur
> in softirq context, so convert all uses of the
> release_list_lock spinlock to irq safe versions.
> 
> The decrement of the cgroup refcount is no longer
> synchronous with the call to put_css_set, which can result
> in the cgroup refcount staying positive after the last call
> to cgroup_attach_task returns.  To allow the cgroup to be
> deleted with cgroup_rmdir synchronously after
> cgroup_attach_task, introduce a second refcount,
> rmdir_count, that is decremented synchronously in
> put_css_set.  If cgroup_rmdir is called on a cgroup for
> hich rmdir_count is zero but count is nonzero, reuse the
> rmdir waitqueue to block the rmdir until the rcu callback
> is called.
> 
> Signed-off-by: Colin Cross <ccross@android.com>
> ---
> 
> This patch is similar to what you described.  The main differences are
> that I used a new atomic to handle the rmdir case, and I converted
> check_for_release to be callable in softirq context rather than schedule
> work in free_css_set_rcu.  Your css_set scanning in rmdir sounds better,
> I'll take another look at that.  Is there any problem with disabling irqs
> with spin_lock_irqsave in check_for_release?
> 
>  include/linux/cgroup.h |    6 ++
>  kernel/cgroup.c        |  124 ++++++++++++++++++++++++++++--------------------
>  2 files changed, 78 insertions(+), 52 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index ed4ba11..3b6e73d 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -202,6 +202,12 @@ struct cgroup {
>  	atomic_t count;
>  
>  	/*
> +	 * separate refcount for rmdir on a cgroup.  When rmdir_count is 0,
> +	 * rmdir should block until count is 0.
> +	 */
> +	atomic_t rmdir_count;
> +
> +	/*
>  	 * We link our 'sibling' struct into our parent's 'children'.
>  	 * Our children link their 'sibling' into our 'children'.
>  	 */
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 66a416b..fa3c0ac 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -267,6 +267,33 @@ static void cgroup_release_agent(struct work_struct *work);
>  static DECLARE_WORK(release_agent_work, cgroup_release_agent);
>  static void check_for_release(struct cgroup *cgrp);
>  
> +/*
> + * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
> + * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
> + * reference to css->refcnt. In general, this refcnt is expected to goes down
> + * to zero, soon.
> + *
> + * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
> + */
> +DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
> +
> +static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
> +{
> +	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> +		wake_up_all(&cgroup_rmdir_waitq);
> +}
> +
> +void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
> +{
> +	css_get(css);
> +}
> +
> +void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
> +{
> +	cgroup_wakeup_rmdir_waiter(css->cgroup);
> +	css_put(css);
> +}
> +
>  /* Link structure for associating css_set objects with cgroups */
>  struct cg_cgroup_link {
>  	/*
> @@ -329,6 +356,22 @@ static struct hlist_head *css_set_hash(struct cgroup_subsys_state *css[])
>  static void free_css_set_rcu(struct rcu_head *obj)
>  {
>  	struct css_set *cg = container_of(obj, struct css_set, rcu_head);
> +	struct cg_cgroup_link *link;
> +	struct cg_cgroup_link *saved_link;
> +
> +	/* Nothing else can have a reference to cg, no need for css_set_lock */
> +	list_for_each_entry_safe(link, saved_link, &cg->cg_links,
> +				 cg_link_list) {
> +		struct cgroup *cgrp = link->cgrp;
> +		list_del(&link->cg_link_list);
> +		list_del(&link->cgrp_link_list);
> +		if (atomic_dec_and_test(&cgrp->count)) {
> +			check_for_release(cgrp);
> +			cgroup_wakeup_rmdir_waiter(cgrp);
> +		}
> +		kfree(link);
> +	}
> +
>  	kfree(cg);
>  }
>  
> @@ -355,23 +398,20 @@ static void __put_css_set(struct css_set *cg, int taskexit)
>  		return;
>  	}
>  
> -	/* This css_set is dead. unlink it and release cgroup refcounts */
>  	hlist_del(&cg->hlist);
>  	css_set_count--;
>  
> +	/* This css_set is now unreachable from the css_set_table, but RCU
> +	 * read-side critical sections may still have a reference to it.
> +	 * Decrement the cgroup rmdir_count so that rmdir's on an empty
> +	 * cgroup can block until the free_css_set_rcu callback */
>  	list_for_each_entry_safe(link, saved_link, &cg->cg_links,
>  				 cg_link_list) {
>  		struct cgroup *cgrp = link->cgrp;
> -		list_del(&link->cg_link_list);
> -		list_del(&link->cgrp_link_list);
> -		if (atomic_dec_and_test(&cgrp->count) &&
> -		    notify_on_release(cgrp)) {
> -			if (taskexit)
> -				set_bit(CGRP_RELEASABLE, &cgrp->flags);
> -			check_for_release(cgrp);
> -		}
> -
> -		kfree(link);
> +		if (taskexit)
> +			set_bit(CGRP_RELEASABLE, &cgrp->flags);
> +		atomic_dec(&cgrp->rmdir_count);
> +		smp_mb();
>  	}
>  
>  	write_unlock(&css_set_lock);
> @@ -571,6 +611,8 @@ static void link_css_set(struct list_head *tmp_cg_links,
>  				cgrp_link_list);
>  	link->cg = cg;
>  	link->cgrp = cgrp;
> +	atomic_inc(&cgrp->rmdir_count);
> +	smp_mb(); /* make sure rmdir_count increments first */
>  	atomic_inc(&cgrp->count);
>  	list_move(&link->cgrp_link_list, &cgrp->css_sets);
>  	/*
> @@ -725,9 +767,9 @@ static struct cgroup *task_cgroup_from_root(struct task_struct *task,
>   * cgroup_attach_task(), which overwrites one tasks cgroup pointer with
>   * another.  It does so using cgroup_mutex, however there are
>   * several performance critical places that need to reference
> - * task->cgroup without the expense of grabbing a system global
> + * task->cgroups without the expense of grabbing a system global
>   * mutex.  Therefore except as noted below, when dereferencing or, as
> - * in cgroup_attach_task(), modifying a task'ss cgroup pointer we use
> + * in cgroup_attach_task(), modifying a task's cgroups pointer we use
>   * task_lock(), which acts on a spinlock (task->alloc_lock) already in
>   * the task_struct routinely used for such matters.
>   *
> @@ -909,33 +951,6 @@ static void cgroup_d_remove_dir(struct dentry *dentry)
>  }
>  
>  /*
> - * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
> - * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
> - * reference to css->refcnt. In general, this refcnt is expected to goes down
> - * to zero, soon.
> - *
> - * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
> - */
> -DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
> -
> -static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
> -{
> -	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> -		wake_up_all(&cgroup_rmdir_waitq);
> -}
> -
> -void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
> -{
> -	css_get(css);
> -}
> -
> -void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
> -{
> -	cgroup_wakeup_rmdir_waiter(css->cgroup);
> -	css_put(css);
> -}
> -
> -/*
>   * Call with cgroup_mutex held. Drops reference counts on modules, including
>   * any duplicate ones that parse_cgroupfs_options took. If this function
>   * returns an error, no reference counts are touched.
> @@ -1802,7 +1817,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  			ss->attach(ss, cgrp, oldcgrp, tsk, false);
>  	}
>  	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> -	synchronize_rcu();
> +	/* put_css_set will not destroy cg until after an RCU grace period */
>  	put_css_set(cg);
>  
>  	/*
> @@ -3566,11 +3581,12 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
>  	DEFINE_WAIT(wait);
>  	struct cgroup_event *event, *tmp;
>  	int ret;
> +	unsigned long flags;
>  
>  	/* the vfs holds both inode->i_mutex already */
>  again:
>  	mutex_lock(&cgroup_mutex);
> -	if (atomic_read(&cgrp->count) != 0) {
> +	if (atomic_read(&cgrp->rmdir_count) != 0) {
>  		mutex_unlock(&cgroup_mutex);
>  		return -EBUSY;
>  	}
> @@ -3603,13 +3619,13 @@ again:
>  
>  	mutex_lock(&cgroup_mutex);
>  	parent = cgrp->parent;
> -	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
> +	if (atomic_read(&cgrp->rmdir_count) || !list_empty(&cgrp->children)) {
>  		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		mutex_unlock(&cgroup_mutex);
>  		return -EBUSY;
>  	}
>  	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> -	if (!cgroup_clear_css_refs(cgrp)) {
> +	if (atomic_read(&cgrp->count) != 0 || !cgroup_clear_css_refs(cgrp)) {
>  		mutex_unlock(&cgroup_mutex);
>  		/*
>  		 * Because someone may call cgroup_wakeup_rmdir_waiter() before
> @@ -3627,11 +3643,11 @@ again:
>  	finish_wait(&cgroup_rmdir_waitq, &wait);
>  	clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  
> -	spin_lock(&release_list_lock);
> +	spin_lock_irqsave(&release_list_lock, flags);
>  	set_bit(CGRP_REMOVED, &cgrp->flags);
>  	if (!list_empty(&cgrp->release_list))
>  		list_del(&cgrp->release_list);
> -	spin_unlock(&release_list_lock);
> +	spin_unlock_irqrestore(&release_list_lock, flags);
>  
>  	cgroup_lock_hierarchy(cgrp->root);
>  	/* delete this cgroup from parent->children */
> @@ -4389,6 +4405,8 @@ int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task)
>  
>  static void check_for_release(struct cgroup *cgrp)
>  {
> +	unsigned long flags;
> +
>  	/* All of these checks rely on RCU to keep the cgroup
>  	 * structure alive */
>  	if (cgroup_is_releasable(cgrp) && !atomic_read(&cgrp->count)
> @@ -4397,13 +4415,13 @@ static void check_for_release(struct cgroup *cgrp)
>  		 * already queued for a userspace notification, queue
>  		 * it now */
>  		int need_schedule_work = 0;
> -		spin_lock(&release_list_lock);
> +		spin_lock_irqsave(&release_list_lock, flags);
>  		if (!cgroup_is_removed(cgrp) &&
>  		    list_empty(&cgrp->release_list)) {
>  			list_add(&cgrp->release_list, &release_list);
>  			need_schedule_work = 1;
>  		}
> -		spin_unlock(&release_list_lock);
> +		spin_unlock_irqrestore(&release_list_lock, flags);
>  		if (need_schedule_work)
>  			schedule_work(&release_agent_work);
>  	}
> @@ -4453,9 +4471,11 @@ EXPORT_SYMBOL_GPL(__css_put);
>   */
>  static void cgroup_release_agent(struct work_struct *work)
>  {
> +	unsigned long flags;
> +
>  	BUG_ON(work != &release_agent_work);
>  	mutex_lock(&cgroup_mutex);
> -	spin_lock(&release_list_lock);
> +	spin_lock_irqsave(&release_list_lock, flags);
>  	while (!list_empty(&release_list)) {
>  		char *argv[3], *envp[3];
>  		int i;
> @@ -4464,7 +4484,7 @@ static void cgroup_release_agent(struct work_struct *work)
>  						    struct cgroup,
>  						    release_list);
>  		list_del_init(&cgrp->release_list);
> -		spin_unlock(&release_list_lock);
> +		spin_unlock_irqrestore(&release_list_lock, flags);
>  		pathbuf = kmalloc(PAGE_SIZE, GFP_KERNEL);
>  		if (!pathbuf)
>  			goto continue_free;
> @@ -4494,9 +4514,9 @@ static void cgroup_release_agent(struct work_struct *work)
>   continue_free:
>  		kfree(pathbuf);
>  		kfree(agentbuf);
> -		spin_lock(&release_list_lock);
> +		spin_lock_irqsave(&release_list_lock, flags);
>  	}
> -	spin_unlock(&release_list_lock);
> +	spin_unlock_irqrestore(&release_list_lock, flags);
>  	mutex_unlock(&cgroup_mutex);
>  }
>  

Tested-by: Mike Bohan <mbohan@codeaurora.org>

I'm responding on Mike's behalf and adding him to this thread.  This
patch improves launch time of a test app from ~700ms to ~250ms on MSM,
with much lower variance across tests.  We also see UI latency
improvements, but have not quantified the gains.

- Bryan

-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task
       [not found]             ` <1290577024-12347-2-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
@ 2011-01-28  1:17               ` Bryan Huntsman
  0 siblings, 0 replies; 55+ messages in thread
From: Bryan Huntsman @ 2011-01-28  1:17 UTC (permalink / raw)
  To: Colin Cross
  Cc: Paul Menage, mbohan-sgV2jX0FEOL9JmXXK+q4OQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 11/23/2010 09:37 PM, Colin Cross wrote:
> synchronize_rcu can be very expensive, averaging 100 ms in
> some cases.  In cgroup_attach_task, it is used to prevent
> a task->cgroups pointer dereferenced in an RCU read side
> critical section from being invalidated, by delaying the
> call to put_css_set until after an RCU grace period.
> 
> To avoid the call to synchronize_rcu, make the put_css_set
> call rcu-safe by moving the deletion of the css_set links
> into free_css_set_work, scheduled by the rcu callback
> free_css_set_rcu.
> 
> The decrement of the cgroup refcount is no longer
> synchronous with the call to put_css_set, which can result
> in the cgroup refcount staying positive after the last call
> to cgroup_attach_task returns.  To allow the cgroup to be
> deleted with cgroup_rmdir synchronously after
> cgroup_attach_task, have rmdir check the refcount of all
> associated css_sets.  If cgroup_rmdir is called on a cgroup
> for which the css_sets all have refcount zero but the
> cgroup refcount is nonzero, reuse the rmdir waitqueue to
> block the rmdir until free_css_set_work is called.
> 
> Signed-off-by: Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
> ---
>  include/linux/cgroup.h |    1 +
>  kernel/cgroup.c        |  120 +++++++++++++++++++++++++++++-------------------
>  2 files changed, 74 insertions(+), 47 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 9e13078..49fdff0 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -279,6 +279,7 @@ struct css_set {
>  
>  	/* For RCU-protected deletion */
>  	struct rcu_head rcu_head;
> +	struct work_struct work;
>  };
>  
>  /*
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 34e855e..e752c83 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -267,6 +267,33 @@ static void cgroup_release_agent(struct work_struct *work);
>  static DECLARE_WORK(release_agent_work, cgroup_release_agent);
>  static void check_for_release(struct cgroup *cgrp);
>  
> +/*
> + * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
> + * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
> + * reference to css->refcnt. In general, this refcnt is expected to goes down
> + * to zero, soon.
> + *
> + * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
> + */
> +DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
> +
> +static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
> +{
> +	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> +		wake_up_all(&cgroup_rmdir_waitq);
> +}
> +
> +void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
> +{
> +	css_get(css);
> +}
> +
> +void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
> +{
> +	cgroup_wakeup_rmdir_waiter(css->cgroup);
> +	css_put(css);
> +}
> +
>  /* Link structure for associating css_set objects with cgroups */
>  struct cg_cgroup_link {
>  	/*
> @@ -326,10 +353,35 @@ static struct hlist_head *css_set_hash(struct cgroup_subsys_state *css[])
>  	return &css_set_table[index];
>  }
>  
> +static void free_css_set_work(struct work_struct *work)
> +{
> +	struct css_set *cg = container_of(work, struct css_set, work);
> +	struct cg_cgroup_link *link;
> +	struct cg_cgroup_link *saved_link;
> +
> +	write_lock(&css_set_lock);
> +	list_for_each_entry_safe(link, saved_link, &cg->cg_links,
> +				 cg_link_list) {
> +		struct cgroup *cgrp = link->cgrp;
> +		list_del(&link->cg_link_list);
> +		list_del(&link->cgrp_link_list);
> +		if (atomic_dec_and_test(&cgrp->count)) {
> +			check_for_release(cgrp);
> +			cgroup_wakeup_rmdir_waiter(cgrp);
> +		}
> +		kfree(link);
> +	}
> +	write_unlock(&css_set_lock);
> +
> +	kfree(cg);
> +}
> +
>  static void free_css_set_rcu(struct rcu_head *obj)
>  {
>  	struct css_set *cg = container_of(obj, struct css_set, rcu_head);
> -	kfree(cg);
> +
> +	INIT_WORK(&cg->work, free_css_set_work);
> +	schedule_work(&cg->work);
>  }
>  
>  /* We don't maintain the lists running through each css_set to its
> @@ -348,8 +400,6 @@ static inline void get_css_set(struct css_set *cg)
>  
>  static void put_css_set(struct css_set *cg)
>  {
> -	struct cg_cgroup_link *link;
> -	struct cg_cgroup_link *saved_link;
>  	/*
>  	 * Ensure that the refcount doesn't hit zero while any readers
>  	 * can see it. Similar to atomic_dec_and_lock(), but for an
> @@ -363,21 +413,9 @@ static void put_css_set(struct css_set *cg)
>  		return;
>  	}
>  
> -	/* This css_set is dead. unlink it and release cgroup refcounts */
>  	hlist_del(&cg->hlist);
>  	css_set_count--;
>  
> -	list_for_each_entry_safe(link, saved_link, &cg->cg_links,
> -				 cg_link_list) {
> -		struct cgroup *cgrp = link->cgrp;
> -		list_del(&link->cg_link_list);
> -		list_del(&link->cgrp_link_list);
> -		if (atomic_dec_and_test(&cgrp->count))
> -			check_for_release(cgrp);
> -
> -		kfree(link);
> -	}
> -
>  	write_unlock(&css_set_lock);
>  	call_rcu(&cg->rcu_head, free_css_set_rcu);
>  }
> @@ -711,9 +749,9 @@ static struct cgroup *task_cgroup_from_root(struct task_struct *task,
>   * cgroup_attach_task(), which overwrites one tasks cgroup pointer with
>   * another.  It does so using cgroup_mutex, however there are
>   * several performance critical places that need to reference
> - * task->cgroup without the expense of grabbing a system global
> + * task->cgroups without the expense of grabbing a system global
>   * mutex.  Therefore except as noted below, when dereferencing or, as
> - * in cgroup_attach_task(), modifying a task'ss cgroup pointer we use
> + * in cgroup_attach_task(), modifying a task's cgroups pointer we use
>   * task_lock(), which acts on a spinlock (task->alloc_lock) already in
>   * the task_struct routinely used for such matters.
>   *
> @@ -895,33 +933,6 @@ static void cgroup_d_remove_dir(struct dentry *dentry)
>  }
>  
>  /*
> - * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
> - * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
> - * reference to css->refcnt. In general, this refcnt is expected to goes down
> - * to zero, soon.
> - *
> - * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
> - */
> -DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
> -
> -static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
> -{
> -	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> -		wake_up_all(&cgroup_rmdir_waitq);
> -}
> -
> -void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
> -{
> -	css_get(css);
> -}
> -
> -void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
> -{
> -	cgroup_wakeup_rmdir_waiter(css->cgroup);
> -	css_put(css);
> -}
> -
> -/*
>   * Call with cgroup_mutex held. Drops reference counts on modules, including
>   * any duplicate ones that parse_cgroupfs_options took. If this function
>   * returns an error, no reference counts are touched.
> @@ -1788,7 +1799,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  			ss->attach(ss, cgrp, oldcgrp, tsk, false);
>  	}
>  	set_bit(CGRP_RELEASABLE, &cgrp->flags);
> -	synchronize_rcu();
> +	/* put_css_set will not destroy cg until after an RCU grace period */
>  	put_css_set(cg);
>  
>  	/*
> @@ -3546,6 +3557,21 @@ static int cgroup_clear_css_refs(struct cgroup *cgrp)
>  	return !failed;
>  }
>  
> +/* checks if all of the css_sets attached to a cgroup have a refcount of 0.
> + * Must be called with css_set_lock held */
> +static int cgroup_css_sets_empty(struct cgroup *cgrp)
> +{
> +	struct cg_cgroup_link *link;
> +
> +	list_for_each_entry(link, &cgrp->css_sets, cgrp_link_list) {
> +		struct css_set *cg = link->cg;
> +		if (atomic_read(&cg->refcount) > 0)
> +			return 0;
> +	}
> +
> +	return 1;
> +}
> +
>  static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
>  {
>  	struct cgroup *cgrp = dentry->d_fsdata;
> @@ -3558,7 +3584,7 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
>  	/* the vfs holds both inode->i_mutex already */
>  again:
>  	mutex_lock(&cgroup_mutex);
> -	if (atomic_read(&cgrp->count) != 0) {
> +	if (!cgroup_css_sets_empty(cgrp)) {
>  		mutex_unlock(&cgroup_mutex);
>  		return -EBUSY;
>  	}
> @@ -3591,7 +3617,7 @@ again:
>  
>  	mutex_lock(&cgroup_mutex);
>  	parent = cgrp->parent;
> -	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
> +	if (!cgroup_css_sets_empty(cgrp) || !list_empty(&cgrp->children)) {
>  		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		mutex_unlock(&cgroup_mutex);
>  		return -EBUSY;

Tested-by: Mike Bohan <mbohan-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>

I'm responding on Mike's behalf and adding him to this thread.  This
patch improves launch time of a test app from ~700ms to ~250ms on MSM,
with much lower variance across tests.  We also see UI latency
improvements, but have not quantified the gains.

- Bryan

-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task
  2010-11-24  5:37           ` Colin Cross
@ 2011-01-28  1:17             ` Bryan Huntsman
       [not found]             ` <1290577024-12347-2-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
  1 sibling, 0 replies; 55+ messages in thread
From: Bryan Huntsman @ 2011-01-28  1:17 UTC (permalink / raw)
  To: Colin Cross; +Cc: Paul Menage, Li Zefan, containers, linux-kernel, mbohan

On 11/23/2010 09:37 PM, Colin Cross wrote:
> synchronize_rcu can be very expensive, averaging 100 ms in
> some cases.  In cgroup_attach_task, it is used to prevent
> a task->cgroups pointer dereferenced in an RCU read side
> critical section from being invalidated, by delaying the
> call to put_css_set until after an RCU grace period.
> 
> To avoid the call to synchronize_rcu, make the put_css_set
> call rcu-safe by moving the deletion of the css_set links
> into free_css_set_work, scheduled by the rcu callback
> free_css_set_rcu.
> 
> The decrement of the cgroup refcount is no longer
> synchronous with the call to put_css_set, which can result
> in the cgroup refcount staying positive after the last call
> to cgroup_attach_task returns.  To allow the cgroup to be
> deleted with cgroup_rmdir synchronously after
> cgroup_attach_task, have rmdir check the refcount of all
> associated css_sets.  If cgroup_rmdir is called on a cgroup
> for which the css_sets all have refcount zero but the
> cgroup refcount is nonzero, reuse the rmdir waitqueue to
> block the rmdir until free_css_set_work is called.
> 
> Signed-off-by: Colin Cross <ccross@android.com>
> ---
>  include/linux/cgroup.h |    1 +
>  kernel/cgroup.c        |  120 +++++++++++++++++++++++++++++-------------------
>  2 files changed, 74 insertions(+), 47 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 9e13078..49fdff0 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -279,6 +279,7 @@ struct css_set {
>  
>  	/* For RCU-protected deletion */
>  	struct rcu_head rcu_head;
> +	struct work_struct work;
>  };
>  
>  /*
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 34e855e..e752c83 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -267,6 +267,33 @@ static void cgroup_release_agent(struct work_struct *work);
>  static DECLARE_WORK(release_agent_work, cgroup_release_agent);
>  static void check_for_release(struct cgroup *cgrp);
>  
> +/*
> + * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
> + * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
> + * reference to css->refcnt. In general, this refcnt is expected to goes down
> + * to zero, soon.
> + *
> + * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
> + */
> +DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
> +
> +static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
> +{
> +	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> +		wake_up_all(&cgroup_rmdir_waitq);
> +}
> +
> +void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
> +{
> +	css_get(css);
> +}
> +
> +void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
> +{
> +	cgroup_wakeup_rmdir_waiter(css->cgroup);
> +	css_put(css);
> +}
> +
>  /* Link structure for associating css_set objects with cgroups */
>  struct cg_cgroup_link {
>  	/*
> @@ -326,10 +353,35 @@ static struct hlist_head *css_set_hash(struct cgroup_subsys_state *css[])
>  	return &css_set_table[index];
>  }
>  
> +static void free_css_set_work(struct work_struct *work)
> +{
> +	struct css_set *cg = container_of(work, struct css_set, work);
> +	struct cg_cgroup_link *link;
> +	struct cg_cgroup_link *saved_link;
> +
> +	write_lock(&css_set_lock);
> +	list_for_each_entry_safe(link, saved_link, &cg->cg_links,
> +				 cg_link_list) {
> +		struct cgroup *cgrp = link->cgrp;
> +		list_del(&link->cg_link_list);
> +		list_del(&link->cgrp_link_list);
> +		if (atomic_dec_and_test(&cgrp->count)) {
> +			check_for_release(cgrp);
> +			cgroup_wakeup_rmdir_waiter(cgrp);
> +		}
> +		kfree(link);
> +	}
> +	write_unlock(&css_set_lock);
> +
> +	kfree(cg);
> +}
> +
>  static void free_css_set_rcu(struct rcu_head *obj)
>  {
>  	struct css_set *cg = container_of(obj, struct css_set, rcu_head);
> -	kfree(cg);
> +
> +	INIT_WORK(&cg->work, free_css_set_work);
> +	schedule_work(&cg->work);
>  }
>  
>  /* We don't maintain the lists running through each css_set to its
> @@ -348,8 +400,6 @@ static inline void get_css_set(struct css_set *cg)
>  
>  static void put_css_set(struct css_set *cg)
>  {
> -	struct cg_cgroup_link *link;
> -	struct cg_cgroup_link *saved_link;
>  	/*
>  	 * Ensure that the refcount doesn't hit zero while any readers
>  	 * can see it. Similar to atomic_dec_and_lock(), but for an
> @@ -363,21 +413,9 @@ static void put_css_set(struct css_set *cg)
>  		return;
>  	}
>  
> -	/* This css_set is dead. unlink it and release cgroup refcounts */
>  	hlist_del(&cg->hlist);
>  	css_set_count--;
>  
> -	list_for_each_entry_safe(link, saved_link, &cg->cg_links,
> -				 cg_link_list) {
> -		struct cgroup *cgrp = link->cgrp;
> -		list_del(&link->cg_link_list);
> -		list_del(&link->cgrp_link_list);
> -		if (atomic_dec_and_test(&cgrp->count))
> -			check_for_release(cgrp);
> -
> -		kfree(link);
> -	}
> -
>  	write_unlock(&css_set_lock);
>  	call_rcu(&cg->rcu_head, free_css_set_rcu);
>  }
> @@ -711,9 +749,9 @@ static struct cgroup *task_cgroup_from_root(struct task_struct *task,
>   * cgroup_attach_task(), which overwrites one tasks cgroup pointer with
>   * another.  It does so using cgroup_mutex, however there are
>   * several performance critical places that need to reference
> - * task->cgroup without the expense of grabbing a system global
> + * task->cgroups without the expense of grabbing a system global
>   * mutex.  Therefore except as noted below, when dereferencing or, as
> - * in cgroup_attach_task(), modifying a task'ss cgroup pointer we use
> + * in cgroup_attach_task(), modifying a task's cgroups pointer we use
>   * task_lock(), which acts on a spinlock (task->alloc_lock) already in
>   * the task_struct routinely used for such matters.
>   *
> @@ -895,33 +933,6 @@ static void cgroup_d_remove_dir(struct dentry *dentry)
>  }
>  
>  /*
> - * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
> - * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
> - * reference to css->refcnt. In general, this refcnt is expected to goes down
> - * to zero, soon.
> - *
> - * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
> - */
> -DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
> -
> -static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
> -{
> -	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> -		wake_up_all(&cgroup_rmdir_waitq);
> -}
> -
> -void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
> -{
> -	css_get(css);
> -}
> -
> -void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
> -{
> -	cgroup_wakeup_rmdir_waiter(css->cgroup);
> -	css_put(css);
> -}
> -
> -/*
>   * Call with cgroup_mutex held. Drops reference counts on modules, including
>   * any duplicate ones that parse_cgroupfs_options took. If this function
>   * returns an error, no reference counts are touched.
> @@ -1788,7 +1799,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  			ss->attach(ss, cgrp, oldcgrp, tsk, false);
>  	}
>  	set_bit(CGRP_RELEASABLE, &cgrp->flags);
> -	synchronize_rcu();
> +	/* put_css_set will not destroy cg until after an RCU grace period */
>  	put_css_set(cg);
>  
>  	/*
> @@ -3546,6 +3557,21 @@ static int cgroup_clear_css_refs(struct cgroup *cgrp)
>  	return !failed;
>  }
>  
> +/* checks if all of the css_sets attached to a cgroup have a refcount of 0.
> + * Must be called with css_set_lock held */
> +static int cgroup_css_sets_empty(struct cgroup *cgrp)
> +{
> +	struct cg_cgroup_link *link;
> +
> +	list_for_each_entry(link, &cgrp->css_sets, cgrp_link_list) {
> +		struct css_set *cg = link->cg;
> +		if (atomic_read(&cg->refcount) > 0)
> +			return 0;
> +	}
> +
> +	return 1;
> +}
> +
>  static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
>  {
>  	struct cgroup *cgrp = dentry->d_fsdata;
> @@ -3558,7 +3584,7 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
>  	/* the vfs holds both inode->i_mutex already */
>  again:
>  	mutex_lock(&cgroup_mutex);
> -	if (atomic_read(&cgrp->count) != 0) {
> +	if (!cgroup_css_sets_empty(cgrp)) {
>  		mutex_unlock(&cgroup_mutex);
>  		return -EBUSY;
>  	}
> @@ -3591,7 +3617,7 @@ again:
>  
>  	mutex_lock(&cgroup_mutex);
>  	parent = cgrp->parent;
> -	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
> +	if (!cgroup_css_sets_empty(cgrp) || !list_empty(&cgrp->children)) {
>  		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		mutex_unlock(&cgroup_mutex);
>  		return -EBUSY;

Tested-by: Mike Bohan <mbohan@codeaurora.org>

I'm responding on Mike's behalf and adding him to this thread.  This
patch improves launch time of a test app from ~700ms to ~250ms on MSM,
with much lower variance across tests.  We also see UI latency
improvements, but have not quantified the gains.

- Bryan

-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
       [not found]             ` <1290577024-12347-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
  2010-11-24 23:54               ` Paul Menage
@ 2011-01-28  1:17               ` Bryan Huntsman
  1 sibling, 0 replies; 55+ messages in thread
From: Bryan Huntsman @ 2011-01-28  1:17 UTC (permalink / raw)
  To: Colin Cross
  Cc: Paul Menage, mbohan-sgV2jX0FEOL9JmXXK+q4OQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 11/23/2010 09:37 PM, Colin Cross wrote:
> Changes the meaning of CGRP_RELEASABLE to be set on any cgroup
> that has ever had a task or cgroup in it, or had css_get called
> on it.  The bit is set in cgroup_attach_task, cgroup_create,
> and __css_get.  It is not necessary to set the bit in
> cgroup_fork, as the task is either in the root cgroup, in
> which can never be released, or the task it was forked from
> already set the bit in croup_attach_task.
> 
> Signed-off-by: Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
> ---
>  include/linux/cgroup.h |   12 +--------
>  kernel/cgroup.c        |   54 ++++++++++++++++++++---------------------------
>  2 files changed, 25 insertions(+), 41 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index ed4ba11..9e13078 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -84,12 +84,6 @@ enum {
>  	CSS_REMOVED, /* This CSS is dead */
>  };
>  
> -/* Caller must verify that the css is not for root cgroup */
> -static inline void __css_get(struct cgroup_subsys_state *css, int count)
> -{
> -	atomic_add(count, &css->refcnt);
> -}
> -
>  /*
>   * Call css_get() to hold a reference on the css; it can be used
>   * for a reference obtained via:
> @@ -97,6 +91,7 @@ static inline void __css_get(struct cgroup_subsys_state *css, int count)
>   * - task->cgroups for a locked task
>   */
>  
> +extern void __css_get(struct cgroup_subsys_state *css, int count);
>  static inline void css_get(struct cgroup_subsys_state *css)
>  {
>  	/* We don't need to reference count the root state */
> @@ -143,10 +138,7 @@ static inline void css_put(struct cgroup_subsys_state *css)
>  enum {
>  	/* Control Group is dead */
>  	CGRP_REMOVED,
> -	/*
> -	 * Control Group has previously had a child cgroup or a task,
> -	 * but no longer (only if CGRP_NOTIFY_ON_RELEASE is set)
> -	 */
> +	/* Control Group has ever had a child cgroup or a task */
>  	CGRP_RELEASABLE,
>  	/* Control Group requires release notifications to userspace */
>  	CGRP_NOTIFY_ON_RELEASE,
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 66a416b..34e855e 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -338,7 +338,15 @@ static void free_css_set_rcu(struct rcu_head *obj)
>   * compiled into their kernel but not actually in use */
>  static int use_task_css_set_links __read_mostly;
>  
> -static void __put_css_set(struct css_set *cg, int taskexit)
> +/*
> + * refcounted get/put for css_set objects
> + */
> +static inline void get_css_set(struct css_set *cg)
> +{
> +	atomic_inc(&cg->refcount);
> +}
> +
> +static void put_css_set(struct css_set *cg)
>  {
>  	struct cg_cgroup_link *link;
>  	struct cg_cgroup_link *saved_link;
> @@ -364,12 +372,8 @@ static void __put_css_set(struct css_set *cg, int taskexit)
>  		struct cgroup *cgrp = link->cgrp;
>  		list_del(&link->cg_link_list);
>  		list_del(&link->cgrp_link_list);
> -		if (atomic_dec_and_test(&cgrp->count) &&
> -		    notify_on_release(cgrp)) {
> -			if (taskexit)
> -				set_bit(CGRP_RELEASABLE, &cgrp->flags);
> +		if (atomic_dec_and_test(&cgrp->count))
>  			check_for_release(cgrp);
> -		}
>  
>  		kfree(link);
>  	}
> @@ -379,24 +383,6 @@ static void __put_css_set(struct css_set *cg, int taskexit)
>  }
>  
>  /*
> - * refcounted get/put for css_set objects
> - */
> -static inline void get_css_set(struct css_set *cg)
> -{
> -	atomic_inc(&cg->refcount);
> -}
> -
> -static inline void put_css_set(struct css_set *cg)
> -{
> -	__put_css_set(cg, 0);
> -}
> -
> -static inline void put_css_set_taskexit(struct css_set *cg)
> -{
> -	__put_css_set(cg, 1);
> -}
> -
> -/*
>   * compare_css_sets - helper function for find_existing_css_set().
>   * @cg: candidate css_set being tested
>   * @old_cg: existing css_set for a task
> @@ -1801,7 +1787,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  		if (ss->attach)
>  			ss->attach(ss, cgrp, oldcgrp, tsk, false);
>  	}
> -	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> +	set_bit(CGRP_RELEASABLE, &cgrp->flags);
>  	synchronize_rcu();
>  	put_css_set(cg);
>  
> @@ -3427,6 +3413,8 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
>  	if (err < 0)
>  		goto err_remove;
>  
> +	set_bit(CGRP_RELEASABLE, &parent->flags);
> +
>  	/* The cgroup directory was pre-locked for us */
>  	BUG_ON(!mutex_is_locked(&cgrp->dentry->d_inode->i_mutex));
>  
> @@ -3645,7 +3633,6 @@ again:
>  	cgroup_d_remove_dir(d);
>  	dput(d);
>  
> -	set_bit(CGRP_RELEASABLE, &parent->flags);
>  	check_for_release(parent);
>  
>  	/*
> @@ -4240,7 +4227,7 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
>  	tsk->cgroups = &init_css_set;
>  	task_unlock(tsk);
>  	if (cg)
> -		put_css_set_taskexit(cg);
> +		put_css_set(cg);
>  }
>  
>  /**
> @@ -4410,6 +4397,14 @@ static void check_for_release(struct cgroup *cgrp)
>  }
>  
>  /* Caller must verify that the css is not for root cgroup */
> +void __css_get(struct cgroup_subsys_state *css, int count)
> +{
> +	atomic_add(count, &css->refcnt);
> +	set_bit(CGRP_RELEASABLE, &css->cgroup->flags);
> +}
> +EXPORT_SYMBOL_GPL(__css_get);
> +
> +/* Caller must verify that the css is not for root cgroup */
>  void __css_put(struct cgroup_subsys_state *css, int count)
>  {
>  	struct cgroup *cgrp = css->cgroup;
> @@ -4417,10 +4412,7 @@ void __css_put(struct cgroup_subsys_state *css, int count)
>  	rcu_read_lock();
>  	val = atomic_sub_return(count, &css->refcnt);
>  	if (val == 1) {
> -		if (notify_on_release(cgrp)) {
> -			set_bit(CGRP_RELEASABLE, &cgrp->flags);
> -			check_for_release(cgrp);
> -		}
> +		check_for_release(cgrp);
>  		cgroup_wakeup_rmdir_waiter(cgrp);
>  	}
>  	rcu_read_unlock();

Tested-by: Mike Bohan <mbohan-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>

I'm responding on Mike's behalf and adding him to this thread.  This
patch improves launch time of a test app from ~700ms to ~250ms on MSM,
with much lower variance across tests.  We also see UI latency
improvements, but have not quantified the gains.

- Bryan

-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
  2010-11-24  5:37           ` [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup Colin Cross
       [not found]             ` <1290577024-12347-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
  2010-11-24 23:54             ` Paul Menage
@ 2011-01-28  1:17             ` Bryan Huntsman
       [not found]               ` <4D42192C.9000701-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
  2011-01-28  1:30               ` Paul Menage
  2 siblings, 2 replies; 55+ messages in thread
From: Bryan Huntsman @ 2011-01-28  1:17 UTC (permalink / raw)
  To: Colin Cross; +Cc: Paul Menage, Li Zefan, containers, linux-kernel, mbohan

On 11/23/2010 09:37 PM, Colin Cross wrote:
> Changes the meaning of CGRP_RELEASABLE to be set on any cgroup
> that has ever had a task or cgroup in it, or had css_get called
> on it.  The bit is set in cgroup_attach_task, cgroup_create,
> and __css_get.  It is not necessary to set the bit in
> cgroup_fork, as the task is either in the root cgroup, in
> which can never be released, or the task it was forked from
> already set the bit in croup_attach_task.
> 
> Signed-off-by: Colin Cross <ccross@android.com>
> ---
>  include/linux/cgroup.h |   12 +--------
>  kernel/cgroup.c        |   54 ++++++++++++++++++++---------------------------
>  2 files changed, 25 insertions(+), 41 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index ed4ba11..9e13078 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -84,12 +84,6 @@ enum {
>  	CSS_REMOVED, /* This CSS is dead */
>  };
>  
> -/* Caller must verify that the css is not for root cgroup */
> -static inline void __css_get(struct cgroup_subsys_state *css, int count)
> -{
> -	atomic_add(count, &css->refcnt);
> -}
> -
>  /*
>   * Call css_get() to hold a reference on the css; it can be used
>   * for a reference obtained via:
> @@ -97,6 +91,7 @@ static inline void __css_get(struct cgroup_subsys_state *css, int count)
>   * - task->cgroups for a locked task
>   */
>  
> +extern void __css_get(struct cgroup_subsys_state *css, int count);
>  static inline void css_get(struct cgroup_subsys_state *css)
>  {
>  	/* We don't need to reference count the root state */
> @@ -143,10 +138,7 @@ static inline void css_put(struct cgroup_subsys_state *css)
>  enum {
>  	/* Control Group is dead */
>  	CGRP_REMOVED,
> -	/*
> -	 * Control Group has previously had a child cgroup or a task,
> -	 * but no longer (only if CGRP_NOTIFY_ON_RELEASE is set)
> -	 */
> +	/* Control Group has ever had a child cgroup or a task */
>  	CGRP_RELEASABLE,
>  	/* Control Group requires release notifications to userspace */
>  	CGRP_NOTIFY_ON_RELEASE,
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 66a416b..34e855e 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -338,7 +338,15 @@ static void free_css_set_rcu(struct rcu_head *obj)
>   * compiled into their kernel but not actually in use */
>  static int use_task_css_set_links __read_mostly;
>  
> -static void __put_css_set(struct css_set *cg, int taskexit)
> +/*
> + * refcounted get/put for css_set objects
> + */
> +static inline void get_css_set(struct css_set *cg)
> +{
> +	atomic_inc(&cg->refcount);
> +}
> +
> +static void put_css_set(struct css_set *cg)
>  {
>  	struct cg_cgroup_link *link;
>  	struct cg_cgroup_link *saved_link;
> @@ -364,12 +372,8 @@ static void __put_css_set(struct css_set *cg, int taskexit)
>  		struct cgroup *cgrp = link->cgrp;
>  		list_del(&link->cg_link_list);
>  		list_del(&link->cgrp_link_list);
> -		if (atomic_dec_and_test(&cgrp->count) &&
> -		    notify_on_release(cgrp)) {
> -			if (taskexit)
> -				set_bit(CGRP_RELEASABLE, &cgrp->flags);
> +		if (atomic_dec_and_test(&cgrp->count))
>  			check_for_release(cgrp);
> -		}
>  
>  		kfree(link);
>  	}
> @@ -379,24 +383,6 @@ static void __put_css_set(struct css_set *cg, int taskexit)
>  }
>  
>  /*
> - * refcounted get/put for css_set objects
> - */
> -static inline void get_css_set(struct css_set *cg)
> -{
> -	atomic_inc(&cg->refcount);
> -}
> -
> -static inline void put_css_set(struct css_set *cg)
> -{
> -	__put_css_set(cg, 0);
> -}
> -
> -static inline void put_css_set_taskexit(struct css_set *cg)
> -{
> -	__put_css_set(cg, 1);
> -}
> -
> -/*
>   * compare_css_sets - helper function for find_existing_css_set().
>   * @cg: candidate css_set being tested
>   * @old_cg: existing css_set for a task
> @@ -1801,7 +1787,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  		if (ss->attach)
>  			ss->attach(ss, cgrp, oldcgrp, tsk, false);
>  	}
> -	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> +	set_bit(CGRP_RELEASABLE, &cgrp->flags);
>  	synchronize_rcu();
>  	put_css_set(cg);
>  
> @@ -3427,6 +3413,8 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
>  	if (err < 0)
>  		goto err_remove;
>  
> +	set_bit(CGRP_RELEASABLE, &parent->flags);
> +
>  	/* The cgroup directory was pre-locked for us */
>  	BUG_ON(!mutex_is_locked(&cgrp->dentry->d_inode->i_mutex));
>  
> @@ -3645,7 +3633,6 @@ again:
>  	cgroup_d_remove_dir(d);
>  	dput(d);
>  
> -	set_bit(CGRP_RELEASABLE, &parent->flags);
>  	check_for_release(parent);
>  
>  	/*
> @@ -4240,7 +4227,7 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
>  	tsk->cgroups = &init_css_set;
>  	task_unlock(tsk);
>  	if (cg)
> -		put_css_set_taskexit(cg);
> +		put_css_set(cg);
>  }
>  
>  /**
> @@ -4410,6 +4397,14 @@ static void check_for_release(struct cgroup *cgrp)
>  }
>  
>  /* Caller must verify that the css is not for root cgroup */
> +void __css_get(struct cgroup_subsys_state *css, int count)
> +{
> +	atomic_add(count, &css->refcnt);
> +	set_bit(CGRP_RELEASABLE, &css->cgroup->flags);
> +}
> +EXPORT_SYMBOL_GPL(__css_get);
> +
> +/* Caller must verify that the css is not for root cgroup */
>  void __css_put(struct cgroup_subsys_state *css, int count)
>  {
>  	struct cgroup *cgrp = css->cgroup;
> @@ -4417,10 +4412,7 @@ void __css_put(struct cgroup_subsys_state *css, int count)
>  	rcu_read_lock();
>  	val = atomic_sub_return(count, &css->refcnt);
>  	if (val == 1) {
> -		if (notify_on_release(cgrp)) {
> -			set_bit(CGRP_RELEASABLE, &cgrp->flags);
> -			check_for_release(cgrp);
> -		}
> +		check_for_release(cgrp);
>  		cgroup_wakeup_rmdir_waiter(cgrp);
>  	}
>  	rcu_read_unlock();

Tested-by: Mike Bohan <mbohan@codeaurora.org>

I'm responding on Mike's behalf and adding him to this thread.  This
patch improves launch time of a test app from ~700ms to ~250ms on MSM,
with much lower variance across tests.  We also see UI latency
improvements, but have not quantified the gains.

- Bryan

-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
       [not found]               ` <4D42192C.9000701-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
@ 2011-01-28  1:30                 ` Paul Menage
  0 siblings, 0 replies; 55+ messages in thread
From: Paul Menage @ 2011-01-28  1:30 UTC (permalink / raw)
  To: Bryan Huntsman
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mbohan-sgV2jX0FEOL9JmXXK+q4OQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Colin Cross

On Thu, Jan 27, 2011 at 5:17 PM, Bryan Huntsman <bryanh-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org> wrote:
>
> Tested-by: Mike Bohan <mbohan-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
>
> I'm responding on Mike's behalf and adding him to this thread.  This
> patch improves launch time of a test app from ~700ms to ~250ms on MSM,
> with much lower variance across tests.  We also see UI latency
> improvements, but have not quantified the gains.
>

Is this attached to the wrong patch? I'd thought that it was the other
patch (removing the rcu_synchronize()) that's the performance booster.
This one is more about preserving the semantics of the notification
API.

Paul

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
  2011-01-28  1:17             ` Bryan Huntsman
       [not found]               ` <4D42192C.9000701-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
@ 2011-01-28  1:30               ` Paul Menage
  2011-01-28  1:48                 ` Michael Bohan
       [not found]                 ` <AANLkTin7B51maXHRH+FNmZ14bmWmEp9P2=2QTNqgq_Fi-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 2 replies; 55+ messages in thread
From: Paul Menage @ 2011-01-28  1:30 UTC (permalink / raw)
  To: Bryan Huntsman; +Cc: Colin Cross, Li Zefan, containers, linux-kernel, mbohan

On Thu, Jan 27, 2011 at 5:17 PM, Bryan Huntsman <bryanh@codeaurora.org> wrote:
>
> Tested-by: Mike Bohan <mbohan@codeaurora.org>
>
> I'm responding on Mike's behalf and adding him to this thread.  This
> patch improves launch time of a test app from ~700ms to ~250ms on MSM,
> with much lower variance across tests.  We also see UI latency
> improvements, but have not quantified the gains.
>

Is this attached to the wrong patch? I'd thought that it was the other
patch (removing the rcu_synchronize()) that's the performance booster.
This one is more about preserving the semantics of the notification
API.

Paul

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
       [not found]                 ` <AANLkTin7B51maXHRH+FNmZ14bmWmEp9P2=2QTNqgq_Fi-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-01-28  1:48                   ` Michael Bohan
  0 siblings, 0 replies; 55+ messages in thread
From: Michael Bohan @ 2011-01-28  1:48 UTC (permalink / raw)
  To: Paul Menage
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Bryan Huntsman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Colin Cross

On 1/27/2011 5:30 PM, Paul Menage wrote:
> On Thu, Jan 27, 2011 at 5:17 PM, Bryan Huntsman<bryanh-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>  wrote:
>>
>> Tested-by: Mike Bohan<mbohan-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
>>
>> I'm responding on Mike's behalf and adding him to this thread.  This
>> patch improves launch time of a test app from ~700ms to ~250ms on MSM,
>> with much lower variance across tests.  We also see UI latency
>> improvements, but have not quantified the gains.
>>
>
> Is this attached to the wrong patch? I'd thought that it was the other
> patch (removing the rcu_synchronize()) that's the performance booster.
> This one is more about preserving the semantics of the notification
> API.

You are correct. "[PATCH 2/2] cgroup: Remove call to synchronize_rcu in 
cgroup_attach_task" improved the performance.

To be more correct, I tested this patch (eg. "cgroup: Set 
CGRP_RELEASABLE when adding to a cgroup") to the degree that it didn't 
appear to cause any stability or functional regressions when performing 
the simple benchmark procedure described above. I did also test "[PATCH 
2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task" 
independently of this patch to verify that it alone improved the 
performance.

Thanks,
Mike

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup
  2011-01-28  1:30               ` Paul Menage
@ 2011-01-28  1:48                 ` Michael Bohan
       [not found]                 ` <AANLkTin7B51maXHRH+FNmZ14bmWmEp9P2=2QTNqgq_Fi-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 0 replies; 55+ messages in thread
From: Michael Bohan @ 2011-01-28  1:48 UTC (permalink / raw)
  To: Paul Menage
  Cc: Bryan Huntsman, Colin Cross, Li Zefan, containers, linux-kernel

On 1/27/2011 5:30 PM, Paul Menage wrote:
> On Thu, Jan 27, 2011 at 5:17 PM, Bryan Huntsman<bryanh@codeaurora.org>  wrote:
>>
>> Tested-by: Mike Bohan<mbohan@codeaurora.org>
>>
>> I'm responding on Mike's behalf and adding him to this thread.  This
>> patch improves launch time of a test app from ~700ms to ~250ms on MSM,
>> with much lower variance across tests.  We also see UI latency
>> improvements, but have not quantified the gains.
>>
>
> Is this attached to the wrong patch? I'd thought that it was the other
> patch (removing the rcu_synchronize()) that's the performance booster.
> This one is more about preserving the semantics of the notification
> API.

You are correct. "[PATCH 2/2] cgroup: Remove call to synchronize_rcu in 
cgroup_attach_task" improved the performance.

To be more correct, I tested this patch (eg. "cgroup: Set 
CGRP_RELEASABLE when adding to a cgroup") to the degree that it didn't 
appear to cause any stability or functional regressions when performing 
the simple benchmark procedure described above. I did also test "[PATCH 
2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task" 
independently of this patch to verify that it alone improved the 
performance.

Thanks,
Mike

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2011-01-28  1:48 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-21  2:00 [PATCH] cgroup: Remove RCU from task->cgroups Colin Cross
     [not found] ` <1290304824-22722-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
2010-11-21 23:02   ` Colin Cross
2010-11-21 23:02 ` Colin Cross
2010-11-22  4:06   ` [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task Colin Cross
     [not found]     ` <1290398767-15230-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
2010-11-23  8:14       ` Li Zefan
2010-11-24  1:24       ` Paul Menage
2010-11-23  8:14     ` Li Zefan
2010-11-23  8:58       ` Colin Cross
     [not found]         ` <AANLkTimjpW6NZ6fEiVi0VzjkpQGVob4=VHsohXUiDQkJ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-11-23 20:22           ` Colin Cross
2010-11-23 20:22         ` Colin Cross
     [not found]       ` <4CEB77E0.10202-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2010-11-23  8:58         ` Colin Cross
2010-11-24  1:24     ` Paul Menage
2010-11-24  2:06       ` Li Zefan
2010-11-24  2:10         ` Colin Cross
2010-11-24  5:37           ` [PATCH 1/2] cgroup: Set CGRP_RELEASABLE when adding to a cgroup Colin Cross
     [not found]             ` <1290577024-12347-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
2010-11-24 23:54               ` Paul Menage
2011-01-28  1:17               ` Bryan Huntsman
2010-11-24 23:54             ` Paul Menage
     [not found]               ` <AANLkTimFqJ+qPidS_81DKd7ExSxDG7GNi0gjcUEEq_7j-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-11-25  0:11                 ` Colin Cross
2010-11-25  0:11                   ` Colin Cross
2010-11-25  0:18                   ` Colin Cross
     [not found]                   ` <AANLkTimwvP2Ey1gJ6AbbFNtDKjGZt4cwqL=08nGBa_PT-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-11-25  0:18                     ` Colin Cross
2010-11-25  0:21                     ` Paul Menage
2010-11-25  0:21                   ` Paul Menage
     [not found]                     ` <AANLkTimJA52-GTM=AzS+tkOugrsi6Keh0_j87vK1BkGv-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-12-03  3:07                       ` Colin Cross
2010-12-03  3:07                     ` Colin Cross
     [not found]                       ` <AANLkTim67fLN+PYz-P0TM0QRmvQKP80tyXSNKNSZhFZ2-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-12-17  0:54                         ` Paul Menage
2010-12-17  0:54                       ` Paul Menage
     [not found]                         ` <AANLkTinZarXbEyb1xfJWjG4gN2qhTVTXTdso4Cym5M9T-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-12-17  1:12                           ` Colin Cross
2010-12-17  1:12                         ` Colin Cross
2011-01-28  1:17             ` Bryan Huntsman
     [not found]               ` <4D42192C.9000701-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
2011-01-28  1:30                 ` Paul Menage
2011-01-28  1:30               ` Paul Menage
2011-01-28  1:48                 ` Michael Bohan
     [not found]                 ` <AANLkTin7B51maXHRH+FNmZ14bmWmEp9P2=2QTNqgq_Fi-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-01-28  1:48                   ` Michael Bohan
     [not found]           ` <AANLkTi=6nwDCdzDz7E2EaAw2pf3KUVjmKMRqGfz5zVhP-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-11-24  5:37             ` Colin Cross
2010-11-24  5:37             ` [PATCH 2/2] cgroup: Remove call to synchronize_rcu in cgroup_attach_task Colin Cross
2010-11-24  5:37           ` Colin Cross
2011-01-28  1:17             ` Bryan Huntsman
     [not found]             ` <1290577024-12347-2-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
2011-01-28  1:17               ` Bryan Huntsman
     [not found]         ` <4CEC7329.7070909-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2010-11-24  2:10           ` [PATCH] cgroup: Convert synchronize_rcu to call_rcu " Colin Cross
2010-11-24 18:58           ` Paul Menage
2010-11-24 18:58         ` Paul Menage
     [not found]       ` <AANLkTi=4-OgPUugnUBaqSU3oC=3wxTjAsOB_Ais3Or+i-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-11-24  1:43         ` [PATCH] cgroup: Remove call to synchronize_rcu " Colin Cross
2010-11-24  1:43           ` Colin Cross
2010-11-24  2:29           ` Colin Cross
2011-01-22  1:17           ` Bryan Huntsman
2011-01-22  2:04             ` Colin Cross
     [not found]             ` <4D3A3024.9040402-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
2011-01-22  2:04               ` Colin Cross
2011-01-28  1:17           ` Bryan Huntsman
     [not found]           ` <1290563018-2804-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
2010-11-24  2:29             ` Colin Cross
2011-01-22  1:17             ` Bryan Huntsman
2011-01-28  1:17             ` Bryan Huntsman
2010-11-24  2:06         ` [PATCH] cgroup: Convert synchronize_rcu to call_rcu " Li Zefan
     [not found]   ` <AANLkTikx6d0_VFtZ4zWQucRCf=vFt7N2M6=0jpnKasEE-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-11-22  4:06     ` Colin Cross

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.