All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET] cgroup: allow dropping RCU read lock while iterating
@ 2013-05-21  1:50 Tejun Heo
       [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2013-05-21  1:50 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Currently all cgroup iterators require the whole traversal to be
contained in a single RCU read critical section, which can be too
restrictive as there are times when blocking operations are necessary
during traversal.  This forces controllers to implement specific
workarounds in those cases - building separate iteration list, punting
actual operations to work items and so on.

This patchset updates cgroup iterators so that they allow dropping RCU
read lock while iteration is in progress so that controllers which
require sleeping during iteration don't need to implement their own
mechanisms.

Dropping RCU read lock during iteration is unsafe because
cgroup->sibling.next can't be trusted once RCU read lock is dropped.
The sibling list is a RCU list and when a cgroup is removed the next
pointer is retained to keep RCU traversal working.  If the next
sibling is removed while RCU read lock is dropped, the removed current
cgroup's next won't be updated and the next sibling may complete its
grace period and get freed leaving the next pointer dangling.

Working around the problem is relatiely simple.  Whether
->sibling.next can be trusted can be trusted can be decided by looking
at CGRP_REMOVED - as cgroup removals are fully serialized, the flag is
guaranteed to be visible before the next sibling finishes its grace
period.  For those cases, each cgroup is assigned a monotonically
increasing serial number.  Because new cgroups are always appeneded to
the children list, it's guaranteed that all children list are sorted
in the ascending order of the serial numbers.  When the next pointer
can't be trusted, the next sibling can be located by walking the
parent's children list from the beginning looking for the first cgroup
with higher serial number.

The above is implemented in cgroup_next_sibling() and all iterators
are updated to use it to find out the next sibling thus allowing
droppping RCU read lock while iteration is in progress.  This patchset
replaces separate iteration list in device_cgroup with direct
descendant walk and there will be further patches making use of this
update.

This patchset contains the following five patches.

 0001-cgroup-fix-a-subtle-bug-in-descendant-pre-order-walk.patch
 0002-cgroup-make-cgroup_is_removed-static.patch
 0003-cgroup-add-cgroup-serial_nr-and-implement-cgroup_nex.patch
 0004-cgroup-update-iterators-to-use-cgroup_next_sibling.patch
 0005-device_cgroup-simplify-cgroup-tree-walk-in-propagate.patch

0001 fixes a subtle iteration bug.  Will be applied to for-3.10-fixes.

0002 is a trivial prep patch.

0003 implements cgroup_next_sibling() which can find out the next
sibling regardless of the state of the current cgroup.

0004 updates all iterators to use cgroup_next_sibling().

0005 replaces iteration list work around in device_cgroup with direct
iteration.

This patchset is on top of cgroup/for-3.11 23958e729e ("cgroup.h:
remove some functions that are now gone") and available in the
following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-interruptible-iter

diffstat follows.

 include/linux/cgroup.h   |   31 +++++++++++---
 kernel/cgroup.c          |   98 ++++++++++++++++++++++++++++++++++++++++-------
 security/device_cgroup.c |   56 ++++++++------------------
 3 files changed, 128 insertions(+), 57 deletions(-)

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/5] cgroup: fix a subtle bug in descendant pre-order walk
       [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-05-21  1:50   ` Tejun Heo
  2013-05-21  1:50   ` Tejun Heo
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2013-05-21  1:50 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	stable-u79uwXL29TY76Z2rM5mHXA, mhocko-AlSwsSmVLrQ,
	hannes-druUgvl0LCNAfugRpC6u6w, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

When cgroup_next_descendant_pre() initiates a walk, it checks whether
the subtree root doesn't have any children and if not returns NULL.
Later code assumes that the subtree isn't empty.  This is broken
because the subtree may become empty inbetween, which can lead to the
traversal escaping the subtree by walking to the sibling of the
subtree root.

There's no reason to have the early exit path.  Remove it along with
the later assumption that the subtree isn't empty.  This simplifies
the code a bit and fixes the subtle bug.

While at it, fix the comment of cgroup_for_each_descendant_pre() which
was incorrectly referring to ->css_offline() instead of
->css_online().

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 include/linux/cgroup.h | 2 +-
 kernel/cgroup.c        | 9 +++------
 2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4f6f513..1df5f69 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -709,7 +709,7 @@ struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos);
  *
  * If a subsystem synchronizes against the parent in its ->css_online() and
  * before starting iterating, and synchronizes against @pos on each
- * iteration, any descendant cgroup which finished ->css_offline() is
+ * iteration, any descendant cgroup which finished ->css_online() is
  * guaranteed to be visible in the future iterations.
  *
  * In other words, the following guarantees that a descendant can't escape
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 6b2b1d9..f20f80c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2990,11 +2990,8 @@ struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
 	WARN_ON_ONCE(!rcu_read_lock_held());
 
 	/* if first iteration, pretend we just visited @cgroup */
-	if (!pos) {
-		if (list_empty(&cgroup->children))
-			return NULL;
+	if (!pos)
 		pos = cgroup;
-	}
 
 	/* visit the first child if exists */
 	next = list_first_or_null_rcu(&pos->children, struct cgroup, sibling);
@@ -3002,14 +2999,14 @@ struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
 		return next;
 
 	/* no child, visit my or the closest ancestor's next sibling */
-	do {
+	while (pos != cgroup) {
 		next = list_entry_rcu(pos->sibling.next, struct cgroup,
 				      sibling);
 		if (&next->sibling != &pos->parent->children)
 			return next;
 
 		pos = pos->parent;
-	} while (pos != cgroup);
+	}
 
 	return NULL;
 }
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 1/5] cgroup: fix a subtle bug in descendant pre-order walk
       [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2013-05-21  1:50   ` [PATCH 1/5] cgroup: fix a subtle bug in descendant pre-order walk Tejun Heo
@ 2013-05-21  1:50   ` Tejun Heo
       [not found]     ` <1369101025-28335-2-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2013-05-21  1:50   ` [PATCH 2/5] cgroup: make cgroup_is_removed() static Tejun Heo
                     ` (10 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2013-05-21  1:50 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, aris-H+wXaHxf7aLQT0dZR+AlfA,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	hannes-druUgvl0LCNAfugRpC6u6w,
	bsingharora-Re5JQEeQqe8AvxtiuMwx3w,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Tejun Heo,
	stable-u79uwXL29TY76Z2rM5mHXA

When cgroup_next_descendant_pre() initiates a walk, it checks whether
the subtree root doesn't have any children and if not returns NULL.
Later code assumes that the subtree isn't empty.  This is broken
because the subtree may become empty inbetween, which can lead to the
traversal escaping the subtree by walking to the sibling of the
subtree root.

There's no reason to have the early exit path.  Remove it along with
the later assumption that the subtree isn't empty.  This simplifies
the code a bit and fixes the subtle bug.

While at it, fix the comment of cgroup_for_each_descendant_pre() which
was incorrectly referring to ->css_offline() instead of
->css_online().

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 include/linux/cgroup.h | 2 +-
 kernel/cgroup.c        | 9 +++------
 2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4f6f513..1df5f69 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -709,7 +709,7 @@ struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos);
  *
  * If a subsystem synchronizes against the parent in its ->css_online() and
  * before starting iterating, and synchronizes against @pos on each
- * iteration, any descendant cgroup which finished ->css_offline() is
+ * iteration, any descendant cgroup which finished ->css_online() is
  * guaranteed to be visible in the future iterations.
  *
  * In other words, the following guarantees that a descendant can't escape
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 6b2b1d9..f20f80c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2990,11 +2990,8 @@ struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
 	WARN_ON_ONCE(!rcu_read_lock_held());
 
 	/* if first iteration, pretend we just visited @cgroup */
-	if (!pos) {
-		if (list_empty(&cgroup->children))
-			return NULL;
+	if (!pos)
 		pos = cgroup;
-	}
 
 	/* visit the first child if exists */
 	next = list_first_or_null_rcu(&pos->children, struct cgroup, sibling);
@@ -3002,14 +2999,14 @@ struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
 		return next;
 
 	/* no child, visit my or the closest ancestor's next sibling */
-	do {
+	while (pos != cgroup) {
 		next = list_entry_rcu(pos->sibling.next, struct cgroup,
 				      sibling);
 		if (&next->sibling != &pos->parent->children)
 			return next;
 
 		pos = pos->parent;
-	} while (pos != cgroup);
+	}
 
 	return NULL;
 }
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/5] cgroup: make cgroup_is_removed() static
       [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (2 preceding siblings ...)
  2013-05-21  1:50   ` [PATCH 2/5] cgroup: make cgroup_is_removed() static Tejun Heo
@ 2013-05-21  1:50   ` Tejun Heo
  2013-05-21  1:50   ` [PATCH 3/5] cgroup: add cgroup->serial_nr and implement cgroup_next_sibling() Tejun Heo
                     ` (8 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2013-05-21  1:50 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

cgroup_is_removed() no longer has external users and it shouldn't grow
any - controllers should deal with cgroup_subsys_state on/offline
state instead of cgroup removal state.  Make it static.

While at it, make it return bool.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 include/linux/cgroup.h | 1 -
 kernel/cgroup.c        | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 1df5f69..8d9f3c9 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -538,7 +538,6 @@ static inline const char *cgroup_name(const struct cgroup *cgrp)
 int cgroup_add_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
 int cgroup_rm_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
 
-int cgroup_is_removed(const struct cgroup *cgrp);
 bool cgroup_is_descendant(struct cgroup *cgrp, struct cgroup *ancestor);
 
 int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f20f80c..3222f93 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -226,7 +226,7 @@ static int css_refcnt(struct cgroup_subsys_state *css)
 }
 
 /* convenient tests for these bits */
-inline int cgroup_is_removed(const struct cgroup *cgrp)
+static inline bool cgroup_is_removed(const struct cgroup *cgrp)
 {
 	return test_bit(CGRP_REMOVED, &cgrp->flags);
 }
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/5] cgroup: make cgroup_is_removed() static
       [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2013-05-21  1:50   ` [PATCH 1/5] cgroup: fix a subtle bug in descendant pre-order walk Tejun Heo
  2013-05-21  1:50   ` Tejun Heo
@ 2013-05-21  1:50   ` Tejun Heo
       [not found]     ` <1369101025-28335-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2013-05-21  1:50   ` Tejun Heo
                     ` (9 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2013-05-21  1:50 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, aris-H+wXaHxf7aLQT0dZR+AlfA,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	hannes-druUgvl0LCNAfugRpC6u6w,
	bsingharora-Re5JQEeQqe8AvxtiuMwx3w,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Tejun Heo

cgroup_is_removed() no longer has external users and it shouldn't grow
any - controllers should deal with cgroup_subsys_state on/offline
state instead of cgroup removal state.  Make it static.

While at it, make it return bool.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 include/linux/cgroup.h | 1 -
 kernel/cgroup.c        | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 1df5f69..8d9f3c9 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -538,7 +538,6 @@ static inline const char *cgroup_name(const struct cgroup *cgrp)
 int cgroup_add_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
 int cgroup_rm_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
 
-int cgroup_is_removed(const struct cgroup *cgrp);
 bool cgroup_is_descendant(struct cgroup *cgrp, struct cgroup *ancestor);
 
 int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f20f80c..3222f93 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -226,7 +226,7 @@ static int css_refcnt(struct cgroup_subsys_state *css)
 }
 
 /* convenient tests for these bits */
-inline int cgroup_is_removed(const struct cgroup *cgrp)
+static inline bool cgroup_is_removed(const struct cgroup *cgrp)
 {
 	return test_bit(CGRP_REMOVED, &cgrp->flags);
 }
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/5] cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
       [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (4 preceding siblings ...)
  2013-05-21  1:50   ` [PATCH 3/5] cgroup: add cgroup->serial_nr and implement cgroup_next_sibling() Tejun Heo
@ 2013-05-21  1:50   ` Tejun Heo
  2013-05-21  1:50   ` [PATCH 4/5] cgroup: update iterators to use cgroup_next_sibling() Tejun Heo
                     ` (6 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2013-05-21  1:50 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section.  This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive.  This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.

It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible.  A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling.  IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.

This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next.  This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order.  A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.

This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.

Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 include/linux/cgroup.h | 10 ++++++++
 kernel/cgroup.c        | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 8d9f3c9..ee041a0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -189,6 +189,14 @@ struct cgroup {
 	struct dentry *dentry;		/* cgroup fs entry, RCU protected */
 
 	/*
+	 * Monotonically increasing unique serial number which defines a
+	 * uniform order among all cgroups.  It's guaranteed that all
+	 * ->children lists are in the ascending order of ->serial_nr.
+	 * It's used to allow interrupting and resuming iterations.
+	 */
+	u64 serial_nr;
+
+	/*
 	 * This is a copy of dentry->d_name, and it's needed because
 	 * we can't use dentry->d_name in cgroup_path().
 	 *
@@ -675,6 +683,8 @@ static inline struct cgroup* task_cgroup(struct task_struct *task,
 	return task_subsys_state(task, subsys_id)->cgroup;
 }
 
+struct cgroup *cgroup_next_sibling(struct cgroup *pos);
+
 /**
  * cgroup_for_each_child - iterate through children of a cgroup
  * @pos: the cgroup * to use as the loop cursor
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3222f93..bc757d7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2975,6 +2975,55 @@ static void cgroup_enable_task_cg_lists(void)
 }
 
 /**
+ * cgroup_next_sibling - find the next sibling of a given cgroup
+ * @pos: the current cgroup
+ *
+ * This function returns the next sibling of @pos and should be called
+ * under RCU read lock.  The only requirement is that @pos is accessible.
+ * The next sibling is guaranteed to be returned regardless of @pos's
+ * state.
+ */
+struct cgroup *cgroup_next_sibling(struct cgroup *pos)
+{
+	struct cgroup *next;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	/*
+	 * @pos could already have been removed.  Once a cgroup is removed,
+	 * its ->sibling.next is no longer updated when its next sibling
+	 * changes.  As CGRP_REMOVED is set on removal which is fully
+	 * serialized, if we see it unasserted, it's guaranteed that the
+	 * next sibling hasn't finished its grace period even if it's
+	 * already removed, and thus safe to dereference from this RCU
+	 * critical section.  If ->sibling.next is inaccessible,
+	 * cgroup_is_removed() is guaranteed to be visible as %true here.
+	 */
+	if (likely(!cgroup_is_removed(pos))) {
+		next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling);
+		if (&next->sibling != &pos->parent->children)
+			return next;
+		return NULL;
+	}
+
+	/*
+	 * Can't dereference the next pointer.  Each cgroup is given a
+	 * monotonically increasing unique serial number and always
+	 * appended to the sibling list, so the next one can be found by
+	 * walking the parent's children until we see a cgroup with higher
+	 * serial number than @pos's.
+	 *
+	 * While this path can be slow, it's taken only when either the
+	 * current cgroup is removed or iteration and reomval race.
+	 */
+	list_for_each_entry_rcu(next, &pos->parent->children, sibling)
+		if (next->serial_nr > pos->serial_nr)
+			return next;
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(cgroup_next_sibling);
+
+/**
  * cgroup_next_descendant_pre - find the next descendant for pre-order walk
  * @pos: the current position (%NULL to initiate traversal)
  * @cgroup: cgroup whose descendants to walk
@@ -4136,6 +4185,7 @@ static void offline_css(struct cgroup_subsys *ss, struct cgroup *cgrp)
 static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
 			     umode_t mode)
 {
+	static atomic64_t serial_nr_cursor = ATOMIC64_INIT(0);
 	struct cgroup *cgrp;
 	struct cgroup_name *name;
 	struct cgroupfs_root *root = parent->root;
@@ -4216,6 +4266,14 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
 		goto err_free_all;
 	lockdep_assert_held(&dentry->d_inode->i_mutex);
 
+	/*
+	 * Assign a monotonically increasing serial number.  With the list
+	 * appending below, it guarantees that sibling cgroups are always
+	 * sorted in the ascending serial number order on the parent's
+	 * ->children.
+	 */
+	cgrp->serial_nr = atomic64_inc_return(&serial_nr_cursor);
+
 	/* allocation complete, commit to creation */
 	list_add_tail(&cgrp->allcg_node, &root->allcg_list);
 	list_add_tail_rcu(&cgrp->sibling, &cgrp->parent->children);
@@ -4303,6 +4361,10 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 	 * removed.  This makes future css_tryget() and child creation
 	 * attempts fail thus maintaining the removal conditions verified
 	 * above.
+	 *
+	 * Note that CGRP_REMVOED clearing is depended upon by
+	 * cgroup_next_sibling() to resume iteration after dropping RCU
+	 * read lock.  See cgroup_next_sibling() for details.
 	 */
 	for_each_subsys(cgrp->root, ss) {
 		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/5] cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
       [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (3 preceding siblings ...)
  2013-05-21  1:50   ` Tejun Heo
@ 2013-05-21  1:50   ` Tejun Heo
       [not found]     ` <1369101025-28335-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2013-05-21  1:50   ` Tejun Heo
                     ` (7 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2013-05-21  1:50 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, aris-H+wXaHxf7aLQT0dZR+AlfA,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	hannes-druUgvl0LCNAfugRpC6u6w,
	bsingharora-Re5JQEeQqe8AvxtiuMwx3w,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Tejun Heo

Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section.  This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive.  This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.

It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible.  A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling.  IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.

This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next.  This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order.  A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.

This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.

Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 include/linux/cgroup.h | 10 ++++++++
 kernel/cgroup.c        | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 8d9f3c9..ee041a0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -189,6 +189,14 @@ struct cgroup {
 	struct dentry *dentry;		/* cgroup fs entry, RCU protected */
 
 	/*
+	 * Monotonically increasing unique serial number which defines a
+	 * uniform order among all cgroups.  It's guaranteed that all
+	 * ->children lists are in the ascending order of ->serial_nr.
+	 * It's used to allow interrupting and resuming iterations.
+	 */
+	u64 serial_nr;
+
+	/*
 	 * This is a copy of dentry->d_name, and it's needed because
 	 * we can't use dentry->d_name in cgroup_path().
 	 *
@@ -675,6 +683,8 @@ static inline struct cgroup* task_cgroup(struct task_struct *task,
 	return task_subsys_state(task, subsys_id)->cgroup;
 }
 
+struct cgroup *cgroup_next_sibling(struct cgroup *pos);
+
 /**
  * cgroup_for_each_child - iterate through children of a cgroup
  * @pos: the cgroup * to use as the loop cursor
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3222f93..bc757d7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2975,6 +2975,55 @@ static void cgroup_enable_task_cg_lists(void)
 }
 
 /**
+ * cgroup_next_sibling - find the next sibling of a given cgroup
+ * @pos: the current cgroup
+ *
+ * This function returns the next sibling of @pos and should be called
+ * under RCU read lock.  The only requirement is that @pos is accessible.
+ * The next sibling is guaranteed to be returned regardless of @pos's
+ * state.
+ */
+struct cgroup *cgroup_next_sibling(struct cgroup *pos)
+{
+	struct cgroup *next;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	/*
+	 * @pos could already have been removed.  Once a cgroup is removed,
+	 * its ->sibling.next is no longer updated when its next sibling
+	 * changes.  As CGRP_REMOVED is set on removal which is fully
+	 * serialized, if we see it unasserted, it's guaranteed that the
+	 * next sibling hasn't finished its grace period even if it's
+	 * already removed, and thus safe to dereference from this RCU
+	 * critical section.  If ->sibling.next is inaccessible,
+	 * cgroup_is_removed() is guaranteed to be visible as %true here.
+	 */
+	if (likely(!cgroup_is_removed(pos))) {
+		next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling);
+		if (&next->sibling != &pos->parent->children)
+			return next;
+		return NULL;
+	}
+
+	/*
+	 * Can't dereference the next pointer.  Each cgroup is given a
+	 * monotonically increasing unique serial number and always
+	 * appended to the sibling list, so the next one can be found by
+	 * walking the parent's children until we see a cgroup with higher
+	 * serial number than @pos's.
+	 *
+	 * While this path can be slow, it's taken only when either the
+	 * current cgroup is removed or iteration and reomval race.
+	 */
+	list_for_each_entry_rcu(next, &pos->parent->children, sibling)
+		if (next->serial_nr > pos->serial_nr)
+			return next;
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(cgroup_next_sibling);
+
+/**
  * cgroup_next_descendant_pre - find the next descendant for pre-order walk
  * @pos: the current position (%NULL to initiate traversal)
  * @cgroup: cgroup whose descendants to walk
@@ -4136,6 +4185,7 @@ static void offline_css(struct cgroup_subsys *ss, struct cgroup *cgrp)
 static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
 			     umode_t mode)
 {
+	static atomic64_t serial_nr_cursor = ATOMIC64_INIT(0);
 	struct cgroup *cgrp;
 	struct cgroup_name *name;
 	struct cgroupfs_root *root = parent->root;
@@ -4216,6 +4266,14 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
 		goto err_free_all;
 	lockdep_assert_held(&dentry->d_inode->i_mutex);
 
+	/*
+	 * Assign a monotonically increasing serial number.  With the list
+	 * appending below, it guarantees that sibling cgroups are always
+	 * sorted in the ascending serial number order on the parent's
+	 * ->children.
+	 */
+	cgrp->serial_nr = atomic64_inc_return(&serial_nr_cursor);
+
 	/* allocation complete, commit to creation */
 	list_add_tail(&cgrp->allcg_node, &root->allcg_list);
 	list_add_tail_rcu(&cgrp->sibling, &cgrp->parent->children);
@@ -4303,6 +4361,10 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 	 * removed.  This makes future css_tryget() and child creation
 	 * attempts fail thus maintaining the removal conditions verified
 	 * above.
+	 *
+	 * Note that CGRP_REMVOED clearing is depended upon by
+	 * cgroup_next_sibling() to resume iteration after dropping RCU
+	 * read lock.  See cgroup_next_sibling() for details.
 	 */
 	for_each_subsys(cgrp->root, ss) {
 		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/5] cgroup: update iterators to use cgroup_next_sibling()
       [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (6 preceding siblings ...)
  2013-05-21  1:50   ` [PATCH 4/5] cgroup: update iterators to use cgroup_next_sibling() Tejun Heo
@ 2013-05-21  1:50   ` Tejun Heo
  2013-05-21  1:50   ` [PATCH 5/5] device_cgroup: simplify cgroup tree walk in propagate_exception() Tejun Heo
                     ` (4 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2013-05-21  1:50 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

This patch converts cgroup_for_each_child(),
cgroup_next_descendant_pre/post() and thus
cgroup_for_each_descendant_pre/post() to use cgroup_next_sibling()
instead of manually dereferencing ->sibling.next.

The only reason the iterators couldn't allow dropping RCU read lock
while iteration is in progress was because they couldn't determine the
next sibling safely once RCU read lock is dropped.  Using
cgroup_next_sibling() removes that problem and enables all iterators
to allow dropping RCU read lock in the middle.  Comments are updated
accordingly.

This makes the iterators easier to use and will simplify controllers.

Note that @cgroup argument is renamed to @cgrp in
cgroup_for_each_child() because it conflicts with "struct cgroup" used
in the new macro body.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 include/linux/cgroup.h | 18 ++++++++++++++----
 kernel/cgroup.c        | 25 +++++++++++++++++++------
 2 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ee041a0..d0ad379 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -688,9 +688,9 @@ struct cgroup *cgroup_next_sibling(struct cgroup *pos);
 /**
  * cgroup_for_each_child - iterate through children of a cgroup
  * @pos: the cgroup * to use as the loop cursor
- * @cgroup: cgroup whose children to walk
+ * @cgrp: cgroup whose children to walk
  *
- * Walk @cgroup's children.  Must be called under rcu_read_lock().  A child
+ * Walk @cgrp's children.  Must be called under rcu_read_lock().  A child
  * cgroup which hasn't finished ->css_online() or already has finished
  * ->css_offline() may show up during traversal and it's each subsystem's
  * responsibility to verify that each @pos is alive.
@@ -698,9 +698,15 @@ struct cgroup *cgroup_next_sibling(struct cgroup *pos);
  * If a subsystem synchronizes against the parent in its ->css_online() and
  * before starting iterating, a cgroup which finished ->css_online() is
  * guaranteed to be visible in the future iterations.
+ *
+ * It is allowed to temporarily drop RCU read lock during iteration.  The
+ * caller is responsible for ensuring that @pos remains accessible until
+ * the start of the next iteration by, for example, bumping the css refcnt.
  */
-#define cgroup_for_each_child(pos, cgroup)				\
-	list_for_each_entry_rcu(pos, &(cgroup)->children, sibling)
+#define cgroup_for_each_child(pos, cgrp)				\
+	for ((pos) = list_first_or_null_rcu(&(cgrp)->children,		\
+					    struct cgroup, sibling);	\
+	     (pos); (pos) = cgroup_next_sibling((pos)))
 
 struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
 					  struct cgroup *cgroup);
@@ -759,6 +765,10 @@ struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos);
  * Alternatively, a subsystem may choose to use a single global lock to
  * synchronize ->css_online() and ->css_offline() against tree-walking
  * operations.
+ *
+ * It is allowed to temporarily drop RCU read lock during iteration.  The
+ * caller is responsible for ensuring that @pos remains accessible until
+ * the start of the next iteration by, for example, bumping the css refcnt.
  */
 #define cgroup_for_each_descendant_pre(pos, cgroup)			\
 	for (pos = cgroup_next_descendant_pre(NULL, (cgroup)); (pos);	\
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index bc757d7..21b1ee4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3030,6 +3030,11 @@ EXPORT_SYMBOL_GPL(cgroup_next_sibling);
  *
  * To be used by cgroup_for_each_descendant_pre().  Find the next
  * descendant to visit for pre-order traversal of @cgroup's descendants.
+ *
+ * While this function requires RCU read locking, it doesn't require the
+ * whole traversal to be contained in a single RCU critical section.  This
+ * function will return the correct next descendant as long as both @pos
+ * and @cgroup are accessible and @pos is a descendant of @cgroup.
  */
 struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
 					  struct cgroup *cgroup)
@@ -3049,11 +3054,9 @@ struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
 
 	/* no child, visit my or the closest ancestor's next sibling */
 	while (pos != cgroup) {
-		next = list_entry_rcu(pos->sibling.next, struct cgroup,
-				      sibling);
-		if (&next->sibling != &pos->parent->children)
+		next = cgroup_next_sibling(pos);
+		if (next)
 			return next;
-
 		pos = pos->parent;
 	}
 
@@ -3068,6 +3071,11 @@ EXPORT_SYMBOL_GPL(cgroup_next_descendant_pre);
  * Return the rightmost descendant of @pos.  If there's no descendant,
  * @pos is returned.  This can be used during pre-order traversal to skip
  * subtree of @pos.
+ *
+ * While this function requires RCU read locking, it doesn't require the
+ * whole traversal to be contained in a single RCU critical section.  This
+ * function will return the correct rightmost descendant as long as @pos is
+ * accessible.
  */
 struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos)
 {
@@ -3107,6 +3115,11 @@ static struct cgroup *cgroup_leftmost_descendant(struct cgroup *pos)
  *
  * To be used by cgroup_for_each_descendant_post().  Find the next
  * descendant to visit for post-order traversal of @cgroup's descendants.
+ *
+ * While this function requires RCU read locking, it doesn't require the
+ * whole traversal to be contained in a single RCU critical section.  This
+ * function will return the correct next descendant as long as both @pos
+ * and @cgroup are accessible and @pos is a descendant of @cgroup.
  */
 struct cgroup *cgroup_next_descendant_post(struct cgroup *pos,
 					   struct cgroup *cgroup)
@@ -3122,8 +3135,8 @@ struct cgroup *cgroup_next_descendant_post(struct cgroup *pos,
 	}
 
 	/* if there's an unvisited sibling, visit its leftmost descendant */
-	next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling);
-	if (&next->sibling != &pos->parent->children)
+	next = cgroup_next_sibling(pos);
+	if (next)
 		return cgroup_leftmost_descendant(next);
 
 	/* no sibling left, visit parent */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/5] cgroup: update iterators to use cgroup_next_sibling()
       [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (5 preceding siblings ...)
  2013-05-21  1:50   ` Tejun Heo
@ 2013-05-21  1:50   ` Tejun Heo
       [not found]     ` <1369101025-28335-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2013-05-21  1:50   ` Tejun Heo
                     ` (5 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2013-05-21  1:50 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, aris-H+wXaHxf7aLQT0dZR+AlfA,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	hannes-druUgvl0LCNAfugRpC6u6w,
	bsingharora-Re5JQEeQqe8AvxtiuMwx3w,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Tejun Heo

This patch converts cgroup_for_each_child(),
cgroup_next_descendant_pre/post() and thus
cgroup_for_each_descendant_pre/post() to use cgroup_next_sibling()
instead of manually dereferencing ->sibling.next.

The only reason the iterators couldn't allow dropping RCU read lock
while iteration is in progress was because they couldn't determine the
next sibling safely once RCU read lock is dropped.  Using
cgroup_next_sibling() removes that problem and enables all iterators
to allow dropping RCU read lock in the middle.  Comments are updated
accordingly.

This makes the iterators easier to use and will simplify controllers.

Note that @cgroup argument is renamed to @cgrp in
cgroup_for_each_child() because it conflicts with "struct cgroup" used
in the new macro body.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 include/linux/cgroup.h | 18 ++++++++++++++----
 kernel/cgroup.c        | 25 +++++++++++++++++++------
 2 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ee041a0..d0ad379 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -688,9 +688,9 @@ struct cgroup *cgroup_next_sibling(struct cgroup *pos);
 /**
  * cgroup_for_each_child - iterate through children of a cgroup
  * @pos: the cgroup * to use as the loop cursor
- * @cgroup: cgroup whose children to walk
+ * @cgrp: cgroup whose children to walk
  *
- * Walk @cgroup's children.  Must be called under rcu_read_lock().  A child
+ * Walk @cgrp's children.  Must be called under rcu_read_lock().  A child
  * cgroup which hasn't finished ->css_online() or already has finished
  * ->css_offline() may show up during traversal and it's each subsystem's
  * responsibility to verify that each @pos is alive.
@@ -698,9 +698,15 @@ struct cgroup *cgroup_next_sibling(struct cgroup *pos);
  * If a subsystem synchronizes against the parent in its ->css_online() and
  * before starting iterating, a cgroup which finished ->css_online() is
  * guaranteed to be visible in the future iterations.
+ *
+ * It is allowed to temporarily drop RCU read lock during iteration.  The
+ * caller is responsible for ensuring that @pos remains accessible until
+ * the start of the next iteration by, for example, bumping the css refcnt.
  */
-#define cgroup_for_each_child(pos, cgroup)				\
-	list_for_each_entry_rcu(pos, &(cgroup)->children, sibling)
+#define cgroup_for_each_child(pos, cgrp)				\
+	for ((pos) = list_first_or_null_rcu(&(cgrp)->children,		\
+					    struct cgroup, sibling);	\
+	     (pos); (pos) = cgroup_next_sibling((pos)))
 
 struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
 					  struct cgroup *cgroup);
@@ -759,6 +765,10 @@ struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos);
  * Alternatively, a subsystem may choose to use a single global lock to
  * synchronize ->css_online() and ->css_offline() against tree-walking
  * operations.
+ *
+ * It is allowed to temporarily drop RCU read lock during iteration.  The
+ * caller is responsible for ensuring that @pos remains accessible until
+ * the start of the next iteration by, for example, bumping the css refcnt.
  */
 #define cgroup_for_each_descendant_pre(pos, cgroup)			\
 	for (pos = cgroup_next_descendant_pre(NULL, (cgroup)); (pos);	\
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index bc757d7..21b1ee4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3030,6 +3030,11 @@ EXPORT_SYMBOL_GPL(cgroup_next_sibling);
  *
  * To be used by cgroup_for_each_descendant_pre().  Find the next
  * descendant to visit for pre-order traversal of @cgroup's descendants.
+ *
+ * While this function requires RCU read locking, it doesn't require the
+ * whole traversal to be contained in a single RCU critical section.  This
+ * function will return the correct next descendant as long as both @pos
+ * and @cgroup are accessible and @pos is a descendant of @cgroup.
  */
 struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
 					  struct cgroup *cgroup)
@@ -3049,11 +3054,9 @@ struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
 
 	/* no child, visit my or the closest ancestor's next sibling */
 	while (pos != cgroup) {
-		next = list_entry_rcu(pos->sibling.next, struct cgroup,
-				      sibling);
-		if (&next->sibling != &pos->parent->children)
+		next = cgroup_next_sibling(pos);
+		if (next)
 			return next;
-
 		pos = pos->parent;
 	}
 
@@ -3068,6 +3071,11 @@ EXPORT_SYMBOL_GPL(cgroup_next_descendant_pre);
  * Return the rightmost descendant of @pos.  If there's no descendant,
  * @pos is returned.  This can be used during pre-order traversal to skip
  * subtree of @pos.
+ *
+ * While this function requires RCU read locking, it doesn't require the
+ * whole traversal to be contained in a single RCU critical section.  This
+ * function will return the correct rightmost descendant as long as @pos is
+ * accessible.
  */
 struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos)
 {
@@ -3107,6 +3115,11 @@ static struct cgroup *cgroup_leftmost_descendant(struct cgroup *pos)
  *
  * To be used by cgroup_for_each_descendant_post().  Find the next
  * descendant to visit for post-order traversal of @cgroup's descendants.
+ *
+ * While this function requires RCU read locking, it doesn't require the
+ * whole traversal to be contained in a single RCU critical section.  This
+ * function will return the correct next descendant as long as both @pos
+ * and @cgroup are accessible and @pos is a descendant of @cgroup.
  */
 struct cgroup *cgroup_next_descendant_post(struct cgroup *pos,
 					   struct cgroup *cgroup)
@@ -3122,8 +3135,8 @@ struct cgroup *cgroup_next_descendant_post(struct cgroup *pos,
 	}
 
 	/* if there's an unvisited sibling, visit its leftmost descendant */
-	next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling);
-	if (&next->sibling != &pos->parent->children)
+	next = cgroup_next_sibling(pos);
+	if (next)
 		return cgroup_leftmost_descendant(next);
 
 	/* no sibling left, visit parent */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 5/5] device_cgroup: simplify cgroup tree walk in propagate_exception()
       [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (7 preceding siblings ...)
  2013-05-21  1:50   ` Tejun Heo
@ 2013-05-21  1:50   ` Tejun Heo
  2013-05-21  1:50   ` Tejun Heo
                     ` (3 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2013-05-21  1:50 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

During a config change, propagate_exception() needs to traverse the
subtree to update config on the subtree.  Because such config updates
need to allocate memory, it couldn't directly use
cgroup_for_each_descendant_pre() which required the whole iteration to
be contained in a single RCU read critical section.  To work around
the limitation, propagate_exception() built a linked list of
descendant cgroups while read-locking RCU and then walked the list
afterwards, which is safe as the whole iteration is protected by
devcgroup_mutex.  This works but is cumbersome.

With the recent updates, cgroup iterators now allow dropping RCU read
lock while iteration is in progress making this workaround no longer
necessary.  This patch replaces dev_cgroup->propagate_pending list and
get_online_devcg() with direct cgroup_for_each_descendant_pre() walk.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 security/device_cgroup.c | 56 ++++++++++++++++--------------------------------
 1 file changed, 18 insertions(+), 38 deletions(-)

diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index dd0dc57..e8aad69 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -49,8 +49,6 @@ struct dev_cgroup {
 	struct cgroup_subsys_state css;
 	struct list_head exceptions;
 	enum devcg_behavior behavior;
-	/* temporary list for pending propagation operations */
-	struct list_head propagate_pending;
 };
 
 static inline struct dev_cgroup *css_to_devcgroup(struct cgroup_subsys_state *s)
@@ -241,7 +239,6 @@ static struct cgroup_subsys_state *devcgroup_css_alloc(struct cgroup *cgroup)
 	if (!dev_cgroup)
 		return ERR_PTR(-ENOMEM);
 	INIT_LIST_HEAD(&dev_cgroup->exceptions);
-	INIT_LIST_HEAD(&dev_cgroup->propagate_pending);
 	dev_cgroup->behavior = DEVCG_DEFAULT_NONE;
 
 	return &dev_cgroup->css;
@@ -445,34 +442,6 @@ static void revalidate_active_exceptions(struct dev_cgroup *devcg)
 }
 
 /**
- * get_online_devcg - walks the cgroup tree and fills a list with the online
- * 		      groups
- * @root: cgroup used as starting point
- * @online: list that will be filled with online groups
- *
- * Must be called with devcgroup_mutex held. Grabs RCU lock.
- * Because devcgroup_mutex is held, no devcg will become online or offline
- * during the tree walk (see devcgroup_online, devcgroup_offline)
- * A separated list is needed because propagate_behavior() and
- * propagate_exception() need to allocate memory and can block.
- */
-static void get_online_devcg(struct cgroup *root, struct list_head *online)
-{
-	struct cgroup *pos;
-	struct dev_cgroup *devcg;
-
-	lockdep_assert_held(&devcgroup_mutex);
-
-	rcu_read_lock();
-	cgroup_for_each_descendant_pre(pos, root) {
-		devcg = cgroup_to_devcgroup(pos);
-		if (is_devcg_online(devcg))
-			list_add_tail(&devcg->propagate_pending, online);
-	}
-	rcu_read_unlock();
-}
-
-/**
  * propagate_exception - propagates a new exception to the children
  * @devcg_root: device cgroup that added a new exception
  * @ex: new exception to be propagated
@@ -482,15 +451,24 @@ static void get_online_devcg(struct cgroup *root, struct list_head *online)
 static int propagate_exception(struct dev_cgroup *devcg_root,
 			       struct dev_exception_item *ex)
 {
-	struct cgroup *root = devcg_root->css.cgroup;
-	struct dev_cgroup *devcg, *parent, *tmp;
+	struct cgroup *root = devcg_root->css.cgroup, *pos;
 	int rc = 0;
-	LIST_HEAD(pending);
 
-	get_online_devcg(root, &pending);
+	rcu_read_lock();
 
-	list_for_each_entry_safe(devcg, tmp, &pending, propagate_pending) {
-		parent = cgroup_to_devcgroup(devcg->css.cgroup->parent);
+	cgroup_for_each_descendant_pre(pos, root) {
+		struct dev_cgroup *devcg = cgroup_to_devcgroup(pos);
+
+		/*
+		 * Because devcgroup_mutex is held, no devcg will become
+		 * online or offline during the tree walk (see on/offline
+		 * methods), and online ones are safe to access outside RCU
+		 * read lock without bumping refcnt.
+		 */
+		if (!is_devcg_online(devcg))
+			continue;
+
+		rcu_read_unlock();
 
 		/*
 		 * in case both root's behavior and devcg is allow, a new
@@ -512,8 +490,10 @@ static int propagate_exception(struct dev_cgroup *devcg_root,
 		}
 		revalidate_active_exceptions(devcg);
 
-		list_del_init(&devcg->propagate_pending);
+		rcu_read_lock();
 	}
+
+	rcu_read_unlock();
 	return rc;
 }
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 5/5] device_cgroup: simplify cgroup tree walk in propagate_exception()
       [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (8 preceding siblings ...)
  2013-05-21  1:50   ` [PATCH 5/5] device_cgroup: simplify cgroup tree walk in propagate_exception() Tejun Heo
@ 2013-05-21  1:50   ` Tejun Heo
       [not found]     ` <1369101025-28335-6-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2013-05-21  3:20   ` [PATCHSET] cgroup: allow dropping RCU read lock while iterating Tejun Heo
                     ` (2 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2013-05-21  1:50 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, aris-H+wXaHxf7aLQT0dZR+AlfA,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	hannes-druUgvl0LCNAfugRpC6u6w,
	bsingharora-Re5JQEeQqe8AvxtiuMwx3w,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Tejun Heo

During a config change, propagate_exception() needs to traverse the
subtree to update config on the subtree.  Because such config updates
need to allocate memory, it couldn't directly use
cgroup_for_each_descendant_pre() which required the whole iteration to
be contained in a single RCU read critical section.  To work around
the limitation, propagate_exception() built a linked list of
descendant cgroups while read-locking RCU and then walked the list
afterwards, which is safe as the whole iteration is protected by
devcgroup_mutex.  This works but is cumbersome.

With the recent updates, cgroup iterators now allow dropping RCU read
lock while iteration is in progress making this workaround no longer
necessary.  This patch replaces dev_cgroup->propagate_pending list and
get_online_devcg() with direct cgroup_for_each_descendant_pre() walk.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 security/device_cgroup.c | 56 ++++++++++++++++--------------------------------
 1 file changed, 18 insertions(+), 38 deletions(-)

diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index dd0dc57..e8aad69 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -49,8 +49,6 @@ struct dev_cgroup {
 	struct cgroup_subsys_state css;
 	struct list_head exceptions;
 	enum devcg_behavior behavior;
-	/* temporary list for pending propagation operations */
-	struct list_head propagate_pending;
 };
 
 static inline struct dev_cgroup *css_to_devcgroup(struct cgroup_subsys_state *s)
@@ -241,7 +239,6 @@ static struct cgroup_subsys_state *devcgroup_css_alloc(struct cgroup *cgroup)
 	if (!dev_cgroup)
 		return ERR_PTR(-ENOMEM);
 	INIT_LIST_HEAD(&dev_cgroup->exceptions);
-	INIT_LIST_HEAD(&dev_cgroup->propagate_pending);
 	dev_cgroup->behavior = DEVCG_DEFAULT_NONE;
 
 	return &dev_cgroup->css;
@@ -445,34 +442,6 @@ static void revalidate_active_exceptions(struct dev_cgroup *devcg)
 }
 
 /**
- * get_online_devcg - walks the cgroup tree and fills a list with the online
- * 		      groups
- * @root: cgroup used as starting point
- * @online: list that will be filled with online groups
- *
- * Must be called with devcgroup_mutex held. Grabs RCU lock.
- * Because devcgroup_mutex is held, no devcg will become online or offline
- * during the tree walk (see devcgroup_online, devcgroup_offline)
- * A separated list is needed because propagate_behavior() and
- * propagate_exception() need to allocate memory and can block.
- */
-static void get_online_devcg(struct cgroup *root, struct list_head *online)
-{
-	struct cgroup *pos;
-	struct dev_cgroup *devcg;
-
-	lockdep_assert_held(&devcgroup_mutex);
-
-	rcu_read_lock();
-	cgroup_for_each_descendant_pre(pos, root) {
-		devcg = cgroup_to_devcgroup(pos);
-		if (is_devcg_online(devcg))
-			list_add_tail(&devcg->propagate_pending, online);
-	}
-	rcu_read_unlock();
-}
-
-/**
  * propagate_exception - propagates a new exception to the children
  * @devcg_root: device cgroup that added a new exception
  * @ex: new exception to be propagated
@@ -482,15 +451,24 @@ static void get_online_devcg(struct cgroup *root, struct list_head *online)
 static int propagate_exception(struct dev_cgroup *devcg_root,
 			       struct dev_exception_item *ex)
 {
-	struct cgroup *root = devcg_root->css.cgroup;
-	struct dev_cgroup *devcg, *parent, *tmp;
+	struct cgroup *root = devcg_root->css.cgroup, *pos;
 	int rc = 0;
-	LIST_HEAD(pending);
 
-	get_online_devcg(root, &pending);
+	rcu_read_lock();
 
-	list_for_each_entry_safe(devcg, tmp, &pending, propagate_pending) {
-		parent = cgroup_to_devcgroup(devcg->css.cgroup->parent);
+	cgroup_for_each_descendant_pre(pos, root) {
+		struct dev_cgroup *devcg = cgroup_to_devcgroup(pos);
+
+		/*
+		 * Because devcgroup_mutex is held, no devcg will become
+		 * online or offline during the tree walk (see on/offline
+		 * methods), and online ones are safe to access outside RCU
+		 * read lock without bumping refcnt.
+		 */
+		if (!is_devcg_online(devcg))
+			continue;
+
+		rcu_read_unlock();
 
 		/*
 		 * in case both root's behavior and devcg is allow, a new
@@ -512,8 +490,10 @@ static int propagate_exception(struct dev_cgroup *devcg_root,
 		}
 		revalidate_active_exceptions(devcg);
 
-		list_del_init(&devcg->propagate_pending);
+		rcu_read_lock();
 	}
+
+	rcu_read_unlock();
 	return rc;
 }
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCHSET] cgroup: allow dropping RCU read lock while iterating
       [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (10 preceding siblings ...)
  2013-05-21  3:20   ` [PATCHSET] cgroup: allow dropping RCU read lock while iterating Tejun Heo
@ 2013-05-21  3:20   ` Tejun Heo
  2013-05-22 14:53   ` Aristeu Rozanski
  12 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2013-05-21  3:20 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA

(oops, used Serge's old address, using the new canonical one)

Serge, the original thread can be read from

 http://thread.gmane.org/gmane.linux.kernel.containers/26037

Li, the cpuset patchset you posted a couple weeks ago should be able
to make use of this, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET] cgroup: allow dropping RCU read lock while iterating
       [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (9 preceding siblings ...)
  2013-05-21  1:50   ` Tejun Heo
@ 2013-05-21  3:20   ` Tejun Heo
  2013-05-21  3:20   ` Tejun Heo
  2013-05-22 14:53   ` Aristeu Rozanski
  12 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2013-05-21  3:20 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, aris-H+wXaHxf7aLQT0dZR+AlfA,
	serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw, mhocko-AlSwsSmVLrQ,
	hannes-druUgvl0LCNAfugRpC6u6w,
	bsingharora-Re5JQEeQqe8AvxtiuMwx3w,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A

(oops, used Serge's old address, using the new canonical one)

Serge, the original thread can be read from

 http://thread.gmane.org/gmane.linux.kernel.containers/26037

Li, the cpuset patchset you posted a couple weeks ago should be able
to make use of this, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
       [not found]     ` <1369101025-28335-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2013-05-21 14:33       ` Serge Hallyn
@ 2013-05-21 14:33       ` Serge Hallyn
  2013-05-22 14:36       ` Aristeu Rozanski
  2013-05-22 18:41       ` Michal Hocko
  3 siblings, 0 replies; 32+ messages in thread
From: Serge Hallyn @ 2013-05-21 14:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: mhocko-AlSwsSmVLrQ, cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hannes-druUgvl0LCNAfugRpC6u6w

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> Currently, there's no easy way to find out the next sibling cgroup
> unless it's known that the current cgroup is accessed from the
> parent's children list in a single RCU critical section.  This in turn
> forces all iterators to require whole iteration to be enclosed in a
> single RCU critical section, which sometimes is too restrictive.  This
> patch implements cgroup_next_sibling() which can reliably determine
> the next sibling regardless of the state of the current cgroup as long
> as it's accessible.
> 
> It currently is impossible to determine the next sibling after
> dropping RCU read lock because the cgroup being iterated could be
> removed anytime and if RCU read lock is dropped, nothing guarantess
> its ->sibling.next pointer is accessible.  A removed cgroup would
> continue to point to its next sibling for RCU accesses but stop
> receiving updates from the sibling.  IOW, the next sibling could be
> removed and then complete its grace period while RCU read lock is
> dropped, making it unsafe to dereference ->sibling.next after dropping
> and re-acquiring RCU read lock.
> 
> This can be solved by adding a way to traverse to the next sibling
> without dereferencing ->sibling.next.  This patch adds a monotonically
> increasing cgroup serial number, cgroup->serial_nr, which guarantees
> that all cgroup->children lists are kept in increasing serial_nr
> order.  A new function, cgroup_next_sibling(), is implemented, which,
> if CGRP_REMOVED is not set on the current cgroup, follows
> ->sibling.next; otherwise, traverses the parent's ->children list
> until it sees a sibling with higher ->serial_nr.
> 
> This allows the function to always return the next sibling regardless
> of the state of the current cgroup without adding overhead in the fast
> path.
> 
> Further patches will update the iterators to use cgroup_next_sibling()
> so that they allow dropping RCU read lock and blocking while iteration
> is in progress which in turn will be used to simplify controllers.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Looks sensible, thanks.

(ultra-nit: typo @ "reomval")

Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

> ---
>  include/linux/cgroup.h | 10 ++++++++
>  kernel/cgroup.c        | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 72 insertions(+)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 8d9f3c9..ee041a0 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -189,6 +189,14 @@ struct cgroup {
>  	struct dentry *dentry;		/* cgroup fs entry, RCU protected */
>  
>  	/*
> +	 * Monotonically increasing unique serial number which defines a
> +	 * uniform order among all cgroups.  It's guaranteed that all
> +	 * ->children lists are in the ascending order of ->serial_nr.
> +	 * It's used to allow interrupting and resuming iterations.
> +	 */
> +	u64 serial_nr;
> +
> +	/*
>  	 * This is a copy of dentry->d_name, and it's needed because
>  	 * we can't use dentry->d_name in cgroup_path().
>  	 *
> @@ -675,6 +683,8 @@ static inline struct cgroup* task_cgroup(struct task_struct *task,
>  	return task_subsys_state(task, subsys_id)->cgroup;
>  }
>  
> +struct cgroup *cgroup_next_sibling(struct cgroup *pos);
> +
>  /**
>   * cgroup_for_each_child - iterate through children of a cgroup
>   * @pos: the cgroup * to use as the loop cursor
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 3222f93..bc757d7 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -2975,6 +2975,55 @@ static void cgroup_enable_task_cg_lists(void)
>  }
>  
>  /**
> + * cgroup_next_sibling - find the next sibling of a given cgroup
> + * @pos: the current cgroup
> + *
> + * This function returns the next sibling of @pos and should be called
> + * under RCU read lock.  The only requirement is that @pos is accessible.
> + * The next sibling is guaranteed to be returned regardless of @pos's
> + * state.
> + */
> +struct cgroup *cgroup_next_sibling(struct cgroup *pos)
> +{
> +	struct cgroup *next;
> +
> +	WARN_ON_ONCE(!rcu_read_lock_held());
> +
> +	/*
> +	 * @pos could already have been removed.  Once a cgroup is removed,
> +	 * its ->sibling.next is no longer updated when its next sibling
> +	 * changes.  As CGRP_REMOVED is set on removal which is fully
> +	 * serialized, if we see it unasserted, it's guaranteed that the
> +	 * next sibling hasn't finished its grace period even if it's
> +	 * already removed, and thus safe to dereference from this RCU
> +	 * critical section.  If ->sibling.next is inaccessible,
> +	 * cgroup_is_removed() is guaranteed to be visible as %true here.
> +	 */
> +	if (likely(!cgroup_is_removed(pos))) {
> +		next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling);
> +		if (&next->sibling != &pos->parent->children)
> +			return next;
> +		return NULL;
> +	}
> +
> +	/*
> +	 * Can't dereference the next pointer.  Each cgroup is given a
> +	 * monotonically increasing unique serial number and always
> +	 * appended to the sibling list, so the next one can be found by
> +	 * walking the parent's children until we see a cgroup with higher
> +	 * serial number than @pos's.
> +	 *
> +	 * While this path can be slow, it's taken only when either the
> +	 * current cgroup is removed or iteration and reomval race.
> +	 */
> +	list_for_each_entry_rcu(next, &pos->parent->children, sibling)
> +		if (next->serial_nr > pos->serial_nr)
> +			return next;
> +	return NULL;
> +}
> +EXPORT_SYMBOL_GPL(cgroup_next_sibling);
> +
> +/**
>   * cgroup_next_descendant_pre - find the next descendant for pre-order walk
>   * @pos: the current position (%NULL to initiate traversal)
>   * @cgroup: cgroup whose descendants to walk
> @@ -4136,6 +4185,7 @@ static void offline_css(struct cgroup_subsys *ss, struct cgroup *cgrp)
>  static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
>  			     umode_t mode)
>  {
> +	static atomic64_t serial_nr_cursor = ATOMIC64_INIT(0);
>  	struct cgroup *cgrp;
>  	struct cgroup_name *name;
>  	struct cgroupfs_root *root = parent->root;
> @@ -4216,6 +4266,14 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
>  		goto err_free_all;
>  	lockdep_assert_held(&dentry->d_inode->i_mutex);
>  
> +	/*
> +	 * Assign a monotonically increasing serial number.  With the list
> +	 * appending below, it guarantees that sibling cgroups are always
> +	 * sorted in the ascending serial number order on the parent's
> +	 * ->children.
> +	 */
> +	cgrp->serial_nr = atomic64_inc_return(&serial_nr_cursor);
> +
>  	/* allocation complete, commit to creation */
>  	list_add_tail(&cgrp->allcg_node, &root->allcg_list);
>  	list_add_tail_rcu(&cgrp->sibling, &cgrp->parent->children);
> @@ -4303,6 +4361,10 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
>  	 * removed.  This makes future css_tryget() and child creation
>  	 * attempts fail thus maintaining the removal conditions verified
>  	 * above.
> +	 *
> +	 * Note that CGRP_REMVOED clearing is depended upon by
> +	 * cgroup_next_sibling() to resume iteration after dropping RCU
> +	 * read lock.  See cgroup_next_sibling() for details.
>  	 */
>  	for_each_subsys(cgrp->root, ss) {
>  		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> -- 
> 1.8.1.4
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
       [not found]     ` <1369101025-28335-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-05-21 14:33       ` Serge Hallyn
  2013-05-21 14:33       ` Serge Hallyn
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 32+ messages in thread
From: Serge Hallyn @ 2013-05-21 14:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> Currently, there's no easy way to find out the next sibling cgroup
> unless it's known that the current cgroup is accessed from the
> parent's children list in a single RCU critical section.  This in turn
> forces all iterators to require whole iteration to be enclosed in a
> single RCU critical section, which sometimes is too restrictive.  This
> patch implements cgroup_next_sibling() which can reliably determine
> the next sibling regardless of the state of the current cgroup as long
> as it's accessible.
> 
> It currently is impossible to determine the next sibling after
> dropping RCU read lock because the cgroup being iterated could be
> removed anytime and if RCU read lock is dropped, nothing guarantess
> its ->sibling.next pointer is accessible.  A removed cgroup would
> continue to point to its next sibling for RCU accesses but stop
> receiving updates from the sibling.  IOW, the next sibling could be
> removed and then complete its grace period while RCU read lock is
> dropped, making it unsafe to dereference ->sibling.next after dropping
> and re-acquiring RCU read lock.
> 
> This can be solved by adding a way to traverse to the next sibling
> without dereferencing ->sibling.next.  This patch adds a monotonically
> increasing cgroup serial number, cgroup->serial_nr, which guarantees
> that all cgroup->children lists are kept in increasing serial_nr
> order.  A new function, cgroup_next_sibling(), is implemented, which,
> if CGRP_REMOVED is not set on the current cgroup, follows
> ->sibling.next; otherwise, traverses the parent's ->children list
> until it sees a sibling with higher ->serial_nr.
> 
> This allows the function to always return the next sibling regardless
> of the state of the current cgroup without adding overhead in the fast
> path.
> 
> Further patches will update the iterators to use cgroup_next_sibling()
> so that they allow dropping RCU read lock and blocking while iteration
> is in progress which in turn will be used to simplify controllers.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Looks sensible, thanks.

(ultra-nit: typo @ "reomval")

Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

> ---
>  include/linux/cgroup.h | 10 ++++++++
>  kernel/cgroup.c        | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 72 insertions(+)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 8d9f3c9..ee041a0 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -189,6 +189,14 @@ struct cgroup {
>  	struct dentry *dentry;		/* cgroup fs entry, RCU protected */
>  
>  	/*
> +	 * Monotonically increasing unique serial number which defines a
> +	 * uniform order among all cgroups.  It's guaranteed that all
> +	 * ->children lists are in the ascending order of ->serial_nr.
> +	 * It's used to allow interrupting and resuming iterations.
> +	 */
> +	u64 serial_nr;
> +
> +	/*
>  	 * This is a copy of dentry->d_name, and it's needed because
>  	 * we can't use dentry->d_name in cgroup_path().
>  	 *
> @@ -675,6 +683,8 @@ static inline struct cgroup* task_cgroup(struct task_struct *task,
>  	return task_subsys_state(task, subsys_id)->cgroup;
>  }
>  
> +struct cgroup *cgroup_next_sibling(struct cgroup *pos);
> +
>  /**
>   * cgroup_for_each_child - iterate through children of a cgroup
>   * @pos: the cgroup * to use as the loop cursor
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 3222f93..bc757d7 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -2975,6 +2975,55 @@ static void cgroup_enable_task_cg_lists(void)
>  }
>  
>  /**
> + * cgroup_next_sibling - find the next sibling of a given cgroup
> + * @pos: the current cgroup
> + *
> + * This function returns the next sibling of @pos and should be called
> + * under RCU read lock.  The only requirement is that @pos is accessible.
> + * The next sibling is guaranteed to be returned regardless of @pos's
> + * state.
> + */
> +struct cgroup *cgroup_next_sibling(struct cgroup *pos)
> +{
> +	struct cgroup *next;
> +
> +	WARN_ON_ONCE(!rcu_read_lock_held());
> +
> +	/*
> +	 * @pos could already have been removed.  Once a cgroup is removed,
> +	 * its ->sibling.next is no longer updated when its next sibling
> +	 * changes.  As CGRP_REMOVED is set on removal which is fully
> +	 * serialized, if we see it unasserted, it's guaranteed that the
> +	 * next sibling hasn't finished its grace period even if it's
> +	 * already removed, and thus safe to dereference from this RCU
> +	 * critical section.  If ->sibling.next is inaccessible,
> +	 * cgroup_is_removed() is guaranteed to be visible as %true here.
> +	 */
> +	if (likely(!cgroup_is_removed(pos))) {
> +		next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling);
> +		if (&next->sibling != &pos->parent->children)
> +			return next;
> +		return NULL;
> +	}
> +
> +	/*
> +	 * Can't dereference the next pointer.  Each cgroup is given a
> +	 * monotonically increasing unique serial number and always
> +	 * appended to the sibling list, so the next one can be found by
> +	 * walking the parent's children until we see a cgroup with higher
> +	 * serial number than @pos's.
> +	 *
> +	 * While this path can be slow, it's taken only when either the
> +	 * current cgroup is removed or iteration and reomval race.
> +	 */
> +	list_for_each_entry_rcu(next, &pos->parent->children, sibling)
> +		if (next->serial_nr > pos->serial_nr)
> +			return next;
> +	return NULL;
> +}
> +EXPORT_SYMBOL_GPL(cgroup_next_sibling);
> +
> +/**
>   * cgroup_next_descendant_pre - find the next descendant for pre-order walk
>   * @pos: the current position (%NULL to initiate traversal)
>   * @cgroup: cgroup whose descendants to walk
> @@ -4136,6 +4185,7 @@ static void offline_css(struct cgroup_subsys *ss, struct cgroup *cgrp)
>  static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
>  			     umode_t mode)
>  {
> +	static atomic64_t serial_nr_cursor = ATOMIC64_INIT(0);
>  	struct cgroup *cgrp;
>  	struct cgroup_name *name;
>  	struct cgroupfs_root *root = parent->root;
> @@ -4216,6 +4266,14 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
>  		goto err_free_all;
>  	lockdep_assert_held(&dentry->d_inode->i_mutex);
>  
> +	/*
> +	 * Assign a monotonically increasing serial number.  With the list
> +	 * appending below, it guarantees that sibling cgroups are always
> +	 * sorted in the ascending serial number order on the parent's
> +	 * ->children.
> +	 */
> +	cgrp->serial_nr = atomic64_inc_return(&serial_nr_cursor);
> +
>  	/* allocation complete, commit to creation */
>  	list_add_tail(&cgrp->allcg_node, &root->allcg_list);
>  	list_add_tail_rcu(&cgrp->sibling, &cgrp->parent->children);
> @@ -4303,6 +4361,10 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
>  	 * removed.  This makes future css_tryget() and child creation
>  	 * attempts fail thus maintaining the removal conditions verified
>  	 * above.
> +	 *
> +	 * Note that CGRP_REMVOED clearing is depended upon by
> +	 * cgroup_next_sibling() to resume iteration after dropping RCU
> +	 * read lock.  See cgroup_next_sibling() for details.
>  	 */
>  	for_each_subsys(cgrp->root, ss) {
>  		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> -- 
> 1.8.1.4
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/5] cgroup: update iterators to use cgroup_next_sibling()
       [not found]     ` <1369101025-28335-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-05-21 22:31       ` Serge Hallyn
  2013-05-21 22:31       ` Serge Hallyn
                         ` (3 subsequent siblings)
  4 siblings, 0 replies; 32+ messages in thread
From: Serge Hallyn @ 2013-05-21 22:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-AlSwsSmVLrQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA, serue-r/Jw6+rmf7HQT0dZR+AlfA

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> This patch converts cgroup_for_each_child(),
> cgroup_next_descendant_pre/post() and thus
> cgroup_for_each_descendant_pre/post() to use cgroup_next_sibling()
> instead of manually dereferencing ->sibling.next.
> 
> The only reason the iterators couldn't allow dropping RCU read lock
> while iteration is in progress was because they couldn't determine the
> next sibling safely once RCU read lock is dropped.  Using
> cgroup_next_sibling() removes that problem and enables all iterators
> to allow dropping RCU read lock in the middle.  Comments are updated
> accordingly.
> 
> This makes the iterators easier to use and will simplify controllers.
> 
> Note that @cgroup argument is renamed to @cgrp in
> cgroup_for_each_child() because it conflicts with "struct cgroup" used
> in the new macro body.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

I didn't test, but it looks correct, thanks.

Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

> ---
>  include/linux/cgroup.h | 18 ++++++++++++++----
>  kernel/cgroup.c        | 25 +++++++++++++++++++------
>  2 files changed, 33 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index ee041a0..d0ad379 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -688,9 +688,9 @@ struct cgroup *cgroup_next_sibling(struct cgroup *pos);
>  /**
>   * cgroup_for_each_child - iterate through children of a cgroup
>   * @pos: the cgroup * to use as the loop cursor
> - * @cgroup: cgroup whose children to walk
> + * @cgrp: cgroup whose children to walk
>   *
> - * Walk @cgroup's children.  Must be called under rcu_read_lock().  A child
> + * Walk @cgrp's children.  Must be called under rcu_read_lock().  A child
>   * cgroup which hasn't finished ->css_online() or already has finished
>   * ->css_offline() may show up during traversal and it's each subsystem's
>   * responsibility to verify that each @pos is alive.
> @@ -698,9 +698,15 @@ struct cgroup *cgroup_next_sibling(struct cgroup *pos);
>   * If a subsystem synchronizes against the parent in its ->css_online() and
>   * before starting iterating, a cgroup which finished ->css_online() is
>   * guaranteed to be visible in the future iterations.
> + *
> + * It is allowed to temporarily drop RCU read lock during iteration.  The
> + * caller is responsible for ensuring that @pos remains accessible until
> + * the start of the next iteration by, for example, bumping the css refcnt.
>   */
> -#define cgroup_for_each_child(pos, cgroup)				\
> -	list_for_each_entry_rcu(pos, &(cgroup)->children, sibling)
> +#define cgroup_for_each_child(pos, cgrp)				\
> +	for ((pos) = list_first_or_null_rcu(&(cgrp)->children,		\
> +					    struct cgroup, sibling);	\
> +	     (pos); (pos) = cgroup_next_sibling((pos)))
>  
>  struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
>  					  struct cgroup *cgroup);
> @@ -759,6 +765,10 @@ struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos);
>   * Alternatively, a subsystem may choose to use a single global lock to
>   * synchronize ->css_online() and ->css_offline() against tree-walking
>   * operations.
> + *
> + * It is allowed to temporarily drop RCU read lock during iteration.  The
> + * caller is responsible for ensuring that @pos remains accessible until
> + * the start of the next iteration by, for example, bumping the css refcnt.
>   */
>  #define cgroup_for_each_descendant_pre(pos, cgroup)			\
>  	for (pos = cgroup_next_descendant_pre(NULL, (cgroup)); (pos);	\
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index bc757d7..21b1ee4 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -3030,6 +3030,11 @@ EXPORT_SYMBOL_GPL(cgroup_next_sibling);
>   *
>   * To be used by cgroup_for_each_descendant_pre().  Find the next
>   * descendant to visit for pre-order traversal of @cgroup's descendants.
> + *
> + * While this function requires RCU read locking, it doesn't require the
> + * whole traversal to be contained in a single RCU critical section.  This
> + * function will return the correct next descendant as long as both @pos
> + * and @cgroup are accessible and @pos is a descendant of @cgroup.
>   */
>  struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
>  					  struct cgroup *cgroup)
> @@ -3049,11 +3054,9 @@ struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
>  
>  	/* no child, visit my or the closest ancestor's next sibling */
>  	while (pos != cgroup) {
> -		next = list_entry_rcu(pos->sibling.next, struct cgroup,
> -				      sibling);
> -		if (&next->sibling != &pos->parent->children)
> +		next = cgroup_next_sibling(pos);
> +		if (next)
>  			return next;
> -
>  		pos = pos->parent;
>  	}
>  
> @@ -3068,6 +3071,11 @@ EXPORT_SYMBOL_GPL(cgroup_next_descendant_pre);
>   * Return the rightmost descendant of @pos.  If there's no descendant,
>   * @pos is returned.  This can be used during pre-order traversal to skip
>   * subtree of @pos.
> + *
> + * While this function requires RCU read locking, it doesn't require the
> + * whole traversal to be contained in a single RCU critical section.  This
> + * function will return the correct rightmost descendant as long as @pos is
> + * accessible.
>   */
>  struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos)
>  {
> @@ -3107,6 +3115,11 @@ static struct cgroup *cgroup_leftmost_descendant(struct cgroup *pos)
>   *
>   * To be used by cgroup_for_each_descendant_post().  Find the next
>   * descendant to visit for post-order traversal of @cgroup's descendants.
> + *
> + * While this function requires RCU read locking, it doesn't require the
> + * whole traversal to be contained in a single RCU critical section.  This
> + * function will return the correct next descendant as long as both @pos
> + * and @cgroup are accessible and @pos is a descendant of @cgroup.
>   */
>  struct cgroup *cgroup_next_descendant_post(struct cgroup *pos,
>  					   struct cgroup *cgroup)
> @@ -3122,8 +3135,8 @@ struct cgroup *cgroup_next_descendant_post(struct cgroup *pos,
>  	}
>  
>  	/* if there's an unvisited sibling, visit its leftmost descendant */
> -	next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling);
> -	if (&next->sibling != &pos->parent->children)
> +	next = cgroup_next_sibling(pos);
> +	if (next)
>  		return cgroup_leftmost_descendant(next);
>  
>  	/* no sibling left, visit parent */
> -- 
> 1.8.1.4
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/5] cgroup: update iterators to use cgroup_next_sibling()
       [not found]     ` <1369101025-28335-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2013-05-21 22:31       ` Serge Hallyn
@ 2013-05-21 22:31       ` Serge Hallyn
  2013-05-22  9:09       ` Li Zefan
                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 32+ messages in thread
From: Serge Hallyn @ 2013-05-21 22:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, serue-r/Jw6+rmf7HQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> This patch converts cgroup_for_each_child(),
> cgroup_next_descendant_pre/post() and thus
> cgroup_for_each_descendant_pre/post() to use cgroup_next_sibling()
> instead of manually dereferencing ->sibling.next.
> 
> The only reason the iterators couldn't allow dropping RCU read lock
> while iteration is in progress was because they couldn't determine the
> next sibling safely once RCU read lock is dropped.  Using
> cgroup_next_sibling() removes that problem and enables all iterators
> to allow dropping RCU read lock in the middle.  Comments are updated
> accordingly.
> 
> This makes the iterators easier to use and will simplify controllers.
> 
> Note that @cgroup argument is renamed to @cgrp in
> cgroup_for_each_child() because it conflicts with "struct cgroup" used
> in the new macro body.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

I didn't test, but it looks correct, thanks.

Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

> ---
>  include/linux/cgroup.h | 18 ++++++++++++++----
>  kernel/cgroup.c        | 25 +++++++++++++++++++------
>  2 files changed, 33 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index ee041a0..d0ad379 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -688,9 +688,9 @@ struct cgroup *cgroup_next_sibling(struct cgroup *pos);
>  /**
>   * cgroup_for_each_child - iterate through children of a cgroup
>   * @pos: the cgroup * to use as the loop cursor
> - * @cgroup: cgroup whose children to walk
> + * @cgrp: cgroup whose children to walk
>   *
> - * Walk @cgroup's children.  Must be called under rcu_read_lock().  A child
> + * Walk @cgrp's children.  Must be called under rcu_read_lock().  A child
>   * cgroup which hasn't finished ->css_online() or already has finished
>   * ->css_offline() may show up during traversal and it's each subsystem's
>   * responsibility to verify that each @pos is alive.
> @@ -698,9 +698,15 @@ struct cgroup *cgroup_next_sibling(struct cgroup *pos);
>   * If a subsystem synchronizes against the parent in its ->css_online() and
>   * before starting iterating, a cgroup which finished ->css_online() is
>   * guaranteed to be visible in the future iterations.
> + *
> + * It is allowed to temporarily drop RCU read lock during iteration.  The
> + * caller is responsible for ensuring that @pos remains accessible until
> + * the start of the next iteration by, for example, bumping the css refcnt.
>   */
> -#define cgroup_for_each_child(pos, cgroup)				\
> -	list_for_each_entry_rcu(pos, &(cgroup)->children, sibling)
> +#define cgroup_for_each_child(pos, cgrp)				\
> +	for ((pos) = list_first_or_null_rcu(&(cgrp)->children,		\
> +					    struct cgroup, sibling);	\
> +	     (pos); (pos) = cgroup_next_sibling((pos)))
>  
>  struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
>  					  struct cgroup *cgroup);
> @@ -759,6 +765,10 @@ struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos);
>   * Alternatively, a subsystem may choose to use a single global lock to
>   * synchronize ->css_online() and ->css_offline() against tree-walking
>   * operations.
> + *
> + * It is allowed to temporarily drop RCU read lock during iteration.  The
> + * caller is responsible for ensuring that @pos remains accessible until
> + * the start of the next iteration by, for example, bumping the css refcnt.
>   */
>  #define cgroup_for_each_descendant_pre(pos, cgroup)			\
>  	for (pos = cgroup_next_descendant_pre(NULL, (cgroup)); (pos);	\
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index bc757d7..21b1ee4 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -3030,6 +3030,11 @@ EXPORT_SYMBOL_GPL(cgroup_next_sibling);
>   *
>   * To be used by cgroup_for_each_descendant_pre().  Find the next
>   * descendant to visit for pre-order traversal of @cgroup's descendants.
> + *
> + * While this function requires RCU read locking, it doesn't require the
> + * whole traversal to be contained in a single RCU critical section.  This
> + * function will return the correct next descendant as long as both @pos
> + * and @cgroup are accessible and @pos is a descendant of @cgroup.
>   */
>  struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
>  					  struct cgroup *cgroup)
> @@ -3049,11 +3054,9 @@ struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
>  
>  	/* no child, visit my or the closest ancestor's next sibling */
>  	while (pos != cgroup) {
> -		next = list_entry_rcu(pos->sibling.next, struct cgroup,
> -				      sibling);
> -		if (&next->sibling != &pos->parent->children)
> +		next = cgroup_next_sibling(pos);
> +		if (next)
>  			return next;
> -
>  		pos = pos->parent;
>  	}
>  
> @@ -3068,6 +3071,11 @@ EXPORT_SYMBOL_GPL(cgroup_next_descendant_pre);
>   * Return the rightmost descendant of @pos.  If there's no descendant,
>   * @pos is returned.  This can be used during pre-order traversal to skip
>   * subtree of @pos.
> + *
> + * While this function requires RCU read locking, it doesn't require the
> + * whole traversal to be contained in a single RCU critical section.  This
> + * function will return the correct rightmost descendant as long as @pos is
> + * accessible.
>   */
>  struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos)
>  {
> @@ -3107,6 +3115,11 @@ static struct cgroup *cgroup_leftmost_descendant(struct cgroup *pos)
>   *
>   * To be used by cgroup_for_each_descendant_post().  Find the next
>   * descendant to visit for post-order traversal of @cgroup's descendants.
> + *
> + * While this function requires RCU read locking, it doesn't require the
> + * whole traversal to be contained in a single RCU critical section.  This
> + * function will return the correct next descendant as long as both @pos
> + * and @cgroup are accessible and @pos is a descendant of @cgroup.
>   */
>  struct cgroup *cgroup_next_descendant_post(struct cgroup *pos,
>  					   struct cgroup *cgroup)
> @@ -3122,8 +3135,8 @@ struct cgroup *cgroup_next_descendant_post(struct cgroup *pos,
>  	}
>  
>  	/* if there's an unvisited sibling, visit its leftmost descendant */
> -	next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling);
> -	if (&next->sibling != &pos->parent->children)
> +	next = cgroup_next_sibling(pos);
> +	if (next)
>  		return cgroup_leftmost_descendant(next);
>  
>  	/* no sibling left, visit parent */
> -- 
> 1.8.1.4
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/5] device_cgroup: simplify cgroup tree walk in propagate_exception()
       [not found]     ` <1369101025-28335-6-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-05-21 22:35       ` Serge Hallyn
  0 siblings, 0 replies; 32+ messages in thread
From: Serge Hallyn @ 2013-05-21 22:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-AlSwsSmVLrQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA, serue-r/Jw6+rmf7HQT0dZR+AlfA

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> During a config change, propagate_exception() needs to traverse the
> subtree to update config on the subtree.  Because such config updates
> need to allocate memory, it couldn't directly use
> cgroup_for_each_descendant_pre() which required the whole iteration to
> be contained in a single RCU read critical section.  To work around
> the limitation, propagate_exception() built a linked list of
> descendant cgroups while read-locking RCU and then walked the list
> afterwards, which is safe as the whole iteration is protected by
> devcgroup_mutex.  This works but is cumbersome.
> 
> With the recent updates, cgroup iterators now allow dropping RCU read
> lock while iteration is in progress making this workaround no longer
> necessary.  This patch replaces dev_cgroup->propagate_pending list and
> get_online_devcg() with direct cgroup_for_each_descendant_pre() walk.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

> ---
>  security/device_cgroup.c | 56 ++++++++++++++++--------------------------------
>  1 file changed, 18 insertions(+), 38 deletions(-)
> 
> diff --git a/security/device_cgroup.c b/security/device_cgroup.c
> index dd0dc57..e8aad69 100644
> --- a/security/device_cgroup.c
> +++ b/security/device_cgroup.c
> @@ -49,8 +49,6 @@ struct dev_cgroup {
>  	struct cgroup_subsys_state css;
>  	struct list_head exceptions;
>  	enum devcg_behavior behavior;
> -	/* temporary list for pending propagation operations */
> -	struct list_head propagate_pending;
>  };
>  
>  static inline struct dev_cgroup *css_to_devcgroup(struct cgroup_subsys_state *s)
> @@ -241,7 +239,6 @@ static struct cgroup_subsys_state *devcgroup_css_alloc(struct cgroup *cgroup)
>  	if (!dev_cgroup)
>  		return ERR_PTR(-ENOMEM);
>  	INIT_LIST_HEAD(&dev_cgroup->exceptions);
> -	INIT_LIST_HEAD(&dev_cgroup->propagate_pending);
>  	dev_cgroup->behavior = DEVCG_DEFAULT_NONE;
>  
>  	return &dev_cgroup->css;
> @@ -445,34 +442,6 @@ static void revalidate_active_exceptions(struct dev_cgroup *devcg)
>  }
>  
>  /**
> - * get_online_devcg - walks the cgroup tree and fills a list with the online
> - * 		      groups
> - * @root: cgroup used as starting point
> - * @online: list that will be filled with online groups
> - *
> - * Must be called with devcgroup_mutex held. Grabs RCU lock.
> - * Because devcgroup_mutex is held, no devcg will become online or offline
> - * during the tree walk (see devcgroup_online, devcgroup_offline)
> - * A separated list is needed because propagate_behavior() and
> - * propagate_exception() need to allocate memory and can block.
> - */
> -static void get_online_devcg(struct cgroup *root, struct list_head *online)
> -{
> -	struct cgroup *pos;
> -	struct dev_cgroup *devcg;
> -
> -	lockdep_assert_held(&devcgroup_mutex);
> -
> -	rcu_read_lock();
> -	cgroup_for_each_descendant_pre(pos, root) {
> -		devcg = cgroup_to_devcgroup(pos);
> -		if (is_devcg_online(devcg))
> -			list_add_tail(&devcg->propagate_pending, online);
> -	}
> -	rcu_read_unlock();
> -}
> -
> -/**
>   * propagate_exception - propagates a new exception to the children
>   * @devcg_root: device cgroup that added a new exception
>   * @ex: new exception to be propagated
> @@ -482,15 +451,24 @@ static void get_online_devcg(struct cgroup *root, struct list_head *online)
>  static int propagate_exception(struct dev_cgroup *devcg_root,
>  			       struct dev_exception_item *ex)
>  {
> -	struct cgroup *root = devcg_root->css.cgroup;
> -	struct dev_cgroup *devcg, *parent, *tmp;
> +	struct cgroup *root = devcg_root->css.cgroup, *pos;
>  	int rc = 0;
> -	LIST_HEAD(pending);
>  
> -	get_online_devcg(root, &pending);
> +	rcu_read_lock();
>  
> -	list_for_each_entry_safe(devcg, tmp, &pending, propagate_pending) {
> -		parent = cgroup_to_devcgroup(devcg->css.cgroup->parent);
> +	cgroup_for_each_descendant_pre(pos, root) {
> +		struct dev_cgroup *devcg = cgroup_to_devcgroup(pos);
> +
> +		/*
> +		 * Because devcgroup_mutex is held, no devcg will become
> +		 * online or offline during the tree walk (see on/offline
> +		 * methods), and online ones are safe to access outside RCU
> +		 * read lock without bumping refcnt.
> +		 */
> +		if (!is_devcg_online(devcg))
> +			continue;
> +
> +		rcu_read_unlock();
>  
>  		/*
>  		 * in case both root's behavior and devcg is allow, a new
> @@ -512,8 +490,10 @@ static int propagate_exception(struct dev_cgroup *devcg_root,
>  		}
>  		revalidate_active_exceptions(devcg);
>  
> -		list_del_init(&devcg->propagate_pending);
> +		rcu_read_lock();
>  	}
> +
> +	rcu_read_unlock();
>  	return rc;
>  }
>  
> -- 
> 1.8.1.4
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/5] cgroup: update iterators to use cgroup_next_sibling()
       [not found]     ` <1369101025-28335-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                         ` (2 preceding siblings ...)
  2013-05-22  9:09       ` Li Zefan
@ 2013-05-22  9:09       ` Li Zefan
  2013-05-22 18:46       ` Michal Hocko
  4 siblings, 0 replies; 32+ messages in thread
From: Li Zefan @ 2013-05-22  9:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA

> @@ -3068,6 +3071,11 @@ EXPORT_SYMBOL_GPL(cgroup_next_descendant_pre);
>   * Return the rightmost descendant of @pos.  If there's no descendant,
>   * @pos is returned.  This can be used during pre-order traversal to skip
>   * subtree of @pos.
> + *
> + * While this function requires RCU read locking, it doesn't require the
> + * whole traversal to be contained in a single RCU critical section.  This
> + * function will return the correct rightmost descendant as long as @pos is
> + * accessible.
>   */
>  struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos)
>  {

Forgot to convert cgroup_rightmost_descendat() to use cgroup_next_sibling()?

        do {
                last = pos;
                /* ->prev isn't RCU safe, walk ->next till the end */
                pos = NULL;
                list_for_each_entry_rcu(tmp, &last->children, sibling)
                        pos = tmp;
        } while (pos);

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/5] cgroup: update iterators to use cgroup_next_sibling()
       [not found]     ` <1369101025-28335-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2013-05-21 22:31       ` Serge Hallyn
  2013-05-21 22:31       ` Serge Hallyn
@ 2013-05-22  9:09       ` Li Zefan
       [not found]         ` <519C8B2E.5040606-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
  2013-05-22  9:09       ` Li Zefan
  2013-05-22 18:46       ` Michal Hocko
  4 siblings, 1 reply; 32+ messages in thread
From: Li Zefan @ 2013-05-22  9:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, aris-H+wXaHxf7aLQT0dZR+AlfA,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	hannes-druUgvl0LCNAfugRpC6u6w,
	bsingharora-Re5JQEeQqe8AvxtiuMwx3w,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A

> @@ -3068,6 +3071,11 @@ EXPORT_SYMBOL_GPL(cgroup_next_descendant_pre);
>   * Return the rightmost descendant of @pos.  If there's no descendant,
>   * @pos is returned.  This can be used during pre-order traversal to skip
>   * subtree of @pos.
> + *
> + * While this function requires RCU read locking, it doesn't require the
> + * whole traversal to be contained in a single RCU critical section.  This
> + * function will return the correct rightmost descendant as long as @pos is
> + * accessible.
>   */
>  struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos)
>  {

Forgot to convert cgroup_rightmost_descendat() to use cgroup_next_sibling()?

        do {
                last = pos;
                /* ->prev isn't RCU safe, walk ->next till the end */
                pos = NULL;
                list_for_each_entry_rcu(tmp, &last->children, sibling)
                        pos = tmp;
        } while (pos);



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/5] cgroup: update iterators to use cgroup_next_sibling()
       [not found]         ` <519C8B2E.5040606-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
@ 2013-05-22  9:17           ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2013-05-22  9:17 UTC (permalink / raw)
  To: Li Zefan
  Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, May 22, 2013 at 05:09:02PM +0800, Li Zefan wrote:
> > @@ -3068,6 +3071,11 @@ EXPORT_SYMBOL_GPL(cgroup_next_descendant_pre);
> >   * Return the rightmost descendant of @pos.  If there's no descendant,
> >   * @pos is returned.  This can be used during pre-order traversal to skip
> >   * subtree of @pos.
> > + *
> > + * While this function requires RCU read locking, it doesn't require the
> > + * whole traversal to be contained in a single RCU critical section.  This
> > + * function will return the correct rightmost descendant as long as @pos is
> > + * accessible.
> >   */
> >  struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos)
> >  {
> 
> Forgot to convert cgroup_rightmost_descendat() to use cgroup_next_sibling()?
> 
>         do {
>                 last = pos;
>                 /* ->prev isn't RCU safe, walk ->next till the end */
>                 pos = NULL;
>                 list_for_each_entry_rcu(tmp, &last->children, sibling)
>                         pos = tmp;
>         } while (pos);

It's always walking from the parent->children with RCU read locked, so
it doesn't need to be converted.  We can still convert it for
consistency and I had that in the patch originally but dropped it as
the conversion seemed a bit misleading as it covers a case which can
never happen.  Hmmm....

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
       [not found]     ` <1369101025-28335-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2013-05-21 14:33       ` Serge Hallyn
  2013-05-21 14:33       ` Serge Hallyn
@ 2013-05-22 14:36       ` Aristeu Rozanski
       [not found]         ` <20130522143636.GC16739-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2013-05-22 18:41       ` Michal Hocko
  3 siblings, 1 reply; 32+ messages in thread
From: Aristeu Rozanski @ 2013-05-22 14:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hi Tejun,
On Tue, May 21, 2013 at 10:50:23AM +0900, Tejun Heo wrote:
> +	 * current cgroup is removed or iteration and reomval race.
                                                      ^^^^^^^
small typo

-- 
Aristeu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
       [not found]         ` <20130522143636.GC16739-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-05-22 14:38           ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2013-05-22 14:38 UTC (permalink / raw)
  To: Aristeu Rozanski
  Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, May 22, 2013 at 10:36:36AM -0400, Aristeu Rozanski wrote:
> Hi Tejun,
> On Tue, May 21, 2013 at 10:50:23AM +0900, Tejun Heo wrote:
> > +	 * current cgroup is removed or iteration and reomval race.
>                                                       ^^^^^^^
> small typo

Yeap, Serge spotted it too.  Updated.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET] cgroup: allow dropping RCU read lock while iterating
       [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (11 preceding siblings ...)
  2013-05-21  3:20   ` Tejun Heo
@ 2013-05-22 14:53   ` Aristeu Rozanski
  12 siblings, 0 replies; 32+ messages in thread
From: Aristeu Rozanski @ 2013-05-22 14:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hi Tejun,
On Tue, May 21, 2013 at 10:50:20AM +0900, Tejun Heo wrote:
> Currently all cgroup iterators require the whole traversal to be
> contained in a single RCU read critical section, which can be too
> restrictive as there are times when blocking operations are necessary
> during traversal.  This forces controllers to implement specific
> workarounds in those cases - building separate iteration list, punting
> actual operations to work items and so on.
> 
> This patchset updates cgroup iterators so that they allow dropping RCU
> read lock while iteration is in progress so that controllers which
> require sleeping during iteration don't need to implement their own
> mechanisms.
> 
> Dropping RCU read lock during iteration is unsafe because
> cgroup->sibling.next can't be trusted once RCU read lock is dropped.
> The sibling list is a RCU list and when a cgroup is removed the next
> pointer is retained to keep RCU traversal working.  If the next
> sibling is removed while RCU read lock is dropped, the removed current
> cgroup's next won't be updated and the next sibling may complete its
> grace period and get freed leaving the next pointer dangling.
> 
> Working around the problem is relatiely simple.  Whether
> ->sibling.next can be trusted can be trusted can be decided by looking
> at CGRP_REMOVED - as cgroup removals are fully serialized, the flag is
> guaranteed to be visible before the next sibling finishes its grace
> period.  For those cases, each cgroup is assigned a monotonically
> increasing serial number.  Because new cgroups are always appeneded to
> the children list, it's guaranteed that all children list are sorted
> in the ascending order of the serial numbers.  When the next pointer
> can't be trusted, the next sibling can be located by walking the
> parent's children list from the beginning looking for the first cgroup
> with higher serial number.
> 
> The above is implemented in cgroup_next_sibling() and all iterators
> are updated to use it to find out the next sibling thus allowing
> droppping RCU read lock while iteration is in progress.  This patchset
> replaces separate iteration list in device_cgroup with direct
> descendant walk and there will be further patches making use of this
> update.
> 
> This patchset contains the following five patches.
> 
>  0001-cgroup-fix-a-subtle-bug-in-descendant-pre-order-walk.patch
>  0002-cgroup-make-cgroup_is_removed-static.patch
>  0003-cgroup-add-cgroup-serial_nr-and-implement-cgroup_nex.patch
>  0004-cgroup-update-iterators-to-use-cgroup_next_sibling.patch
>  0005-device_cgroup-simplify-cgroup-tree-walk-in-propagate.patch
> 
> 0001 fixes a subtle iteration bug.  Will be applied to for-3.10-fixes.
> 
> 0002 is a trivial prep patch.
> 
> 0003 implements cgroup_next_sibling() which can find out the next
> sibling regardless of the state of the current cgroup.
> 
> 0004 updates all iterators to use cgroup_next_sibling().
> 
> 0005 replaces iteration list work around in device_cgroup with direct
> iteration.
> 
> This patchset is on top of cgroup/for-3.11 23958e729e ("cgroup.h:
> remove some functions that are now gone") and available in the
> following git branch.
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-interruptible-iter

patchset looks good to me. ran some tests in a kernel with it without problems.

Acked-by: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

-- 
Aristeu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] cgroup: fix a subtle bug in descendant pre-order walk
       [not found]     ` <1369101025-28335-2-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2013-05-22 18:22       ` Michal Hocko
@ 2013-05-22 18:22       ` Michal Hocko
  2013-05-24  1:51       ` Tejun Heo
  2 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2013-05-22 18:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	stable-u79uwXL29TY76Z2rM5mHXA, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue 21-05-13 10:50:21, Tejun Heo wrote:
> When cgroup_next_descendant_pre() initiates a walk, it checks whether
> the subtree root doesn't have any children and if not returns NULL.
> Later code assumes that the subtree isn't empty.  This is broken
> because the subtree may become empty inbetween, which can lead to the
> traversal escaping the subtree by walking to the sibling of the
> subtree root.
> 
> There's no reason to have the early exit path.  Remove it along with
> the later assumption that the subtree isn't empty.  This simplifies
> the code a bit and fixes the subtle bug.
> 
> While at it, fix the comment of cgroup_for_each_descendant_pre() which
> was incorrectly referring to ->css_offline() instead of
> ->css_online().
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Well spotted and looks good to me
Reviewed-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

> ---
>  include/linux/cgroup.h | 2 +-
>  kernel/cgroup.c        | 9 +++------
>  2 files changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 4f6f513..1df5f69 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -709,7 +709,7 @@ struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos);
>   *
>   * If a subsystem synchronizes against the parent in its ->css_online() and
>   * before starting iterating, and synchronizes against @pos on each
> - * iteration, any descendant cgroup which finished ->css_offline() is
> + * iteration, any descendant cgroup which finished ->css_online() is
>   * guaranteed to be visible in the future iterations.
>   *
>   * In other words, the following guarantees that a descendant can't escape
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 6b2b1d9..f20f80c 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -2990,11 +2990,8 @@ struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
>  	WARN_ON_ONCE(!rcu_read_lock_held());
>  
>  	/* if first iteration, pretend we just visited @cgroup */
> -	if (!pos) {
> -		if (list_empty(&cgroup->children))
> -			return NULL;
> +	if (!pos)
>  		pos = cgroup;
> -	}
>  
>  	/* visit the first child if exists */
>  	next = list_first_or_null_rcu(&pos->children, struct cgroup, sibling);
> @@ -3002,14 +2999,14 @@ struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
>  		return next;
>  
>  	/* no child, visit my or the closest ancestor's next sibling */
> -	do {
> +	while (pos != cgroup) {
>  		next = list_entry_rcu(pos->sibling.next, struct cgroup,
>  				      sibling);
>  		if (&next->sibling != &pos->parent->children)
>  			return next;
>  
>  		pos = pos->parent;
> -	} while (pos != cgroup);
> +	}
>  
>  	return NULL;
>  }
> -- 
> 1.8.1.4
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] cgroup: fix a subtle bug in descendant pre-order walk
       [not found]     ` <1369101025-28335-2-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-05-22 18:22       ` Michal Hocko
  2013-05-22 18:22       ` Michal Hocko
  2013-05-24  1:51       ` Tejun Heo
  2 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2013-05-22 18:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, aris-H+wXaHxf7aLQT0dZR+AlfA,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	bsingharora-Re5JQEeQqe8AvxtiuMwx3w,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	stable-u79uwXL29TY76Z2rM5mHXA

On Tue 21-05-13 10:50:21, Tejun Heo wrote:
> When cgroup_next_descendant_pre() initiates a walk, it checks whether
> the subtree root doesn't have any children and if not returns NULL.
> Later code assumes that the subtree isn't empty.  This is broken
> because the subtree may become empty inbetween, which can lead to the
> traversal escaping the subtree by walking to the sibling of the
> subtree root.
> 
> There's no reason to have the early exit path.  Remove it along with
> the later assumption that the subtree isn't empty.  This simplifies
> the code a bit and fixes the subtle bug.
> 
> While at it, fix the comment of cgroup_for_each_descendant_pre() which
> was incorrectly referring to ->css_offline() instead of
> ->css_online().
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Well spotted and looks good to me
Reviewed-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

> ---
>  include/linux/cgroup.h | 2 +-
>  kernel/cgroup.c        | 9 +++------
>  2 files changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 4f6f513..1df5f69 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -709,7 +709,7 @@ struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos);
>   *
>   * If a subsystem synchronizes against the parent in its ->css_online() and
>   * before starting iterating, and synchronizes against @pos on each
> - * iteration, any descendant cgroup which finished ->css_offline() is
> + * iteration, any descendant cgroup which finished ->css_online() is
>   * guaranteed to be visible in the future iterations.
>   *
>   * In other words, the following guarantees that a descendant can't escape
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 6b2b1d9..f20f80c 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -2990,11 +2990,8 @@ struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
>  	WARN_ON_ONCE(!rcu_read_lock_held());
>  
>  	/* if first iteration, pretend we just visited @cgroup */
> -	if (!pos) {
> -		if (list_empty(&cgroup->children))
> -			return NULL;
> +	if (!pos)
>  		pos = cgroup;
> -	}
>  
>  	/* visit the first child if exists */
>  	next = list_first_or_null_rcu(&pos->children, struct cgroup, sibling);
> @@ -3002,14 +2999,14 @@ struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
>  		return next;
>  
>  	/* no child, visit my or the closest ancestor's next sibling */
> -	do {
> +	while (pos != cgroup) {
>  		next = list_entry_rcu(pos->sibling.next, struct cgroup,
>  				      sibling);
>  		if (&next->sibling != &pos->parent->children)
>  			return next;
>  
>  		pos = pos->parent;
> -	} while (pos != cgroup);
> +	}
>  
>  	return NULL;
>  }
> -- 
> 1.8.1.4
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
       [not found]     ` <1369101025-28335-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                         ` (2 preceding siblings ...)
  2013-05-22 14:36       ` Aristeu Rozanski
@ 2013-05-22 18:41       ` Michal Hocko
  3 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2013-05-22 18:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue 21-05-13 10:50:23, Tejun Heo wrote:
> Currently, there's no easy way to find out the next sibling cgroup
> unless it's known that the current cgroup is accessed from the
> parent's children list in a single RCU critical section.  This in turn
> forces all iterators to require whole iteration to be enclosed in a
> single RCU critical section, which sometimes is too restrictive.  This
> patch implements cgroup_next_sibling() which can reliably determine
> the next sibling regardless of the state of the current cgroup as long
> as it's accessible.
> 
> It currently is impossible to determine the next sibling after
> dropping RCU read lock because the cgroup being iterated could be
> removed anytime and if RCU read lock is dropped, nothing guarantess
> its ->sibling.next pointer is accessible.  A removed cgroup would
> continue to point to its next sibling for RCU accesses but stop
> receiving updates from the sibling.  IOW, the next sibling could be
> removed and then complete its grace period while RCU read lock is
> dropped, making it unsafe to dereference ->sibling.next after dropping
> and re-acquiring RCU read lock.
> 
> This can be solved by adding a way to traverse to the next sibling
> without dereferencing ->sibling.next.  This patch adds a monotonically
> increasing cgroup serial number, cgroup->serial_nr, which guarantees
> that all cgroup->children lists are kept in increasing serial_nr
> order.  A new function, cgroup_next_sibling(), is implemented, which,
> if CGRP_REMOVED is not set on the current cgroup, follows
> ->sibling.next; otherwise, traverses the parent's ->children list
> until it sees a sibling with higher ->serial_nr.
> 
> This allows the function to always return the next sibling regardless
> of the state of the current cgroup without adding overhead in the fast
> path.
> 
> Further patches will update the iterators to use cgroup_next_sibling()
> so that they allow dropping RCU read lock and blocking while iteration
> is in progress which in turn will be used to simplify controllers.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

OK, I was about to object that the given pos could be freed already but
I can see that the next patch documents that the caller has to make sure
that pos will not go away (by elevating css count) before rcu is
dropped.

Reviewed-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

> ---
>  include/linux/cgroup.h | 10 ++++++++
>  kernel/cgroup.c        | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 72 insertions(+)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 8d9f3c9..ee041a0 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -189,6 +189,14 @@ struct cgroup {
>  	struct dentry *dentry;		/* cgroup fs entry, RCU protected */
>  
>  	/*
> +	 * Monotonically increasing unique serial number which defines a
> +	 * uniform order among all cgroups.  It's guaranteed that all
> +	 * ->children lists are in the ascending order of ->serial_nr.
> +	 * It's used to allow interrupting and resuming iterations.
> +	 */
> +	u64 serial_nr;
> +
> +	/*
>  	 * This is a copy of dentry->d_name, and it's needed because
>  	 * we can't use dentry->d_name in cgroup_path().
>  	 *
> @@ -675,6 +683,8 @@ static inline struct cgroup* task_cgroup(struct task_struct *task,
>  	return task_subsys_state(task, subsys_id)->cgroup;
>  }
>  
> +struct cgroup *cgroup_next_sibling(struct cgroup *pos);
> +
>  /**
>   * cgroup_for_each_child - iterate through children of a cgroup
>   * @pos: the cgroup * to use as the loop cursor
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 3222f93..bc757d7 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -2975,6 +2975,55 @@ static void cgroup_enable_task_cg_lists(void)
>  }
>  
>  /**
> + * cgroup_next_sibling - find the next sibling of a given cgroup
> + * @pos: the current cgroup
> + *
> + * This function returns the next sibling of @pos and should be called
> + * under RCU read lock.  The only requirement is that @pos is accessible.
> + * The next sibling is guaranteed to be returned regardless of @pos's
> + * state.
> + */
> +struct cgroup *cgroup_next_sibling(struct cgroup *pos)
> +{
> +	struct cgroup *next;
> +
> +	WARN_ON_ONCE(!rcu_read_lock_held());
> +
> +	/*
> +	 * @pos could already have been removed.  Once a cgroup is removed,
> +	 * its ->sibling.next is no longer updated when its next sibling
> +	 * changes.  As CGRP_REMOVED is set on removal which is fully
> +	 * serialized, if we see it unasserted, it's guaranteed that the
> +	 * next sibling hasn't finished its grace period even if it's
> +	 * already removed, and thus safe to dereference from this RCU
> +	 * critical section.  If ->sibling.next is inaccessible,
> +	 * cgroup_is_removed() is guaranteed to be visible as %true here.
> +	 */
> +	if (likely(!cgroup_is_removed(pos))) {
> +		next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling);
> +		if (&next->sibling != &pos->parent->children)
> +			return next;
> +		return NULL;
> +	}
> +
> +	/*
> +	 * Can't dereference the next pointer.  Each cgroup is given a
> +	 * monotonically increasing unique serial number and always
> +	 * appended to the sibling list, so the next one can be found by
> +	 * walking the parent's children until we see a cgroup with higher
> +	 * serial number than @pos's.
> +	 *
> +	 * While this path can be slow, it's taken only when either the
> +	 * current cgroup is removed or iteration and reomval race.
> +	 */
> +	list_for_each_entry_rcu(next, &pos->parent->children, sibling)
> +		if (next->serial_nr > pos->serial_nr)
> +			return next;
> +	return NULL;
> +}
> +EXPORT_SYMBOL_GPL(cgroup_next_sibling);
> +
> +/**
>   * cgroup_next_descendant_pre - find the next descendant for pre-order walk
>   * @pos: the current position (%NULL to initiate traversal)
>   * @cgroup: cgroup whose descendants to walk
> @@ -4136,6 +4185,7 @@ static void offline_css(struct cgroup_subsys *ss, struct cgroup *cgrp)
>  static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
>  			     umode_t mode)
>  {
> +	static atomic64_t serial_nr_cursor = ATOMIC64_INIT(0);
>  	struct cgroup *cgrp;
>  	struct cgroup_name *name;
>  	struct cgroupfs_root *root = parent->root;
> @@ -4216,6 +4266,14 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
>  		goto err_free_all;
>  	lockdep_assert_held(&dentry->d_inode->i_mutex);
>  
> +	/*
> +	 * Assign a monotonically increasing serial number.  With the list
> +	 * appending below, it guarantees that sibling cgroups are always
> +	 * sorted in the ascending serial number order on the parent's
> +	 * ->children.
> +	 */
> +	cgrp->serial_nr = atomic64_inc_return(&serial_nr_cursor);
> +
>  	/* allocation complete, commit to creation */
>  	list_add_tail(&cgrp->allcg_node, &root->allcg_list);
>  	list_add_tail_rcu(&cgrp->sibling, &cgrp->parent->children);
> @@ -4303,6 +4361,10 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
>  	 * removed.  This makes future css_tryget() and child creation
>  	 * attempts fail thus maintaining the removal conditions verified
>  	 * above.
> +	 *
> +	 * Note that CGRP_REMVOED clearing is depended upon by
> +	 * cgroup_next_sibling() to resume iteration after dropping RCU
> +	 * read lock.  See cgroup_next_sibling() for details.
>  	 */
>  	for_each_subsys(cgrp->root, ss) {
>  		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> -- 
> 1.8.1.4
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/5] cgroup: update iterators to use cgroup_next_sibling()
       [not found]     ` <1369101025-28335-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                         ` (3 preceding siblings ...)
  2013-05-22  9:09       ` Li Zefan
@ 2013-05-22 18:46       ` Michal Hocko
  4 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2013-05-22 18:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue 21-05-13 10:50:24, Tejun Heo wrote:
> This patch converts cgroup_for_each_child(),
> cgroup_next_descendant_pre/post() and thus
> cgroup_for_each_descendant_pre/post() to use cgroup_next_sibling()
> instead of manually dereferencing ->sibling.next.
> 
> The only reason the iterators couldn't allow dropping RCU read lock
> while iteration is in progress was because they couldn't determine the
> next sibling safely once RCU read lock is dropped.  Using
> cgroup_next_sibling() removes that problem and enables all iterators
> to allow dropping RCU read lock in the middle.  Comments are updated
> accordingly.
> 
> This makes the iterators easier to use and will simplify controllers.
> 
> Note that @cgroup argument is renamed to @cgrp in
> cgroup_for_each_child() because it conflicts with "struct cgroup" used
> in the new macro body.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Looks good to me
Reviewed-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

> ---
>  include/linux/cgroup.h | 18 ++++++++++++++----
>  kernel/cgroup.c        | 25 +++++++++++++++++++------
>  2 files changed, 33 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index ee041a0..d0ad379 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -688,9 +688,9 @@ struct cgroup *cgroup_next_sibling(struct cgroup *pos);
>  /**
>   * cgroup_for_each_child - iterate through children of a cgroup
>   * @pos: the cgroup * to use as the loop cursor
> - * @cgroup: cgroup whose children to walk
> + * @cgrp: cgroup whose children to walk
>   *
> - * Walk @cgroup's children.  Must be called under rcu_read_lock().  A child
> + * Walk @cgrp's children.  Must be called under rcu_read_lock().  A child
>   * cgroup which hasn't finished ->css_online() or already has finished
>   * ->css_offline() may show up during traversal and it's each subsystem's
>   * responsibility to verify that each @pos is alive.
> @@ -698,9 +698,15 @@ struct cgroup *cgroup_next_sibling(struct cgroup *pos);
>   * If a subsystem synchronizes against the parent in its ->css_online() and
>   * before starting iterating, a cgroup which finished ->css_online() is
>   * guaranteed to be visible in the future iterations.
> + *
> + * It is allowed to temporarily drop RCU read lock during iteration.  The
> + * caller is responsible for ensuring that @pos remains accessible until
> + * the start of the next iteration by, for example, bumping the css refcnt.
>   */
> -#define cgroup_for_each_child(pos, cgroup)				\
> -	list_for_each_entry_rcu(pos, &(cgroup)->children, sibling)
> +#define cgroup_for_each_child(pos, cgrp)				\
> +	for ((pos) = list_first_or_null_rcu(&(cgrp)->children,		\
> +					    struct cgroup, sibling);	\
> +	     (pos); (pos) = cgroup_next_sibling((pos)))
>  
>  struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
>  					  struct cgroup *cgroup);
> @@ -759,6 +765,10 @@ struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos);
>   * Alternatively, a subsystem may choose to use a single global lock to
>   * synchronize ->css_online() and ->css_offline() against tree-walking
>   * operations.
> + *
> + * It is allowed to temporarily drop RCU read lock during iteration.  The
> + * caller is responsible for ensuring that @pos remains accessible until
> + * the start of the next iteration by, for example, bumping the css refcnt.
>   */
>  #define cgroup_for_each_descendant_pre(pos, cgroup)			\
>  	for (pos = cgroup_next_descendant_pre(NULL, (cgroup)); (pos);	\
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index bc757d7..21b1ee4 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -3030,6 +3030,11 @@ EXPORT_SYMBOL_GPL(cgroup_next_sibling);
>   *
>   * To be used by cgroup_for_each_descendant_pre().  Find the next
>   * descendant to visit for pre-order traversal of @cgroup's descendants.
> + *
> + * While this function requires RCU read locking, it doesn't require the
> + * whole traversal to be contained in a single RCU critical section.  This
> + * function will return the correct next descendant as long as both @pos
> + * and @cgroup are accessible and @pos is a descendant of @cgroup.
>   */
>  struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
>  					  struct cgroup *cgroup)
> @@ -3049,11 +3054,9 @@ struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
>  
>  	/* no child, visit my or the closest ancestor's next sibling */
>  	while (pos != cgroup) {
> -		next = list_entry_rcu(pos->sibling.next, struct cgroup,
> -				      sibling);
> -		if (&next->sibling != &pos->parent->children)
> +		next = cgroup_next_sibling(pos);
> +		if (next)
>  			return next;
> -
>  		pos = pos->parent;
>  	}
>  
> @@ -3068,6 +3071,11 @@ EXPORT_SYMBOL_GPL(cgroup_next_descendant_pre);
>   * Return the rightmost descendant of @pos.  If there's no descendant,
>   * @pos is returned.  This can be used during pre-order traversal to skip
>   * subtree of @pos.
> + *
> + * While this function requires RCU read locking, it doesn't require the
> + * whole traversal to be contained in a single RCU critical section.  This
> + * function will return the correct rightmost descendant as long as @pos is
> + * accessible.
>   */
>  struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos)
>  {
> @@ -3107,6 +3115,11 @@ static struct cgroup *cgroup_leftmost_descendant(struct cgroup *pos)
>   *
>   * To be used by cgroup_for_each_descendant_post().  Find the next
>   * descendant to visit for post-order traversal of @cgroup's descendants.
> + *
> + * While this function requires RCU read locking, it doesn't require the
> + * whole traversal to be contained in a single RCU critical section.  This
> + * function will return the correct next descendant as long as both @pos
> + * and @cgroup are accessible and @pos is a descendant of @cgroup.
>   */
>  struct cgroup *cgroup_next_descendant_post(struct cgroup *pos,
>  					   struct cgroup *cgroup)
> @@ -3122,8 +3135,8 @@ struct cgroup *cgroup_next_descendant_post(struct cgroup *pos,
>  	}
>  
>  	/* if there's an unvisited sibling, visit its leftmost descendant */
> -	next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling);
> -	if (&next->sibling != &pos->parent->children)
> +	next = cgroup_next_sibling(pos);
> +	if (next)
>  		return cgroup_leftmost_descendant(next);
>  
>  	/* no sibling left, visit parent */
> -- 
> 1.8.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] cgroup: fix a subtle bug in descendant pre-order walk
       [not found]     ` <1369101025-28335-2-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2013-05-22 18:22       ` Michal Hocko
  2013-05-22 18:22       ` Michal Hocko
@ 2013-05-24  1:51       ` Tejun Heo
  2 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2013-05-24  1:51 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	stable-u79uwXL29TY76Z2rM5mHXA, mhocko-AlSwsSmVLrQ,
	hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, May 21, 2013 at 10:50:21AM +0900, Tejun Heo wrote:
> When cgroup_next_descendant_pre() initiates a walk, it checks whether
> the subtree root doesn't have any children and if not returns NULL.
> Later code assumes that the subtree isn't empty.  This is broken
> because the subtree may become empty inbetween, which can lead to the
> traversal escaping the subtree by walking to the sibling of the
> subtree root.
> 
> There's no reason to have the early exit path.  Remove it along with
> the later assumption that the subtree isn't empty.  This simplifies
> the code a bit and fixes the subtle bug.
> 
> While at it, fix the comment of cgroup_for_each_descendant_pre() which
> was incorrectly referring to ->css_offline() instead of
> ->css_online().
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Applied to cgroup/for-3.10-fixes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/5] cgroup: make cgroup_is_removed() static
       [not found]     ` <1369101025-28335-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-05-24  1:56       ` Tejun Heo
       [not found]         ` <20130524015613.GB19755-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2013-05-24  1:56 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, May 21, 2013 at 10:50:22AM +0900, Tejun Heo wrote:
> cgroup_is_removed() no longer has external users and it shouldn't grow
> any - controllers should deal with cgroup_subsys_state on/offline
> state instead of cgroup removal state.  Make it static.
> 
> While at it, make it return bool.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Applied 2-5 to cgroup/for-3.11.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/5] cgroup: make cgroup_is_removed() static
       [not found]         ` <20130524015613.GB19755-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  2013-05-24  3:32           ` Li Zefan
@ 2013-05-24  3:32           ` Li Zefan
  1 sibling, 0 replies; 32+ messages in thread
From: Li Zefan @ 2013-05-24  3:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On 2013/5/24 9:56, Tejun Heo wrote:
> On Tue, May 21, 2013 at 10:50:22AM +0900, Tejun Heo wrote:
>> cgroup_is_removed() no longer has external users and it shouldn't grow
>> any - controllers should deal with cgroup_subsys_state on/offline
>> state instead of cgroup removal state.  Make it static.
>>
>> While at it, make it return bool.
>>
>> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> 
> Applied 2-5 to cgroup/for-3.11.
> 

Acked-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

Has been kind of busy with internal work. I'll update the cpuset patchset
to use this.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/5] cgroup: make cgroup_is_removed() static
       [not found]         ` <20130524015613.GB19755-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2013-05-24  3:32           ` Li Zefan
  2013-05-24  3:32           ` Li Zefan
  1 sibling, 0 replies; 32+ messages in thread
From: Li Zefan @ 2013-05-24  3:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, aris-H+wXaHxf7aLQT0dZR+AlfA,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	hannes-druUgvl0LCNAfugRpC6u6w,
	bsingharora-Re5JQEeQqe8AvxtiuMwx3w,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A

On 2013/5/24 9:56, Tejun Heo wrote:
> On Tue, May 21, 2013 at 10:50:22AM +0900, Tejun Heo wrote:
>> cgroup_is_removed() no longer has external users and it shouldn't grow
>> any - controllers should deal with cgroup_subsys_state on/offline
>> state instead of cgroup removal state.  Make it static.
>>
>> While at it, make it return bool.
>>
>> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> 
> Applied 2-5 to cgroup/for-3.11.
> 

Acked-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

Has been kind of busy with internal work. I'll update the cpuset patchset
to use this.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2013-05-24  3:32 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-21  1:50 [PATCHSET] cgroup: allow dropping RCU read lock while iterating Tejun Heo
     [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-05-21  1:50   ` [PATCH 1/5] cgroup: fix a subtle bug in descendant pre-order walk Tejun Heo
2013-05-21  1:50   ` Tejun Heo
     [not found]     ` <1369101025-28335-2-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-05-22 18:22       ` Michal Hocko
2013-05-22 18:22       ` Michal Hocko
2013-05-24  1:51       ` Tejun Heo
2013-05-21  1:50   ` [PATCH 2/5] cgroup: make cgroup_is_removed() static Tejun Heo
     [not found]     ` <1369101025-28335-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-05-24  1:56       ` Tejun Heo
     [not found]         ` <20130524015613.GB19755-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2013-05-24  3:32           ` Li Zefan
2013-05-24  3:32           ` Li Zefan
2013-05-21  1:50   ` Tejun Heo
2013-05-21  1:50   ` [PATCH 3/5] cgroup: add cgroup->serial_nr and implement cgroup_next_sibling() Tejun Heo
     [not found]     ` <1369101025-28335-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-05-21 14:33       ` Serge Hallyn
2013-05-21 14:33       ` Serge Hallyn
2013-05-22 14:36       ` Aristeu Rozanski
     [not found]         ` <20130522143636.GC16739-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-05-22 14:38           ` Tejun Heo
2013-05-22 18:41       ` Michal Hocko
2013-05-21  1:50   ` Tejun Heo
2013-05-21  1:50   ` [PATCH 4/5] cgroup: update iterators to use cgroup_next_sibling() Tejun Heo
     [not found]     ` <1369101025-28335-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-05-21 22:31       ` Serge Hallyn
2013-05-21 22:31       ` Serge Hallyn
2013-05-22  9:09       ` Li Zefan
     [not found]         ` <519C8B2E.5040606-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2013-05-22  9:17           ` Tejun Heo
2013-05-22  9:09       ` Li Zefan
2013-05-22 18:46       ` Michal Hocko
2013-05-21  1:50   ` Tejun Heo
2013-05-21  1:50   ` [PATCH 5/5] device_cgroup: simplify cgroup tree walk in propagate_exception() Tejun Heo
2013-05-21  1:50   ` Tejun Heo
     [not found]     ` <1369101025-28335-6-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-05-21 22:35       ` Serge Hallyn
2013-05-21  3:20   ` [PATCHSET] cgroup: allow dropping RCU read lock while iterating Tejun Heo
2013-05-21  3:20   ` Tejun Heo
2013-05-22 14:53   ` Aristeu Rozanski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.