All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 -mm] rework mem_cgroup iterator
@ 2013-02-14 13:26 ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2013-02-14 13:26 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan, Andrew Morton

Hi,
this is a fourth iteration of the patch series previously posted here:
http://lkml.org/lkml/2013/1/3/293. I am adding Andrew to the CC as I
feel this is getting to the acceptable state finally. It still need
review as some things changed a lot since the last version but we are
getting there...

The patch set tries to make mem_cgroup_iter saner in the way how it
walks hierarchies. css->id based traversal is far from being ideal as
it is not deterministic because it depends on the creation ordering.
Additional to that css_id is considered a burden for cgroup maintainers
because it is quite some code and memcg is the last user of it. After
this series only the swap accounting uses css_id but that one will
follow up later.

Diffstat (if we exclude removed/added comments) looks quite
promissing. We got rid of some code:
$ git diff mmotm... | grep -v "^[+-][[:space:]]*[/ ]\*" | diffstat
 b/include/linux/cgroup.h |    3 ---
 kernel/cgroup.c          |   33 ---------------------------------
 mm/memcontrol.c          |    4 +++-
 3 files changed, 3 insertions(+), 37 deletions(-)

The first patch is just preparatory and it changes when we release css
of the previously returned memcg. Nothing controlversial.

The second patch is the core of the patchset and it replaces css_get_next
based on css_id by the generic cgroup pre-order. This brings some
chalanges for the last visited group caching during the reclaim
(mem_cgroup_per_zone::reclaim_iter). We have to use memcg pointers
directly now which means that we have to keep a reference to those
groups' css to keep them alive.
I also folded iter_lock introduced by https://lkml.org/lkml/2013/1/3/295
in the previous version into this patch. Johannes felt the race I
was describing should be mostly harmless and I haven't been able to
trigger it so the lock doesn't deserve its own patch. It is still needed
temporarily, though, because the reference counting on iter->last_visited
depends on it. It will go away with the next patch.

The next patch fixups an unbounded cgroup removal holdoff caused
by the elevated css refcount. The issue has been observed by
Ying Han.  Johannes wasn't impressed by the previous version of the fix
(https://lkml.org/lkml/2013/2/8/379) which cleaned up pending references
during mem_cgroup_css_offline when a group is removed. He has suggested
a different way when the iterator checks whether a cached memcg is
still valid or no. More on that in the patch but the basic idea is that
every memcg tracks the number removed subgroups and iterator records
this number when a group is cached. These numbers are checked before
iter->last_visited is about to be used and the iteration is restarted if
it is invalid.

The fourth and fifth patches are an attempt for simplification of the
mem_cgroup_iter. css juggling is removed and the iteration logic is
moved to a helper so that the reference counting and iteration are
separated.

The last patch just removes css_get_next as there is no user for it any
longer.

I have dropped Acks from patches that needed a rework since the last
time so please have a look at them again (especially 1-4). I hope I
haven't screwed anything during the rebase.

My testing looked as follows:
        A (use_hierarchy=1, limit_in_bytes=150M)
       /|\
      1 2 3

Children groups were created so that the number is never higher than
3 and their limits were random between 50-100M. Each group hosts a
kernel build (starting with tar -xf so the tree is not shared and make
-jNUM_CPUs/3) and terminated after random time - up to 5 minutes) and
then it is removed.
This should exercise both leaf and hierarchical reclaim as well as
races with cgroup removals and debugging messages I added on top proved
that. 100 groups were created during the test.

Shortlog says:
Michal Hocko (6):
      memcg: keep prev's css alive for the whole mem_cgroup_iter
      memcg: rework mem_cgroup_iter to use cgroup iterators
      memcg: relax memcg iter caching
      memcg: simplify mem_cgroup_iter
      memcg: further simplify mem_cgroup_iter
      cgroup: remove css_get_next

Full diffstat says:
 include/linux/cgroup.h |    7 ---
 kernel/cgroup.c        |   49 ----------------
 mm/memcontrol.c        |  145 ++++++++++++++++++++++++++++++++++++++++--------
 3 files changed, 121 insertions(+), 80 deletions(-)

Any suggestions are welcome of course.

Thanks!


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v4 -mm] rework mem_cgroup iterator
@ 2013-02-14 13:26 ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2013-02-14 13:26 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan, Andrew Morton

Hi,
this is a fourth iteration of the patch series previously posted here:
http://lkml.org/lkml/2013/1/3/293. I am adding Andrew to the CC as I
feel this is getting to the acceptable state finally. It still need
review as some things changed a lot since the last version but we are
getting there...

The patch set tries to make mem_cgroup_iter saner in the way how it
walks hierarchies. css->id based traversal is far from being ideal as
it is not deterministic because it depends on the creation ordering.
Additional to that css_id is considered a burden for cgroup maintainers
because it is quite some code and memcg is the last user of it. After
this series only the swap accounting uses css_id but that one will
follow up later.

Diffstat (if we exclude removed/added comments) looks quite
promissing. We got rid of some code:
$ git diff mmotm... | grep -v "^[+-][[:space:]]*[/ ]\*" | diffstat
 b/include/linux/cgroup.h |    3 ---
 kernel/cgroup.c          |   33 ---------------------------------
 mm/memcontrol.c          |    4 +++-
 3 files changed, 3 insertions(+), 37 deletions(-)

The first patch is just preparatory and it changes when we release css
of the previously returned memcg. Nothing controlversial.

The second patch is the core of the patchset and it replaces css_get_next
based on css_id by the generic cgroup pre-order. This brings some
chalanges for the last visited group caching during the reclaim
(mem_cgroup_per_zone::reclaim_iter). We have to use memcg pointers
directly now which means that we have to keep a reference to those
groups' css to keep them alive.
I also folded iter_lock introduced by https://lkml.org/lkml/2013/1/3/295
in the previous version into this patch. Johannes felt the race I
was describing should be mostly harmless and I haven't been able to
trigger it so the lock doesn't deserve its own patch. It is still needed
temporarily, though, because the reference counting on iter->last_visited
depends on it. It will go away with the next patch.

The next patch fixups an unbounded cgroup removal holdoff caused
by the elevated css refcount. The issue has been observed by
Ying Han.  Johannes wasn't impressed by the previous version of the fix
(https://lkml.org/lkml/2013/2/8/379) which cleaned up pending references
during mem_cgroup_css_offline when a group is removed. He has suggested
a different way when the iterator checks whether a cached memcg is
still valid or no. More on that in the patch but the basic idea is that
every memcg tracks the number removed subgroups and iterator records
this number when a group is cached. These numbers are checked before
iter->last_visited is about to be used and the iteration is restarted if
it is invalid.

The fourth and fifth patches are an attempt for simplification of the
mem_cgroup_iter. css juggling is removed and the iteration logic is
moved to a helper so that the reference counting and iteration are
separated.

The last patch just removes css_get_next as there is no user for it any
longer.

I have dropped Acks from patches that needed a rework since the last
time so please have a look at them again (especially 1-4). I hope I
haven't screwed anything during the rebase.

My testing looked as follows:
        A (use_hierarchy=1, limit_in_bytes=150M)
       /|\
      1 2 3

Children groups were created so that the number is never higher than
3 and their limits were random between 50-100M. Each group hosts a
kernel build (starting with tar -xf so the tree is not shared and make
-jNUM_CPUs/3) and terminated after random time - up to 5 minutes) and
then it is removed.
This should exercise both leaf and hierarchical reclaim as well as
races with cgroup removals and debugging messages I added on top proved
that. 100 groups were created during the test.

Shortlog says:
Michal Hocko (6):
      memcg: keep prev's css alive for the whole mem_cgroup_iter
      memcg: rework mem_cgroup_iter to use cgroup iterators
      memcg: relax memcg iter caching
      memcg: simplify mem_cgroup_iter
      memcg: further simplify mem_cgroup_iter
      cgroup: remove css_get_next

Full diffstat says:
 include/linux/cgroup.h |    7 ---
 kernel/cgroup.c        |   49 ----------------
 mm/memcontrol.c        |  145 ++++++++++++++++++++++++++++++++++++++++--------
 3 files changed, 121 insertions(+), 80 deletions(-)

Any suggestions are welcome of course.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v4 1/6] memcg: keep prev's css alive for the whole mem_cgroup_iter
  2013-02-14 13:26 ` Michal Hocko
@ 2013-02-14 13:26   ` Michal Hocko
  -1 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2013-02-14 13:26 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan, Andrew Morton

css reference counting keeps the cgroup alive even though it has been
already removed. mem_cgroup_iter relies on this fact and takes a
reference to the returned group. The reference is then released on the
next iteration or mem_cgroup_iter_break.
mem_cgroup_iter currently releases the reference right after it gets the
last css_id.
This is correct because neither prev's memcg nor cgroup are accessed
after then. This will change in the next patch so we need to hold the
group alive a bit longer so let's move the css_put at the end of the
function.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1ea8951..f3d1bfe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1141,12 +1141,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	if (prev && !reclaim)
 		id = css_id(&prev->css);
 
-	if (prev && prev != root)
-		css_put(&prev->css);
-
 	if (!root->use_hierarchy && root != root_mem_cgroup) {
 		if (prev)
-			return NULL;
+			goto out_css_put;
 		return root;
 	}
 
@@ -1162,7 +1159,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			mz = mem_cgroup_zoneinfo(root, nid, zid);
 			iter = &mz->reclaim_iter[reclaim->priority];
 			if (prev && reclaim->generation != iter->generation)
-				return NULL;
+				goto out_css_put;
 			id = iter->position;
 		}
 
@@ -1184,8 +1181,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		}
 
 		if (prev && !css)
-			return NULL;
+			goto out_css_put;
 	}
+out_css_put:
+	if (prev && prev != root)
+		css_put(&prev->css);
+
 	return memcg;
 }
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 1/6] memcg: keep prev's css alive for the whole mem_cgroup_iter
@ 2013-02-14 13:26   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2013-02-14 13:26 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan, Andrew Morton

css reference counting keeps the cgroup alive even though it has been
already removed. mem_cgroup_iter relies on this fact and takes a
reference to the returned group. The reference is then released on the
next iteration or mem_cgroup_iter_break.
mem_cgroup_iter currently releases the reference right after it gets the
last css_id.
This is correct because neither prev's memcg nor cgroup are accessed
after then. This will change in the next patch so we need to hold the
group alive a bit longer so let's move the css_put at the end of the
function.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1ea8951..f3d1bfe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1141,12 +1141,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	if (prev && !reclaim)
 		id = css_id(&prev->css);
 
-	if (prev && prev != root)
-		css_put(&prev->css);
-
 	if (!root->use_hierarchy && root != root_mem_cgroup) {
 		if (prev)
-			return NULL;
+			goto out_css_put;
 		return root;
 	}
 
@@ -1162,7 +1159,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			mz = mem_cgroup_zoneinfo(root, nid, zid);
 			iter = &mz->reclaim_iter[reclaim->priority];
 			if (prev && reclaim->generation != iter->generation)
-				return NULL;
+				goto out_css_put;
 			id = iter->position;
 		}
 
@@ -1184,8 +1181,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		}
 
 		if (prev && !css)
-			return NULL;
+			goto out_css_put;
 	}
+out_css_put:
+	if (prev && prev != root)
+		css_put(&prev->css);
+
 	return memcg;
 }
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 2/6] memcg: rework mem_cgroup_iter to use cgroup iterators
  2013-02-14 13:26 ` Michal Hocko
@ 2013-02-14 13:26   ` Michal Hocko
  -1 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2013-02-14 13:26 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan, Andrew Morton

mem_cgroup_iter curently relies on css->id when walking down a group
hierarchy tree. This is really awkward because the tree walk depends on
the groups creation ordering. The only guarantee is that a parent node
is visited before its children.
Example
 1) mkdir -p a a/d a/b/c
 2) mkdir -a a/b/c a/d
Will create the same trees but the tree walks will be different:
 1) a, d, b, c
 2) a, b, c, d

574bd9f7 (cgroup: implement generic child / descendant walk macros) has
introduced generic cgroup tree walkers which provide either pre-order
or post-order tree walk. This patch converts css->id based iteration
to pre-order tree walk to keep the semantic with the original iterator
where parent is always visited before its subtree.

cgroup_for_each_descendant_pre suggests using post_create and
pre_destroy for proper synchronization with groups addidition resp.
removal. This implementation doesn't use those because a new memory
cgroup is initialized sufficiently for iteration in mem_cgroup_css_alloc
already and css reference counting enforces that the group is alive for
both the last seen cgroup and the found one resp. it signals that the
group is dead and it should be skipped.

If the reclaim cookie is used we need to store the last visited group
into the iterator so we have to be careful that it doesn't disappear in
the mean time. Elevated reference count on the css keeps it alive even
though the group have been removed (parked waiting for the last dput so
that it can be freed).

Per node-zone-prio iter_lock has been introduced to ensure that
css_tryget and iter->last_visited is set atomically. Otherwise two
racing walkers could both take a references and only one release it
leading to a css leak (which pins cgroup dentry).

V3
- introduce iter_lock
V2
- use css_{get,put} for iter->last_visited rather than
  mem_cgroup_{get,put} because it is stronger wrt. cgroup life cycle
- cgroup_next_descendant_pre expects NULL pos for the first iterartion
  otherwise it might loop endlessly for intermediate node without any
  children.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |   86 +++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 68 insertions(+), 18 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f3d1bfe..e9f5c47 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -144,10 +144,12 @@ struct mem_cgroup_stat_cpu {
 };
 
 struct mem_cgroup_reclaim_iter {
-	/* css_id of the last scanned hierarchy member */
-	int position;
+	/* last scanned hierarchy member with elevated css ref count */
+	struct mem_cgroup *last_visited;
 	/* scan generation, increased every round-trip */
 	unsigned int generation;
+	/* lock to protect the position and generation */
+	spinlock_t iter_lock;
 };
 
 /*
@@ -1130,7 +1132,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 				   struct mem_cgroup_reclaim_cookie *reclaim)
 {
 	struct mem_cgroup *memcg = NULL;
-	int id = 0;
+	struct mem_cgroup *last_visited = NULL;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -1139,7 +1141,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		root = root_mem_cgroup;
 
 	if (prev && !reclaim)
-		id = css_id(&prev->css);
+		last_visited = prev;
 
 	if (!root->use_hierarchy && root != root_mem_cgroup) {
 		if (prev)
@@ -1147,9 +1149,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		return root;
 	}
 
+	rcu_read_lock();
 	while (!memcg) {
 		struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
-		struct cgroup_subsys_state *css;
+		struct cgroup_subsys_state *css = NULL;
 
 		if (reclaim) {
 			int nid = zone_to_nid(reclaim->zone);
@@ -1158,31 +1161,74 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 
 			mz = mem_cgroup_zoneinfo(root, nid, zid);
 			iter = &mz->reclaim_iter[reclaim->priority];
-			if (prev && reclaim->generation != iter->generation)
-				goto out_css_put;
-			id = iter->position;
+			spin_lock(&iter->iter_lock);
+			last_visited = iter->last_visited;
+			if (prev && reclaim->generation != iter->generation) {
+				if (last_visited) {
+					css_put(&last_visited->css);
+					iter->last_visited = NULL;
+				}
+				spin_unlock(&iter->iter_lock);
+				goto out_unlock;
+			}
 		}
 
-		rcu_read_lock();
-		css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
-		if (css) {
-			if (css == &root->css || css_tryget(css))
-				memcg = mem_cgroup_from_css(css);
-		} else
-			id = 0;
-		rcu_read_unlock();
+		/*
+		 * Root is not visited by cgroup iterators so it needs an
+		 * explicit visit.
+		 */
+		if (!last_visited) {
+			css = &root->css;
+		} else {
+			struct cgroup *prev_cgroup, *next_cgroup;
+
+			prev_cgroup = (last_visited == root) ? NULL
+				: last_visited->css.cgroup;
+			next_cgroup = cgroup_next_descendant_pre(prev_cgroup,
+					root->css.cgroup);
+			if (next_cgroup)
+				css = cgroup_subsys_state(next_cgroup,
+						mem_cgroup_subsys_id);
+		}
+
+		/*
+		 * Even if we found a group we have to make sure it is alive.
+		 * css && !memcg means that the groups should be skipped and
+		 * we should continue the tree walk.
+		 * last_visited css is safe to use because it is protected by
+		 * css_get and the tree walk is rcu safe.
+		 */
+		if (css == &root->css || (css && css_tryget(css)))
+			memcg = mem_cgroup_from_css(css);
 
 		if (reclaim) {
-			iter->position = id;
+			struct mem_cgroup *curr = memcg;
+
+			if (last_visited)
+				css_put(&last_visited->css);
+
+			if (css && !memcg)
+				curr = mem_cgroup_from_css(css);
+
+			/* make sure that the cached memcg is not removed */
+			if (curr)
+				css_get(&curr->css);
+			iter->last_visited = curr;
+
 			if (!css)
 				iter->generation++;
 			else if (!prev && memcg)
 				reclaim->generation = iter->generation;
+			spin_unlock(&iter->iter_lock);
+		} else if (css && !memcg) {
+			last_visited = mem_cgroup_from_css(css);
 		}
 
 		if (prev && !css)
-			goto out_css_put;
+			goto out_unlock;
 	}
+out_unlock:
+	rcu_read_unlock();
 out_css_put:
 	if (prev && prev != root)
 		css_put(&prev->css);
@@ -6052,8 +6098,12 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 		return 1;
 
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+		int prio;
+
 		mz = &pn->zoneinfo[zone];
 		lruvec_init(&mz->lruvec);
+		for (prio = 0; prio < DEF_PRIORITY + 1; prio++)
+			spin_lock_init(&mz->reclaim_iter[prio].iter_lock);
 		mz->usage_in_excess = 0;
 		mz->on_tree = false;
 		mz->memcg = memcg;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 2/6] memcg: rework mem_cgroup_iter to use cgroup iterators
@ 2013-02-14 13:26   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2013-02-14 13:26 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan, Andrew Morton

mem_cgroup_iter curently relies on css->id when walking down a group
hierarchy tree. This is really awkward because the tree walk depends on
the groups creation ordering. The only guarantee is that a parent node
is visited before its children.
Example
 1) mkdir -p a a/d a/b/c
 2) mkdir -a a/b/c a/d
Will create the same trees but the tree walks will be different:
 1) a, d, b, c
 2) a, b, c, d

574bd9f7 (cgroup: implement generic child / descendant walk macros) has
introduced generic cgroup tree walkers which provide either pre-order
or post-order tree walk. This patch converts css->id based iteration
to pre-order tree walk to keep the semantic with the original iterator
where parent is always visited before its subtree.

cgroup_for_each_descendant_pre suggests using post_create and
pre_destroy for proper synchronization with groups addidition resp.
removal. This implementation doesn't use those because a new memory
cgroup is initialized sufficiently for iteration in mem_cgroup_css_alloc
already and css reference counting enforces that the group is alive for
both the last seen cgroup and the found one resp. it signals that the
group is dead and it should be skipped.

If the reclaim cookie is used we need to store the last visited group
into the iterator so we have to be careful that it doesn't disappear in
the mean time. Elevated reference count on the css keeps it alive even
though the group have been removed (parked waiting for the last dput so
that it can be freed).

Per node-zone-prio iter_lock has been introduced to ensure that
css_tryget and iter->last_visited is set atomically. Otherwise two
racing walkers could both take a references and only one release it
leading to a css leak (which pins cgroup dentry).

V3
- introduce iter_lock
V2
- use css_{get,put} for iter->last_visited rather than
  mem_cgroup_{get,put} because it is stronger wrt. cgroup life cycle
- cgroup_next_descendant_pre expects NULL pos for the first iterartion
  otherwise it might loop endlessly for intermediate node without any
  children.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |   86 +++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 68 insertions(+), 18 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f3d1bfe..e9f5c47 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -144,10 +144,12 @@ struct mem_cgroup_stat_cpu {
 };
 
 struct mem_cgroup_reclaim_iter {
-	/* css_id of the last scanned hierarchy member */
-	int position;
+	/* last scanned hierarchy member with elevated css ref count */
+	struct mem_cgroup *last_visited;
 	/* scan generation, increased every round-trip */
 	unsigned int generation;
+	/* lock to protect the position and generation */
+	spinlock_t iter_lock;
 };
 
 /*
@@ -1130,7 +1132,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 				   struct mem_cgroup_reclaim_cookie *reclaim)
 {
 	struct mem_cgroup *memcg = NULL;
-	int id = 0;
+	struct mem_cgroup *last_visited = NULL;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -1139,7 +1141,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		root = root_mem_cgroup;
 
 	if (prev && !reclaim)
-		id = css_id(&prev->css);
+		last_visited = prev;
 
 	if (!root->use_hierarchy && root != root_mem_cgroup) {
 		if (prev)
@@ -1147,9 +1149,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		return root;
 	}
 
+	rcu_read_lock();
 	while (!memcg) {
 		struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
-		struct cgroup_subsys_state *css;
+		struct cgroup_subsys_state *css = NULL;
 
 		if (reclaim) {
 			int nid = zone_to_nid(reclaim->zone);
@@ -1158,31 +1161,74 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 
 			mz = mem_cgroup_zoneinfo(root, nid, zid);
 			iter = &mz->reclaim_iter[reclaim->priority];
-			if (prev && reclaim->generation != iter->generation)
-				goto out_css_put;
-			id = iter->position;
+			spin_lock(&iter->iter_lock);
+			last_visited = iter->last_visited;
+			if (prev && reclaim->generation != iter->generation) {
+				if (last_visited) {
+					css_put(&last_visited->css);
+					iter->last_visited = NULL;
+				}
+				spin_unlock(&iter->iter_lock);
+				goto out_unlock;
+			}
 		}
 
-		rcu_read_lock();
-		css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
-		if (css) {
-			if (css == &root->css || css_tryget(css))
-				memcg = mem_cgroup_from_css(css);
-		} else
-			id = 0;
-		rcu_read_unlock();
+		/*
+		 * Root is not visited by cgroup iterators so it needs an
+		 * explicit visit.
+		 */
+		if (!last_visited) {
+			css = &root->css;
+		} else {
+			struct cgroup *prev_cgroup, *next_cgroup;
+
+			prev_cgroup = (last_visited == root) ? NULL
+				: last_visited->css.cgroup;
+			next_cgroup = cgroup_next_descendant_pre(prev_cgroup,
+					root->css.cgroup);
+			if (next_cgroup)
+				css = cgroup_subsys_state(next_cgroup,
+						mem_cgroup_subsys_id);
+		}
+
+		/*
+		 * Even if we found a group we have to make sure it is alive.
+		 * css && !memcg means that the groups should be skipped and
+		 * we should continue the tree walk.
+		 * last_visited css is safe to use because it is protected by
+		 * css_get and the tree walk is rcu safe.
+		 */
+		if (css == &root->css || (css && css_tryget(css)))
+			memcg = mem_cgroup_from_css(css);
 
 		if (reclaim) {
-			iter->position = id;
+			struct mem_cgroup *curr = memcg;
+
+			if (last_visited)
+				css_put(&last_visited->css);
+
+			if (css && !memcg)
+				curr = mem_cgroup_from_css(css);
+
+			/* make sure that the cached memcg is not removed */
+			if (curr)
+				css_get(&curr->css);
+			iter->last_visited = curr;
+
 			if (!css)
 				iter->generation++;
 			else if (!prev && memcg)
 				reclaim->generation = iter->generation;
+			spin_unlock(&iter->iter_lock);
+		} else if (css && !memcg) {
+			last_visited = mem_cgroup_from_css(css);
 		}
 
 		if (prev && !css)
-			goto out_css_put;
+			goto out_unlock;
 	}
+out_unlock:
+	rcu_read_unlock();
 out_css_put:
 	if (prev && prev != root)
 		css_put(&prev->css);
@@ -6052,8 +6098,12 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 		return 1;
 
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+		int prio;
+
 		mz = &pn->zoneinfo[zone];
 		lruvec_init(&mz->lruvec);
+		for (prio = 0; prio < DEF_PRIORITY + 1; prio++)
+			spin_lock_init(&mz->reclaim_iter[prio].iter_lock);
 		mz->usage_in_excess = 0;
 		mz->on_tree = false;
 		mz->memcg = memcg;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 3/6] memcg: relax memcg iter caching
  2013-02-14 13:26 ` Michal Hocko
@ 2013-02-14 13:26   ` Michal Hocko
  -1 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2013-02-14 13:26 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan, Andrew Morton

Now that per-node-zone-priority iterator caches memory cgroups rather
than their css ids we have to be careful and remove them from the
iterator when they are on the way out otherwise they might live for
unbounded amount of time even though their group is already gone (until
the global/targeted reclaim triggers the zone under priority to find out
the group is dead and let it to find the final rest).

We can fix this issue by relaxing rules for the last_visited memcg.
Instead of taking a reference to the css before it is stored into
iter->last_visited we can just store its pointer and track the number of
removed groups from each memcg's subhierarchy.

This number would be stored into iterator everytime when a memcg is
cached. If the iter count doesn't match the curent walker root's one we
will start from the root again. The group counter is incremented upwards
the hierarchy every time a group is removed.

The iter_lock can be dropped because racing iterators cannot leak
the reference anymore as the reference count is not elevated for
last_visited when it is cached.

Locking rules got a bit complicated by this change though. The iterator
primarily relies on rcu read lock which makes sure that once we see
a valid last_visited pointer then it will be valid for the whole RCU
walk. smp_rmb makes sure that dead_count is read before last_visited
and last_dead_count while smp_wmb makes sure that last_visited is
updated before last_dead_count so the up-to-date last_dead_count cannot
point to an outdated last_visited. css_tryget then makes sure that
the last_visited is still alive in case the iteration races with the
cached group removal (css is invalidated before mem_cgroup_css_offline
increments dead_count).

In short:
mem_cgroup_iter
 rcu_read_lock()
 dead_count = atomic_read(parent->dead_count)
 smp_rmb()
 if (dead_count != iter->last_dead_count)
 	last_visited POSSIBLY INVALID -> last_visited = NULL
 if (!css_tryget(iter->last_visited))
 	last_visited DEAD -> last_visited = NULL
 next = find_next(last_visited)
 css_tryget(next)
 css_put(last_visited) 	// css would be invalidated and parent->dead_count
 			// incremented if this was the last reference
 iter->last_visited = next
 smp_wmb()
 iter->last_dead_count = dead_count
 rcu_read_unlock()

cgroup_rmdir
 cgroup_destroy_locked
  atomic_add(CSS_DEACT_BIAS, &css->refcnt) // subsequent css_tryget fail
   mem_cgroup_css_offline
    mem_cgroup_invalidate_reclaim_iterators
     while(parent = parent_mem_cgroup)
     	atomic_inc(parent->dead_count)
  css_put(css) // last reference held by cgroup core

Spotted-by: Ying Han <yinghan@google.com>
Original-idea-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |   69 +++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 52 insertions(+), 17 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e9f5c47..88d5882 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -144,12 +144,15 @@ struct mem_cgroup_stat_cpu {
 };
 
 struct mem_cgroup_reclaim_iter {
-	/* last scanned hierarchy member with elevated css ref count */
+	/*
+	 * last scanned hierarchy member. Valid only if last_dead_count
+	 * matches memcg->dead_count of the hierarchy root group.
+	 */
 	struct mem_cgroup *last_visited;
+	unsigned long last_dead_count;
+
 	/* scan generation, increased every round-trip */
 	unsigned int generation;
-	/* lock to protect the position and generation */
-	spinlock_t iter_lock;
 };
 
 /*
@@ -357,6 +360,7 @@ struct mem_cgroup {
 	struct mem_cgroup_stat_cpu nocpu_base;
 	spinlock_t pcp_counter_lock;
 
+	atomic_t	dead_count;
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
 	struct tcp_memcontrol tcp_mem;
 #endif
@@ -1133,6 +1137,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 {
 	struct mem_cgroup *memcg = NULL;
 	struct mem_cgroup *last_visited = NULL;
+	unsigned long uninitialized_var(dead_count);
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -1161,16 +1166,33 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 
 			mz = mem_cgroup_zoneinfo(root, nid, zid);
 			iter = &mz->reclaim_iter[reclaim->priority];
-			spin_lock(&iter->iter_lock);
 			last_visited = iter->last_visited;
 			if (prev && reclaim->generation != iter->generation) {
-				if (last_visited) {
-					css_put(&last_visited->css);
-					iter->last_visited = NULL;
-				}
-				spin_unlock(&iter->iter_lock);
+				iter->last_visited = NULL;
 				goto out_unlock;
 			}
+
+			/*
+                         * If the dead_count mismatches, a destruction
+                         * has happened or is happening concurrently.
+                         * If the dead_count matches, a destruction
+                         * might still happen concurrently, but since
+                         * we checked under RCU, that destruction
+                         * won't free the object until we release the
+                         * RCU reader lock.  Thus, the dead_count
+                         * check verifies the pointer is still valid,
+                         * css_tryget() verifies the cgroup pointed to
+                         * is alive.
+			 */
+			dead_count = atomic_read(&root->dead_count);
+			smp_rmb();
+			last_visited = iter->last_visited;
+			if (last_visited) {
+				if ((dead_count != iter->last_dead_count) ||
+					!css_tryget(&last_visited->css)) {
+					last_visited = NULL;
+				}
+			}
 		}
 
 		/*
@@ -1210,16 +1232,14 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			if (css && !memcg)
 				curr = mem_cgroup_from_css(css);
 
-			/* make sure that the cached memcg is not removed */
-			if (curr)
-				css_get(&curr->css);
 			iter->last_visited = curr;
+			smp_wmb();
+			iter->last_dead_count = dead_count;
 
 			if (!css)
 				iter->generation++;
 			else if (!prev && memcg)
 				reclaim->generation = iter->generation;
-			spin_unlock(&iter->iter_lock);
 		} else if (css && !memcg) {
 			last_visited = mem_cgroup_from_css(css);
 		}
@@ -6098,12 +6118,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 		return 1;
 
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-		int prio;
-
 		mz = &pn->zoneinfo[zone];
 		lruvec_init(&mz->lruvec);
-		for (prio = 0; prio < DEF_PRIORITY + 1; prio++)
-			spin_lock_init(&mz->reclaim_iter[prio].iter_lock);
 		mz->usage_in_excess = 0;
 		mz->on_tree = false;
 		mz->memcg = memcg;
@@ -6375,10 +6391,29 @@ free_out:
 	return ERR_PTR(error);
 }
 
+/*
+ * Announce all parents that a group from their hierarchy is gone.
+ */
+static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *parent = memcg;
+
+	while ((parent = parent_mem_cgroup(parent)))
+		atomic_inc(&parent->dead_count);
+
+	/*
+	 * if the root memcg is not hierarchical we have to check it
+	 * explicitely.
+	 */
+	if (!root_mem_cgroup->use_hierarchy)
+		atomic_inc(&root_mem_cgroup->dead_count);
+}
+
 static void mem_cgroup_css_offline(struct cgroup *cont)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
 
+	mem_cgroup_invalidate_reclaim_iterators(memcg);
 	mem_cgroup_reparent_charges(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 3/6] memcg: relax memcg iter caching
@ 2013-02-14 13:26   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2013-02-14 13:26 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan, Andrew Morton

Now that per-node-zone-priority iterator caches memory cgroups rather
than their css ids we have to be careful and remove them from the
iterator when they are on the way out otherwise they might live for
unbounded amount of time even though their group is already gone (until
the global/targeted reclaim triggers the zone under priority to find out
the group is dead and let it to find the final rest).

We can fix this issue by relaxing rules for the last_visited memcg.
Instead of taking a reference to the css before it is stored into
iter->last_visited we can just store its pointer and track the number of
removed groups from each memcg's subhierarchy.

This number would be stored into iterator everytime when a memcg is
cached. If the iter count doesn't match the curent walker root's one we
will start from the root again. The group counter is incremented upwards
the hierarchy every time a group is removed.

The iter_lock can be dropped because racing iterators cannot leak
the reference anymore as the reference count is not elevated for
last_visited when it is cached.

Locking rules got a bit complicated by this change though. The iterator
primarily relies on rcu read lock which makes sure that once we see
a valid last_visited pointer then it will be valid for the whole RCU
walk. smp_rmb makes sure that dead_count is read before last_visited
and last_dead_count while smp_wmb makes sure that last_visited is
updated before last_dead_count so the up-to-date last_dead_count cannot
point to an outdated last_visited. css_tryget then makes sure that
the last_visited is still alive in case the iteration races with the
cached group removal (css is invalidated before mem_cgroup_css_offline
increments dead_count).

In short:
mem_cgroup_iter
 rcu_read_lock()
 dead_count = atomic_read(parent->dead_count)
 smp_rmb()
 if (dead_count != iter->last_dead_count)
 	last_visited POSSIBLY INVALID -> last_visited = NULL
 if (!css_tryget(iter->last_visited))
 	last_visited DEAD -> last_visited = NULL
 next = find_next(last_visited)
 css_tryget(next)
 css_put(last_visited) 	// css would be invalidated and parent->dead_count
 			// incremented if this was the last reference
 iter->last_visited = next
 smp_wmb()
 iter->last_dead_count = dead_count
 rcu_read_unlock()

cgroup_rmdir
 cgroup_destroy_locked
  atomic_add(CSS_DEACT_BIAS, &css->refcnt) // subsequent css_tryget fail
   mem_cgroup_css_offline
    mem_cgroup_invalidate_reclaim_iterators
     while(parent = parent_mem_cgroup)
     	atomic_inc(parent->dead_count)
  css_put(css) // last reference held by cgroup core

Spotted-by: Ying Han <yinghan@google.com>
Original-idea-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |   69 +++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 52 insertions(+), 17 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e9f5c47..88d5882 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -144,12 +144,15 @@ struct mem_cgroup_stat_cpu {
 };
 
 struct mem_cgroup_reclaim_iter {
-	/* last scanned hierarchy member with elevated css ref count */
+	/*
+	 * last scanned hierarchy member. Valid only if last_dead_count
+	 * matches memcg->dead_count of the hierarchy root group.
+	 */
 	struct mem_cgroup *last_visited;
+	unsigned long last_dead_count;
+
 	/* scan generation, increased every round-trip */
 	unsigned int generation;
-	/* lock to protect the position and generation */
-	spinlock_t iter_lock;
 };
 
 /*
@@ -357,6 +360,7 @@ struct mem_cgroup {
 	struct mem_cgroup_stat_cpu nocpu_base;
 	spinlock_t pcp_counter_lock;
 
+	atomic_t	dead_count;
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
 	struct tcp_memcontrol tcp_mem;
 #endif
@@ -1133,6 +1137,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 {
 	struct mem_cgroup *memcg = NULL;
 	struct mem_cgroup *last_visited = NULL;
+	unsigned long uninitialized_var(dead_count);
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -1161,16 +1166,33 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 
 			mz = mem_cgroup_zoneinfo(root, nid, zid);
 			iter = &mz->reclaim_iter[reclaim->priority];
-			spin_lock(&iter->iter_lock);
 			last_visited = iter->last_visited;
 			if (prev && reclaim->generation != iter->generation) {
-				if (last_visited) {
-					css_put(&last_visited->css);
-					iter->last_visited = NULL;
-				}
-				spin_unlock(&iter->iter_lock);
+				iter->last_visited = NULL;
 				goto out_unlock;
 			}
+
+			/*
+                         * If the dead_count mismatches, a destruction
+                         * has happened or is happening concurrently.
+                         * If the dead_count matches, a destruction
+                         * might still happen concurrently, but since
+                         * we checked under RCU, that destruction
+                         * won't free the object until we release the
+                         * RCU reader lock.  Thus, the dead_count
+                         * check verifies the pointer is still valid,
+                         * css_tryget() verifies the cgroup pointed to
+                         * is alive.
+			 */
+			dead_count = atomic_read(&root->dead_count);
+			smp_rmb();
+			last_visited = iter->last_visited;
+			if (last_visited) {
+				if ((dead_count != iter->last_dead_count) ||
+					!css_tryget(&last_visited->css)) {
+					last_visited = NULL;
+				}
+			}
 		}
 
 		/*
@@ -1210,16 +1232,14 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			if (css && !memcg)
 				curr = mem_cgroup_from_css(css);
 
-			/* make sure that the cached memcg is not removed */
-			if (curr)
-				css_get(&curr->css);
 			iter->last_visited = curr;
+			smp_wmb();
+			iter->last_dead_count = dead_count;
 
 			if (!css)
 				iter->generation++;
 			else if (!prev && memcg)
 				reclaim->generation = iter->generation;
-			spin_unlock(&iter->iter_lock);
 		} else if (css && !memcg) {
 			last_visited = mem_cgroup_from_css(css);
 		}
@@ -6098,12 +6118,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 		return 1;
 
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-		int prio;
-
 		mz = &pn->zoneinfo[zone];
 		lruvec_init(&mz->lruvec);
-		for (prio = 0; prio < DEF_PRIORITY + 1; prio++)
-			spin_lock_init(&mz->reclaim_iter[prio].iter_lock);
 		mz->usage_in_excess = 0;
 		mz->on_tree = false;
 		mz->memcg = memcg;
@@ -6375,10 +6391,29 @@ free_out:
 	return ERR_PTR(error);
 }
 
+/*
+ * Announce all parents that a group from their hierarchy is gone.
+ */
+static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *parent = memcg;
+
+	while ((parent = parent_mem_cgroup(parent)))
+		atomic_inc(&parent->dead_count);
+
+	/*
+	 * if the root memcg is not hierarchical we have to check it
+	 * explicitely.
+	 */
+	if (!root_mem_cgroup->use_hierarchy)
+		atomic_inc(&root_mem_cgroup->dead_count);
+}
+
 static void mem_cgroup_css_offline(struct cgroup *cont)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
 
+	mem_cgroup_invalidate_reclaim_iterators(memcg);
 	mem_cgroup_reparent_charges(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
 }
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 4/6] memcg: simplify mem_cgroup_iter
  2013-02-14 13:26 ` Michal Hocko
@ 2013-02-14 13:26   ` Michal Hocko
  -1 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2013-02-14 13:26 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan, Andrew Morton

Current implementation of mem_cgroup_iter has to consider both css and
memcg to find out whether no group has been found (css==NULL - aka the
loop is completed) and that no memcg is associated with the found node
(!memcg - aka css_tryget failed because the group is no longer alive).
This leads to awkward tweaks like tests for css && !memcg to skip the
current node.

It will be much easier if we got rid off css variable altogether and
only rely on memcg. In order to do that the iteration part has to skip
dead nodes. This sounds natural to me and as a nice side effect we will
get a simple invariant that memcg is always alive when non-NULL and all
nodes have been visited otherwise.

We could get rid of the surrounding while loop but keep it in for now to
make review easier. It will go away in the following patch.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |   52 +++++++++++++++++++++++++---------------------------
 1 file changed, 25 insertions(+), 27 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 88d5882..36938f6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1157,7 +1157,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	rcu_read_lock();
 	while (!memcg) {
 		struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
-		struct cgroup_subsys_state *css = NULL;
 
 		if (reclaim) {
 			int nid = zone_to_nid(reclaim->zone);
@@ -1200,51 +1199,50 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		 * explicit visit.
 		 */
 		if (!last_visited) {
-			css = &root->css;
+			memcg = root;
 		} else {
 			struct cgroup *prev_cgroup, *next_cgroup;
 
 			prev_cgroup = (last_visited == root) ? NULL
 				: last_visited->css.cgroup;
-			next_cgroup = cgroup_next_descendant_pre(prev_cgroup,
-					root->css.cgroup);
-			if (next_cgroup)
-				css = cgroup_subsys_state(next_cgroup,
-						mem_cgroup_subsys_id);
-		}
+skip_node:
+			next_cgroup = cgroup_next_descendant_pre(
+					prev_cgroup, root->css.cgroup);
 
-		/*
-		 * Even if we found a group we have to make sure it is alive.
-		 * css && !memcg means that the groups should be skipped and
-		 * we should continue the tree walk.
-		 * last_visited css is safe to use because it is protected by
-		 * css_get and the tree walk is rcu safe.
-		 */
-		if (css == &root->css || (css && css_tryget(css)))
-			memcg = mem_cgroup_from_css(css);
+			/*
+			 * Even if we found a group we have to make sure it is
+			 * alive. css && !memcg means that the groups should be
+			 * skipped and we should continue the tree walk.
+			 * last_visited css is safe to use because it is
+			 * protected by css_get and the tree walk is rcu safe.
+			 */
+			if (next_cgroup) {
+				struct mem_cgroup *mem = mem_cgroup_from_cont(
+						next_cgroup);
+				if (css_tryget(&mem->css))
+					memcg = mem;
+				else {
+					prev_cgroup = next_cgroup;
+					goto skip_node;
+				}
+			}
+		}
 
 		if (reclaim) {
-			struct mem_cgroup *curr = memcg;
-
 			if (last_visited)
 				css_put(&last_visited->css);
 
-			if (css && !memcg)
-				curr = mem_cgroup_from_css(css);
-
-			iter->last_visited = curr;
+			iter->last_visited = memcg;
 			smp_wmb();
 			iter->last_dead_count = dead_count;
 
-			if (!css)
+			if (!memcg)
 				iter->generation++;
 			else if (!prev && memcg)
 				reclaim->generation = iter->generation;
-		} else if (css && !memcg) {
-			last_visited = mem_cgroup_from_css(css);
 		}
 
-		if (prev && !css)
+		if (prev && !memcg)
 			goto out_unlock;
 	}
 out_unlock:
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 4/6] memcg: simplify mem_cgroup_iter
@ 2013-02-14 13:26   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2013-02-14 13:26 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan, Andrew Morton

Current implementation of mem_cgroup_iter has to consider both css and
memcg to find out whether no group has been found (css==NULL - aka the
loop is completed) and that no memcg is associated with the found node
(!memcg - aka css_tryget failed because the group is no longer alive).
This leads to awkward tweaks like tests for css && !memcg to skip the
current node.

It will be much easier if we got rid off css variable altogether and
only rely on memcg. In order to do that the iteration part has to skip
dead nodes. This sounds natural to me and as a nice side effect we will
get a simple invariant that memcg is always alive when non-NULL and all
nodes have been visited otherwise.

We could get rid of the surrounding while loop but keep it in for now to
make review easier. It will go away in the following patch.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |   52 +++++++++++++++++++++++++---------------------------
 1 file changed, 25 insertions(+), 27 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 88d5882..36938f6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1157,7 +1157,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	rcu_read_lock();
 	while (!memcg) {
 		struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
-		struct cgroup_subsys_state *css = NULL;
 
 		if (reclaim) {
 			int nid = zone_to_nid(reclaim->zone);
@@ -1200,51 +1199,50 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		 * explicit visit.
 		 */
 		if (!last_visited) {
-			css = &root->css;
+			memcg = root;
 		} else {
 			struct cgroup *prev_cgroup, *next_cgroup;
 
 			prev_cgroup = (last_visited == root) ? NULL
 				: last_visited->css.cgroup;
-			next_cgroup = cgroup_next_descendant_pre(prev_cgroup,
-					root->css.cgroup);
-			if (next_cgroup)
-				css = cgroup_subsys_state(next_cgroup,
-						mem_cgroup_subsys_id);
-		}
+skip_node:
+			next_cgroup = cgroup_next_descendant_pre(
+					prev_cgroup, root->css.cgroup);
 
-		/*
-		 * Even if we found a group we have to make sure it is alive.
-		 * css && !memcg means that the groups should be skipped and
-		 * we should continue the tree walk.
-		 * last_visited css is safe to use because it is protected by
-		 * css_get and the tree walk is rcu safe.
-		 */
-		if (css == &root->css || (css && css_tryget(css)))
-			memcg = mem_cgroup_from_css(css);
+			/*
+			 * Even if we found a group we have to make sure it is
+			 * alive. css && !memcg means that the groups should be
+			 * skipped and we should continue the tree walk.
+			 * last_visited css is safe to use because it is
+			 * protected by css_get and the tree walk is rcu safe.
+			 */
+			if (next_cgroup) {
+				struct mem_cgroup *mem = mem_cgroup_from_cont(
+						next_cgroup);
+				if (css_tryget(&mem->css))
+					memcg = mem;
+				else {
+					prev_cgroup = next_cgroup;
+					goto skip_node;
+				}
+			}
+		}
 
 		if (reclaim) {
-			struct mem_cgroup *curr = memcg;
-
 			if (last_visited)
 				css_put(&last_visited->css);
 
-			if (css && !memcg)
-				curr = mem_cgroup_from_css(css);
-
-			iter->last_visited = curr;
+			iter->last_visited = memcg;
 			smp_wmb();
 			iter->last_dead_count = dead_count;
 
-			if (!css)
+			if (!memcg)
 				iter->generation++;
 			else if (!prev && memcg)
 				reclaim->generation = iter->generation;
-		} else if (css && !memcg) {
-			last_visited = mem_cgroup_from_css(css);
 		}
 
-		if (prev && !css)
+		if (prev && !memcg)
 			goto out_unlock;
 	}
 out_unlock:
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 5/6] memcg: further simplify mem_cgroup_iter
  2013-02-14 13:26 ` Michal Hocko
@ 2013-02-14 13:26   ` Michal Hocko
  -1 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2013-02-14 13:26 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan, Andrew Morton

mem_cgroup_iter basically does two things currently. It takes care of
the house keeping (reference counting, raclaim cookie) and it iterates
through a hierarchy tree (by using cgroup generic tree walk).
The code would be much more easier to follow if we move the iteration
outside of the function (to __mem_cgrou_iter_next) so the distinction
is more clear.
This patch doesn't introduce any functional changes.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   79 ++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 46 insertions(+), 33 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 36938f6..0d7357b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1114,6 +1114,51 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
 	return memcg;
 }
 
+/*
+ * Returns a next (in a pre-order walk) alive memcg (with elevated css
+ * ref. count) or NULL if the whole root's subtree has been visited.
+ *
+ * helper function to be used by mem_cgroup_iter
+ */
+static struct mem_cgroup *__mem_cgroup_iter_next(struct mem_cgroup *root,
+		struct mem_cgroup *last_visited)
+{
+	struct cgroup *prev_cgroup, *next_cgroup;
+
+	/*
+	 * Root is not visited by cgroup iterators so it needs an
+	 * explicit visit.
+	 */
+	if (!last_visited)
+		return root;
+
+	prev_cgroup = (last_visited == root) ? NULL
+		: last_visited->css.cgroup;
+skip_node:
+	next_cgroup = cgroup_next_descendant_pre(
+			prev_cgroup, root->css.cgroup);
+
+	/*
+	 * Even if we found a group we have to make sure it is
+	 * alive. css && !memcg means that the groups should be
+	 * skipped and we should continue the tree walk.
+	 * last_visited css is safe to use because it is
+	 * protected by css_get and the tree walk is rcu safe.
+	 */
+	if (next_cgroup) {
+		struct mem_cgroup *mem = mem_cgroup_from_cont(
+				next_cgroup);
+		if (css_tryget(&mem->css))
+			return mem;
+		else {
+			prev_cgroup = next_cgroup;
+			goto skip_node;
+		}
+	}
+
+	return NULL;
+}
+
 /**
  * mem_cgroup_iter - iterate over memory cgroup hierarchy
  * @root: hierarchy root
@@ -1194,39 +1239,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			}
 		}
 
-		/*
-		 * Root is not visited by cgroup iterators so it needs an
-		 * explicit visit.
-		 */
-		if (!last_visited) {
-			memcg = root;
-		} else {
-			struct cgroup *prev_cgroup, *next_cgroup;
-
-			prev_cgroup = (last_visited == root) ? NULL
-				: last_visited->css.cgroup;
-skip_node:
-			next_cgroup = cgroup_next_descendant_pre(
-					prev_cgroup, root->css.cgroup);
-
-			/*
-			 * Even if we found a group we have to make sure it is
-			 * alive. css && !memcg means that the groups should be
-			 * skipped and we should continue the tree walk.
-			 * last_visited css is safe to use because it is
-			 * protected by css_get and the tree walk is rcu safe.
-			 */
-			if (next_cgroup) {
-				struct mem_cgroup *mem = mem_cgroup_from_cont(
-						next_cgroup);
-				if (css_tryget(&mem->css))
-					memcg = mem;
-				else {
-					prev_cgroup = next_cgroup;
-					goto skip_node;
-				}
-			}
-		}
+		memcg = __mem_cgroup_iter_next(root, last_visited);
 
 		if (reclaim) {
 			if (last_visited)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 5/6] memcg: further simplify mem_cgroup_iter
@ 2013-02-14 13:26   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2013-02-14 13:26 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan, Andrew Morton

mem_cgroup_iter basically does two things currently. It takes care of
the house keeping (reference counting, raclaim cookie) and it iterates
through a hierarchy tree (by using cgroup generic tree walk).
The code would be much more easier to follow if we move the iteration
outside of the function (to __mem_cgrou_iter_next) so the distinction
is more clear.
This patch doesn't introduce any functional changes.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   79 ++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 46 insertions(+), 33 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 36938f6..0d7357b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1114,6 +1114,51 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
 	return memcg;
 }
 
+/*
+ * Returns a next (in a pre-order walk) alive memcg (with elevated css
+ * ref. count) or NULL if the whole root's subtree has been visited.
+ *
+ * helper function to be used by mem_cgroup_iter
+ */
+static struct mem_cgroup *__mem_cgroup_iter_next(struct mem_cgroup *root,
+		struct mem_cgroup *last_visited)
+{
+	struct cgroup *prev_cgroup, *next_cgroup;
+
+	/*
+	 * Root is not visited by cgroup iterators so it needs an
+	 * explicit visit.
+	 */
+	if (!last_visited)
+		return root;
+
+	prev_cgroup = (last_visited == root) ? NULL
+		: last_visited->css.cgroup;
+skip_node:
+	next_cgroup = cgroup_next_descendant_pre(
+			prev_cgroup, root->css.cgroup);
+
+	/*
+	 * Even if we found a group we have to make sure it is
+	 * alive. css && !memcg means that the groups should be
+	 * skipped and we should continue the tree walk.
+	 * last_visited css is safe to use because it is
+	 * protected by css_get and the tree walk is rcu safe.
+	 */
+	if (next_cgroup) {
+		struct mem_cgroup *mem = mem_cgroup_from_cont(
+				next_cgroup);
+		if (css_tryget(&mem->css))
+			return mem;
+		else {
+			prev_cgroup = next_cgroup;
+			goto skip_node;
+		}
+	}
+
+	return NULL;
+}
+
 /**
  * mem_cgroup_iter - iterate over memory cgroup hierarchy
  * @root: hierarchy root
@@ -1194,39 +1239,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			}
 		}
 
-		/*
-		 * Root is not visited by cgroup iterators so it needs an
-		 * explicit visit.
-		 */
-		if (!last_visited) {
-			memcg = root;
-		} else {
-			struct cgroup *prev_cgroup, *next_cgroup;
-
-			prev_cgroup = (last_visited == root) ? NULL
-				: last_visited->css.cgroup;
-skip_node:
-			next_cgroup = cgroup_next_descendant_pre(
-					prev_cgroup, root->css.cgroup);
-
-			/*
-			 * Even if we found a group we have to make sure it is
-			 * alive. css && !memcg means that the groups should be
-			 * skipped and we should continue the tree walk.
-			 * last_visited css is safe to use because it is
-			 * protected by css_get and the tree walk is rcu safe.
-			 */
-			if (next_cgroup) {
-				struct mem_cgroup *mem = mem_cgroup_from_cont(
-						next_cgroup);
-				if (css_tryget(&mem->css))
-					memcg = mem;
-				else {
-					prev_cgroup = next_cgroup;
-					goto skip_node;
-				}
-			}
-		}
+		memcg = __mem_cgroup_iter_next(root, last_visited);
 
 		if (reclaim) {
 			if (last_visited)
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 6/6] cgroup: remove css_get_next
  2013-02-14 13:26 ` Michal Hocko
@ 2013-02-14 13:26   ` Michal Hocko
  -1 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2013-02-14 13:26 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan, Andrew Morton

Now that we have generic and well ordered cgroup tree walkers there is
no need to keep css_get_next in the place.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
---
 include/linux/cgroup.h |    7 -------
 kernel/cgroup.c        |   49 ------------------------------------------------
 2 files changed, 56 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 7d73905..a4d86b0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -685,13 +685,6 @@ void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css);
 
 struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id);
 
-/*
- * Get a cgroup whose id is greater than or equal to id under tree of root.
- * Returning a cgroup_subsys_state or NULL.
- */
-struct cgroup_subsys_state *css_get_next(struct cgroup_subsys *ss, int id,
-		struct cgroup_subsys_state *root, int *foundid);
-
 /* Returns true if root is ancestor of cg */
 bool css_is_ancestor(struct cgroup_subsys_state *cg,
 		     const struct cgroup_subsys_state *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f34c41b..3013ec4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5384,55 +5384,6 @@ struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id)
 }
 EXPORT_SYMBOL_GPL(css_lookup);
 
-/**
- * css_get_next - lookup next cgroup under specified hierarchy.
- * @ss: pointer to subsystem
- * @id: current position of iteration.
- * @root: pointer to css. search tree under this.
- * @foundid: position of found object.
- *
- * Search next css under the specified hierarchy of rootid. Calling under
- * rcu_read_lock() is necessary. Returns NULL if it reaches the end.
- */
-struct cgroup_subsys_state *
-css_get_next(struct cgroup_subsys *ss, int id,
-	     struct cgroup_subsys_state *root, int *foundid)
-{
-	struct cgroup_subsys_state *ret = NULL;
-	struct css_id *tmp;
-	int tmpid;
-	int rootid = css_id(root);
-	int depth = css_depth(root);
-
-	if (!rootid)
-		return NULL;
-
-	BUG_ON(!ss->use_id);
-	WARN_ON_ONCE(!rcu_read_lock_held());
-
-	/* fill start point for scan */
-	tmpid = id;
-	while (1) {
-		/*
-		 * scan next entry from bitmap(tree), tmpid is updated after
-		 * idr_get_next().
-		 */
-		tmp = idr_get_next(&ss->idr, &tmpid);
-		if (!tmp)
-			break;
-		if (tmp->depth >= depth && tmp->stack[depth] == rootid) {
-			ret = rcu_dereference(tmp->css);
-			if (ret) {
-				*foundid = tmpid;
-				break;
-			}
-		}
-		/* continue to scan from next id */
-		tmpid = tmpid + 1;
-	}
-	return ret;
-}
-
 /*
  * get corresponding css from file open on cgroupfs directory
  */
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 6/6] cgroup: remove css_get_next
@ 2013-02-14 13:26   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2013-02-14 13:26 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan, Andrew Morton

Now that we have generic and well ordered cgroup tree walkers there is
no need to keep css_get_next in the place.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
---
 include/linux/cgroup.h |    7 -------
 kernel/cgroup.c        |   49 ------------------------------------------------
 2 files changed, 56 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 7d73905..a4d86b0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -685,13 +685,6 @@ void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css);
 
 struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id);
 
-/*
- * Get a cgroup whose id is greater than or equal to id under tree of root.
- * Returning a cgroup_subsys_state or NULL.
- */
-struct cgroup_subsys_state *css_get_next(struct cgroup_subsys *ss, int id,
-		struct cgroup_subsys_state *root, int *foundid);
-
 /* Returns true if root is ancestor of cg */
 bool css_is_ancestor(struct cgroup_subsys_state *cg,
 		     const struct cgroup_subsys_state *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f34c41b..3013ec4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5384,55 +5384,6 @@ struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id)
 }
 EXPORT_SYMBOL_GPL(css_lookup);
 
-/**
- * css_get_next - lookup next cgroup under specified hierarchy.
- * @ss: pointer to subsystem
- * @id: current position of iteration.
- * @root: pointer to css. search tree under this.
- * @foundid: position of found object.
- *
- * Search next css under the specified hierarchy of rootid. Calling under
- * rcu_read_lock() is necessary. Returns NULL if it reaches the end.
- */
-struct cgroup_subsys_state *
-css_get_next(struct cgroup_subsys *ss, int id,
-	     struct cgroup_subsys_state *root, int *foundid)
-{
-	struct cgroup_subsys_state *ret = NULL;
-	struct css_id *tmp;
-	int tmpid;
-	int rootid = css_id(root);
-	int depth = css_depth(root);
-
-	if (!rootid)
-		return NULL;
-
-	BUG_ON(!ss->use_id);
-	WARN_ON_ONCE(!rcu_read_lock_held());
-
-	/* fill start point for scan */
-	tmpid = id;
-	while (1) {
-		/*
-		 * scan next entry from bitmap(tree), tmpid is updated after
-		 * idr_get_next().
-		 */
-		tmp = idr_get_next(&ss->idr, &tmpid);
-		if (!tmp)
-			break;
-		if (tmp->depth >= depth && tmp->stack[depth] == rootid) {
-			ret = rcu_dereference(tmp->css);
-			if (ret) {
-				*foundid = tmpid;
-				break;
-			}
-		}
-		/* continue to scan from next id */
-		tmpid = tmpid + 1;
-	}
-	return ret;
-}
-
 /*
  * get corresponding css from file open on cgroupfs directory
  */
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 2/6] memcg: rework mem_cgroup_iter to use cgroup iterators
  2013-02-14 13:26   ` Michal Hocko
@ 2013-02-15  8:03     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 28+ messages in thread
From: Kamezawa Hiroyuki @ 2013-02-15  8:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Johannes Weiner, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan, Andrew Morton

(2013/02/14 22:26), Michal Hocko wrote:
> mem_cgroup_iter curently relies on css->id when walking down a group
> hierarchy tree. This is really awkward because the tree walk depends on
> the groups creation ordering. The only guarantee is that a parent node
> is visited before its children.
> Example
>   1) mkdir -p a a/d a/b/c
>   2) mkdir -a a/b/c a/d
> Will create the same trees but the tree walks will be different:
>   1) a, d, b, c
>   2) a, b, c, d
> 
> 574bd9f7 (cgroup: implement generic child / descendant walk macros) has
> introduced generic cgroup tree walkers which provide either pre-order
> or post-order tree walk. This patch converts css->id based iteration
> to pre-order tree walk to keep the semantic with the original iterator
> where parent is always visited before its subtree.
> 
> cgroup_for_each_descendant_pre suggests using post_create and
> pre_destroy for proper synchronization with groups addidition resp.
> removal. This implementation doesn't use those because a new memory
> cgroup is initialized sufficiently for iteration in mem_cgroup_css_alloc
> already and css reference counting enforces that the group is alive for
> both the last seen cgroup and the found one resp. it signals that the
> group is dead and it should be skipped.
> 
> If the reclaim cookie is used we need to store the last visited group
> into the iterator so we have to be careful that it doesn't disappear in
> the mean time. Elevated reference count on the css keeps it alive even
> though the group have been removed (parked waiting for the last dput so
> that it can be freed).
> 
> Per node-zone-prio iter_lock has been introduced to ensure that
> css_tryget and iter->last_visited is set atomically. Otherwise two
> racing walkers could both take a references and only one release it
> leading to a css leak (which pins cgroup dentry).
> 
> V3
> - introduce iter_lock
> V2
> - use css_{get,put} for iter->last_visited rather than
>    mem_cgroup_{get,put} because it is stronger wrt. cgroup life cycle
> - cgroup_next_descendant_pre expects NULL pos for the first iterartion
>    otherwise it might loop endlessly for intermediate node without any
>    children.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>   mm/memcontrol.c |   86 +++++++++++++++++++++++++++++++++++++++++++------------
>   1 file changed, 68 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f3d1bfe..e9f5c47 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -144,10 +144,12 @@ struct mem_cgroup_stat_cpu {
>   };
>   
>   struct mem_cgroup_reclaim_iter {
> -	/* css_id of the last scanned hierarchy member */
> -	int position;
> +	/* last scanned hierarchy member with elevated css ref count */
> +	struct mem_cgroup *last_visited;
>   	/* scan generation, increased every round-trip */
>   	unsigned int generation;
> +	/* lock to protect the position and generation */
> +	spinlock_t iter_lock;
>   };
>   
>   /*
> @@ -1130,7 +1132,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>   				   struct mem_cgroup_reclaim_cookie *reclaim)
>   {
>   	struct mem_cgroup *memcg = NULL;
> -	int id = 0;
> +	struct mem_cgroup *last_visited = NULL;
>   
>   	if (mem_cgroup_disabled())
>   		return NULL;
> @@ -1139,7 +1141,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>   		root = root_mem_cgroup;
>   
>   	if (prev && !reclaim)
> -		id = css_id(&prev->css);
> +		last_visited = prev;
>   
>   	if (!root->use_hierarchy && root != root_mem_cgroup) {
>   		if (prev)
> @@ -1147,9 +1149,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>   		return root;
>   	}
>   
> +	rcu_read_lock();
>   	while (!memcg) {
>   		struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
> -		struct cgroup_subsys_state *css;
> +		struct cgroup_subsys_state *css = NULL;
>   
>   		if (reclaim) {
>   			int nid = zone_to_nid(reclaim->zone);
> @@ -1158,31 +1161,74 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>   
>   			mz = mem_cgroup_zoneinfo(root, nid, zid);
>   			iter = &mz->reclaim_iter[reclaim->priority];
> -			if (prev && reclaim->generation != iter->generation)
> -				goto out_css_put;
> -			id = iter->position;
> +			spin_lock(&iter->iter_lock);
> +			last_visited = iter->last_visited;
> +			if (prev && reclaim->generation != iter->generation) {
> +				if (last_visited) {
> +					css_put(&last_visited->css);
> +					iter->last_visited = NULL;
> +				}
> +				spin_unlock(&iter->iter_lock);
> +				goto out_unlock;
> +			}
>   		}
>   
> -		rcu_read_lock();
> -		css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
> -		if (css) {
> -			if (css == &root->css || css_tryget(css))
> -				memcg = mem_cgroup_from_css(css);
> -		} else
> -			id = 0;
> -		rcu_read_unlock();
> +		/*
> +		 * Root is not visited by cgroup iterators so it needs an
> +		 * explicit visit.
> +		 */
> +		if (!last_visited) {
> +			css = &root->css;
> +		} else {
> +			struct cgroup *prev_cgroup, *next_cgroup;
> +
> +			prev_cgroup = (last_visited == root) ? NULL
> +				: last_visited->css.cgroup;
> +			next_cgroup = cgroup_next_descendant_pre(prev_cgroup,
> +					root->css.cgroup);
> +			if (next_cgroup)
> +				css = cgroup_subsys_state(next_cgroup,
> +						mem_cgroup_subsys_id);
> +		}
> +
> +		/*
> +		 * Even if we found a group we have to make sure it is alive.
> +		 * css && !memcg means that the groups should be skipped and
> +		 * we should continue the tree walk.
> +		 * last_visited css is safe to use because it is protected by
> +		 * css_get and the tree walk is rcu safe.
> +		 */
> +		if (css == &root->css || (css && css_tryget(css)))
> +			memcg = mem_cgroup_from_css(css);
>   
>   		if (reclaim) {
> -			iter->position = id;
> +			struct mem_cgroup *curr = memcg;
> +
> +			if (last_visited)
> +				css_put(&last_visited->css);
> +
> +			if (css && !memcg)
> +				curr = mem_cgroup_from_css(css);
> +
> +			/* make sure that the cached memcg is not removed */
> +			if (curr)
> +				css_get(&curr->css);
I'm sorry if I miss something...

This curr is  curr == memcg = mem_cgroup_from_css(css) <= already try_get() done.

double refcounted ?

Thanks,
-Kame


> +			iter->last_visited = curr;
> +
>   			if (!css)
>   				iter->generation++;
>   			else if (!prev && memcg)
>   				reclaim->generation = iter->generation;
> +			spin_unlock(&iter->iter_lock);
> +		} else if (css && !memcg) {
> +			last_visited = mem_cgroup_from_css(css);
>   		}
>   
>   		if (prev && !css)
> -			goto out_css_put;
> +			goto out_unlock;
>   	}
> +out_unlock:
> +	rcu_read_unlock();
>   out_css_put:
>   	if (prev && prev != root)
>   		css_put(&prev->css);
> @@ -6052,8 +6098,12 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
>   		return 1;
>   
>   	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> +		int prio;
> +
>   		mz = &pn->zoneinfo[zone];
>   		lruvec_init(&mz->lruvec);
> +		for (prio = 0; prio < DEF_PRIORITY + 1; prio++)
> +			spin_lock_init(&mz->reclaim_iter[prio].iter_lock);
>   		mz->usage_in_excess = 0;
>   		mz->on_tree = false;
>   		mz->memcg = memcg;
> 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 2/6] memcg: rework mem_cgroup_iter to use cgroup iterators
@ 2013-02-15  8:03     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 28+ messages in thread
From: Kamezawa Hiroyuki @ 2013-02-15  8:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Johannes Weiner, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan, Andrew Morton

(2013/02/14 22:26), Michal Hocko wrote:
> mem_cgroup_iter curently relies on css->id when walking down a group
> hierarchy tree. This is really awkward because the tree walk depends on
> the groups creation ordering. The only guarantee is that a parent node
> is visited before its children.
> Example
>   1) mkdir -p a a/d a/b/c
>   2) mkdir -a a/b/c a/d
> Will create the same trees but the tree walks will be different:
>   1) a, d, b, c
>   2) a, b, c, d
> 
> 574bd9f7 (cgroup: implement generic child / descendant walk macros) has
> introduced generic cgroup tree walkers which provide either pre-order
> or post-order tree walk. This patch converts css->id based iteration
> to pre-order tree walk to keep the semantic with the original iterator
> where parent is always visited before its subtree.
> 
> cgroup_for_each_descendant_pre suggests using post_create and
> pre_destroy for proper synchronization with groups addidition resp.
> removal. This implementation doesn't use those because a new memory
> cgroup is initialized sufficiently for iteration in mem_cgroup_css_alloc
> already and css reference counting enforces that the group is alive for
> both the last seen cgroup and the found one resp. it signals that the
> group is dead and it should be skipped.
> 
> If the reclaim cookie is used we need to store the last visited group
> into the iterator so we have to be careful that it doesn't disappear in
> the mean time. Elevated reference count on the css keeps it alive even
> though the group have been removed (parked waiting for the last dput so
> that it can be freed).
> 
> Per node-zone-prio iter_lock has been introduced to ensure that
> css_tryget and iter->last_visited is set atomically. Otherwise two
> racing walkers could both take a references and only one release it
> leading to a css leak (which pins cgroup dentry).
> 
> V3
> - introduce iter_lock
> V2
> - use css_{get,put} for iter->last_visited rather than
>    mem_cgroup_{get,put} because it is stronger wrt. cgroup life cycle
> - cgroup_next_descendant_pre expects NULL pos for the first iterartion
>    otherwise it might loop endlessly for intermediate node without any
>    children.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>   mm/memcontrol.c |   86 +++++++++++++++++++++++++++++++++++++++++++------------
>   1 file changed, 68 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f3d1bfe..e9f5c47 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -144,10 +144,12 @@ struct mem_cgroup_stat_cpu {
>   };
>   
>   struct mem_cgroup_reclaim_iter {
> -	/* css_id of the last scanned hierarchy member */
> -	int position;
> +	/* last scanned hierarchy member with elevated css ref count */
> +	struct mem_cgroup *last_visited;
>   	/* scan generation, increased every round-trip */
>   	unsigned int generation;
> +	/* lock to protect the position and generation */
> +	spinlock_t iter_lock;
>   };
>   
>   /*
> @@ -1130,7 +1132,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>   				   struct mem_cgroup_reclaim_cookie *reclaim)
>   {
>   	struct mem_cgroup *memcg = NULL;
> -	int id = 0;
> +	struct mem_cgroup *last_visited = NULL;
>   
>   	if (mem_cgroup_disabled())
>   		return NULL;
> @@ -1139,7 +1141,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>   		root = root_mem_cgroup;
>   
>   	if (prev && !reclaim)
> -		id = css_id(&prev->css);
> +		last_visited = prev;
>   
>   	if (!root->use_hierarchy && root != root_mem_cgroup) {
>   		if (prev)
> @@ -1147,9 +1149,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>   		return root;
>   	}
>   
> +	rcu_read_lock();
>   	while (!memcg) {
>   		struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
> -		struct cgroup_subsys_state *css;
> +		struct cgroup_subsys_state *css = NULL;
>   
>   		if (reclaim) {
>   			int nid = zone_to_nid(reclaim->zone);
> @@ -1158,31 +1161,74 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>   
>   			mz = mem_cgroup_zoneinfo(root, nid, zid);
>   			iter = &mz->reclaim_iter[reclaim->priority];
> -			if (prev && reclaim->generation != iter->generation)
> -				goto out_css_put;
> -			id = iter->position;
> +			spin_lock(&iter->iter_lock);
> +			last_visited = iter->last_visited;
> +			if (prev && reclaim->generation != iter->generation) {
> +				if (last_visited) {
> +					css_put(&last_visited->css);
> +					iter->last_visited = NULL;
> +				}
> +				spin_unlock(&iter->iter_lock);
> +				goto out_unlock;
> +			}
>   		}
>   
> -		rcu_read_lock();
> -		css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
> -		if (css) {
> -			if (css == &root->css || css_tryget(css))
> -				memcg = mem_cgroup_from_css(css);
> -		} else
> -			id = 0;
> -		rcu_read_unlock();
> +		/*
> +		 * Root is not visited by cgroup iterators so it needs an
> +		 * explicit visit.
> +		 */
> +		if (!last_visited) {
> +			css = &root->css;
> +		} else {
> +			struct cgroup *prev_cgroup, *next_cgroup;
> +
> +			prev_cgroup = (last_visited == root) ? NULL
> +				: last_visited->css.cgroup;
> +			next_cgroup = cgroup_next_descendant_pre(prev_cgroup,
> +					root->css.cgroup);
> +			if (next_cgroup)
> +				css = cgroup_subsys_state(next_cgroup,
> +						mem_cgroup_subsys_id);
> +		}
> +
> +		/*
> +		 * Even if we found a group we have to make sure it is alive.
> +		 * css && !memcg means that the groups should be skipped and
> +		 * we should continue the tree walk.
> +		 * last_visited css is safe to use because it is protected by
> +		 * css_get and the tree walk is rcu safe.
> +		 */
> +		if (css == &root->css || (css && css_tryget(css)))
> +			memcg = mem_cgroup_from_css(css);
>   
>   		if (reclaim) {
> -			iter->position = id;
> +			struct mem_cgroup *curr = memcg;
> +
> +			if (last_visited)
> +				css_put(&last_visited->css);
> +
> +			if (css && !memcg)
> +				curr = mem_cgroup_from_css(css);
> +
> +			/* make sure that the cached memcg is not removed */
> +			if (curr)
> +				css_get(&curr->css);
I'm sorry if I miss something...

This curr is  curr == memcg = mem_cgroup_from_css(css) <= already try_get() done.

double refcounted ?

Thanks,
-Kame


> +			iter->last_visited = curr;
> +
>   			if (!css)
>   				iter->generation++;
>   			else if (!prev && memcg)
>   				reclaim->generation = iter->generation;
> +			spin_unlock(&iter->iter_lock);
> +		} else if (css && !memcg) {
> +			last_visited = mem_cgroup_from_css(css);
>   		}
>   
>   		if (prev && !css)
> -			goto out_css_put;
> +			goto out_unlock;
>   	}
> +out_unlock:
> +	rcu_read_unlock();
>   out_css_put:
>   	if (prev && prev != root)
>   		css_put(&prev->css);
> @@ -6052,8 +6098,12 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
>   		return 1;
>   
>   	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> +		int prio;
> +
>   		mz = &pn->zoneinfo[zone];
>   		lruvec_init(&mz->lruvec);
> +		for (prio = 0; prio < DEF_PRIORITY + 1; prio++)
> +			spin_lock_init(&mz->reclaim_iter[prio].iter_lock);
>   		mz->usage_in_excess = 0;
>   		mz->on_tree = false;
>   		mz->memcg = memcg;
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 3/6] memcg: relax memcg iter caching
  2013-02-14 13:26   ` Michal Hocko
@ 2013-02-15  8:08     ` Michal Hocko
  -1 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2013-02-15  8:08 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan, Andrew Morton

On Thu 14-02-13 14:26:33, Michal Hocko wrote:
> Now that per-node-zone-priority iterator caches memory cgroups rather
> than their css ids we have to be careful and remove them from the
> iterator when they are on the way out otherwise they might live for
> unbounded amount of time even though their group is already gone (until
> the global/targeted reclaim triggers the zone under priority to find out
> the group is dead and let it to find the final rest).
> 
> We can fix this issue by relaxing rules for the last_visited memcg.
> Instead of taking a reference to the css before it is stored into
> iter->last_visited we can just store its pointer and track the number of
> removed groups from each memcg's subhierarchy.
> 
> This number would be stored into iterator everytime when a memcg is
> cached. If the iter count doesn't match the curent walker root's one we
> will start from the root again. The group counter is incremented upwards
> the hierarchy every time a group is removed.
> 
> The iter_lock can be dropped because racing iterators cannot leak
> the reference anymore as the reference count is not elevated for
> last_visited when it is cached.
> 
> Locking rules got a bit complicated by this change though. The iterator
> primarily relies on rcu read lock which makes sure that once we see
> a valid last_visited pointer then it will be valid for the whole RCU
> walk. smp_rmb makes sure that dead_count is read before last_visited
> and last_dead_count while smp_wmb makes sure that last_visited is
> updated before last_dead_count so the up-to-date last_dead_count cannot
> point to an outdated last_visited. css_tryget then makes sure that
> the last_visited is still alive in case the iteration races with the
> cached group removal (css is invalidated before mem_cgroup_css_offline
> increments dead_count).
> 
> In short:
> mem_cgroup_iter
>  rcu_read_lock()
>  dead_count = atomic_read(parent->dead_count)
>  smp_rmb()
>  if (dead_count != iter->last_dead_count)
>  	last_visited POSSIBLY INVALID -> last_visited = NULL
>  if (!css_tryget(iter->last_visited))
>  	last_visited DEAD -> last_visited = NULL
>  next = find_next(last_visited)
>  css_tryget(next)
>  css_put(last_visited) 	// css would be invalidated and parent->dead_count
>  			// incremented if this was the last reference
>  iter->last_visited = next
>  smp_wmb()
>  iter->last_dead_count = dead_count
>  rcu_read_unlock()
> 
> cgroup_rmdir
>  cgroup_destroy_locked
>   atomic_add(CSS_DEACT_BIAS, &css->refcnt) // subsequent css_tryget fail
>    mem_cgroup_css_offline
>     mem_cgroup_invalidate_reclaim_iterators
>      while(parent = parent_mem_cgroup)
>      	atomic_inc(parent->dead_count)
>   css_put(css) // last reference held by cgroup core
> 
> Spotted-by: Ying Han <yinghan@google.com>
> Original-idea-by: Johannes Weiner <hannes@cmpxchg.org>

I think Johannes deserves s-o-b rather than o-i-b here. I will post this
changed in the next version.

> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c |   69 +++++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 52 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e9f5c47..88d5882 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -144,12 +144,15 @@ struct mem_cgroup_stat_cpu {
>  };
>  
>  struct mem_cgroup_reclaim_iter {
> -	/* last scanned hierarchy member with elevated css ref count */
> +	/*
> +	 * last scanned hierarchy member. Valid only if last_dead_count
> +	 * matches memcg->dead_count of the hierarchy root group.
> +	 */
>  	struct mem_cgroup *last_visited;
> +	unsigned long last_dead_count;
> +
>  	/* scan generation, increased every round-trip */
>  	unsigned int generation;
> -	/* lock to protect the position and generation */
> -	spinlock_t iter_lock;
>  };
>  
>  /*
> @@ -357,6 +360,7 @@ struct mem_cgroup {
>  	struct mem_cgroup_stat_cpu nocpu_base;
>  	spinlock_t pcp_counter_lock;
>  
> +	atomic_t	dead_count;
>  #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
>  	struct tcp_memcontrol tcp_mem;
>  #endif
> @@ -1133,6 +1137,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  {
>  	struct mem_cgroup *memcg = NULL;
>  	struct mem_cgroup *last_visited = NULL;
> +	unsigned long uninitialized_var(dead_count);
>  
>  	if (mem_cgroup_disabled())
>  		return NULL;
> @@ -1161,16 +1166,33 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  
>  			mz = mem_cgroup_zoneinfo(root, nid, zid);
>  			iter = &mz->reclaim_iter[reclaim->priority];
> -			spin_lock(&iter->iter_lock);
>  			last_visited = iter->last_visited;
>  			if (prev && reclaim->generation != iter->generation) {
> -				if (last_visited) {
> -					css_put(&last_visited->css);
> -					iter->last_visited = NULL;
> -				}
> -				spin_unlock(&iter->iter_lock);
> +				iter->last_visited = NULL;
>  				goto out_unlock;
>  			}
> +
> +			/*
> +                         * If the dead_count mismatches, a destruction
> +                         * has happened or is happening concurrently.
> +                         * If the dead_count matches, a destruction
> +                         * might still happen concurrently, but since
> +                         * we checked under RCU, that destruction
> +                         * won't free the object until we release the
> +                         * RCU reader lock.  Thus, the dead_count
> +                         * check verifies the pointer is still valid,
> +                         * css_tryget() verifies the cgroup pointed to
> +                         * is alive.
> +			 */
> +			dead_count = atomic_read(&root->dead_count);
> +			smp_rmb();
> +			last_visited = iter->last_visited;
> +			if (last_visited) {
> +				if ((dead_count != iter->last_dead_count) ||
> +					!css_tryget(&last_visited->css)) {
> +					last_visited = NULL;
> +				}
> +			}
>  		}
>  
>  		/*
> @@ -1210,16 +1232,14 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  			if (css && !memcg)
>  				curr = mem_cgroup_from_css(css);
>  
> -			/* make sure that the cached memcg is not removed */
> -			if (curr)
> -				css_get(&curr->css);
>  			iter->last_visited = curr;
> +			smp_wmb();
> +			iter->last_dead_count = dead_count;
>  
>  			if (!css)
>  				iter->generation++;
>  			else if (!prev && memcg)
>  				reclaim->generation = iter->generation;
> -			spin_unlock(&iter->iter_lock);
>  		} else if (css && !memcg) {
>  			last_visited = mem_cgroup_from_css(css);
>  		}
> @@ -6098,12 +6118,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
>  		return 1;
>  
>  	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> -		int prio;
> -
>  		mz = &pn->zoneinfo[zone];
>  		lruvec_init(&mz->lruvec);
> -		for (prio = 0; prio < DEF_PRIORITY + 1; prio++)
> -			spin_lock_init(&mz->reclaim_iter[prio].iter_lock);
>  		mz->usage_in_excess = 0;
>  		mz->on_tree = false;
>  		mz->memcg = memcg;
> @@ -6375,10 +6391,29 @@ free_out:
>  	return ERR_PTR(error);
>  }
>  
> +/*
> + * Announce all parents that a group from their hierarchy is gone.
> + */
> +static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg)
> +{
> +	struct mem_cgroup *parent = memcg;
> +
> +	while ((parent = parent_mem_cgroup(parent)))
> +		atomic_inc(&parent->dead_count);
> +
> +	/*
> +	 * if the root memcg is not hierarchical we have to check it
> +	 * explicitely.
> +	 */
> +	if (!root_mem_cgroup->use_hierarchy)
> +		atomic_inc(&root_mem_cgroup->dead_count);
> +}
> +
>  static void mem_cgroup_css_offline(struct cgroup *cont)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
>  
> +	mem_cgroup_invalidate_reclaim_iterators(memcg);
>  	mem_cgroup_reparent_charges(memcg);
>  	mem_cgroup_destroy_all_caches(memcg);
>  }
> -- 
> 1.7.10.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 3/6] memcg: relax memcg iter caching
@ 2013-02-15  8:08     ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2013-02-15  8:08 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan, Andrew Morton

On Thu 14-02-13 14:26:33, Michal Hocko wrote:
> Now that per-node-zone-priority iterator caches memory cgroups rather
> than their css ids we have to be careful and remove them from the
> iterator when they are on the way out otherwise they might live for
> unbounded amount of time even though their group is already gone (until
> the global/targeted reclaim triggers the zone under priority to find out
> the group is dead and let it to find the final rest).
> 
> We can fix this issue by relaxing rules for the last_visited memcg.
> Instead of taking a reference to the css before it is stored into
> iter->last_visited we can just store its pointer and track the number of
> removed groups from each memcg's subhierarchy.
> 
> This number would be stored into iterator everytime when a memcg is
> cached. If the iter count doesn't match the curent walker root's one we
> will start from the root again. The group counter is incremented upwards
> the hierarchy every time a group is removed.
> 
> The iter_lock can be dropped because racing iterators cannot leak
> the reference anymore as the reference count is not elevated for
> last_visited when it is cached.
> 
> Locking rules got a bit complicated by this change though. The iterator
> primarily relies on rcu read lock which makes sure that once we see
> a valid last_visited pointer then it will be valid for the whole RCU
> walk. smp_rmb makes sure that dead_count is read before last_visited
> and last_dead_count while smp_wmb makes sure that last_visited is
> updated before last_dead_count so the up-to-date last_dead_count cannot
> point to an outdated last_visited. css_tryget then makes sure that
> the last_visited is still alive in case the iteration races with the
> cached group removal (css is invalidated before mem_cgroup_css_offline
> increments dead_count).
> 
> In short:
> mem_cgroup_iter
>  rcu_read_lock()
>  dead_count = atomic_read(parent->dead_count)
>  smp_rmb()
>  if (dead_count != iter->last_dead_count)
>  	last_visited POSSIBLY INVALID -> last_visited = NULL
>  if (!css_tryget(iter->last_visited))
>  	last_visited DEAD -> last_visited = NULL
>  next = find_next(last_visited)
>  css_tryget(next)
>  css_put(last_visited) 	// css would be invalidated and parent->dead_count
>  			// incremented if this was the last reference
>  iter->last_visited = next
>  smp_wmb()
>  iter->last_dead_count = dead_count
>  rcu_read_unlock()
> 
> cgroup_rmdir
>  cgroup_destroy_locked
>   atomic_add(CSS_DEACT_BIAS, &css->refcnt) // subsequent css_tryget fail
>    mem_cgroup_css_offline
>     mem_cgroup_invalidate_reclaim_iterators
>      while(parent = parent_mem_cgroup)
>      	atomic_inc(parent->dead_count)
>   css_put(css) // last reference held by cgroup core
> 
> Spotted-by: Ying Han <yinghan@google.com>
> Original-idea-by: Johannes Weiner <hannes@cmpxchg.org>

I think Johannes deserves s-o-b rather than o-i-b here. I will post this
changed in the next version.

> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c |   69 +++++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 52 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e9f5c47..88d5882 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -144,12 +144,15 @@ struct mem_cgroup_stat_cpu {
>  };
>  
>  struct mem_cgroup_reclaim_iter {
> -	/* last scanned hierarchy member with elevated css ref count */
> +	/*
> +	 * last scanned hierarchy member. Valid only if last_dead_count
> +	 * matches memcg->dead_count of the hierarchy root group.
> +	 */
>  	struct mem_cgroup *last_visited;
> +	unsigned long last_dead_count;
> +
>  	/* scan generation, increased every round-trip */
>  	unsigned int generation;
> -	/* lock to protect the position and generation */
> -	spinlock_t iter_lock;
>  };
>  
>  /*
> @@ -357,6 +360,7 @@ struct mem_cgroup {
>  	struct mem_cgroup_stat_cpu nocpu_base;
>  	spinlock_t pcp_counter_lock;
>  
> +	atomic_t	dead_count;
>  #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
>  	struct tcp_memcontrol tcp_mem;
>  #endif
> @@ -1133,6 +1137,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  {
>  	struct mem_cgroup *memcg = NULL;
>  	struct mem_cgroup *last_visited = NULL;
> +	unsigned long uninitialized_var(dead_count);
>  
>  	if (mem_cgroup_disabled())
>  		return NULL;
> @@ -1161,16 +1166,33 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  
>  			mz = mem_cgroup_zoneinfo(root, nid, zid);
>  			iter = &mz->reclaim_iter[reclaim->priority];
> -			spin_lock(&iter->iter_lock);
>  			last_visited = iter->last_visited;
>  			if (prev && reclaim->generation != iter->generation) {
> -				if (last_visited) {
> -					css_put(&last_visited->css);
> -					iter->last_visited = NULL;
> -				}
> -				spin_unlock(&iter->iter_lock);
> +				iter->last_visited = NULL;
>  				goto out_unlock;
>  			}
> +
> +			/*
> +                         * If the dead_count mismatches, a destruction
> +                         * has happened or is happening concurrently.
> +                         * If the dead_count matches, a destruction
> +                         * might still happen concurrently, but since
> +                         * we checked under RCU, that destruction
> +                         * won't free the object until we release the
> +                         * RCU reader lock.  Thus, the dead_count
> +                         * check verifies the pointer is still valid,
> +                         * css_tryget() verifies the cgroup pointed to
> +                         * is alive.
> +			 */
> +			dead_count = atomic_read(&root->dead_count);
> +			smp_rmb();
> +			last_visited = iter->last_visited;
> +			if (last_visited) {
> +				if ((dead_count != iter->last_dead_count) ||
> +					!css_tryget(&last_visited->css)) {
> +					last_visited = NULL;
> +				}
> +			}
>  		}
>  
>  		/*
> @@ -1210,16 +1232,14 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  			if (css && !memcg)
>  				curr = mem_cgroup_from_css(css);
>  
> -			/* make sure that the cached memcg is not removed */
> -			if (curr)
> -				css_get(&curr->css);
>  			iter->last_visited = curr;
> +			smp_wmb();
> +			iter->last_dead_count = dead_count;
>  
>  			if (!css)
>  				iter->generation++;
>  			else if (!prev && memcg)
>  				reclaim->generation = iter->generation;
> -			spin_unlock(&iter->iter_lock);
>  		} else if (css && !memcg) {
>  			last_visited = mem_cgroup_from_css(css);
>  		}
> @@ -6098,12 +6118,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
>  		return 1;
>  
>  	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> -		int prio;
> -
>  		mz = &pn->zoneinfo[zone];
>  		lruvec_init(&mz->lruvec);
> -		for (prio = 0; prio < DEF_PRIORITY + 1; prio++)
> -			spin_lock_init(&mz->reclaim_iter[prio].iter_lock);
>  		mz->usage_in_excess = 0;
>  		mz->on_tree = false;
>  		mz->memcg = memcg;
> @@ -6375,10 +6391,29 @@ free_out:
>  	return ERR_PTR(error);
>  }
>  
> +/*
> + * Announce all parents that a group from their hierarchy is gone.
> + */
> +static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg)
> +{
> +	struct mem_cgroup *parent = memcg;
> +
> +	while ((parent = parent_mem_cgroup(parent)))
> +		atomic_inc(&parent->dead_count);
> +
> +	/*
> +	 * if the root memcg is not hierarchical we have to check it
> +	 * explicitely.
> +	 */
> +	if (!root_mem_cgroup->use_hierarchy)
> +		atomic_inc(&root_mem_cgroup->dead_count);
> +}
> +
>  static void mem_cgroup_css_offline(struct cgroup *cont)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
>  
> +	mem_cgroup_invalidate_reclaim_iterators(memcg);
>  	mem_cgroup_reparent_charges(memcg);
>  	mem_cgroup_destroy_all_caches(memcg);
>  }
> -- 
> 1.7.10.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 2/6] memcg: rework mem_cgroup_iter to use cgroup iterators
  2013-02-15  8:03     ` Kamezawa Hiroyuki
@ 2013-02-15  8:11       ` Michal Hocko
  -1 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2013-02-15  8:11 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, linux-kernel, Johannes Weiner, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan, Andrew Morton

On Fri 15-02-13 17:03:09, KAMEZAWA Hiroyuki wrote:
[...]
> > @@ -1158,31 +1161,74 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> >   
> >   			mz = mem_cgroup_zoneinfo(root, nid, zid);
> >   			iter = &mz->reclaim_iter[reclaim->priority];
> > -			if (prev && reclaim->generation != iter->generation)
> > -				goto out_css_put;
> > -			id = iter->position;
> > +			spin_lock(&iter->iter_lock);
> > +			last_visited = iter->last_visited;
> > +			if (prev && reclaim->generation != iter->generation) {
> > +				if (last_visited) {
> > +					css_put(&last_visited->css);
> > +					iter->last_visited = NULL;
> > +				}
> > +				spin_unlock(&iter->iter_lock);
> > +				goto out_unlock;
> > +			}
[...]
> > +		/*
> > +		 * Even if we found a group we have to make sure it is alive.
> > +		 * css && !memcg means that the groups should be skipped and
> > +		 * we should continue the tree walk.
> > +		 * last_visited css is safe to use because it is protected by
> > +		 * css_get and the tree walk is rcu safe.
> > +		 */
> > +		if (css == &root->css || (css && css_tryget(css)))
> > +			memcg = mem_cgroup_from_css(css);
> >   
> >   		if (reclaim) {
> > -			iter->position = id;
> > +			struct mem_cgroup *curr = memcg;
> > +
> > +			if (last_visited)
> > +				css_put(&last_visited->css);
> > +
> > +			if (css && !memcg)
> > +				curr = mem_cgroup_from_css(css);
> > +
> > +			/* make sure that the cached memcg is not removed */
> > +			if (curr)
> > +				css_get(&curr->css);
> I'm sorry if I miss something...
> 
> This curr is  curr == memcg = mem_cgroup_from_css(css) <= already try_get() done.
> double refcounted ?

Yes we get 2 references here. One for the returned memcg - which will be
released either by mem_cgroup_iter_break or a next iteration round
(where it would be prev) and the other is for last_visited which is
released when a new memcg is cached.
 
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 2/6] memcg: rework mem_cgroup_iter to use cgroup iterators
@ 2013-02-15  8:11       ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2013-02-15  8:11 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, linux-kernel, Johannes Weiner, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan, Andrew Morton

On Fri 15-02-13 17:03:09, KAMEZAWA Hiroyuki wrote:
[...]
> > @@ -1158,31 +1161,74 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> >   
> >   			mz = mem_cgroup_zoneinfo(root, nid, zid);
> >   			iter = &mz->reclaim_iter[reclaim->priority];
> > -			if (prev && reclaim->generation != iter->generation)
> > -				goto out_css_put;
> > -			id = iter->position;
> > +			spin_lock(&iter->iter_lock);
> > +			last_visited = iter->last_visited;
> > +			if (prev && reclaim->generation != iter->generation) {
> > +				if (last_visited) {
> > +					css_put(&last_visited->css);
> > +					iter->last_visited = NULL;
> > +				}
> > +				spin_unlock(&iter->iter_lock);
> > +				goto out_unlock;
> > +			}
[...]
> > +		/*
> > +		 * Even if we found a group we have to make sure it is alive.
> > +		 * css && !memcg means that the groups should be skipped and
> > +		 * we should continue the tree walk.
> > +		 * last_visited css is safe to use because it is protected by
> > +		 * css_get and the tree walk is rcu safe.
> > +		 */
> > +		if (css == &root->css || (css && css_tryget(css)))
> > +			memcg = mem_cgroup_from_css(css);
> >   
> >   		if (reclaim) {
> > -			iter->position = id;
> > +			struct mem_cgroup *curr = memcg;
> > +
> > +			if (last_visited)
> > +				css_put(&last_visited->css);
> > +
> > +			if (css && !memcg)
> > +				curr = mem_cgroup_from_css(css);
> > +
> > +			/* make sure that the cached memcg is not removed */
> > +			if (curr)
> > +				css_get(&curr->css);
> I'm sorry if I miss something...
> 
> This curr is  curr == memcg = mem_cgroup_from_css(css) <= already try_get() done.
> double refcounted ?

Yes we get 2 references here. One for the returned memcg - which will be
released either by mem_cgroup_iter_break or a next iteration round
(where it would be prev) and the other is for last_visited which is
released when a new memcg is cached.
 
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 2/6] memcg: rework mem_cgroup_iter to use cgroup iterators
  2013-02-15  8:11       ` Michal Hocko
@ 2013-02-15  8:13         ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 28+ messages in thread
From: Kamezawa Hiroyuki @ 2013-02-15  8:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Johannes Weiner, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan, Andrew Morton

(2013/02/15 17:11), Michal Hocko wrote:
> On Fri 15-02-13 17:03:09, KAMEZAWA Hiroyuki wrote:
				css_get(&curr->css);
>> I'm sorry if I miss something...
>>
>> This curr is  curr == memcg = mem_cgroup_from_css(css) <= already try_get() done.
>> double refcounted ?
>
> Yes we get 2 references here. One for the returned memcg - which will be
> released either by mem_cgroup_iter_break or a next iteration round
> (where it would be prev) and the other is for last_visited which is
> released when a new memcg is cached.
>
Sure.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 2/6] memcg: rework mem_cgroup_iter to use cgroup iterators
@ 2013-02-15  8:13         ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 28+ messages in thread
From: Kamezawa Hiroyuki @ 2013-02-15  8:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Johannes Weiner, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan, Andrew Morton

(2013/02/15 17:11), Michal Hocko wrote:
> On Fri 15-02-13 17:03:09, KAMEZAWA Hiroyuki wrote:
				css_get(&curr->css);
>> I'm sorry if I miss something...
>>
>> This curr is  curr == memcg = mem_cgroup_from_css(css) <= already try_get() done.
>> double refcounted ?
>
> Yes we get 2 references here. One for the returned memcg - which will be
> released either by mem_cgroup_iter_break or a next iteration round
> (where it would be prev) and the other is for last_visited which is
> released when a new memcg is cached.
>
Sure.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 3/6] memcg: relax memcg iter caching
  2013-02-14 13:26   ` Michal Hocko
@ 2013-02-15  8:19     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 28+ messages in thread
From: Kamezawa Hiroyuki @ 2013-02-15  8:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Johannes Weiner, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan, Andrew Morton

(2013/02/14 22:26), Michal Hocko wrote:
> Now that per-node-zone-priority iterator caches memory cgroups rather
> than their css ids we have to be careful and remove them from the
> iterator when they are on the way out otherwise they might live for
> unbounded amount of time even though their group is already gone (until
> the global/targeted reclaim triggers the zone under priority to find out
> the group is dead and let it to find the final rest).
> 
> We can fix this issue by relaxing rules for the last_visited memcg.
> Instead of taking a reference to the css before it is stored into
> iter->last_visited we can just store its pointer and track the number of
> removed groups from each memcg's subhierarchy.
> 
> This number would be stored into iterator everytime when a memcg is
> cached. If the iter count doesn't match the curent walker root's one we
> will start from the root again. The group counter is incremented upwards
> the hierarchy every time a group is removed.
> 
> The iter_lock can be dropped because racing iterators cannot leak
> the reference anymore as the reference count is not elevated for
> last_visited when it is cached.
> 
> Locking rules got a bit complicated by this change though. The iterator
> primarily relies on rcu read lock which makes sure that once we see
> a valid last_visited pointer then it will be valid for the whole RCU
> walk. smp_rmb makes sure that dead_count is read before last_visited
> and last_dead_count while smp_wmb makes sure that last_visited is
> updated before last_dead_count so the up-to-date last_dead_count cannot
> point to an outdated last_visited. css_tryget then makes sure that
> the last_visited is still alive in case the iteration races with the
> cached group removal (css is invalidated before mem_cgroup_css_offline
> increments dead_count).
> 
> In short:
> mem_cgroup_iter
>   rcu_read_lock()
>   dead_count = atomic_read(parent->dead_count)
>   smp_rmb()
>   if (dead_count != iter->last_dead_count)
>   	last_visited POSSIBLY INVALID -> last_visited = NULL
>   if (!css_tryget(iter->last_visited))
>   	last_visited DEAD -> last_visited = NULL
>   next = find_next(last_visited)
>   css_tryget(next)
>   css_put(last_visited) 	// css would be invalidated and parent->dead_count
>   			// incremented if this was the last reference
>   iter->last_visited = next
>   smp_wmb()
>   iter->last_dead_count = dead_count
>   rcu_read_unlock()
> 
> cgroup_rmdir
>   cgroup_destroy_locked
>    atomic_add(CSS_DEACT_BIAS, &css->refcnt) // subsequent css_tryget fail
>     mem_cgroup_css_offline
>      mem_cgroup_invalidate_reclaim_iterators
>       while(parent = parent_mem_cgroup)
>       	atomic_inc(parent->dead_count)
>    css_put(css) // last reference held by cgroup core
> 
> Spotted-by: Ying Han <yinghan@google.com>
> Original-idea-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

interesting. Thank you for your hard works.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 3/6] memcg: relax memcg iter caching
@ 2013-02-15  8:19     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 28+ messages in thread
From: Kamezawa Hiroyuki @ 2013-02-15  8:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Johannes Weiner, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan, Andrew Morton

(2013/02/14 22:26), Michal Hocko wrote:
> Now that per-node-zone-priority iterator caches memory cgroups rather
> than their css ids we have to be careful and remove them from the
> iterator when they are on the way out otherwise they might live for
> unbounded amount of time even though their group is already gone (until
> the global/targeted reclaim triggers the zone under priority to find out
> the group is dead and let it to find the final rest).
> 
> We can fix this issue by relaxing rules for the last_visited memcg.
> Instead of taking a reference to the css before it is stored into
> iter->last_visited we can just store its pointer and track the number of
> removed groups from each memcg's subhierarchy.
> 
> This number would be stored into iterator everytime when a memcg is
> cached. If the iter count doesn't match the curent walker root's one we
> will start from the root again. The group counter is incremented upwards
> the hierarchy every time a group is removed.
> 
> The iter_lock can be dropped because racing iterators cannot leak
> the reference anymore as the reference count is not elevated for
> last_visited when it is cached.
> 
> Locking rules got a bit complicated by this change though. The iterator
> primarily relies on rcu read lock which makes sure that once we see
> a valid last_visited pointer then it will be valid for the whole RCU
> walk. smp_rmb makes sure that dead_count is read before last_visited
> and last_dead_count while smp_wmb makes sure that last_visited is
> updated before last_dead_count so the up-to-date last_dead_count cannot
> point to an outdated last_visited. css_tryget then makes sure that
> the last_visited is still alive in case the iteration races with the
> cached group removal (css is invalidated before mem_cgroup_css_offline
> increments dead_count).
> 
> In short:
> mem_cgroup_iter
>   rcu_read_lock()
>   dead_count = atomic_read(parent->dead_count)
>   smp_rmb()
>   if (dead_count != iter->last_dead_count)
>   	last_visited POSSIBLY INVALID -> last_visited = NULL
>   if (!css_tryget(iter->last_visited))
>   	last_visited DEAD -> last_visited = NULL
>   next = find_next(last_visited)
>   css_tryget(next)
>   css_put(last_visited) 	// css would be invalidated and parent->dead_count
>   			// incremented if this was the last reference
>   iter->last_visited = next
>   smp_wmb()
>   iter->last_dead_count = dead_count
>   rcu_read_unlock()
> 
> cgroup_rmdir
>   cgroup_destroy_locked
>    atomic_add(CSS_DEACT_BIAS, &css->refcnt) // subsequent css_tryget fail
>     mem_cgroup_css_offline
>      mem_cgroup_invalidate_reclaim_iterators
>       while(parent = parent_mem_cgroup)
>       	atomic_inc(parent->dead_count)
>    css_put(css) // last reference held by cgroup core
> 
> Spotted-by: Ying Han <yinghan@google.com>
> Original-idea-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

interesting. Thank you for your hard works.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/6] memcg: simplify mem_cgroup_iter
  2013-02-14 13:26   ` Michal Hocko
@ 2013-02-15  8:24     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 28+ messages in thread
From: Kamezawa Hiroyuki @ 2013-02-15  8:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Johannes Weiner, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan, Andrew Morton

(2013/02/14 22:26), Michal Hocko wrote:
> Current implementation of mem_cgroup_iter has to consider both css and
> memcg to find out whether no group has been found (css==NULL - aka the
> loop is completed) and that no memcg is associated with the found node
> (!memcg - aka css_tryget failed because the group is no longer alive).
> This leads to awkward tweaks like tests for css && !memcg to skip the
> current node.
> 
> It will be much easier if we got rid off css variable altogether and
> only rely on memcg. In order to do that the iteration part has to skip
> dead nodes. This sounds natural to me and as a nice side effect we will
> get a simple invariant that memcg is always alive when non-NULL and all
> nodes have been visited otherwise.
> 
> We could get rid of the surrounding while loop but keep it in for now to
> make review easier. It will go away in the following patch.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
nice.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/6] memcg: simplify mem_cgroup_iter
@ 2013-02-15  8:24     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 28+ messages in thread
From: Kamezawa Hiroyuki @ 2013-02-15  8:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Johannes Weiner, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan, Andrew Morton

(2013/02/14 22:26), Michal Hocko wrote:
> Current implementation of mem_cgroup_iter has to consider both css and
> memcg to find out whether no group has been found (css==NULL - aka the
> loop is completed) and that no memcg is associated with the found node
> (!memcg - aka css_tryget failed because the group is no longer alive).
> This leads to awkward tweaks like tests for css && !memcg to skip the
> current node.
> 
> It will be much easier if we got rid off css variable altogether and
> only rely on memcg. In order to do that the iteration part has to skip
> dead nodes. This sounds natural to me and as a nice side effect we will
> get a simple invariant that memcg is always alive when non-NULL and all
> nodes have been visited otherwise.
> 
> We could get rid of the surrounding while loop but keep it in for now to
> make review easier. It will go away in the following patch.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
nice.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 -mm] rework mem_cgroup iterator
  2013-02-14 13:26 ` Michal Hocko
@ 2013-03-13 15:08   ` Michal Hocko
  -1 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2013-03-13 15:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan, linux-mm

Hi Andrew,
it seems that there were no other objections[1] to this version of the
patchset. Could you take it into your tree and target it for 3.10?

---
[1] The only one from Johannes (https://lkml.org/lkml/2013/2/8/379) has
been addressed

On Thu 14-02-13 14:26:30, Michal Hocko wrote:
> Hi,
> this is a fourth iteration of the patch series previously posted here:
> http://lkml.org/lkml/2013/1/3/293. I am adding Andrew to the CC as I
> feel this is getting to the acceptable state finally. It still need
> review as some things changed a lot since the last version but we are
> getting there...
> 
> The patch set tries to make mem_cgroup_iter saner in the way how it
> walks hierarchies. css->id based traversal is far from being ideal as
> it is not deterministic because it depends on the creation ordering.
> Additional to that css_id is considered a burden for cgroup maintainers
> because it is quite some code and memcg is the last user of it. After
> this series only the swap accounting uses css_id but that one will
> follow up later.
> 
> Diffstat (if we exclude removed/added comments) looks quite
> promissing. We got rid of some code:
> $ git diff mmotm... | grep -v "^[+-][[:space:]]*[/ ]\*" | diffstat
>  b/include/linux/cgroup.h |    3 ---
>  kernel/cgroup.c          |   33 ---------------------------------
>  mm/memcontrol.c          |    4 +++-
>  3 files changed, 3 insertions(+), 37 deletions(-)
> 
> The first patch is just preparatory and it changes when we release css
> of the previously returned memcg. Nothing controlversial.
> 
> The second patch is the core of the patchset and it replaces css_get_next
> based on css_id by the generic cgroup pre-order. This brings some
> chalanges for the last visited group caching during the reclaim
> (mem_cgroup_per_zone::reclaim_iter). We have to use memcg pointers
> directly now which means that we have to keep a reference to those
> groups' css to keep them alive.
> I also folded iter_lock introduced by https://lkml.org/lkml/2013/1/3/295
> in the previous version into this patch. Johannes felt the race I
> was describing should be mostly harmless and I haven't been able to
> trigger it so the lock doesn't deserve its own patch. It is still needed
> temporarily, though, because the reference counting on iter->last_visited
> depends on it. It will go away with the next patch.
> 
> The next patch fixups an unbounded cgroup removal holdoff caused
> by the elevated css refcount. The issue has been observed by
> Ying Han.  Johannes wasn't impressed by the previous version of the fix
> (https://lkml.org/lkml/2013/2/8/379) which cleaned up pending references
> during mem_cgroup_css_offline when a group is removed. He has suggested
> a different way when the iterator checks whether a cached memcg is
> still valid or no. More on that in the patch but the basic idea is that
> every memcg tracks the number removed subgroups and iterator records
> this number when a group is cached. These numbers are checked before
> iter->last_visited is about to be used and the iteration is restarted if
> it is invalid.
> 
> The fourth and fifth patches are an attempt for simplification of the
> mem_cgroup_iter. css juggling is removed and the iteration logic is
> moved to a helper so that the reference counting and iteration are
> separated.
> 
> The last patch just removes css_get_next as there is no user for it any
> longer.
> 
> I have dropped Acks from patches that needed a rework since the last
> time so please have a look at them again (especially 1-4). I hope I
> haven't screwed anything during the rebase.
> 
> My testing looked as follows:
>         A (use_hierarchy=1, limit_in_bytes=150M)
>        /|\
>       1 2 3
> 
> Children groups were created so that the number is never higher than
> 3 and their limits were random between 50-100M. Each group hosts a
> kernel build (starting with tar -xf so the tree is not shared and make
> -jNUM_CPUs/3) and terminated after random time - up to 5 minutes) and
> then it is removed.
> This should exercise both leaf and hierarchical reclaim as well as
> races with cgroup removals and debugging messages I added on top proved
> that. 100 groups were created during the test.
> 
> Shortlog says:
> Michal Hocko (6):
>       memcg: keep prev's css alive for the whole mem_cgroup_iter
>       memcg: rework mem_cgroup_iter to use cgroup iterators
>       memcg: relax memcg iter caching
>       memcg: simplify mem_cgroup_iter
>       memcg: further simplify mem_cgroup_iter
>       cgroup: remove css_get_next
> 
> Full diffstat says:
>  include/linux/cgroup.h |    7 ---
>  kernel/cgroup.c        |   49 ----------------
>  mm/memcontrol.c        |  145 ++++++++++++++++++++++++++++++++++++++++--------
>  3 files changed, 121 insertions(+), 80 deletions(-)
> 
> Any suggestions are welcome of course.
> 
> Thanks!
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 -mm] rework mem_cgroup iterator
@ 2013-03-13 15:08   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2013-03-13 15:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan, linux-mm

Hi Andrew,
it seems that there were no other objections[1] to this version of the
patchset. Could you take it into your tree and target it for 3.10?

---
[1] The only one from Johannes (https://lkml.org/lkml/2013/2/8/379) has
been addressed

On Thu 14-02-13 14:26:30, Michal Hocko wrote:
> Hi,
> this is a fourth iteration of the patch series previously posted here:
> http://lkml.org/lkml/2013/1/3/293. I am adding Andrew to the CC as I
> feel this is getting to the acceptable state finally. It still need
> review as some things changed a lot since the last version but we are
> getting there...
> 
> The patch set tries to make mem_cgroup_iter saner in the way how it
> walks hierarchies. css->id based traversal is far from being ideal as
> it is not deterministic because it depends on the creation ordering.
> Additional to that css_id is considered a burden for cgroup maintainers
> because it is quite some code and memcg is the last user of it. After
> this series only the swap accounting uses css_id but that one will
> follow up later.
> 
> Diffstat (if we exclude removed/added comments) looks quite
> promissing. We got rid of some code:
> $ git diff mmotm... | grep -v "^[+-][[:space:]]*[/ ]\*" | diffstat
>  b/include/linux/cgroup.h |    3 ---
>  kernel/cgroup.c          |   33 ---------------------------------
>  mm/memcontrol.c          |    4 +++-
>  3 files changed, 3 insertions(+), 37 deletions(-)
> 
> The first patch is just preparatory and it changes when we release css
> of the previously returned memcg. Nothing controlversial.
> 
> The second patch is the core of the patchset and it replaces css_get_next
> based on css_id by the generic cgroup pre-order. This brings some
> chalanges for the last visited group caching during the reclaim
> (mem_cgroup_per_zone::reclaim_iter). We have to use memcg pointers
> directly now which means that we have to keep a reference to those
> groups' css to keep them alive.
> I also folded iter_lock introduced by https://lkml.org/lkml/2013/1/3/295
> in the previous version into this patch. Johannes felt the race I
> was describing should be mostly harmless and I haven't been able to
> trigger it so the lock doesn't deserve its own patch. It is still needed
> temporarily, though, because the reference counting on iter->last_visited
> depends on it. It will go away with the next patch.
> 
> The next patch fixups an unbounded cgroup removal holdoff caused
> by the elevated css refcount. The issue has been observed by
> Ying Han.  Johannes wasn't impressed by the previous version of the fix
> (https://lkml.org/lkml/2013/2/8/379) which cleaned up pending references
> during mem_cgroup_css_offline when a group is removed. He has suggested
> a different way when the iterator checks whether a cached memcg is
> still valid or no. More on that in the patch but the basic idea is that
> every memcg tracks the number removed subgroups and iterator records
> this number when a group is cached. These numbers are checked before
> iter->last_visited is about to be used and the iteration is restarted if
> it is invalid.
> 
> The fourth and fifth patches are an attempt for simplification of the
> mem_cgroup_iter. css juggling is removed and the iteration logic is
> moved to a helper so that the reference counting and iteration are
> separated.
> 
> The last patch just removes css_get_next as there is no user for it any
> longer.
> 
> I have dropped Acks from patches that needed a rework since the last
> time so please have a look at them again (especially 1-4). I hope I
> haven't screwed anything during the rebase.
> 
> My testing looked as follows:
>         A (use_hierarchy=1, limit_in_bytes=150M)
>        /|\
>       1 2 3
> 
> Children groups were created so that the number is never higher than
> 3 and their limits were random between 50-100M. Each group hosts a
> kernel build (starting with tar -xf so the tree is not shared and make
> -jNUM_CPUs/3) and terminated after random time - up to 5 minutes) and
> then it is removed.
> This should exercise both leaf and hierarchical reclaim as well as
> races with cgroup removals and debugging messages I added on top proved
> that. 100 groups were created during the test.
> 
> Shortlog says:
> Michal Hocko (6):
>       memcg: keep prev's css alive for the whole mem_cgroup_iter
>       memcg: rework mem_cgroup_iter to use cgroup iterators
>       memcg: relax memcg iter caching
>       memcg: simplify mem_cgroup_iter
>       memcg: further simplify mem_cgroup_iter
>       cgroup: remove css_get_next
> 
> Full diffstat says:
>  include/linux/cgroup.h |    7 ---
>  kernel/cgroup.c        |   49 ----------------
>  mm/memcontrol.c        |  145 ++++++++++++++++++++++++++++++++++++++++--------
>  3 files changed, 121 insertions(+), 80 deletions(-)
> 
> Any suggestions are welcome of course.
> 
> Thanks!
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2013-03-13 15:08 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-14 13:26 [PATCH v4 -mm] rework mem_cgroup iterator Michal Hocko
2013-02-14 13:26 ` Michal Hocko
2013-02-14 13:26 ` [PATCH v4 1/6] memcg: keep prev's css alive for the whole mem_cgroup_iter Michal Hocko
2013-02-14 13:26   ` Michal Hocko
2013-02-14 13:26 ` [PATCH v4 2/6] memcg: rework mem_cgroup_iter to use cgroup iterators Michal Hocko
2013-02-14 13:26   ` Michal Hocko
2013-02-15  8:03   ` Kamezawa Hiroyuki
2013-02-15  8:03     ` Kamezawa Hiroyuki
2013-02-15  8:11     ` Michal Hocko
2013-02-15  8:11       ` Michal Hocko
2013-02-15  8:13       ` Kamezawa Hiroyuki
2013-02-15  8:13         ` Kamezawa Hiroyuki
2013-02-14 13:26 ` [PATCH v4 3/6] memcg: relax memcg iter caching Michal Hocko
2013-02-14 13:26   ` Michal Hocko
2013-02-15  8:08   ` Michal Hocko
2013-02-15  8:08     ` Michal Hocko
2013-02-15  8:19   ` Kamezawa Hiroyuki
2013-02-15  8:19     ` Kamezawa Hiroyuki
2013-02-14 13:26 ` [PATCH v4 4/6] memcg: simplify mem_cgroup_iter Michal Hocko
2013-02-14 13:26   ` Michal Hocko
2013-02-15  8:24   ` Kamezawa Hiroyuki
2013-02-15  8:24     ` Kamezawa Hiroyuki
2013-02-14 13:26 ` [PATCH v4 5/6] memcg: further " Michal Hocko
2013-02-14 13:26   ` Michal Hocko
2013-02-14 13:26 ` [PATCH v4 6/6] cgroup: remove css_get_next Michal Hocko
2013-02-14 13:26   ` Michal Hocko
2013-03-13 15:08 ` [PATCH v4 -mm] rework mem_cgroup iterator Michal Hocko
2013-03-13 15:08   ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.