All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/7] rework mem_cgroup iterator
@ 2013-01-03 17:54 ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-01-03 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan

Hi all,
this is a third version of the patchset previously posted here:
https://lkml.org/lkml/2012/11/26/616

The patch set tries to make mem_cgroup_iter saner in the way how it
walks hierarchies. css->id based traversal is far from being ideal as it
is not deterministic because it depends on the creation ordering.

Diffstat doesn't look that promising as in previous versions anymore but
I think it is worth the resulting outcome (and the sanity ;)).

The first patch fixes a potential misbehaving which I haven't seen but
the fix is needed for the later patches anyway. We could take it alone
as well but I do not have any bug report to base the fix on. The second
one is also preparatory and it is new to the series.

The third patch is the core of the patchset and it replaces css_get_next
based on css_id by the generic cgroup pre-order iterator which
means that css_id is no longer used by memcg. This brings some
chalanges for the last visited group caching during the reclaim
(mem_cgroup_per_zone::reclaim_iter). We have to use memcg pointers
directly now which means that we have to keep a reference to those
groups' css to keep them alive.

The next patch fixups an unbounded cgroup removal holdoff caused by
the elevated css refcount and does the clean up on the group removal.
Thanks to Ying who spotted this during testing of the previous version
of the patchset.
I could have folded it into the previous patch but I felt it would be
too big to review but if people feel it would be better that way, I have
no problems to squash them together.

The fourth and fifth patches are an attempt for simplification of the
mem_cgroup_iter. css juggling is removed and the iteration logic is
moved to a helper so that the reference counting and iteration are
separated.

The last patch just removes css_get_next as there is no user for it any
longer.

I am also thinking that leaf-to-root iteration makes more sense but this
patch is not included in the series yet because I have to think some
more about the justification.

Same as with the previous version I have tested with a quite simple
hierarchy:
        A (limit = 280M, use_hierarchy=true)
      / | \
     B  C  D (all have 100M limit)

And a separate kernel build in the each leaf group. This triggers
both children only and hierarchical reclaim which is parallel so the
iter_reclaim caching is active a lot. I will hammer it some more but the
series should be in quite a good shape already. 

Michal Hocko (7):
      memcg: synchronize per-zone iterator access by a spinlock
      memcg: keep prev's css alive for the whole mem_cgroup_iter
      memcg: rework mem_cgroup_iter to use cgroup iterators
      memcg: remove memcg from the reclaim iterators
      memcg: simplify mem_cgroup_iter
      memcg: further simplify mem_cgroup_iter
      cgroup: remove css_get_next

And the diffstat says:
 include/linux/cgroup.h |    7 --
 kernel/cgroup.c        |   49 ------------
 mm/memcontrol.c        |  199 ++++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 175 insertions(+), 80 deletions(-)

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3 0/7] rework mem_cgroup iterator
@ 2013-01-03 17:54 ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-01-03 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan

Hi all,
this is a third version of the patchset previously posted here:
https://lkml.org/lkml/2012/11/26/616

The patch set tries to make mem_cgroup_iter saner in the way how it
walks hierarchies. css->id based traversal is far from being ideal as it
is not deterministic because it depends on the creation ordering.

Diffstat doesn't look that promising as in previous versions anymore but
I think it is worth the resulting outcome (and the sanity ;)).

The first patch fixes a potential misbehaving which I haven't seen but
the fix is needed for the later patches anyway. We could take it alone
as well but I do not have any bug report to base the fix on. The second
one is also preparatory and it is new to the series.

The third patch is the core of the patchset and it replaces css_get_next
based on css_id by the generic cgroup pre-order iterator which
means that css_id is no longer used by memcg. This brings some
chalanges for the last visited group caching during the reclaim
(mem_cgroup_per_zone::reclaim_iter). We have to use memcg pointers
directly now which means that we have to keep a reference to those
groups' css to keep them alive.

The next patch fixups an unbounded cgroup removal holdoff caused by
the elevated css refcount and does the clean up on the group removal.
Thanks to Ying who spotted this during testing of the previous version
of the patchset.
I could have folded it into the previous patch but I felt it would be
too big to review but if people feel it would be better that way, I have
no problems to squash them together.

The fourth and fifth patches are an attempt for simplification of the
mem_cgroup_iter. css juggling is removed and the iteration logic is
moved to a helper so that the reference counting and iteration are
separated.

The last patch just removes css_get_next as there is no user for it any
longer.

I am also thinking that leaf-to-root iteration makes more sense but this
patch is not included in the series yet because I have to think some
more about the justification.

Same as with the previous version I have tested with a quite simple
hierarchy:
        A (limit = 280M, use_hierarchy=true)
      / | \
     B  C  D (all have 100M limit)

And a separate kernel build in the each leaf group. This triggers
both children only and hierarchical reclaim which is parallel so the
iter_reclaim caching is active a lot. I will hammer it some more but the
series should be in quite a good shape already. 

Michal Hocko (7):
      memcg: synchronize per-zone iterator access by a spinlock
      memcg: keep prev's css alive for the whole mem_cgroup_iter
      memcg: rework mem_cgroup_iter to use cgroup iterators
      memcg: remove memcg from the reclaim iterators
      memcg: simplify mem_cgroup_iter
      memcg: further simplify mem_cgroup_iter
      cgroup: remove css_get_next

And the diffstat says:
 include/linux/cgroup.h |    7 --
 kernel/cgroup.c        |   49 ------------
 mm/memcontrol.c        |  199 ++++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 175 insertions(+), 80 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3 1/7] memcg: synchronize per-zone iterator access by a spinlock
  2013-01-03 17:54 ` Michal Hocko
@ 2013-01-03 17:54   ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-01-03 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan

per-zone per-priority iterator is aimed at coordinating concurrent
reclaimers on the same hierarchy (or the global reclaim when all
groups are reclaimed) so that all groups get reclaimed evenly as
much as possible. iter->position holds the last css->id visited
and iter->generation signals the completed tree walk (when it is
incremented).
Concurrent reclaimers are supposed to provide a reclaim cookie which
holds the reclaim priority and the last generation they saw. If cookie's
generation doesn't match the iterator's view then other concurrent
reclaimer already did the job and the tree walk is done for that
priority.

This scheme works nicely in most cases but it is not raceless. Two
racing reclaimers can see the same iter->position and so bang on the
same group. iter->generation increment is not serialized as well so a
reclaimer can see an updated iter->position with and old generation so
the iteration might be restarted from the root of the hierarchy.

The simplest way to fix this issue is to synchronise access to the
iterator by a lock. This implementation uses per-zone per-priority
spinlock which linearizes only directly racing reclaimers which use
reclaim cookies so the effect of the new locking should be really
minimal.

I have to note that I haven't seen this as a real issue so far. The
primary motivation for the change is different. The following patch
will change the way how the iterator is implemented and css->id
iteration will be replaced cgroup generic iteration which requires
storing mem_cgroup pointer into iterator and that requires reference
counting and so concurrent access will be a problem.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1ea8951..e71cfde 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -148,6 +148,8 @@ struct mem_cgroup_reclaim_iter {
 	int position;
 	/* scan generation, increased every round-trip */
 	unsigned int generation;
+	/* lock to protect the position and generation */
+	spinlock_t iter_lock;
 };
 
 /*
@@ -1161,8 +1163,11 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 
 			mz = mem_cgroup_zoneinfo(root, nid, zid);
 			iter = &mz->reclaim_iter[reclaim->priority];
-			if (prev && reclaim->generation != iter->generation)
+			spin_lock(&iter->iter_lock);
+			if (prev && reclaim->generation != iter->generation) {
+				spin_unlock(&iter->iter_lock);
 				return NULL;
+			}
 			id = iter->position;
 		}
 
@@ -1181,6 +1186,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 				iter->generation++;
 			else if (!prev && memcg)
 				reclaim->generation = iter->generation;
+			spin_unlock(&iter->iter_lock);
 		}
 
 		if (prev && !css)
@@ -6051,8 +6057,12 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 		return 1;
 
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+		int prio;
+
 		mz = &pn->zoneinfo[zone];
 		lruvec_init(&mz->lruvec);
+		for (prio = 0; prio < DEF_PRIORITY + 1; prio++)
+			spin_lock_init(&mz->reclaim_iter[prio].iter_lock);
 		mz->usage_in_excess = 0;
 		mz->on_tree = false;
 		mz->memcg = memcg;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 1/7] memcg: synchronize per-zone iterator access by a spinlock
@ 2013-01-03 17:54   ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-01-03 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan

per-zone per-priority iterator is aimed at coordinating concurrent
reclaimers on the same hierarchy (or the global reclaim when all
groups are reclaimed) so that all groups get reclaimed evenly as
much as possible. iter->position holds the last css->id visited
and iter->generation signals the completed tree walk (when it is
incremented).
Concurrent reclaimers are supposed to provide a reclaim cookie which
holds the reclaim priority and the last generation they saw. If cookie's
generation doesn't match the iterator's view then other concurrent
reclaimer already did the job and the tree walk is done for that
priority.

This scheme works nicely in most cases but it is not raceless. Two
racing reclaimers can see the same iter->position and so bang on the
same group. iter->generation increment is not serialized as well so a
reclaimer can see an updated iter->position with and old generation so
the iteration might be restarted from the root of the hierarchy.

The simplest way to fix this issue is to synchronise access to the
iterator by a lock. This implementation uses per-zone per-priority
spinlock which linearizes only directly racing reclaimers which use
reclaim cookies so the effect of the new locking should be really
minimal.

I have to note that I haven't seen this as a real issue so far. The
primary motivation for the change is different. The following patch
will change the way how the iterator is implemented and css->id
iteration will be replaced cgroup generic iteration which requires
storing mem_cgroup pointer into iterator and that requires reference
counting and so concurrent access will be a problem.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1ea8951..e71cfde 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -148,6 +148,8 @@ struct mem_cgroup_reclaim_iter {
 	int position;
 	/* scan generation, increased every round-trip */
 	unsigned int generation;
+	/* lock to protect the position and generation */
+	spinlock_t iter_lock;
 };
 
 /*
@@ -1161,8 +1163,11 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 
 			mz = mem_cgroup_zoneinfo(root, nid, zid);
 			iter = &mz->reclaim_iter[reclaim->priority];
-			if (prev && reclaim->generation != iter->generation)
+			spin_lock(&iter->iter_lock);
+			if (prev && reclaim->generation != iter->generation) {
+				spin_unlock(&iter->iter_lock);
 				return NULL;
+			}
 			id = iter->position;
 		}
 
@@ -1181,6 +1186,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 				iter->generation++;
 			else if (!prev && memcg)
 				reclaim->generation = iter->generation;
+			spin_unlock(&iter->iter_lock);
 		}
 
 		if (prev && !css)
@@ -6051,8 +6057,12 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 		return 1;
 
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+		int prio;
+
 		mz = &pn->zoneinfo[zone];
 		lruvec_init(&mz->lruvec);
+		for (prio = 0; prio < DEF_PRIORITY + 1; prio++)
+			spin_lock_init(&mz->reclaim_iter[prio].iter_lock);
 		mz->usage_in_excess = 0;
 		mz->on_tree = false;
 		mz->memcg = memcg;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 2/7] memcg: keep prev's css alive for the whole mem_cgroup_iter
  2013-01-03 17:54 ` Michal Hocko
@ 2013-01-03 17:54   ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-01-03 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan

css reference counting keeps the cgroup alive even though it has been
already removed. mem_cgroup_iter relies on this fact and takes a
reference to the returned group. The reference is then released on the
next iteration or mem_cgroup_iter_break.
mem_cgroup_iter currently releases the reference right after it gets the
last css_id.
This is correct because neither prev's memcg nor cgroup are accessed
after then. This will change in the next patch so we need to hold the
group alive a bit longer so let's move the css_put at the end of the
function.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e71cfde..90a3b1d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1143,12 +1143,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	if (prev && !reclaim)
 		id = css_id(&prev->css);
 
-	if (prev && prev != root)
-		css_put(&prev->css);
-
 	if (!root->use_hierarchy && root != root_mem_cgroup) {
 		if (prev)
-			return NULL;
+			goto out_css_put;
 		return root;
 	}
 
@@ -1166,7 +1163,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			spin_lock(&iter->iter_lock);
 			if (prev && reclaim->generation != iter->generation) {
 				spin_unlock(&iter->iter_lock);
-				return NULL;
+				goto out_css_put;
 			}
 			id = iter->position;
 		}
@@ -1190,8 +1187,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		}
 
 		if (prev && !css)
-			return NULL;
+			goto out_css_put;
 	}
+out_css_put:
+	if (prev && prev != root)
+		css_put(&prev->css);
+
 	return memcg;
 }
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 2/7] memcg: keep prev's css alive for the whole mem_cgroup_iter
@ 2013-01-03 17:54   ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-01-03 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan

css reference counting keeps the cgroup alive even though it has been
already removed. mem_cgroup_iter relies on this fact and takes a
reference to the returned group. The reference is then released on the
next iteration or mem_cgroup_iter_break.
mem_cgroup_iter currently releases the reference right after it gets the
last css_id.
This is correct because neither prev's memcg nor cgroup are accessed
after then. This will change in the next patch so we need to hold the
group alive a bit longer so let's move the css_put at the end of the
function.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e71cfde..90a3b1d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1143,12 +1143,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	if (prev && !reclaim)
 		id = css_id(&prev->css);
 
-	if (prev && prev != root)
-		css_put(&prev->css);
-
 	if (!root->use_hierarchy && root != root_mem_cgroup) {
 		if (prev)
-			return NULL;
+			goto out_css_put;
 		return root;
 	}
 
@@ -1166,7 +1163,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			spin_lock(&iter->iter_lock);
 			if (prev && reclaim->generation != iter->generation) {
 				spin_unlock(&iter->iter_lock);
-				return NULL;
+				goto out_css_put;
 			}
 			id = iter->position;
 		}
@@ -1190,8 +1187,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		}
 
 		if (prev && !css)
-			return NULL;
+			goto out_css_put;
 	}
+out_css_put:
+	if (prev && prev != root)
+		css_put(&prev->css);
+
 	return memcg;
 }
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 3/7] memcg: rework mem_cgroup_iter to use cgroup iterators
  2013-01-03 17:54 ` Michal Hocko
@ 2013-01-03 17:54   ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-01-03 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan

mem_cgroup_iter curently relies on css->id when walking down a group
hierarchy tree. This is really awkward because the tree walk depends on
the groups creation ordering. The only guarantee is that a parent node
is visited before its children.
Example
 1) mkdir -p a a/d a/b/c
 2) mkdir -a a/b/c a/d
Will create the same trees but the tree walks will be different:
 1) a, d, b, c
 2) a, b, c, d

574bd9f7 (cgroup: implement generic child / descendant walk macros) has
introduced generic cgroup tree walkers which provide either pre-order
or post-order tree walk. This patch converts css->id based iteration
to pre-order tree walk to keep the semantic with the original iterator
where parent is always visited before its subtree.

cgroup_for_each_descendant_pre suggests using post_create and
pre_destroy for proper synchronization with groups addidition resp.
removal. This implementation doesn't use those because a new memory
cgroup is fully initialized in mem_cgroup_create and css reference
counting enforces that the group is alive for both the last seen cgroup
and the found one resp. it signals that the group is dead and it should
be skipped.

If the reclaim cookie is used we need to store the last visited group
into the iterator so we have to be careful that it doesn't disappear in
the mean time. Elevated reference count on the css keeps it alive even
though the group have been removed (parked waiting for the last dput so
that it can be freed).

V2
- use css_{get,put} for iter->last_visited rather than
  mem_cgroup_{get,put} because it is stronger wrt. cgroup life cycle
- cgroup_next_descendant_pre expects NULL pos for the first iterartion
  otherwise it might loop endlessly for intermediate node without any
  children.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   74 ++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 57 insertions(+), 17 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 90a3b1d..e9f5c47 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -144,8 +144,8 @@ struct mem_cgroup_stat_cpu {
 };
 
 struct mem_cgroup_reclaim_iter {
-	/* css_id of the last scanned hierarchy member */
-	int position;
+	/* last scanned hierarchy member with elevated css ref count */
+	struct mem_cgroup *last_visited;
 	/* scan generation, increased every round-trip */
 	unsigned int generation;
 	/* lock to protect the position and generation */
@@ -1132,7 +1132,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 				   struct mem_cgroup_reclaim_cookie *reclaim)
 {
 	struct mem_cgroup *memcg = NULL;
-	int id = 0;
+	struct mem_cgroup *last_visited = NULL;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -1141,7 +1141,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		root = root_mem_cgroup;
 
 	if (prev && !reclaim)
-		id = css_id(&prev->css);
+		last_visited = prev;
 
 	if (!root->use_hierarchy && root != root_mem_cgroup) {
 		if (prev)
@@ -1149,9 +1149,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		return root;
 	}
 
+	rcu_read_lock();
 	while (!memcg) {
 		struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
-		struct cgroup_subsys_state *css;
+		struct cgroup_subsys_state *css = NULL;
 
 		if (reclaim) {
 			int nid = zone_to_nid(reclaim->zone);
@@ -1161,34 +1162,73 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			mz = mem_cgroup_zoneinfo(root, nid, zid);
 			iter = &mz->reclaim_iter[reclaim->priority];
 			spin_lock(&iter->iter_lock);
+			last_visited = iter->last_visited;
 			if (prev && reclaim->generation != iter->generation) {
+				if (last_visited) {
+					css_put(&last_visited->css);
+					iter->last_visited = NULL;
+				}
 				spin_unlock(&iter->iter_lock);
-				goto out_css_put;
+				goto out_unlock;
 			}
-			id = iter->position;
 		}
 
-		rcu_read_lock();
-		css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
-		if (css) {
-			if (css == &root->css || css_tryget(css))
-				memcg = mem_cgroup_from_css(css);
-		} else
-			id = 0;
-		rcu_read_unlock();
+		/*
+		 * Root is not visited by cgroup iterators so it needs an
+		 * explicit visit.
+		 */
+		if (!last_visited) {
+			css = &root->css;
+		} else {
+			struct cgroup *prev_cgroup, *next_cgroup;
+
+			prev_cgroup = (last_visited == root) ? NULL
+				: last_visited->css.cgroup;
+			next_cgroup = cgroup_next_descendant_pre(prev_cgroup,
+					root->css.cgroup);
+			if (next_cgroup)
+				css = cgroup_subsys_state(next_cgroup,
+						mem_cgroup_subsys_id);
+		}
+
+		/*
+		 * Even if we found a group we have to make sure it is alive.
+		 * css && !memcg means that the groups should be skipped and
+		 * we should continue the tree walk.
+		 * last_visited css is safe to use because it is protected by
+		 * css_get and the tree walk is rcu safe.
+		 */
+		if (css == &root->css || (css && css_tryget(css)))
+			memcg = mem_cgroup_from_css(css);
 
 		if (reclaim) {
-			iter->position = id;
+			struct mem_cgroup *curr = memcg;
+
+			if (last_visited)
+				css_put(&last_visited->css);
+
+			if (css && !memcg)
+				curr = mem_cgroup_from_css(css);
+
+			/* make sure that the cached memcg is not removed */
+			if (curr)
+				css_get(&curr->css);
+			iter->last_visited = curr;
+
 			if (!css)
 				iter->generation++;
 			else if (!prev && memcg)
 				reclaim->generation = iter->generation;
 			spin_unlock(&iter->iter_lock);
+		} else if (css && !memcg) {
+			last_visited = mem_cgroup_from_css(css);
 		}
 
 		if (prev && !css)
-			goto out_css_put;
+			goto out_unlock;
 	}
+out_unlock:
+	rcu_read_unlock();
 out_css_put:
 	if (prev && prev != root)
 		css_put(&prev->css);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 3/7] memcg: rework mem_cgroup_iter to use cgroup iterators
@ 2013-01-03 17:54   ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-01-03 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan

mem_cgroup_iter curently relies on css->id when walking down a group
hierarchy tree. This is really awkward because the tree walk depends on
the groups creation ordering. The only guarantee is that a parent node
is visited before its children.
Example
 1) mkdir -p a a/d a/b/c
 2) mkdir -a a/b/c a/d
Will create the same trees but the tree walks will be different:
 1) a, d, b, c
 2) a, b, c, d

574bd9f7 (cgroup: implement generic child / descendant walk macros) has
introduced generic cgroup tree walkers which provide either pre-order
or post-order tree walk. This patch converts css->id based iteration
to pre-order tree walk to keep the semantic with the original iterator
where parent is always visited before its subtree.

cgroup_for_each_descendant_pre suggests using post_create and
pre_destroy for proper synchronization with groups addidition resp.
removal. This implementation doesn't use those because a new memory
cgroup is fully initialized in mem_cgroup_create and css reference
counting enforces that the group is alive for both the last seen cgroup
and the found one resp. it signals that the group is dead and it should
be skipped.

If the reclaim cookie is used we need to store the last visited group
into the iterator so we have to be careful that it doesn't disappear in
the mean time. Elevated reference count on the css keeps it alive even
though the group have been removed (parked waiting for the last dput so
that it can be freed).

V2
- use css_{get,put} for iter->last_visited rather than
  mem_cgroup_{get,put} because it is stronger wrt. cgroup life cycle
- cgroup_next_descendant_pre expects NULL pos for the first iterartion
  otherwise it might loop endlessly for intermediate node without any
  children.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   74 ++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 57 insertions(+), 17 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 90a3b1d..e9f5c47 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -144,8 +144,8 @@ struct mem_cgroup_stat_cpu {
 };
 
 struct mem_cgroup_reclaim_iter {
-	/* css_id of the last scanned hierarchy member */
-	int position;
+	/* last scanned hierarchy member with elevated css ref count */
+	struct mem_cgroup *last_visited;
 	/* scan generation, increased every round-trip */
 	unsigned int generation;
 	/* lock to protect the position and generation */
@@ -1132,7 +1132,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 				   struct mem_cgroup_reclaim_cookie *reclaim)
 {
 	struct mem_cgroup *memcg = NULL;
-	int id = 0;
+	struct mem_cgroup *last_visited = NULL;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -1141,7 +1141,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		root = root_mem_cgroup;
 
 	if (prev && !reclaim)
-		id = css_id(&prev->css);
+		last_visited = prev;
 
 	if (!root->use_hierarchy && root != root_mem_cgroup) {
 		if (prev)
@@ -1149,9 +1149,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		return root;
 	}
 
+	rcu_read_lock();
 	while (!memcg) {
 		struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
-		struct cgroup_subsys_state *css;
+		struct cgroup_subsys_state *css = NULL;
 
 		if (reclaim) {
 			int nid = zone_to_nid(reclaim->zone);
@@ -1161,34 +1162,73 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			mz = mem_cgroup_zoneinfo(root, nid, zid);
 			iter = &mz->reclaim_iter[reclaim->priority];
 			spin_lock(&iter->iter_lock);
+			last_visited = iter->last_visited;
 			if (prev && reclaim->generation != iter->generation) {
+				if (last_visited) {
+					css_put(&last_visited->css);
+					iter->last_visited = NULL;
+				}
 				spin_unlock(&iter->iter_lock);
-				goto out_css_put;
+				goto out_unlock;
 			}
-			id = iter->position;
 		}
 
-		rcu_read_lock();
-		css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
-		if (css) {
-			if (css == &root->css || css_tryget(css))
-				memcg = mem_cgroup_from_css(css);
-		} else
-			id = 0;
-		rcu_read_unlock();
+		/*
+		 * Root is not visited by cgroup iterators so it needs an
+		 * explicit visit.
+		 */
+		if (!last_visited) {
+			css = &root->css;
+		} else {
+			struct cgroup *prev_cgroup, *next_cgroup;
+
+			prev_cgroup = (last_visited == root) ? NULL
+				: last_visited->css.cgroup;
+			next_cgroup = cgroup_next_descendant_pre(prev_cgroup,
+					root->css.cgroup);
+			if (next_cgroup)
+				css = cgroup_subsys_state(next_cgroup,
+						mem_cgroup_subsys_id);
+		}
+
+		/*
+		 * Even if we found a group we have to make sure it is alive.
+		 * css && !memcg means that the groups should be skipped and
+		 * we should continue the tree walk.
+		 * last_visited css is safe to use because it is protected by
+		 * css_get and the tree walk is rcu safe.
+		 */
+		if (css == &root->css || (css && css_tryget(css)))
+			memcg = mem_cgroup_from_css(css);
 
 		if (reclaim) {
-			iter->position = id;
+			struct mem_cgroup *curr = memcg;
+
+			if (last_visited)
+				css_put(&last_visited->css);
+
+			if (css && !memcg)
+				curr = mem_cgroup_from_css(css);
+
+			/* make sure that the cached memcg is not removed */
+			if (curr)
+				css_get(&curr->css);
+			iter->last_visited = curr;
+
 			if (!css)
 				iter->generation++;
 			else if (!prev && memcg)
 				reclaim->generation = iter->generation;
 			spin_unlock(&iter->iter_lock);
+		} else if (css && !memcg) {
+			last_visited = mem_cgroup_from_css(css);
 		}
 
 		if (prev && !css)
-			goto out_css_put;
+			goto out_unlock;
 	}
+out_unlock:
+	rcu_read_unlock();
 out_css_put:
 	if (prev && prev != root)
 		css_put(&prev->css);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-01-03 17:54 ` Michal Hocko
@ 2013-01-03 17:54   ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-01-03 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan

Now that per-node-zone-priority iterator caches memory cgroups rather
than their css ids we have to be careful and remove them from the
iterator when they are on the way out otherwise they might hang for
unbounded amount of time (until the global/targeted reclaim triggers the
zone under priority to find out the group is dead and let it to find the
final rest).

This is solved by hooking into mem_cgroup_css_offline and checking all
per-node-zone-priority iterators up the way to the root cgroup. If the
current memcg is found in the respective iter->last_visited then it is
replaced by the previous one in the same sub-hierarchy.

This guarantees that no group gets more reclaiming than necessary and
the next iteration will continue without noticing that the removed group
has disappeared.

Spotted-by: Ying Han <yinghan@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 89 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e9f5c47..4f81abd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6375,10 +6375,99 @@ free_out:
 	return ERR_PTR(error);
 }
 
+/*
+ * Helper to find memcg's previous group under the given root
+ * hierarchy.
+ */
+struct mem_cgroup *__find_prev_memcg(struct mem_cgroup *root,
+		struct mem_cgroup *memcg)
+{
+	struct cgroup *memcg_cgroup = memcg->css.cgroup;
+	struct cgroup *root_cgroup = root->css.cgroup;
+	struct cgroup *prev_cgroup = NULL;
+	struct cgroup *iter;
+
+	cgroup_for_each_descendant_pre(iter, root_cgroup) {
+		if (iter == memcg_cgroup)
+			break;
+		prev_cgroup = iter;
+	}
+
+	return (prev_cgroup) ? mem_cgroup_from_cont(prev_cgroup) : NULL;
+}
+
+/*
+ * Remove the given memcg under given root from all per-node per-zone
+ * per-priority chached iterators.
+ */
+static void mem_cgroup_uncache_reclaim_iters(struct mem_cgroup *root,
+		struct mem_cgroup *memcg)
+{
+	int node;
+
+	for_each_node(node) {
+		struct mem_cgroup_per_node *pn = root->info.nodeinfo[node];
+		int zone;
+
+		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+			struct mem_cgroup_per_zone *mz;
+			int prio;
+
+			mz = &pn->zoneinfo[zone];
+			for (prio = 0; prio < DEF_PRIORITY + 1; prio++) {
+				struct mem_cgroup_reclaim_iter *iter;
+
+				/*
+				 * Just drop the reference on the removed memcg
+				 * cached last_visited. No need to lock iter as
+				 * the memcg is on the way out and cannot be
+				 * reclaimed.
+				 */
+				iter = &mz->reclaim_iter[prio];
+				if (root == memcg) {
+					if (iter->last_visited)
+						css_put(&iter->last_visited->css);
+					continue;
+				}
+
+				rcu_read_lock();
+				spin_lock(&iter->iter_lock);
+				if (iter->last_visited == memcg) {
+					iter->last_visited = __find_prev_memcg(
+							root, memcg);
+					css_put(&memcg->css);
+				}
+				spin_unlock(&iter->iter_lock);
+				rcu_read_unlock();
+			}
+		}
+	}
+}
+
+/*
+ * Remove the given memcg from all cached reclaim iterators.
+ */
+static void mem_cgroup_uncache_from_reclaim(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *parent = memcg;
+
+	do {
+		mem_cgroup_uncache_reclaim_iters(parent, memcg);
+	} while ((parent = parent_mem_cgroup(parent)));
+
+	/*
+	 * if the root memcg is not hierarchical we have to check it
+	 * explicitely.
+	 */
+	if (!root_mem_cgroup->use_hierarchy)
+		mem_cgroup_uncache_reclaim_iters(root_mem_cgroup, memcg);
+}
+
 static void mem_cgroup_css_offline(struct cgroup *cont)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
 
+	mem_cgroup_uncache_from_reclaim(memcg);
 	mem_cgroup_reparent_charges(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-01-03 17:54   ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-01-03 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan

Now that per-node-zone-priority iterator caches memory cgroups rather
than their css ids we have to be careful and remove them from the
iterator when they are on the way out otherwise they might hang for
unbounded amount of time (until the global/targeted reclaim triggers the
zone under priority to find out the group is dead and let it to find the
final rest).

This is solved by hooking into mem_cgroup_css_offline and checking all
per-node-zone-priority iterators up the way to the root cgroup. If the
current memcg is found in the respective iter->last_visited then it is
replaced by the previous one in the same sub-hierarchy.

This guarantees that no group gets more reclaiming than necessary and
the next iteration will continue without noticing that the removed group
has disappeared.

Spotted-by: Ying Han <yinghan@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 89 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e9f5c47..4f81abd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6375,10 +6375,99 @@ free_out:
 	return ERR_PTR(error);
 }
 
+/*
+ * Helper to find memcg's previous group under the given root
+ * hierarchy.
+ */
+struct mem_cgroup *__find_prev_memcg(struct mem_cgroup *root,
+		struct mem_cgroup *memcg)
+{
+	struct cgroup *memcg_cgroup = memcg->css.cgroup;
+	struct cgroup *root_cgroup = root->css.cgroup;
+	struct cgroup *prev_cgroup = NULL;
+	struct cgroup *iter;
+
+	cgroup_for_each_descendant_pre(iter, root_cgroup) {
+		if (iter == memcg_cgroup)
+			break;
+		prev_cgroup = iter;
+	}
+
+	return (prev_cgroup) ? mem_cgroup_from_cont(prev_cgroup) : NULL;
+}
+
+/*
+ * Remove the given memcg under given root from all per-node per-zone
+ * per-priority chached iterators.
+ */
+static void mem_cgroup_uncache_reclaim_iters(struct mem_cgroup *root,
+		struct mem_cgroup *memcg)
+{
+	int node;
+
+	for_each_node(node) {
+		struct mem_cgroup_per_node *pn = root->info.nodeinfo[node];
+		int zone;
+
+		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+			struct mem_cgroup_per_zone *mz;
+			int prio;
+
+			mz = &pn->zoneinfo[zone];
+			for (prio = 0; prio < DEF_PRIORITY + 1; prio++) {
+				struct mem_cgroup_reclaim_iter *iter;
+
+				/*
+				 * Just drop the reference on the removed memcg
+				 * cached last_visited. No need to lock iter as
+				 * the memcg is on the way out and cannot be
+				 * reclaimed.
+				 */
+				iter = &mz->reclaim_iter[prio];
+				if (root == memcg) {
+					if (iter->last_visited)
+						css_put(&iter->last_visited->css);
+					continue;
+				}
+
+				rcu_read_lock();
+				spin_lock(&iter->iter_lock);
+				if (iter->last_visited == memcg) {
+					iter->last_visited = __find_prev_memcg(
+							root, memcg);
+					css_put(&memcg->css);
+				}
+				spin_unlock(&iter->iter_lock);
+				rcu_read_unlock();
+			}
+		}
+	}
+}
+
+/*
+ * Remove the given memcg from all cached reclaim iterators.
+ */
+static void mem_cgroup_uncache_from_reclaim(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *parent = memcg;
+
+	do {
+		mem_cgroup_uncache_reclaim_iters(parent, memcg);
+	} while ((parent = parent_mem_cgroup(parent)));
+
+	/*
+	 * if the root memcg is not hierarchical we have to check it
+	 * explicitely.
+	 */
+	if (!root_mem_cgroup->use_hierarchy)
+		mem_cgroup_uncache_reclaim_iters(root_mem_cgroup, memcg);
+}
+
 static void mem_cgroup_css_offline(struct cgroup *cont)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
 
+	mem_cgroup_uncache_from_reclaim(memcg);
 	mem_cgroup_reparent_charges(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
 }
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 5/7] memcg: simplify mem_cgroup_iter
  2013-01-03 17:54 ` Michal Hocko
@ 2013-01-03 17:54   ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-01-03 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan

Current implementation of mem_cgroup_iter has to consider both css and
memcg to find out whether no group has been found (css==NULL - aka the
loop is completed) and that no memcg is associated with the found node
(!memcg - aka css_tryget failed because the group is no longer alive).
This leads to awkward tweaks like tests for css && !memcg to skip the
current node.

It will be much easier if we got rid off css variable altogether and
only rely on memcg. In order to do that the iteration part has to skip
dead nodes. This sounds natural to me and as a nice side effect we will
get a simple invariant that memcg is always alive when non-NULL and all
nodes have been visited otherwise.

We could get rid of the surrounding while loop but keep it in for now to
make review easier. It will go away in the following patch.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   56 +++++++++++++++++++++++++++----------------------------
 1 file changed, 27 insertions(+), 29 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4f81abd..d8c6e5e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1152,7 +1152,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	rcu_read_lock();
 	while (!memcg) {
 		struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
-		struct cgroup_subsys_state *css = NULL;
 
 		if (reclaim) {
 			int nid = zone_to_nid(reclaim->zone);
@@ -1178,53 +1177,52 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		 * explicit visit.
 		 */
 		if (!last_visited) {
-			css = &root->css;
+			memcg = root;
 		} else {
 			struct cgroup *prev_cgroup, *next_cgroup;
 
 			prev_cgroup = (last_visited == root) ? NULL
 				: last_visited->css.cgroup;
-			next_cgroup = cgroup_next_descendant_pre(prev_cgroup,
-					root->css.cgroup);
-			if (next_cgroup)
-				css = cgroup_subsys_state(next_cgroup,
-						mem_cgroup_subsys_id);
-		}
+skip_node:
+			next_cgroup = cgroup_next_descendant_pre(
+					prev_cgroup, root->css.cgroup);
 
-		/*
-		 * Even if we found a group we have to make sure it is alive.
-		 * css && !memcg means that the groups should be skipped and
-		 * we should continue the tree walk.
-		 * last_visited css is safe to use because it is protected by
-		 * css_get and the tree walk is rcu safe.
-		 */
-		if (css == &root->css || (css && css_tryget(css)))
-			memcg = mem_cgroup_from_css(css);
+			/*
+			 * Even if we found a group we have to make sure it is
+			 * alive. css && !memcg means that the groups should be
+			 * skipped and we should continue the tree walk.
+			 * last_visited css is safe to use because it is
+			 * protected by css_get and the tree walk is rcu safe.
+			 */
+			if (next_cgroup) {
+				struct mem_cgroup *mem = mem_cgroup_from_cont(
+						next_cgroup);
+				if (css_tryget(&mem->css))
+					memcg = mem;
+				else {
+					prev_cgroup = next_cgroup;
+					goto skip_node;
+				}
+			}
+		}
 
 		if (reclaim) {
-			struct mem_cgroup *curr = memcg;
-
 			if (last_visited)
 				css_put(&last_visited->css);
 
-			if (css && !memcg)
-				curr = mem_cgroup_from_css(css);
-
 			/* make sure that the cached memcg is not removed */
-			if (curr)
-				css_get(&curr->css);
-			iter->last_visited = curr;
+			if (memcg)
+				css_get(&memcg->css);
+			iter->last_visited = memcg;
 
-			if (!css)
+			if (!memcg)
 				iter->generation++;
 			else if (!prev && memcg)
 				reclaim->generation = iter->generation;
 			spin_unlock(&iter->iter_lock);
-		} else if (css && !memcg) {
-			last_visited = mem_cgroup_from_css(css);
 		}
 
-		if (prev && !css)
+		if (prev && !memcg)
 			goto out_unlock;
 	}
 out_unlock:
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 5/7] memcg: simplify mem_cgroup_iter
@ 2013-01-03 17:54   ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-01-03 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan

Current implementation of mem_cgroup_iter has to consider both css and
memcg to find out whether no group has been found (css==NULL - aka the
loop is completed) and that no memcg is associated with the found node
(!memcg - aka css_tryget failed because the group is no longer alive).
This leads to awkward tweaks like tests for css && !memcg to skip the
current node.

It will be much easier if we got rid off css variable altogether and
only rely on memcg. In order to do that the iteration part has to skip
dead nodes. This sounds natural to me and as a nice side effect we will
get a simple invariant that memcg is always alive when non-NULL and all
nodes have been visited otherwise.

We could get rid of the surrounding while loop but keep it in for now to
make review easier. It will go away in the following patch.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   56 +++++++++++++++++++++++++++----------------------------
 1 file changed, 27 insertions(+), 29 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4f81abd..d8c6e5e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1152,7 +1152,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	rcu_read_lock();
 	while (!memcg) {
 		struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
-		struct cgroup_subsys_state *css = NULL;
 
 		if (reclaim) {
 			int nid = zone_to_nid(reclaim->zone);
@@ -1178,53 +1177,52 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		 * explicit visit.
 		 */
 		if (!last_visited) {
-			css = &root->css;
+			memcg = root;
 		} else {
 			struct cgroup *prev_cgroup, *next_cgroup;
 
 			prev_cgroup = (last_visited == root) ? NULL
 				: last_visited->css.cgroup;
-			next_cgroup = cgroup_next_descendant_pre(prev_cgroup,
-					root->css.cgroup);
-			if (next_cgroup)
-				css = cgroup_subsys_state(next_cgroup,
-						mem_cgroup_subsys_id);
-		}
+skip_node:
+			next_cgroup = cgroup_next_descendant_pre(
+					prev_cgroup, root->css.cgroup);
 
-		/*
-		 * Even if we found a group we have to make sure it is alive.
-		 * css && !memcg means that the groups should be skipped and
-		 * we should continue the tree walk.
-		 * last_visited css is safe to use because it is protected by
-		 * css_get and the tree walk is rcu safe.
-		 */
-		if (css == &root->css || (css && css_tryget(css)))
-			memcg = mem_cgroup_from_css(css);
+			/*
+			 * Even if we found a group we have to make sure it is
+			 * alive. css && !memcg means that the groups should be
+			 * skipped and we should continue the tree walk.
+			 * last_visited css is safe to use because it is
+			 * protected by css_get and the tree walk is rcu safe.
+			 */
+			if (next_cgroup) {
+				struct mem_cgroup *mem = mem_cgroup_from_cont(
+						next_cgroup);
+				if (css_tryget(&mem->css))
+					memcg = mem;
+				else {
+					prev_cgroup = next_cgroup;
+					goto skip_node;
+				}
+			}
+		}
 
 		if (reclaim) {
-			struct mem_cgroup *curr = memcg;
-
 			if (last_visited)
 				css_put(&last_visited->css);
 
-			if (css && !memcg)
-				curr = mem_cgroup_from_css(css);
-
 			/* make sure that the cached memcg is not removed */
-			if (curr)
-				css_get(&curr->css);
-			iter->last_visited = curr;
+			if (memcg)
+				css_get(&memcg->css);
+			iter->last_visited = memcg;
 
-			if (!css)
+			if (!memcg)
 				iter->generation++;
 			else if (!prev && memcg)
 				reclaim->generation = iter->generation;
 			spin_unlock(&iter->iter_lock);
-		} else if (css && !memcg) {
-			last_visited = mem_cgroup_from_css(css);
 		}
 
-		if (prev && !css)
+		if (prev && !memcg)
 			goto out_unlock;
 	}
 out_unlock:
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 6/7] memcg: further simplify mem_cgroup_iter
  2013-01-03 17:54 ` Michal Hocko
@ 2013-01-03 17:54   ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-01-03 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan

mem_cgroup_iter basically does two things currently. It takes care of
the house keeping (reference counting, raclaim cookie) and it iterates
through a hierarchy tree (by using cgroup generic tree walk).
The code would be much more easier to follow if we move the iteration
outside of the function (to __mem_cgrou_iter_next) so the distinction
is more clear.
This patch doesn't introduce any functional changes.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   79 ++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 46 insertions(+), 33 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d8c6e5e..d80fcff 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1110,6 +1110,51 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
 	return memcg;
 }
 
+/*
+ * Returns a next (in a pre-order walk) alive memcg (with elevated css
+ * ref. count) or NULL if the whole root's subtree has been visited.
+ *
+ * helper function to be used by mem_cgroup_iter
+ */
+static struct mem_cgroup *__mem_cgroup_iter_next(struct mem_cgroup *root,
+		struct mem_cgroup *last_visited)
+{
+	struct cgroup *prev_cgroup, *next_cgroup;
+
+	/*
+	 * Root is not visited by cgroup iterators so it needs an
+	 * explicit visit.
+	 */
+	if (!last_visited)
+		return root;
+
+	prev_cgroup = (last_visited == root) ? NULL
+		: last_visited->css.cgroup;
+skip_node:
+	next_cgroup = cgroup_next_descendant_pre(
+			prev_cgroup, root->css.cgroup);
+
+	/*
+	 * Even if we found a group we have to make sure it is
+	 * alive. css && !memcg means that the groups should be
+	 * skipped and we should continue the tree walk.
+	 * last_visited css is safe to use because it is
+	 * protected by css_get and the tree walk is rcu safe.
+	 */
+	if (next_cgroup) {
+		struct mem_cgroup *mem = mem_cgroup_from_cont(
+				next_cgroup);
+		if (css_tryget(&mem->css))
+			return mem;
+		else {
+			prev_cgroup = next_cgroup;
+			goto skip_node;
+		}
+	}
+
+	return NULL;
+}
+
 /**
  * mem_cgroup_iter - iterate over memory cgroup hierarchy
  * @root: hierarchy root
@@ -1172,39 +1217,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			}
 		}
 
-		/*
-		 * Root is not visited by cgroup iterators so it needs an
-		 * explicit visit.
-		 */
-		if (!last_visited) {
-			memcg = root;
-		} else {
-			struct cgroup *prev_cgroup, *next_cgroup;
-
-			prev_cgroup = (last_visited == root) ? NULL
-				: last_visited->css.cgroup;
-skip_node:
-			next_cgroup = cgroup_next_descendant_pre(
-					prev_cgroup, root->css.cgroup);
-
-			/*
-			 * Even if we found a group we have to make sure it is
-			 * alive. css && !memcg means that the groups should be
-			 * skipped and we should continue the tree walk.
-			 * last_visited css is safe to use because it is
-			 * protected by css_get and the tree walk is rcu safe.
-			 */
-			if (next_cgroup) {
-				struct mem_cgroup *mem = mem_cgroup_from_cont(
-						next_cgroup);
-				if (css_tryget(&mem->css))
-					memcg = mem;
-				else {
-					prev_cgroup = next_cgroup;
-					goto skip_node;
-				}
-			}
-		}
+		memcg = __mem_cgroup_iter_next(root, last_visited);
 
 		if (reclaim) {
 			if (last_visited)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 6/7] memcg: further simplify mem_cgroup_iter
@ 2013-01-03 17:54   ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-01-03 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan

mem_cgroup_iter basically does two things currently. It takes care of
the house keeping (reference counting, raclaim cookie) and it iterates
through a hierarchy tree (by using cgroup generic tree walk).
The code would be much more easier to follow if we move the iteration
outside of the function (to __mem_cgrou_iter_next) so the distinction
is more clear.
This patch doesn't introduce any functional changes.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   79 ++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 46 insertions(+), 33 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d8c6e5e..d80fcff 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1110,6 +1110,51 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
 	return memcg;
 }
 
+/*
+ * Returns a next (in a pre-order walk) alive memcg (with elevated css
+ * ref. count) or NULL if the whole root's subtree has been visited.
+ *
+ * helper function to be used by mem_cgroup_iter
+ */
+static struct mem_cgroup *__mem_cgroup_iter_next(struct mem_cgroup *root,
+		struct mem_cgroup *last_visited)
+{
+	struct cgroup *prev_cgroup, *next_cgroup;
+
+	/*
+	 * Root is not visited by cgroup iterators so it needs an
+	 * explicit visit.
+	 */
+	if (!last_visited)
+		return root;
+
+	prev_cgroup = (last_visited == root) ? NULL
+		: last_visited->css.cgroup;
+skip_node:
+	next_cgroup = cgroup_next_descendant_pre(
+			prev_cgroup, root->css.cgroup);
+
+	/*
+	 * Even if we found a group we have to make sure it is
+	 * alive. css && !memcg means that the groups should be
+	 * skipped and we should continue the tree walk.
+	 * last_visited css is safe to use because it is
+	 * protected by css_get and the tree walk is rcu safe.
+	 */
+	if (next_cgroup) {
+		struct mem_cgroup *mem = mem_cgroup_from_cont(
+				next_cgroup);
+		if (css_tryget(&mem->css))
+			return mem;
+		else {
+			prev_cgroup = next_cgroup;
+			goto skip_node;
+		}
+	}
+
+	return NULL;
+}
+
 /**
  * mem_cgroup_iter - iterate over memory cgroup hierarchy
  * @root: hierarchy root
@@ -1172,39 +1217,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			}
 		}
 
-		/*
-		 * Root is not visited by cgroup iterators so it needs an
-		 * explicit visit.
-		 */
-		if (!last_visited) {
-			memcg = root;
-		} else {
-			struct cgroup *prev_cgroup, *next_cgroup;
-
-			prev_cgroup = (last_visited == root) ? NULL
-				: last_visited->css.cgroup;
-skip_node:
-			next_cgroup = cgroup_next_descendant_pre(
-					prev_cgroup, root->css.cgroup);
-
-			/*
-			 * Even if we found a group we have to make sure it is
-			 * alive. css && !memcg means that the groups should be
-			 * skipped and we should continue the tree walk.
-			 * last_visited css is safe to use because it is
-			 * protected by css_get and the tree walk is rcu safe.
-			 */
-			if (next_cgroup) {
-				struct mem_cgroup *mem = mem_cgroup_from_cont(
-						next_cgroup);
-				if (css_tryget(&mem->css))
-					memcg = mem;
-				else {
-					prev_cgroup = next_cgroup;
-					goto skip_node;
-				}
-			}
-		}
+		memcg = __mem_cgroup_iter_next(root, last_visited);
 
 		if (reclaim) {
 			if (last_visited)
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 7/7] cgroup: remove css_get_next
  2013-01-03 17:54 ` Michal Hocko
@ 2013-01-03 17:54   ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-01-03 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan

Now that we have generic and well ordered cgroup tree walkers there is
no need to keep css_get_next in the place.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/cgroup.h |    7 -------
 kernel/cgroup.c        |   49 ------------------------------------------------
 2 files changed, 56 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 7d73905..a4d86b0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -685,13 +685,6 @@ void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css);
 
 struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id);
 
-/*
- * Get a cgroup whose id is greater than or equal to id under tree of root.
- * Returning a cgroup_subsys_state or NULL.
- */
-struct cgroup_subsys_state *css_get_next(struct cgroup_subsys *ss, int id,
-		struct cgroup_subsys_state *root, int *foundid);
-
 /* Returns true if root is ancestor of cg */
 bool css_is_ancestor(struct cgroup_subsys_state *cg,
 		     const struct cgroup_subsys_state *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f34c41b..3013ec4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5384,55 +5384,6 @@ struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id)
 }
 EXPORT_SYMBOL_GPL(css_lookup);
 
-/**
- * css_get_next - lookup next cgroup under specified hierarchy.
- * @ss: pointer to subsystem
- * @id: current position of iteration.
- * @root: pointer to css. search tree under this.
- * @foundid: position of found object.
- *
- * Search next css under the specified hierarchy of rootid. Calling under
- * rcu_read_lock() is necessary. Returns NULL if it reaches the end.
- */
-struct cgroup_subsys_state *
-css_get_next(struct cgroup_subsys *ss, int id,
-	     struct cgroup_subsys_state *root, int *foundid)
-{
-	struct cgroup_subsys_state *ret = NULL;
-	struct css_id *tmp;
-	int tmpid;
-	int rootid = css_id(root);
-	int depth = css_depth(root);
-
-	if (!rootid)
-		return NULL;
-
-	BUG_ON(!ss->use_id);
-	WARN_ON_ONCE(!rcu_read_lock_held());
-
-	/* fill start point for scan */
-	tmpid = id;
-	while (1) {
-		/*
-		 * scan next entry from bitmap(tree), tmpid is updated after
-		 * idr_get_next().
-		 */
-		tmp = idr_get_next(&ss->idr, &tmpid);
-		if (!tmp)
-			break;
-		if (tmp->depth >= depth && tmp->stack[depth] == rootid) {
-			ret = rcu_dereference(tmp->css);
-			if (ret) {
-				*foundid = tmpid;
-				break;
-			}
-		}
-		/* continue to scan from next id */
-		tmpid = tmpid + 1;
-	}
-	return ret;
-}
-
 /*
  * get corresponding css from file open on cgroupfs directory
  */
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 7/7] cgroup: remove css_get_next
@ 2013-01-03 17:54   ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-01-03 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan

Now that we have generic and well ordered cgroup tree walkers there is
no need to keep css_get_next in the place.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/cgroup.h |    7 -------
 kernel/cgroup.c        |   49 ------------------------------------------------
 2 files changed, 56 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 7d73905..a4d86b0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -685,13 +685,6 @@ void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css);
 
 struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id);
 
-/*
- * Get a cgroup whose id is greater than or equal to id under tree of root.
- * Returning a cgroup_subsys_state or NULL.
- */
-struct cgroup_subsys_state *css_get_next(struct cgroup_subsys *ss, int id,
-		struct cgroup_subsys_state *root, int *foundid);
-
 /* Returns true if root is ancestor of cg */
 bool css_is_ancestor(struct cgroup_subsys_state *cg,
 		     const struct cgroup_subsys_state *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f34c41b..3013ec4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5384,55 +5384,6 @@ struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id)
 }
 EXPORT_SYMBOL_GPL(css_lookup);
 
-/**
- * css_get_next - lookup next cgroup under specified hierarchy.
- * @ss: pointer to subsystem
- * @id: current position of iteration.
- * @root: pointer to css. search tree under this.
- * @foundid: position of found object.
- *
- * Search next css under the specified hierarchy of rootid. Calling under
- * rcu_read_lock() is necessary. Returns NULL if it reaches the end.
- */
-struct cgroup_subsys_state *
-css_get_next(struct cgroup_subsys *ss, int id,
-	     struct cgroup_subsys_state *root, int *foundid)
-{
-	struct cgroup_subsys_state *ret = NULL;
-	struct css_id *tmp;
-	int tmpid;
-	int rootid = css_id(root);
-	int depth = css_depth(root);
-
-	if (!rootid)
-		return NULL;
-
-	BUG_ON(!ss->use_id);
-	WARN_ON_ONCE(!rcu_read_lock_held());
-
-	/* fill start point for scan */
-	tmpid = id;
-	while (1) {
-		/*
-		 * scan next entry from bitmap(tree), tmpid is updated after
-		 * idr_get_next().
-		 */
-		tmp = idr_get_next(&ss->idr, &tmpid);
-		if (!tmp)
-			break;
-		if (tmp->depth >= depth && tmp->stack[depth] == rootid) {
-			ret = rcu_dereference(tmp->css);
-			if (ret) {
-				*foundid = tmpid;
-				break;
-			}
-		}
-		/* continue to scan from next id */
-		tmpid = tmpid + 1;
-	}
-	return ret;
-}
-
 /*
  * get corresponding css from file open on cgroupfs directory
  */
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 7/7] cgroup: remove css_get_next
  2013-01-03 17:54   ` Michal Hocko
@ 2013-01-04  3:42     ` Li Zefan
  -1 siblings, 0 replies; 78+ messages in thread
From: Li Zefan @ 2013-01-04  3:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, Tejun Heo, Glauber Costa

On 2013/1/4 1:54, Michal Hocko wrote:
> Now that we have generic and well ordered cgroup tree walkers there is
> no need to keep css_get_next in the place.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: Li Zefan <lizefan@huawei.com>

> ---
>  include/linux/cgroup.h |    7 -------
>  kernel/cgroup.c        |   49 ------------------------------------------------
>  2 files changed, 56 deletions(-)



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 7/7] cgroup: remove css_get_next
@ 2013-01-04  3:42     ` Li Zefan
  0 siblings, 0 replies; 78+ messages in thread
From: Li Zefan @ 2013-01-04  3:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, Tejun Heo, Glauber Costa

On 2013/1/4 1:54, Michal Hocko wrote:
> Now that we have generic and well ordered cgroup tree walkers there is
> no need to keep css_get_next in the place.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: Li Zefan <lizefan@huawei.com>

> ---
>  include/linux/cgroup.h |    7 -------
>  kernel/cgroup.c        |   49 ------------------------------------------------
>  2 files changed, 56 deletions(-)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-01-03 17:54   ` Michal Hocko
@ 2013-01-07  6:18     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 78+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-07  6:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Johannes Weiner, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

(2013/01/04 2:54), Michal Hocko wrote:
> Now that per-node-zone-priority iterator caches memory cgroups rather
> than their css ids we have to be careful and remove them from the
> iterator when they are on the way out otherwise they might hang for
> unbounded amount of time (until the global/targeted reclaim triggers the
> zone under priority to find out the group is dead and let it to find the
> final rest).
> 
> This is solved by hooking into mem_cgroup_css_offline and checking all
> per-node-zone-priority iterators up the way to the root cgroup. If the
> current memcg is found in the respective iter->last_visited then it is
> replaced by the previous one in the same sub-hierarchy.
> 
> This guarantees that no group gets more reclaiming than necessary and
> the next iteration will continue without noticing that the removed group
> has disappeared.
> 
> Spotted-by: Ying Han <yinghan@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>


Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-01-07  6:18     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 78+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-07  6:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Johannes Weiner, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

(2013/01/04 2:54), Michal Hocko wrote:
> Now that per-node-zone-priority iterator caches memory cgroups rather
> than their css ids we have to be careful and remove them from the
> iterator when they are on the way out otherwise they might hang for
> unbounded amount of time (until the global/targeted reclaim triggers the
> zone under priority to find out the group is dead and let it to find the
> final rest).
> 
> This is solved by hooking into mem_cgroup_css_offline and checking all
> per-node-zone-priority iterators up the way to the root cgroup. If the
> current memcg is found in the respective iter->last_visited then it is
> replaced by the previous one in the same sub-hierarchy.
> 
> This guarantees that no group gets more reclaiming than necessary and
> the next iteration will continue without noticing that the removed group
> has disappeared.
> 
> Spotted-by: Ying Han <yinghan@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>


Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 0/7] rework mem_cgroup iterator
  2013-01-03 17:54 ` Michal Hocko
@ 2013-01-23 12:52   ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-01-23 12:52 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan

Are there any comments? Ying, Johannes?
I would be happy if this could go into 3.9.

On Thu 03-01-13 18:54:14, Michal Hocko wrote:
> Hi all,
> this is a third version of the patchset previously posted here:
> https://lkml.org/lkml/2012/11/26/616
> 
> The patch set tries to make mem_cgroup_iter saner in the way how it
> walks hierarchies. css->id based traversal is far from being ideal as it
> is not deterministic because it depends on the creation ordering.
> 
> Diffstat doesn't look that promising as in previous versions anymore but
> I think it is worth the resulting outcome (and the sanity ;)).
> 
> The first patch fixes a potential misbehaving which I haven't seen but
> the fix is needed for the later patches anyway. We could take it alone
> as well but I do not have any bug report to base the fix on. The second
> one is also preparatory and it is new to the series.
> 
> The third patch is the core of the patchset and it replaces css_get_next
> based on css_id by the generic cgroup pre-order iterator which
> means that css_id is no longer used by memcg. This brings some
> chalanges for the last visited group caching during the reclaim
> (mem_cgroup_per_zone::reclaim_iter). We have to use memcg pointers
> directly now which means that we have to keep a reference to those
> groups' css to keep them alive.
> 
> The next patch fixups an unbounded cgroup removal holdoff caused by
> the elevated css refcount and does the clean up on the group removal.
> Thanks to Ying who spotted this during testing of the previous version
> of the patchset.
> I could have folded it into the previous patch but I felt it would be
> too big to review but if people feel it would be better that way, I have
> no problems to squash them together.
> 
> The fourth and fifth patches are an attempt for simplification of the
> mem_cgroup_iter. css juggling is removed and the iteration logic is
> moved to a helper so that the reference counting and iteration are
> separated.
> 
> The last patch just removes css_get_next as there is no user for it any
> longer.
> 
> I am also thinking that leaf-to-root iteration makes more sense but this
> patch is not included in the series yet because I have to think some
> more about the justification.
> 
> Same as with the previous version I have tested with a quite simple
> hierarchy:
>         A (limit = 280M, use_hierarchy=true)
>       / | \
>      B  C  D (all have 100M limit)
> 
> And a separate kernel build in the each leaf group. This triggers
> both children only and hierarchical reclaim which is parallel so the
> iter_reclaim caching is active a lot. I will hammer it some more but the
> series should be in quite a good shape already. 
> 
> Michal Hocko (7):
>       memcg: synchronize per-zone iterator access by a spinlock
>       memcg: keep prev's css alive for the whole mem_cgroup_iter
>       memcg: rework mem_cgroup_iter to use cgroup iterators
>       memcg: remove memcg from the reclaim iterators
>       memcg: simplify mem_cgroup_iter
>       memcg: further simplify mem_cgroup_iter
>       cgroup: remove css_get_next
> 
> And the diffstat says:
>  include/linux/cgroup.h |    7 --
>  kernel/cgroup.c        |   49 ------------
>  mm/memcontrol.c        |  199 ++++++++++++++++++++++++++++++++++++++++++------
>  3 files changed, 175 insertions(+), 80 deletions(-)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 0/7] rework mem_cgroup iterator
@ 2013-01-23 12:52   ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-01-23 12:52 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	Tejun Heo, Glauber Costa, Li Zefan

Are there any comments? Ying, Johannes?
I would be happy if this could go into 3.9.

On Thu 03-01-13 18:54:14, Michal Hocko wrote:
> Hi all,
> this is a third version of the patchset previously posted here:
> https://lkml.org/lkml/2012/11/26/616
> 
> The patch set tries to make mem_cgroup_iter saner in the way how it
> walks hierarchies. css->id based traversal is far from being ideal as it
> is not deterministic because it depends on the creation ordering.
> 
> Diffstat doesn't look that promising as in previous versions anymore but
> I think it is worth the resulting outcome (and the sanity ;)).
> 
> The first patch fixes a potential misbehaving which I haven't seen but
> the fix is needed for the later patches anyway. We could take it alone
> as well but I do not have any bug report to base the fix on. The second
> one is also preparatory and it is new to the series.
> 
> The third patch is the core of the patchset and it replaces css_get_next
> based on css_id by the generic cgroup pre-order iterator which
> means that css_id is no longer used by memcg. This brings some
> chalanges for the last visited group caching during the reclaim
> (mem_cgroup_per_zone::reclaim_iter). We have to use memcg pointers
> directly now which means that we have to keep a reference to those
> groups' css to keep them alive.
> 
> The next patch fixups an unbounded cgroup removal holdoff caused by
> the elevated css refcount and does the clean up on the group removal.
> Thanks to Ying who spotted this during testing of the previous version
> of the patchset.
> I could have folded it into the previous patch but I felt it would be
> too big to review but if people feel it would be better that way, I have
> no problems to squash them together.
> 
> The fourth and fifth patches are an attempt for simplification of the
> mem_cgroup_iter. css juggling is removed and the iteration logic is
> moved to a helper so that the reference counting and iteration are
> separated.
> 
> The last patch just removes css_get_next as there is no user for it any
> longer.
> 
> I am also thinking that leaf-to-root iteration makes more sense but this
> patch is not included in the series yet because I have to think some
> more about the justification.
> 
> Same as with the previous version I have tested with a quite simple
> hierarchy:
>         A (limit = 280M, use_hierarchy=true)
>       / | \
>      B  C  D (all have 100M limit)
> 
> And a separate kernel build in the each leaf group. This triggers
> both children only and hierarchical reclaim which is parallel so the
> iter_reclaim caching is active a lot. I will hammer it some more but the
> series should be in quite a good shape already. 
> 
> Michal Hocko (7):
>       memcg: synchronize per-zone iterator access by a spinlock
>       memcg: keep prev's css alive for the whole mem_cgroup_iter
>       memcg: rework mem_cgroup_iter to use cgroup iterators
>       memcg: remove memcg from the reclaim iterators
>       memcg: simplify mem_cgroup_iter
>       memcg: further simplify mem_cgroup_iter
>       cgroup: remove css_get_next
> 
> And the diffstat says:
>  include/linux/cgroup.h |    7 --
>  kernel/cgroup.c        |   49 ------------
>  mm/memcontrol.c        |  199 ++++++++++++++++++++++++++++++++++++++++++------
>  3 files changed, 175 insertions(+), 80 deletions(-)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-01-03 17:54   ` Michal Hocko
@ 2013-02-08 19:33     ` Johannes Weiner
  -1 siblings, 0 replies; 78+ messages in thread
From: Johannes Weiner @ 2013-02-08 19:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Thu, Jan 03, 2013 at 06:54:18PM +0100, Michal Hocko wrote:
> Now that per-node-zone-priority iterator caches memory cgroups rather
> than their css ids we have to be careful and remove them from the
> iterator when they are on the way out otherwise they might hang for
> unbounded amount of time (until the global/targeted reclaim triggers the
> zone under priority to find out the group is dead and let it to find the
> final rest).
> 
> This is solved by hooking into mem_cgroup_css_offline and checking all
> per-node-zone-priority iterators up the way to the root cgroup. If the
> current memcg is found in the respective iter->last_visited then it is
> replaced by the previous one in the same sub-hierarchy.
> 
> This guarantees that no group gets more reclaiming than necessary and
> the next iteration will continue without noticing that the removed group
> has disappeared.
> 
> Spotted-by: Ying Han <yinghan@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 89 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e9f5c47..4f81abd 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6375,10 +6375,99 @@ free_out:
>  	return ERR_PTR(error);
>  }
>  
> +/*
> + * Helper to find memcg's previous group under the given root
> + * hierarchy.
> + */
> +struct mem_cgroup *__find_prev_memcg(struct mem_cgroup *root,
> +		struct mem_cgroup *memcg)
> +{
> +	struct cgroup *memcg_cgroup = memcg->css.cgroup;
> +	struct cgroup *root_cgroup = root->css.cgroup;
> +	struct cgroup *prev_cgroup = NULL;
> +	struct cgroup *iter;
> +
> +	cgroup_for_each_descendant_pre(iter, root_cgroup) {
> +		if (iter == memcg_cgroup)
> +			break;
> +		prev_cgroup = iter;
> +	}
> +
> +	return (prev_cgroup) ? mem_cgroup_from_cont(prev_cgroup) : NULL;
> +}
> +
> +/*
> + * Remove the given memcg under given root from all per-node per-zone
> + * per-priority chached iterators.
> + */
> +static void mem_cgroup_uncache_reclaim_iters(struct mem_cgroup *root,
> +		struct mem_cgroup *memcg)
> +{
> +	int node;
> +
> +	for_each_node(node) {
> +		struct mem_cgroup_per_node *pn = root->info.nodeinfo[node];
> +		int zone;
> +
> +		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> +			struct mem_cgroup_per_zone *mz;
> +			int prio;
> +
> +			mz = &pn->zoneinfo[zone];
> +			for (prio = 0; prio < DEF_PRIORITY + 1; prio++) {
> +				struct mem_cgroup_reclaim_iter *iter;
> +
> +				/*
> +				 * Just drop the reference on the removed memcg
> +				 * cached last_visited. No need to lock iter as
> +				 * the memcg is on the way out and cannot be
> +				 * reclaimed.
> +				 */
> +				iter = &mz->reclaim_iter[prio];
> +				if (root == memcg) {
> +					if (iter->last_visited)
> +						css_put(&iter->last_visited->css);
> +					continue;
> +				}
> +
> +				rcu_read_lock();
> +				spin_lock(&iter->iter_lock);
> +				if (iter->last_visited == memcg) {
> +					iter->last_visited = __find_prev_memcg(
> +							root, memcg);
> +					css_put(&memcg->css);
> +				}
> +				spin_unlock(&iter->iter_lock);
> +				rcu_read_unlock();
> +			}
> +		}
> +	}
> +}
> +
> +/*
> + * Remove the given memcg from all cached reclaim iterators.
> + */
> +static void mem_cgroup_uncache_from_reclaim(struct mem_cgroup *memcg)
> +{
> +	struct mem_cgroup *parent = memcg;
> +
> +	do {
> +		mem_cgroup_uncache_reclaim_iters(parent, memcg);
> +	} while ((parent = parent_mem_cgroup(parent)));
> +
> +	/*
> +	 * if the root memcg is not hierarchical we have to check it
> +	 * explicitely.
> +	 */
> +	if (!root_mem_cgroup->use_hierarchy)
> +		mem_cgroup_uncache_reclaim_iters(root_mem_cgroup, memcg);
> +}

for each in hierarchy:
  for each node:
    for each zone:
      for each reclaim priority:

every time a cgroup is destroyed.  I don't think such a hammer is
justified in general, let alone for consolidating code a little.

Can we invalidate the position cache lazily?  Have a global "cgroup
destruction" counter and store a snapshot of that counter whenever we
put a cgroup pointer in the position cache.  We only use the cached
pointer if that counter has not changed in the meantime, so we know
that the cgroup still exists.

It is pretty pretty imprecise and we invalidate the whole cache every
time a cgroup is destroyed, but I think that should be okay.  If not,
better ideas are welcome.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-08 19:33     ` Johannes Weiner
  0 siblings, 0 replies; 78+ messages in thread
From: Johannes Weiner @ 2013-02-08 19:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Thu, Jan 03, 2013 at 06:54:18PM +0100, Michal Hocko wrote:
> Now that per-node-zone-priority iterator caches memory cgroups rather
> than their css ids we have to be careful and remove them from the
> iterator when they are on the way out otherwise they might hang for
> unbounded amount of time (until the global/targeted reclaim triggers the
> zone under priority to find out the group is dead and let it to find the
> final rest).
> 
> This is solved by hooking into mem_cgroup_css_offline and checking all
> per-node-zone-priority iterators up the way to the root cgroup. If the
> current memcg is found in the respective iter->last_visited then it is
> replaced by the previous one in the same sub-hierarchy.
> 
> This guarantees that no group gets more reclaiming than necessary and
> the next iteration will continue without noticing that the removed group
> has disappeared.
> 
> Spotted-by: Ying Han <yinghan@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 89 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e9f5c47..4f81abd 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6375,10 +6375,99 @@ free_out:
>  	return ERR_PTR(error);
>  }
>  
> +/*
> + * Helper to find memcg's previous group under the given root
> + * hierarchy.
> + */
> +struct mem_cgroup *__find_prev_memcg(struct mem_cgroup *root,
> +		struct mem_cgroup *memcg)
> +{
> +	struct cgroup *memcg_cgroup = memcg->css.cgroup;
> +	struct cgroup *root_cgroup = root->css.cgroup;
> +	struct cgroup *prev_cgroup = NULL;
> +	struct cgroup *iter;
> +
> +	cgroup_for_each_descendant_pre(iter, root_cgroup) {
> +		if (iter == memcg_cgroup)
> +			break;
> +		prev_cgroup = iter;
> +	}
> +
> +	return (prev_cgroup) ? mem_cgroup_from_cont(prev_cgroup) : NULL;
> +}
> +
> +/*
> + * Remove the given memcg under given root from all per-node per-zone
> + * per-priority chached iterators.
> + */
> +static void mem_cgroup_uncache_reclaim_iters(struct mem_cgroup *root,
> +		struct mem_cgroup *memcg)
> +{
> +	int node;
> +
> +	for_each_node(node) {
> +		struct mem_cgroup_per_node *pn = root->info.nodeinfo[node];
> +		int zone;
> +
> +		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> +			struct mem_cgroup_per_zone *mz;
> +			int prio;
> +
> +			mz = &pn->zoneinfo[zone];
> +			for (prio = 0; prio < DEF_PRIORITY + 1; prio++) {
> +				struct mem_cgroup_reclaim_iter *iter;
> +
> +				/*
> +				 * Just drop the reference on the removed memcg
> +				 * cached last_visited. No need to lock iter as
> +				 * the memcg is on the way out and cannot be
> +				 * reclaimed.
> +				 */
> +				iter = &mz->reclaim_iter[prio];
> +				if (root == memcg) {
> +					if (iter->last_visited)
> +						css_put(&iter->last_visited->css);
> +					continue;
> +				}
> +
> +				rcu_read_lock();
> +				spin_lock(&iter->iter_lock);
> +				if (iter->last_visited == memcg) {
> +					iter->last_visited = __find_prev_memcg(
> +							root, memcg);
> +					css_put(&memcg->css);
> +				}
> +				spin_unlock(&iter->iter_lock);
> +				rcu_read_unlock();
> +			}
> +		}
> +	}
> +}
> +
> +/*
> + * Remove the given memcg from all cached reclaim iterators.
> + */
> +static void mem_cgroup_uncache_from_reclaim(struct mem_cgroup *memcg)
> +{
> +	struct mem_cgroup *parent = memcg;
> +
> +	do {
> +		mem_cgroup_uncache_reclaim_iters(parent, memcg);
> +	} while ((parent = parent_mem_cgroup(parent)));
> +
> +	/*
> +	 * if the root memcg is not hierarchical we have to check it
> +	 * explicitely.
> +	 */
> +	if (!root_mem_cgroup->use_hierarchy)
> +		mem_cgroup_uncache_reclaim_iters(root_mem_cgroup, memcg);
> +}

for each in hierarchy:
  for each node:
    for each zone:
      for each reclaim priority:

every time a cgroup is destroyed.  I don't think such a hammer is
justified in general, let alone for consolidating code a little.

Can we invalidate the position cache lazily?  Have a global "cgroup
destruction" counter and store a snapshot of that counter whenever we
put a cgroup pointer in the position cache.  We only use the cached
pointer if that counter has not changed in the meantime, so we know
that the cgroup still exists.

It is pretty pretty imprecise and we invalidate the whole cache every
time a cgroup is destroyed, but I think that should be okay.  If not,
better ideas are welcome.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-08 19:33     ` Johannes Weiner
@ 2013-02-11 15:16       ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-11 15:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Fri 08-02-13 14:33:18, Johannes Weiner wrote:
[...]
> for each in hierarchy:
>   for each node:
>     for each zone:
>       for each reclaim priority:
> 
> every time a cgroup is destroyed.  I don't think such a hammer is
> justified in general, let alone for consolidating code a little.
> 
> Can we invalidate the position cache lazily?  Have a global "cgroup
> destruction" counter and store a snapshot of that counter whenever we
> put a cgroup pointer in the position cache.  We only use the cached
> pointer if that counter has not changed in the meantime, so we know
> that the cgroup still exists.

Currently we have:
rcu_read_lock()	// keeps cgroup links safe
	iter->iter_lock	// keeps selection exclusive for a specific iterator
	1) global_counter == iter_counter
	2) css_tryget(cached_memcg)  // check it is still alive
rcu_read_unlock()

What would protect us from races when css would disappear between 1 and
2?

css is invalidated from worker context scheduled from __css_put and it
is using dentry locking which we surely do not want to pull here. We
could hook into css_offline which is called with cgroup_mutex but we
cannot use this one here because it is no longer exported and Tejun
would kill us for that.
So we can add a new global memcg internal lock to do this atomically.
Ohh, this is getting uglier...
	
> It is pretty pretty imprecise and we invalidate the whole cache every
> time a cgroup is destroyed, but I think that should be okay. 

I am not sure this is OK because this gives an indirect way of
influencing reclaim in one hierarchy by another one which opens a door
for regressions (or malicious over-reclaim in the extreme case).
So I do not like this very much.

> If not, better ideas are welcome.

Maybe we could keep the counter per memcg but that would mean that we
would need to go up the hierarchy as well. We wouldn't have to go over
node-zone-priority cleanup so it would be much more lightweight.

I am not sure this is necessarily better than explicit cleanup because
it brings yet another kind of generation number to the game but I guess
I can live with it if people really thing the relaxed way is much
better.
What do you think about the patch below (untested yet)?
---
>From 8169aa49649753822661b8fbbfba0852dcfedba6 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Mon, 11 Feb 2013 16:13:48 +0100
Subject: [PATCH] memcg: relax memcg iter caching

Now that per-node-zone-priority iterator caches memory cgroups rather
than their css ids we have to be careful and remove them from the
iterator when they are on the way out otherwise they might hang for
unbounded amount of time (until the global/targeted reclaim triggers the
zone under priority to find out the group is dead and let it to find the
final rest).

We can fix this issue by relaxing rules for the last_visited memcg as
well.
Instead of taking reference to css we can just use its pointer and
track the number of removed groups for each memcg. This number would be
stored into iterator everytime when a memcg is cached. If the iter count
doesn't match the curent walker root's one we will start over from the
root again. The group counter is incremented upwards the hierarchy every
time a group is removed.

dead_count_lock makes sure that we do not race with memcg removal.

Spotted-by: Ying Han <yinghan@google.com>
Original-idea-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |   68 ++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 57 insertions(+), 11 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e9f5c47..65bf2cb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu {
 };
 
 struct mem_cgroup_reclaim_iter {
-	/* last scanned hierarchy member with elevated css ref count */
+	/*
+	 * last scanned hierarchy member. Valid only if last_dead_count
+	 * matches memcg->dead_count of the hierarchy root group.
+	 */
 	struct mem_cgroup *last_visited;
+	unsigned int last_dead_count;
+
 	/* scan generation, increased every round-trip */
 	unsigned int generation;
 	/* lock to protect the position and generation */
@@ -357,6 +362,8 @@ struct mem_cgroup {
 	struct mem_cgroup_stat_cpu nocpu_base;
 	spinlock_t pcp_counter_lock;
 
+	spinlock_t		dead_count_lock;
+	unsigned int		dead_count;
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
 	struct tcp_memcontrol tcp_mem;
 #endif
@@ -1162,15 +1169,24 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			mz = mem_cgroup_zoneinfo(root, nid, zid);
 			iter = &mz->reclaim_iter[reclaim->priority];
 			spin_lock(&iter->iter_lock);
-			last_visited = iter->last_visited;
 			if (prev && reclaim->generation != iter->generation) {
-				if (last_visited) {
-					css_put(&last_visited->css);
-					iter->last_visited = NULL;
-				}
+				iter->last_visited = NULL;
 				spin_unlock(&iter->iter_lock);
 				goto out_unlock;
 			}
+
+			/*
+			 * last_visited might be invalid if some of the group
+			 * downwards was removed. As we do not know which one
+			 * disappeared we have to start all over again from the
+			 * root.
+			 */
+			spin_lock(&root->dead_count_lock);
+			last_visited = iter->last_visited;
+			if (last_visited && (root->dead_count !=
+						iter->last_dead_count)) {
+				last_visited = NULL;
+			}
 		}
 
 		/*
@@ -1204,16 +1220,21 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		if (reclaim) {
 			struct mem_cgroup *curr = memcg;
 
-			if (last_visited)
-				css_put(&last_visited->css);
+			/*
+			 * last_visited is not longer used so we can let
+			 * other thread to run and update dead_count
+			 * because the current memcg would be valid
+			 * regardless other memcg was removed
+			 */
+			spin_unlock(&root->dead_count_lock);
 
 			if (css && !memcg)
 				curr = mem_cgroup_from_css(css);
 
-			/* make sure that the cached memcg is not removed */
-			if (curr)
-				css_get(&curr->css);
 			iter->last_visited = curr;
+			spin_lock(&root->dead_count_lock);
+			iter->last_dead_count = root->dead_count;
+			spin_unlock(&root->dead_count_lock);
 
 			if (!css)
 				iter->generation++;
@@ -6375,10 +6396,35 @@ free_out:
 	return ERR_PTR(error);
 }
 
+/*
+ * Announce all parents that a group from their hierarchy is gone.
+ */
+static void mem_cgroup_uncache_from_reclaim(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *parent = memcg;
+
+	while ((parent = parent_mem_cgroup(parent))) {
+		spin_lock(&parent->dead_count_lock);
+		parent->dead_count++;
+		spin_unlock(&parent->dead_count_lock);
+	}
+
+	/*
+	 * if the root memcg is not hierarchical we have to check it
+	 * explicitely.
+	 */
+	if (!root_mem_cgroup->use_hierarchy) {
+		spin_lock(&root_mem_cgroup->dead_count_lock);
+		parent->dead_count++;
+		spin_unlock(&root_mem_cgroup->dead_count_lock);
+	}
+}
+
 static void mem_cgroup_css_offline(struct cgroup *cont)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
 
+	mem_cgroup_uncache_from_reclaim(memcg);
 	mem_cgroup_reparent_charges(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
 }
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-11 15:16       ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-11 15:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Fri 08-02-13 14:33:18, Johannes Weiner wrote:
[...]
> for each in hierarchy:
>   for each node:
>     for each zone:
>       for each reclaim priority:
> 
> every time a cgroup is destroyed.  I don't think such a hammer is
> justified in general, let alone for consolidating code a little.
> 
> Can we invalidate the position cache lazily?  Have a global "cgroup
> destruction" counter and store a snapshot of that counter whenever we
> put a cgroup pointer in the position cache.  We only use the cached
> pointer if that counter has not changed in the meantime, so we know
> that the cgroup still exists.

Currently we have:
rcu_read_lock()	// keeps cgroup links safe
	iter->iter_lock	// keeps selection exclusive for a specific iterator
	1) global_counter == iter_counter
	2) css_tryget(cached_memcg)  // check it is still alive
rcu_read_unlock()

What would protect us from races when css would disappear between 1 and
2?

css is invalidated from worker context scheduled from __css_put and it
is using dentry locking which we surely do not want to pull here. We
could hook into css_offline which is called with cgroup_mutex but we
cannot use this one here because it is no longer exported and Tejun
would kill us for that.
So we can add a new global memcg internal lock to do this atomically.
Ohh, this is getting uglier...
	
> It is pretty pretty imprecise and we invalidate the whole cache every
> time a cgroup is destroyed, but I think that should be okay. 

I am not sure this is OK because this gives an indirect way of
influencing reclaim in one hierarchy by another one which opens a door
for regressions (or malicious over-reclaim in the extreme case).
So I do not like this very much.

> If not, better ideas are welcome.

Maybe we could keep the counter per memcg but that would mean that we
would need to go up the hierarchy as well. We wouldn't have to go over
node-zone-priority cleanup so it would be much more lightweight.

I am not sure this is necessarily better than explicit cleanup because
it brings yet another kind of generation number to the game but I guess
I can live with it if people really thing the relaxed way is much
better.
What do you think about the patch below (untested yet)?
---

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-11 15:16       ` Michal Hocko
@ 2013-02-11 17:56         ` Johannes Weiner
  -1 siblings, 0 replies; 78+ messages in thread
From: Johannes Weiner @ 2013-02-11 17:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote:
> On Fri 08-02-13 14:33:18, Johannes Weiner wrote:
> [...]
> > for each in hierarchy:
> >   for each node:
> >     for each zone:
> >       for each reclaim priority:
> > 
> > every time a cgroup is destroyed.  I don't think such a hammer is
> > justified in general, let alone for consolidating code a little.
> > 
> > Can we invalidate the position cache lazily?  Have a global "cgroup
> > destruction" counter and store a snapshot of that counter whenever we
> > put a cgroup pointer in the position cache.  We only use the cached
> > pointer if that counter has not changed in the meantime, so we know
> > that the cgroup still exists.
> 
> Currently we have:
> rcu_read_lock()	// keeps cgroup links safe
> 	iter->iter_lock	// keeps selection exclusive for a specific iterator
> 	1) global_counter == iter_counter
> 	2) css_tryget(cached_memcg)  // check it is still alive
> rcu_read_unlock()
> 
> What would protect us from races when css would disappear between 1 and
> 2?

rcu

> css is invalidated from worker context scheduled from __css_put and it
> is using dentry locking which we surely do not want to pull here. We
> could hook into css_offline which is called with cgroup_mutex but we
> cannot use this one here because it is no longer exported and Tejun
> would kill us for that.
> So we can add a new global memcg internal lock to do this atomically.
> Ohh, this is getting uglier...

A racing final css_put() means that the tryget fails, but our RCU read
lock keeps the CSS allocated.  If the dead_count is uptodate, it means
that the rcu read lock was acquired before the synchronize_rcu()
before the css is freed.

> > It is pretty pretty imprecise and we invalidate the whole cache every
> > time a cgroup is destroyed, but I think that should be okay. 
> 
> I am not sure this is OK because this gives an indirect way of
> influencing reclaim in one hierarchy by another one which opens a door
> for regressions (or malicious over-reclaim in the extreme case).
> So I do not like this very much.
> 
> > If not, better ideas are welcome.
> 
> Maybe we could keep the counter per memcg but that would mean that we
> would need to go up the hierarchy as well. We wouldn't have to go over
> node-zone-priority cleanup so it would be much more lightweight.
> 
> I am not sure this is necessarily better than explicit cleanup because
> it brings yet another kind of generation number to the game but I guess
> I can live with it if people really thing the relaxed way is much
> better.
> What do you think about the patch below (untested yet)?

Better, but I think you can get rid of both locks:

mem_cgroup_iter:
rcu_read_lock()
if atomic_read(&root->dead_count) == iter->dead_count:
  smp_rmb()
  if tryget(iter->position):
    position = iter->position
memcg = find_next(postion)
css_put(position)
iter->position = memcg
smp_wmb() /* Write position cache BEFORE marking it uptodate */
iter->dead_count = atomic_read(&root->dead_count)
rcu_read_unlock()

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-11 17:56         ` Johannes Weiner
  0 siblings, 0 replies; 78+ messages in thread
From: Johannes Weiner @ 2013-02-11 17:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote:
> On Fri 08-02-13 14:33:18, Johannes Weiner wrote:
> [...]
> > for each in hierarchy:
> >   for each node:
> >     for each zone:
> >       for each reclaim priority:
> > 
> > every time a cgroup is destroyed.  I don't think such a hammer is
> > justified in general, let alone for consolidating code a little.
> > 
> > Can we invalidate the position cache lazily?  Have a global "cgroup
> > destruction" counter and store a snapshot of that counter whenever we
> > put a cgroup pointer in the position cache.  We only use the cached
> > pointer if that counter has not changed in the meantime, so we know
> > that the cgroup still exists.
> 
> Currently we have:
> rcu_read_lock()	// keeps cgroup links safe
> 	iter->iter_lock	// keeps selection exclusive for a specific iterator
> 	1) global_counter == iter_counter
> 	2) css_tryget(cached_memcg)  // check it is still alive
> rcu_read_unlock()
> 
> What would protect us from races when css would disappear between 1 and
> 2?

rcu

> css is invalidated from worker context scheduled from __css_put and it
> is using dentry locking which we surely do not want to pull here. We
> could hook into css_offline which is called with cgroup_mutex but we
> cannot use this one here because it is no longer exported and Tejun
> would kill us for that.
> So we can add a new global memcg internal lock to do this atomically.
> Ohh, this is getting uglier...

A racing final css_put() means that the tryget fails, but our RCU read
lock keeps the CSS allocated.  If the dead_count is uptodate, it means
that the rcu read lock was acquired before the synchronize_rcu()
before the css is freed.

> > It is pretty pretty imprecise and we invalidate the whole cache every
> > time a cgroup is destroyed, but I think that should be okay. 
> 
> I am not sure this is OK because this gives an indirect way of
> influencing reclaim in one hierarchy by another one which opens a door
> for regressions (or malicious over-reclaim in the extreme case).
> So I do not like this very much.
> 
> > If not, better ideas are welcome.
> 
> Maybe we could keep the counter per memcg but that would mean that we
> would need to go up the hierarchy as well. We wouldn't have to go over
> node-zone-priority cleanup so it would be much more lightweight.
> 
> I am not sure this is necessarily better than explicit cleanup because
> it brings yet another kind of generation number to the game but I guess
> I can live with it if people really thing the relaxed way is much
> better.
> What do you think about the patch below (untested yet)?

Better, but I think you can get rid of both locks:

mem_cgroup_iter:
rcu_read_lock()
if atomic_read(&root->dead_count) == iter->dead_count:
  smp_rmb()
  if tryget(iter->position):
    position = iter->position
memcg = find_next(postion)
css_put(position)
iter->position = memcg
smp_wmb() /* Write position cache BEFORE marking it uptodate */
iter->dead_count = atomic_read(&root->dead_count)
rcu_read_unlock()

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-11 17:56         ` Johannes Weiner
@ 2013-02-11 19:29           ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-11 19:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Mon 11-02-13 12:56:19, Johannes Weiner wrote:
> On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote:
> > On Fri 08-02-13 14:33:18, Johannes Weiner wrote:
> > [...]
> > > for each in hierarchy:
> > >   for each node:
> > >     for each zone:
> > >       for each reclaim priority:
> > > 
> > > every time a cgroup is destroyed.  I don't think such a hammer is
> > > justified in general, let alone for consolidating code a little.
> > > 
> > > Can we invalidate the position cache lazily?  Have a global "cgroup
> > > destruction" counter and store a snapshot of that counter whenever we
> > > put a cgroup pointer in the position cache.  We only use the cached
> > > pointer if that counter has not changed in the meantime, so we know
> > > that the cgroup still exists.
> > 
> > Currently we have:
> > rcu_read_lock()	// keeps cgroup links safe
> > 	iter->iter_lock	// keeps selection exclusive for a specific iterator
> > 	1) global_counter == iter_counter
> > 	2) css_tryget(cached_memcg)  // check it is still alive
> > rcu_read_unlock()
> > 
> > What would protect us from races when css would disappear between 1 and
> > 2?
> 
> rcu

That was my first attempt but then I convinced myself it might not be
sufficient. But now that I think about it more I guess you are right.
 
> > css is invalidated from worker context scheduled from __css_put and it
> > is using dentry locking which we surely do not want to pull here. We
> > could hook into css_offline which is called with cgroup_mutex but we
> > cannot use this one here because it is no longer exported and Tejun
> > would kill us for that.
> > So we can add a new global memcg internal lock to do this atomically.
> > Ohh, this is getting uglier...
> 
> A racing final css_put() means that the tryget fails, but our RCU read
> lock keeps the CSS allocated.  If the dead_count is uptodate, it means
> that the rcu read lock was acquired before the synchronize_rcu()
> before the css is freed.

yes.

> 
> > > It is pretty pretty imprecise and we invalidate the whole cache every
> > > time a cgroup is destroyed, but I think that should be okay. 
> > 
> > I am not sure this is OK because this gives an indirect way of
> > influencing reclaim in one hierarchy by another one which opens a door
> > for regressions (or malicious over-reclaim in the extreme case).
> > So I do not like this very much.
> > 
> > > If not, better ideas are welcome.
> > 
> > Maybe we could keep the counter per memcg but that would mean that we
> > would need to go up the hierarchy as well. We wouldn't have to go over
> > node-zone-priority cleanup so it would be much more lightweight.
> > 
> > I am not sure this is necessarily better than explicit cleanup because
> > it brings yet another kind of generation number to the game but I guess
> > I can live with it if people really thing the relaxed way is much
> > better.
> > What do you think about the patch below (untested yet)?
> 
> Better, but I think you can get rid of both locks:

What is the other lock you have in mind.

> mem_cgroup_iter:
> rcu_read_lock()
> if atomic_read(&root->dead_count) == iter->dead_count:
>   smp_rmb()
>   if tryget(iter->position):
>     position = iter->position
> memcg = find_next(postion)
> css_put(position)
> iter->position = memcg
> smp_wmb() /* Write position cache BEFORE marking it uptodate */
> iter->dead_count = atomic_read(&root->dead_count)
> rcu_read_unlock()

Updated patch bellow:
---
>From 756c4f0091d250bc5ff816f8e9d11840e8522b3a Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Mon, 11 Feb 2013 20:23:51 +0100
Subject: [PATCH] memcg: relax memcg iter caching

Now that per-node-zone-priority iterator caches memory cgroups rather
than their css ids we have to be careful and remove them from the
iterator when they are on the way out otherwise they might hang for
unbounded amount of time (until the global/targeted reclaim triggers the
zone under priority to find out the group is dead and let it to find the
final rest).

We can fix this issue by relaxing rules for the last_visited memcg as
well.
Instead of taking reference to css before it is stored into
iter->last_visited we can just store its pointer and track the number of
removed groups for each memcg. This number would be stored into iterator
everytime when a memcg is cached. If the iter count doesn't match the
curent walker root's one we will start over from the root again. The
group counter is incremented upwards the hierarchy every time a group is
removed.

Locking rules are a bit complicated but we primarily rely on rcu which
protects css from disappearing while it is proved to be still valid. The
validity is checked in two steps. First the iter->last_dead_count has
to match root->dead_count and second css_tryget has to confirm the
that the group is still alive and it pins it until we get a next memcg.

Spotted-by: Ying Han <yinghan@google.com>
Original-idea-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |   66 +++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 57 insertions(+), 9 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e9f5c47..f9b5719 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu {
 };
 
 struct mem_cgroup_reclaim_iter {
-	/* last scanned hierarchy member with elevated css ref count */
+	/*
+	 * last scanned hierarchy member. Valid only if last_dead_count
+	 * matches memcg->dead_count of the hierarchy root group.
+	 */
 	struct mem_cgroup *last_visited;
+	unsigned int last_dead_count;
+
 	/* scan generation, increased every round-trip */
 	unsigned int generation;
 	/* lock to protect the position and generation */
@@ -357,6 +362,7 @@ struct mem_cgroup {
 	struct mem_cgroup_stat_cpu nocpu_base;
 	spinlock_t pcp_counter_lock;
 
+	atomic_t	dead_count;
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
 	struct tcp_memcontrol tcp_mem;
 #endif
@@ -1158,19 +1164,33 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			int nid = zone_to_nid(reclaim->zone);
 			int zid = zone_idx(reclaim->zone);
 			struct mem_cgroup_per_zone *mz;
+			unsigned int dead_count;
 
 			mz = mem_cgroup_zoneinfo(root, nid, zid);
 			iter = &mz->reclaim_iter[reclaim->priority];
 			spin_lock(&iter->iter_lock);
-			last_visited = iter->last_visited;
 			if (prev && reclaim->generation != iter->generation) {
-				if (last_visited) {
-					css_put(&last_visited->css);
-					iter->last_visited = NULL;
-				}
+				iter->last_visited = NULL;
 				spin_unlock(&iter->iter_lock);
 				goto out_unlock;
 			}
+
+			/*
+			 * last_visited might be invalid if some of the group
+			 * downwards was removed. As we do not know which one
+			 * disappeared we have to start all over again from the
+			 * root.
+			 * css ref count then makes sure that css won't
+			 * disappear while we iterate to the next memcg
+			 */
+			last_visited = iter->last_visited;
+			dead_count = atomic_read(&root->dead_count);
+			smp_rmb();
+			if (last_visited &&
+					((dead_count != iter->last_dead_count) ||
+					 !css_tryget(&last_visited->css))) {
+				last_visited = NULL;
+			}
 		}
 
 		/*
@@ -1210,10 +1230,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			if (css && !memcg)
 				curr = mem_cgroup_from_css(css);
 
-			/* make sure that the cached memcg is not removed */
-			if (curr)
-				css_get(&curr->css);
+			/*
+			 * No memory barrier is needed here because we are
+			 * protected by iter_lock
+			 */
 			iter->last_visited = curr;
+			iter->last_dead_count = atomic_read(&root->dead_count);
 
 			if (!css)
 				iter->generation++;
@@ -6375,10 +6397,36 @@ free_out:
 	return ERR_PTR(error);
 }
 
+/*
+ * Announce all parents that a group from their hierarchy is gone.
+ */
+static void mem_cgroup_uncache_from_reclaim(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *parent = memcg;
+
+	while ((parent = parent_mem_cgroup(parent)))
+		atomic_inc(&parent->dead_count);
+
+	/*
+	 * if the root memcg is not hierarchical we have to check it
+	 * explicitely.
+	 */
+	if (!root_mem_cgroup->use_hierarchy)
+		atomic_inc(&parent->dead_count);
+
+	/*
+	 * Make sure that dead_count updates are visible before other
+	 * cleanup from css_offline.
+	 * Pairs with smp_rmb in mem_cgroup_iter
+	 */
+	smp_wmb();
+}
+
 static void mem_cgroup_css_offline(struct cgroup *cont)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
 
+	mem_cgroup_uncache_from_reclaim(memcg);
 	mem_cgroup_reparent_charges(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
 }
-- 
1.7.10.4


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-11 19:29           ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-11 19:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Mon 11-02-13 12:56:19, Johannes Weiner wrote:
> On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote:
> > On Fri 08-02-13 14:33:18, Johannes Weiner wrote:
> > [...]
> > > for each in hierarchy:
> > >   for each node:
> > >     for each zone:
> > >       for each reclaim priority:
> > > 
> > > every time a cgroup is destroyed.  I don't think such a hammer is
> > > justified in general, let alone for consolidating code a little.
> > > 
> > > Can we invalidate the position cache lazily?  Have a global "cgroup
> > > destruction" counter and store a snapshot of that counter whenever we
> > > put a cgroup pointer in the position cache.  We only use the cached
> > > pointer if that counter has not changed in the meantime, so we know
> > > that the cgroup still exists.
> > 
> > Currently we have:
> > rcu_read_lock()	// keeps cgroup links safe
> > 	iter->iter_lock	// keeps selection exclusive for a specific iterator
> > 	1) global_counter == iter_counter
> > 	2) css_tryget(cached_memcg)  // check it is still alive
> > rcu_read_unlock()
> > 
> > What would protect us from races when css would disappear between 1 and
> > 2?
> 
> rcu

That was my first attempt but then I convinced myself it might not be
sufficient. But now that I think about it more I guess you are right.
 
> > css is invalidated from worker context scheduled from __css_put and it
> > is using dentry locking which we surely do not want to pull here. We
> > could hook into css_offline which is called with cgroup_mutex but we
> > cannot use this one here because it is no longer exported and Tejun
> > would kill us for that.
> > So we can add a new global memcg internal lock to do this atomically.
> > Ohh, this is getting uglier...
> 
> A racing final css_put() means that the tryget fails, but our RCU read
> lock keeps the CSS allocated.  If the dead_count is uptodate, it means
> that the rcu read lock was acquired before the synchronize_rcu()
> before the css is freed.

yes.

> 
> > > It is pretty pretty imprecise and we invalidate the whole cache every
> > > time a cgroup is destroyed, but I think that should be okay. 
> > 
> > I am not sure this is OK because this gives an indirect way of
> > influencing reclaim in one hierarchy by another one which opens a door
> > for regressions (or malicious over-reclaim in the extreme case).
> > So I do not like this very much.
> > 
> > > If not, better ideas are welcome.
> > 
> > Maybe we could keep the counter per memcg but that would mean that we
> > would need to go up the hierarchy as well. We wouldn't have to go over
> > node-zone-priority cleanup so it would be much more lightweight.
> > 
> > I am not sure this is necessarily better than explicit cleanup because
> > it brings yet another kind of generation number to the game but I guess
> > I can live with it if people really thing the relaxed way is much
> > better.
> > What do you think about the patch below (untested yet)?
> 
> Better, but I think you can get rid of both locks:

What is the other lock you have in mind.

> mem_cgroup_iter:
> rcu_read_lock()
> if atomic_read(&root->dead_count) == iter->dead_count:
>   smp_rmb()
>   if tryget(iter->position):
>     position = iter->position
> memcg = find_next(postion)
> css_put(position)
> iter->position = memcg
> smp_wmb() /* Write position cache BEFORE marking it uptodate */
> iter->dead_count = atomic_read(&root->dead_count)
> rcu_read_unlock()

Updated patch bellow:
---

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-11 19:29           ` Michal Hocko
@ 2013-02-11 19:58             ` Johannes Weiner
  -1 siblings, 0 replies; 78+ messages in thread
From: Johannes Weiner @ 2013-02-11 19:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Mon, Feb 11, 2013 at 08:29:29PM +0100, Michal Hocko wrote:
> On Mon 11-02-13 12:56:19, Johannes Weiner wrote:
> > On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote:
> > > Maybe we could keep the counter per memcg but that would mean that we
> > > would need to go up the hierarchy as well. We wouldn't have to go over
> > > node-zone-priority cleanup so it would be much more lightweight.
> > > 
> > > I am not sure this is necessarily better than explicit cleanup because
> > > it brings yet another kind of generation number to the game but I guess
> > > I can live with it if people really thing the relaxed way is much
> > > better.
> > > What do you think about the patch below (untested yet)?
> > 
> > Better, but I think you can get rid of both locks:
> 
> What is the other lock you have in mind.

The iter lock itself.  I mean, multiple reclaimers can still race but
there won't be any corruption (if you make iter->dead_count a long,
setting it happens atomically, we nly need the memcg->dead_count to be
an atomic because of the inc) and the worst that could happen is that
a reclaim starts at the wrong point in hierarchy, right?  But as you
said in the changelog that introduced the lock, it's never actually
been a practical problem.  You just need to put the wmb back in place,
so that we never see the dead_count give the green light while the
cached position is stale, or we'll tryget random memory.

> > mem_cgroup_iter:
> > rcu_read_lock()
> > if atomic_read(&root->dead_count) == iter->dead_count:
> >   smp_rmb()
> >   if tryget(iter->position):
> >     position = iter->position
> > memcg = find_next(postion)
> > css_put(position)
> > iter->position = memcg
> > smp_wmb() /* Write position cache BEFORE marking it uptodate */
> > iter->dead_count = atomic_read(&root->dead_count)
> > rcu_read_unlock()
> 
> Updated patch bellow:

Cool, thanks.  I hope you don't find it too ugly anymore :-)

> >From 756c4f0091d250bc5ff816f8e9d11840e8522b3a Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Mon, 11 Feb 2013 20:23:51 +0100
> Subject: [PATCH] memcg: relax memcg iter caching
> 
> Now that per-node-zone-priority iterator caches memory cgroups rather
> than their css ids we have to be careful and remove them from the
> iterator when they are on the way out otherwise they might hang for
> unbounded amount of time (until the global/targeted reclaim triggers the
> zone under priority to find out the group is dead and let it to find the
> final rest).
> 
> We can fix this issue by relaxing rules for the last_visited memcg as
> well.
> Instead of taking reference to css before it is stored into
> iter->last_visited we can just store its pointer and track the number of
> removed groups for each memcg. This number would be stored into iterator
> everytime when a memcg is cached. If the iter count doesn't match the
> curent walker root's one we will start over from the root again. The
> group counter is incremented upwards the hierarchy every time a group is
> removed.
> 
> Locking rules are a bit complicated but we primarily rely on rcu which
> protects css from disappearing while it is proved to be still valid. The
> validity is checked in two steps. First the iter->last_dead_count has
> to match root->dead_count and second css_tryget has to confirm the
> that the group is still alive and it pins it until we get a next memcg.
> 
> Spotted-by: Ying Han <yinghan@google.com>
> Original-idea-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c |   66 +++++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 57 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e9f5c47..f9b5719 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu {
>  };
>  
>  struct mem_cgroup_reclaim_iter {
> -	/* last scanned hierarchy member with elevated css ref count */
> +	/*
> +	 * last scanned hierarchy member. Valid only if last_dead_count
> +	 * matches memcg->dead_count of the hierarchy root group.
> +	 */
>  	struct mem_cgroup *last_visited;
> +	unsigned int last_dead_count;
> +
>  	/* scan generation, increased every round-trip */
>  	unsigned int generation;
>  	/* lock to protect the position and generation */
> @@ -357,6 +362,7 @@ struct mem_cgroup {
>  	struct mem_cgroup_stat_cpu nocpu_base;
>  	spinlock_t pcp_counter_lock;
>  
> +	atomic_t	dead_count;
>  #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
>  	struct tcp_memcontrol tcp_mem;
>  #endif
> @@ -1158,19 +1164,33 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  			int nid = zone_to_nid(reclaim->zone);
>  			int zid = zone_idx(reclaim->zone);
>  			struct mem_cgroup_per_zone *mz;
> +			unsigned int dead_count;
>  
>  			mz = mem_cgroup_zoneinfo(root, nid, zid);
>  			iter = &mz->reclaim_iter[reclaim->priority];
>  			spin_lock(&iter->iter_lock);
> -			last_visited = iter->last_visited;
>  			if (prev && reclaim->generation != iter->generation) {
> -				if (last_visited) {
> -					css_put(&last_visited->css);
> -					iter->last_visited = NULL;
> -				}
> +				iter->last_visited = NULL;
>  				spin_unlock(&iter->iter_lock);
>  				goto out_unlock;
>  			}
> +
> +			/*
> +			 * last_visited might be invalid if some of the group
> +			 * downwards was removed. As we do not know which one
> +			 * disappeared we have to start all over again from the
> +			 * root.
> +			 * css ref count then makes sure that css won't
> +			 * disappear while we iterate to the next memcg
> +			 */
> +			last_visited = iter->last_visited;
> +			dead_count = atomic_read(&root->dead_count);
> +			smp_rmb();

Confused about this barrier, see below.

As per above, if you remove the iter lock, those lines are mixed up.
You need to read the dead count first because the writer updates the
dead count after it sets the new position.  That way, if the dead
count gives the go-ahead, you KNOW that the position cache is valid,
because it has been updated first.  If either the two reads or the two
writes get reordered, you risk seeing a matching dead count while the
position cache is stale.

> +			if (last_visited &&
> +					((dead_count != iter->last_dead_count) ||
> +					 !css_tryget(&last_visited->css))) {
> +				last_visited = NULL;
> +			}
>  		}
>  
>  		/*
> @@ -1210,10 +1230,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  			if (css && !memcg)
>  				curr = mem_cgroup_from_css(css);
>  
> -			/* make sure that the cached memcg is not removed */
> -			if (curr)
> -				css_get(&curr->css);
> +			/*
> +			 * No memory barrier is needed here because we are
> +			 * protected by iter_lock
> +			 */
>  			iter->last_visited = curr;
> +			iter->last_dead_count = atomic_read(&root->dead_count);
>  
>  			if (!css)
>  				iter->generation++;
> @@ -6375,10 +6397,36 @@ free_out:
>  	return ERR_PTR(error);
>  }
>  
> +/*
> + * Announce all parents that a group from their hierarchy is gone.
> + */
> +static void mem_cgroup_uncache_from_reclaim(struct mem_cgroup *memcg)

How about

static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg)

?

> +{
> +	struct mem_cgroup *parent = memcg;
> +
> +	while ((parent = parent_mem_cgroup(parent)))
> +		atomic_inc(&parent->dead_count);
> +
> +	/*
> +	 * if the root memcg is not hierarchical we have to check it
> +	 * explicitely.
> +	 */
> +	if (!root_mem_cgroup->use_hierarchy)
> +		atomic_inc(&parent->dead_count);

Increase root_mem_cgroup->dead_count instead?

> +	/*
> +	 * Make sure that dead_count updates are visible before other
> +	 * cleanup from css_offline.
> +	 * Pairs with smp_rmb in mem_cgroup_iter
> +	 */
> +	smp_wmb();

That's unexpected.  What other cleanups?  A race between this and
mem_cgroup_iter should be fine because of the RCU synchronization.

Thanks!
Johannes

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-11 19:58             ` Johannes Weiner
  0 siblings, 0 replies; 78+ messages in thread
From: Johannes Weiner @ 2013-02-11 19:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Mon, Feb 11, 2013 at 08:29:29PM +0100, Michal Hocko wrote:
> On Mon 11-02-13 12:56:19, Johannes Weiner wrote:
> > On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote:
> > > Maybe we could keep the counter per memcg but that would mean that we
> > > would need to go up the hierarchy as well. We wouldn't have to go over
> > > node-zone-priority cleanup so it would be much more lightweight.
> > > 
> > > I am not sure this is necessarily better than explicit cleanup because
> > > it brings yet another kind of generation number to the game but I guess
> > > I can live with it if people really thing the relaxed way is much
> > > better.
> > > What do you think about the patch below (untested yet)?
> > 
> > Better, but I think you can get rid of both locks:
> 
> What is the other lock you have in mind.

The iter lock itself.  I mean, multiple reclaimers can still race but
there won't be any corruption (if you make iter->dead_count a long,
setting it happens atomically, we nly need the memcg->dead_count to be
an atomic because of the inc) and the worst that could happen is that
a reclaim starts at the wrong point in hierarchy, right?  But as you
said in the changelog that introduced the lock, it's never actually
been a practical problem.  You just need to put the wmb back in place,
so that we never see the dead_count give the green light while the
cached position is stale, or we'll tryget random memory.

> > mem_cgroup_iter:
> > rcu_read_lock()
> > if atomic_read(&root->dead_count) == iter->dead_count:
> >   smp_rmb()
> >   if tryget(iter->position):
> >     position = iter->position
> > memcg = find_next(postion)
> > css_put(position)
> > iter->position = memcg
> > smp_wmb() /* Write position cache BEFORE marking it uptodate */
> > iter->dead_count = atomic_read(&root->dead_count)
> > rcu_read_unlock()
> 
> Updated patch bellow:

Cool, thanks.  I hope you don't find it too ugly anymore :-)

> >From 756c4f0091d250bc5ff816f8e9d11840e8522b3a Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Mon, 11 Feb 2013 20:23:51 +0100
> Subject: [PATCH] memcg: relax memcg iter caching
> 
> Now that per-node-zone-priority iterator caches memory cgroups rather
> than their css ids we have to be careful and remove them from the
> iterator when they are on the way out otherwise they might hang for
> unbounded amount of time (until the global/targeted reclaim triggers the
> zone under priority to find out the group is dead and let it to find the
> final rest).
> 
> We can fix this issue by relaxing rules for the last_visited memcg as
> well.
> Instead of taking reference to css before it is stored into
> iter->last_visited we can just store its pointer and track the number of
> removed groups for each memcg. This number would be stored into iterator
> everytime when a memcg is cached. If the iter count doesn't match the
> curent walker root's one we will start over from the root again. The
> group counter is incremented upwards the hierarchy every time a group is
> removed.
> 
> Locking rules are a bit complicated but we primarily rely on rcu which
> protects css from disappearing while it is proved to be still valid. The
> validity is checked in two steps. First the iter->last_dead_count has
> to match root->dead_count and second css_tryget has to confirm the
> that the group is still alive and it pins it until we get a next memcg.
> 
> Spotted-by: Ying Han <yinghan@google.com>
> Original-idea-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c |   66 +++++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 57 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e9f5c47..f9b5719 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu {
>  };
>  
>  struct mem_cgroup_reclaim_iter {
> -	/* last scanned hierarchy member with elevated css ref count */
> +	/*
> +	 * last scanned hierarchy member. Valid only if last_dead_count
> +	 * matches memcg->dead_count of the hierarchy root group.
> +	 */
>  	struct mem_cgroup *last_visited;
> +	unsigned int last_dead_count;
> +
>  	/* scan generation, increased every round-trip */
>  	unsigned int generation;
>  	/* lock to protect the position and generation */
> @@ -357,6 +362,7 @@ struct mem_cgroup {
>  	struct mem_cgroup_stat_cpu nocpu_base;
>  	spinlock_t pcp_counter_lock;
>  
> +	atomic_t	dead_count;
>  #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
>  	struct tcp_memcontrol tcp_mem;
>  #endif
> @@ -1158,19 +1164,33 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  			int nid = zone_to_nid(reclaim->zone);
>  			int zid = zone_idx(reclaim->zone);
>  			struct mem_cgroup_per_zone *mz;
> +			unsigned int dead_count;
>  
>  			mz = mem_cgroup_zoneinfo(root, nid, zid);
>  			iter = &mz->reclaim_iter[reclaim->priority];
>  			spin_lock(&iter->iter_lock);
> -			last_visited = iter->last_visited;
>  			if (prev && reclaim->generation != iter->generation) {
> -				if (last_visited) {
> -					css_put(&last_visited->css);
> -					iter->last_visited = NULL;
> -				}
> +				iter->last_visited = NULL;
>  				spin_unlock(&iter->iter_lock);
>  				goto out_unlock;
>  			}
> +
> +			/*
> +			 * last_visited might be invalid if some of the group
> +			 * downwards was removed. As we do not know which one
> +			 * disappeared we have to start all over again from the
> +			 * root.
> +			 * css ref count then makes sure that css won't
> +			 * disappear while we iterate to the next memcg
> +			 */
> +			last_visited = iter->last_visited;
> +			dead_count = atomic_read(&root->dead_count);
> +			smp_rmb();

Confused about this barrier, see below.

As per above, if you remove the iter lock, those lines are mixed up.
You need to read the dead count first because the writer updates the
dead count after it sets the new position.  That way, if the dead
count gives the go-ahead, you KNOW that the position cache is valid,
because it has been updated first.  If either the two reads or the two
writes get reordered, you risk seeing a matching dead count while the
position cache is stale.

> +			if (last_visited &&
> +					((dead_count != iter->last_dead_count) ||
> +					 !css_tryget(&last_visited->css))) {
> +				last_visited = NULL;
> +			}
>  		}
>  
>  		/*
> @@ -1210,10 +1230,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  			if (css && !memcg)
>  				curr = mem_cgroup_from_css(css);
>  
> -			/* make sure that the cached memcg is not removed */
> -			if (curr)
> -				css_get(&curr->css);
> +			/*
> +			 * No memory barrier is needed here because we are
> +			 * protected by iter_lock
> +			 */
>  			iter->last_visited = curr;
> +			iter->last_dead_count = atomic_read(&root->dead_count);
>  
>  			if (!css)
>  				iter->generation++;
> @@ -6375,10 +6397,36 @@ free_out:
>  	return ERR_PTR(error);
>  }
>  
> +/*
> + * Announce all parents that a group from their hierarchy is gone.
> + */
> +static void mem_cgroup_uncache_from_reclaim(struct mem_cgroup *memcg)

How about

static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg)

?

> +{
> +	struct mem_cgroup *parent = memcg;
> +
> +	while ((parent = parent_mem_cgroup(parent)))
> +		atomic_inc(&parent->dead_count);
> +
> +	/*
> +	 * if the root memcg is not hierarchical we have to check it
> +	 * explicitely.
> +	 */
> +	if (!root_mem_cgroup->use_hierarchy)
> +		atomic_inc(&parent->dead_count);

Increase root_mem_cgroup->dead_count instead?

> +	/*
> +	 * Make sure that dead_count updates are visible before other
> +	 * cleanup from css_offline.
> +	 * Pairs with smp_rmb in mem_cgroup_iter
> +	 */
> +	smp_wmb();

That's unexpected.  What other cleanups?  A race between this and
mem_cgroup_iter should be fine because of the RCU synchronization.

Thanks!
Johannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-11 19:58             ` Johannes Weiner
@ 2013-02-11 21:27               ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-11 21:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Mon 11-02-13 14:58:24, Johannes Weiner wrote:
> On Mon, Feb 11, 2013 at 08:29:29PM +0100, Michal Hocko wrote:
> > On Mon 11-02-13 12:56:19, Johannes Weiner wrote:
> > > On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote:
> > > > Maybe we could keep the counter per memcg but that would mean that we
> > > > would need to go up the hierarchy as well. We wouldn't have to go over
> > > > node-zone-priority cleanup so it would be much more lightweight.
> > > > 
> > > > I am not sure this is necessarily better than explicit cleanup because
> > > > it brings yet another kind of generation number to the game but I guess
> > > > I can live with it if people really thing the relaxed way is much
> > > > better.
> > > > What do you think about the patch below (untested yet)?
> > > 
> > > Better, but I think you can get rid of both locks:
> > 
> > What is the other lock you have in mind.
> 
> The iter lock itself.  I mean, multiple reclaimers can still race but
> there won't be any corruption (if you make iter->dead_count a long,
> setting it happens atomically, we nly need the memcg->dead_count to be
> an atomic because of the inc) and the worst that could happen is that
> a reclaim starts at the wrong point in hierarchy, right?

The lack of synchronization basically means that 2 parallel reclaimers
can reclaim every group exactly once (ideally) or up to each group
twice in the worst case.
So the exclusion was quite comfortable.

> But as you said in the changelog that introduced the lock, it's never
> actually been a practical problem.

That is true but those bugs would be subtle though so I wouldn't be
opposed to prevent from them before we get burnt. But if you think that
we should keep the previous semantic I can drop that patch.

> You just need to put the wmb back in place, so that we never see the
> dead_count give the green light while the cached position is stale, or
> we'll tryget random memory.
> 
> > > mem_cgroup_iter:
> > > rcu_read_lock()
> > > if atomic_read(&root->dead_count) == iter->dead_count:
> > >   smp_rmb()
> > >   if tryget(iter->position):
> > >     position = iter->position
> > > memcg = find_next(postion)
> > > css_put(position)
> > > iter->position = memcg
> > > smp_wmb() /* Write position cache BEFORE marking it uptodate */
> > > iter->dead_count = atomic_read(&root->dead_count)
> > > rcu_read_unlock()
> > 
> > Updated patch bellow:
> 
> Cool, thanks.  I hope you don't find it too ugly anymore :-)

It's getting trick and you know how people love when you have to play
and rely on atomics with memory barriers...
 
[...]
> > +
> > +			/*
> > +			 * last_visited might be invalid if some of the group
> > +			 * downwards was removed. As we do not know which one
> > +			 * disappeared we have to start all over again from the
> > +			 * root.
> > +			 * css ref count then makes sure that css won't
> > +			 * disappear while we iterate to the next memcg
> > +			 */
> > +			last_visited = iter->last_visited;
> > +			dead_count = atomic_read(&root->dead_count);
> > +			smp_rmb();
> 
> Confused about this barrier, see below.
> 
> As per above, if you remove the iter lock, those lines are mixed up.
> You need to read the dead count first because the writer updates the
> dead count after it sets the new position. 

You are right, we need
+			dead_count = atomic_read(&root->dead_count);
+			smp_rmb();
+			last_visited = iter->last_visited;

> That way, if the dead count gives the go-ahead, you KNOW that the
> position cache is valid, because it has been updated first.

OK, you are right. We can live without css_tryget because dead_count is
either OK which means that css would be alive at least this rcu period
(and RCU walk would be safe as well) or it is incremented which means
that we have started css_offline already and then css is dead already.
So css_tryget can be dropped.

> If either the two reads or the two writes get reordered, you risk
> seeing a matching dead count while the position cache is stale.
> 
> > +			if (last_visited &&
> > +					((dead_count != iter->last_dead_count) ||
> > +					 !css_tryget(&last_visited->css))) {
> > +				last_visited = NULL;
> > +			}
> >  		}
> >  
> >  		/*
> > @@ -1210,10 +1230,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> >  			if (css && !memcg)
> >  				curr = mem_cgroup_from_css(css);
> >  
> > -			/* make sure that the cached memcg is not removed */
> > -			if (curr)
> > -				css_get(&curr->css);
> > +			/*
> > +			 * No memory barrier is needed here because we are
> > +			 * protected by iter_lock
> > +			 */
> >  			iter->last_visited = curr;

+			smp_wmb();

> > +			iter->last_dead_count = atomic_read(&root->dead_count);
> >  
> >  			if (!css)
> >  				iter->generation++;
> > @@ -6375,10 +6397,36 @@ free_out:
> >  	return ERR_PTR(error);
> >  }
> >  
> > +/*
> > + * Announce all parents that a group from their hierarchy is gone.
> > + */
> > +static void mem_cgroup_uncache_from_reclaim(struct mem_cgroup *memcg)
> 
> How about
> 
> static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg)

OK

> ?
> 
> > +{
> > +	struct mem_cgroup *parent = memcg;
> > +
> > +	while ((parent = parent_mem_cgroup(parent)))
> > +		atomic_inc(&parent->dead_count);
> > +
> > +	/*
> > +	 * if the root memcg is not hierarchical we have to check it
> > +	 * explicitely.
> > +	 */
> > +	if (!root_mem_cgroup->use_hierarchy)
> > +		atomic_inc(&parent->dead_count);
> 
> Increase root_mem_cgroup->dead_count instead?

Sure. C&P
 
> > +	/*
> > +	 * Make sure that dead_count updates are visible before other
> > +	 * cleanup from css_offline.
> > +	 * Pairs with smp_rmb in mem_cgroup_iter
> > +	 */
> > +	smp_wmb();
> 
> That's unexpected.  What other cleanups?  A race between this and
> mem_cgroup_iter should be fine because of the RCU synchronization.

OK, I was too careful, probably (memory barriers are always head
scratchers). I was worried about all dead_count should be committed
before we do other steps in the clean up like reparenting charges etc.
But as you say it will not do any changes.

I will get back to this tomorrow.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-11 21:27               ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-11 21:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Mon 11-02-13 14:58:24, Johannes Weiner wrote:
> On Mon, Feb 11, 2013 at 08:29:29PM +0100, Michal Hocko wrote:
> > On Mon 11-02-13 12:56:19, Johannes Weiner wrote:
> > > On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote:
> > > > Maybe we could keep the counter per memcg but that would mean that we
> > > > would need to go up the hierarchy as well. We wouldn't have to go over
> > > > node-zone-priority cleanup so it would be much more lightweight.
> > > > 
> > > > I am not sure this is necessarily better than explicit cleanup because
> > > > it brings yet another kind of generation number to the game but I guess
> > > > I can live with it if people really thing the relaxed way is much
> > > > better.
> > > > What do you think about the patch below (untested yet)?
> > > 
> > > Better, but I think you can get rid of both locks:
> > 
> > What is the other lock you have in mind.
> 
> The iter lock itself.  I mean, multiple reclaimers can still race but
> there won't be any corruption (if you make iter->dead_count a long,
> setting it happens atomically, we nly need the memcg->dead_count to be
> an atomic because of the inc) and the worst that could happen is that
> a reclaim starts at the wrong point in hierarchy, right?

The lack of synchronization basically means that 2 parallel reclaimers
can reclaim every group exactly once (ideally) or up to each group
twice in the worst case.
So the exclusion was quite comfortable.

> But as you said in the changelog that introduced the lock, it's never
> actually been a practical problem.

That is true but those bugs would be subtle though so I wouldn't be
opposed to prevent from them before we get burnt. But if you think that
we should keep the previous semantic I can drop that patch.

> You just need to put the wmb back in place, so that we never see the
> dead_count give the green light while the cached position is stale, or
> we'll tryget random memory.
> 
> > > mem_cgroup_iter:
> > > rcu_read_lock()
> > > if atomic_read(&root->dead_count) == iter->dead_count:
> > >   smp_rmb()
> > >   if tryget(iter->position):
> > >     position = iter->position
> > > memcg = find_next(postion)
> > > css_put(position)
> > > iter->position = memcg
> > > smp_wmb() /* Write position cache BEFORE marking it uptodate */
> > > iter->dead_count = atomic_read(&root->dead_count)
> > > rcu_read_unlock()
> > 
> > Updated patch bellow:
> 
> Cool, thanks.  I hope you don't find it too ugly anymore :-)

It's getting trick and you know how people love when you have to play
and rely on atomics with memory barriers...
 
[...]
> > +
> > +			/*
> > +			 * last_visited might be invalid if some of the group
> > +			 * downwards was removed. As we do not know which one
> > +			 * disappeared we have to start all over again from the
> > +			 * root.
> > +			 * css ref count then makes sure that css won't
> > +			 * disappear while we iterate to the next memcg
> > +			 */
> > +			last_visited = iter->last_visited;
> > +			dead_count = atomic_read(&root->dead_count);
> > +			smp_rmb();
> 
> Confused about this barrier, see below.
> 
> As per above, if you remove the iter lock, those lines are mixed up.
> You need to read the dead count first because the writer updates the
> dead count after it sets the new position. 

You are right, we need
+			dead_count = atomic_read(&root->dead_count);
+			smp_rmb();
+			last_visited = iter->last_visited;

> That way, if the dead count gives the go-ahead, you KNOW that the
> position cache is valid, because it has been updated first.

OK, you are right. We can live without css_tryget because dead_count is
either OK which means that css would be alive at least this rcu period
(and RCU walk would be safe as well) or it is incremented which means
that we have started css_offline already and then css is dead already.
So css_tryget can be dropped.

> If either the two reads or the two writes get reordered, you risk
> seeing a matching dead count while the position cache is stale.
> 
> > +			if (last_visited &&
> > +					((dead_count != iter->last_dead_count) ||
> > +					 !css_tryget(&last_visited->css))) {
> > +				last_visited = NULL;
> > +			}
> >  		}
> >  
> >  		/*
> > @@ -1210,10 +1230,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> >  			if (css && !memcg)
> >  				curr = mem_cgroup_from_css(css);
> >  
> > -			/* make sure that the cached memcg is not removed */
> > -			if (curr)
> > -				css_get(&curr->css);
> > +			/*
> > +			 * No memory barrier is needed here because we are
> > +			 * protected by iter_lock
> > +			 */
> >  			iter->last_visited = curr;

+			smp_wmb();

> > +			iter->last_dead_count = atomic_read(&root->dead_count);
> >  
> >  			if (!css)
> >  				iter->generation++;
> > @@ -6375,10 +6397,36 @@ free_out:
> >  	return ERR_PTR(error);
> >  }
> >  
> > +/*
> > + * Announce all parents that a group from their hierarchy is gone.
> > + */
> > +static void mem_cgroup_uncache_from_reclaim(struct mem_cgroup *memcg)
> 
> How about
> 
> static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg)

OK

> ?
> 
> > +{
> > +	struct mem_cgroup *parent = memcg;
> > +
> > +	while ((parent = parent_mem_cgroup(parent)))
> > +		atomic_inc(&parent->dead_count);
> > +
> > +	/*
> > +	 * if the root memcg is not hierarchical we have to check it
> > +	 * explicitely.
> > +	 */
> > +	if (!root_mem_cgroup->use_hierarchy)
> > +		atomic_inc(&parent->dead_count);
> 
> Increase root_mem_cgroup->dead_count instead?

Sure. C&P
 
> > +	/*
> > +	 * Make sure that dead_count updates are visible before other
> > +	 * cleanup from css_offline.
> > +	 * Pairs with smp_rmb in mem_cgroup_iter
> > +	 */
> > +	smp_wmb();
> 
> That's unexpected.  What other cleanups?  A race between this and
> mem_cgroup_iter should be fine because of the RCU synchronization.

OK, I was too careful, probably (memory barriers are always head
scratchers). I was worried about all dead_count should be committed
before we do other steps in the clean up like reparenting charges etc.
But as you say it will not do any changes.

I will get back to this tomorrow.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-11 21:27               ` Michal Hocko
@ 2013-02-11 22:07                 ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-11 22:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Mon 11-02-13 22:27:56, Michal Hocko wrote:
[...]
> I will get back to this tomorrow.

Maybe not a great idea as it is getting late here and brain turns into
cabbage but there we go:
---
>From f927358fe620837081d7a7ec6bf27af378deb35d Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Mon, 11 Feb 2013 23:02:00 +0100
Subject: [PATCH] memcg: relax memcg iter caching

Now that per-node-zone-priority iterator caches memory cgroups rather
than their css ids we have to be careful and remove them from the
iterator when they are on the way out otherwise they might hang for
unbounded amount of time (until the global/targeted reclaim triggers the
zone under priority to find out the group is dead and let it to find the
final rest).

We can fix this issue by relaxing rules for the last_visited memcg as
well.
Instead of taking reference to css before it is stored into
iter->last_visited we can just store its pointer and track the number of
removed groups for each memcg. This number would be stored into iterator
everytime when a memcg is cached. If the iter count doesn't match the
curent walker root's one we will start over from the root again. The
group counter is incremented upwards the hierarchy every time a group is
removed.

Locking rules got a bit complicated. We primarily rely on rcu read
lock which makes sure that once we see an up-to-date dead_count then
iter->last_visited is valid for RCU walk. smp_rmb makes sure that
dead_count is read before last_visited and last_dead_count while smp_wmb
makes sure that last_visited is updated before last_dead_count so the
up-to-date last_dead_count cannot point to an outdated last_visited.
Which also means that css reference counting is no longer needed because
RCU will keep last_visited alive.

Spotted-by: Ying Han <yinghan@google.com>
Original-idea-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |   53 ++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 44 insertions(+), 9 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e9f5c47..42f9d94 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu {
 };
 
 struct mem_cgroup_reclaim_iter {
-	/* last scanned hierarchy member with elevated css ref count */
+	/*
+	 * last scanned hierarchy member. Valid only if last_dead_count
+	 * matches memcg->dead_count of the hierarchy root group.
+	 */
 	struct mem_cgroup *last_visited;
+	unsigned int last_dead_count;
+
 	/* scan generation, increased every round-trip */
 	unsigned int generation;
 	/* lock to protect the position and generation */
@@ -357,6 +362,7 @@ struct mem_cgroup {
 	struct mem_cgroup_stat_cpu nocpu_base;
 	spinlock_t pcp_counter_lock;
 
+	atomic_t	dead_count;
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
 	struct tcp_memcontrol tcp_mem;
 #endif
@@ -1158,19 +1164,30 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			int nid = zone_to_nid(reclaim->zone);
 			int zid = zone_idx(reclaim->zone);
 			struct mem_cgroup_per_zone *mz;
+			unsigned int dead_count;
 
 			mz = mem_cgroup_zoneinfo(root, nid, zid);
 			iter = &mz->reclaim_iter[reclaim->priority];
 			spin_lock(&iter->iter_lock);
-			last_visited = iter->last_visited;
 			if (prev && reclaim->generation != iter->generation) {
-				if (last_visited) {
-					css_put(&last_visited->css);
-					iter->last_visited = NULL;
-				}
+				iter->last_visited = NULL;
 				spin_unlock(&iter->iter_lock);
 				goto out_unlock;
 			}
+
+			/*
+			 * last_visited might be invalid if some of the group
+			 * downwards was removed. As we do not know which one
+			 * disappeared we have to start all over again from the
+			 * root.
+			 */
+			dead_count = atomic_read(&root->dead_count);
+			smp_rmb();
+			last_visited = iter->last_visited;
+			if (last_visited &&
+					((dead_count != iter->last_dead_count))) {
+				last_visited = NULL;
+			}
 		}
 
 		/*
@@ -1210,10 +1227,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			if (css && !memcg)
 				curr = mem_cgroup_from_css(css);
 
-			/* make sure that the cached memcg is not removed */
-			if (curr)
-				css_get(&curr->css);
 			iter->last_visited = curr;
+			smp_wmb();
+			iter->last_dead_count = atomic_read(&root->dead_count);
 
 			if (!css)
 				iter->generation++;
@@ -6375,10 +6391,29 @@ free_out:
 	return ERR_PTR(error);
 }
 
+/*
+ * Announce all parents that a group from their hierarchy is gone.
+ */
+static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *parent = memcg;
+
+	while ((parent = parent_mem_cgroup(parent)))
+		atomic_inc(&parent->dead_count);
+
+	/*
+	 * if the root memcg is not hierarchical we have to check it
+	 * explicitely.
+	 */
+	if (!root_mem_cgroup->use_hierarchy)
+		atomic_inc(&root_mem_cgroup->dead_count);
+}
+
 static void mem_cgroup_css_offline(struct cgroup *cont)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
 
+	mem_cgroup_invalidate_reclaim_iterators(memcg);
 	mem_cgroup_reparent_charges(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
 }
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-11 22:07                 ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-11 22:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Mon 11-02-13 22:27:56, Michal Hocko wrote:
[...]
> I will get back to this tomorrow.

Maybe not a great idea as it is getting late here and brain turns into
cabbage but there we go:
---

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-11 21:27               ` Michal Hocko
@ 2013-02-11 22:39                 ` Johannes Weiner
  -1 siblings, 0 replies; 78+ messages in thread
From: Johannes Weiner @ 2013-02-11 22:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote:
> On Mon 11-02-13 14:58:24, Johannes Weiner wrote:
> > On Mon, Feb 11, 2013 at 08:29:29PM +0100, Michal Hocko wrote:
> > > On Mon 11-02-13 12:56:19, Johannes Weiner wrote:
> > > > On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote:
> > > > > Maybe we could keep the counter per memcg but that would mean that we
> > > > > would need to go up the hierarchy as well. We wouldn't have to go over
> > > > > node-zone-priority cleanup so it would be much more lightweight.
> > > > > 
> > > > > I am not sure this is necessarily better than explicit cleanup because
> > > > > it brings yet another kind of generation number to the game but I guess
> > > > > I can live with it if people really thing the relaxed way is much
> > > > > better.
> > > > > What do you think about the patch below (untested yet)?
> > > > 
> > > > Better, but I think you can get rid of both locks:
> > > 
> > > What is the other lock you have in mind.
> > 
> > The iter lock itself.  I mean, multiple reclaimers can still race but
> > there won't be any corruption (if you make iter->dead_count a long,
> > setting it happens atomically, we nly need the memcg->dead_count to be
> > an atomic because of the inc) and the worst that could happen is that
> > a reclaim starts at the wrong point in hierarchy, right?
> 
> The lack of synchronization basically means that 2 parallel reclaimers
> can reclaim every group exactly once (ideally) or up to each group
> twice in the worst case.
> So the exclusion was quite comfortable.

It's quite unlikely, though.  Don't forget that they actually reclaim
in between, I just can't see them line up perfectly and race to the
iterator at the same time repeatedly.  It's more likely to happen at
the higher priority levels where less reclaim happens, and then it's
not a big deal anyway.  With lower priority levels, when the glitches
would be more problematic, they also become even less likely.

> > But as you said in the changelog that introduced the lock, it's never
> > actually been a practical problem.
> 
> That is true but those bugs would be subtle though so I wouldn't be
> opposed to prevent from them before we get burnt. But if you think that
> we should keep the previous semantic I can drop that patch.

I just think that the problem is unlikely and not that big of a deal.

> > You just need to put the wmb back in place, so that we never see the
> > dead_count give the green light while the cached position is stale, or
> > we'll tryget random memory.
> > 
> > > > mem_cgroup_iter:
> > > > rcu_read_lock()
> > > > if atomic_read(&root->dead_count) == iter->dead_count:
> > > >   smp_rmb()
> > > >   if tryget(iter->position):
> > > >     position = iter->position
> > > > memcg = find_next(postion)
> > > > css_put(position)
> > > > iter->position = memcg
> > > > smp_wmb() /* Write position cache BEFORE marking it uptodate */
> > > > iter->dead_count = atomic_read(&root->dead_count)
> > > > rcu_read_unlock()
> > > 
> > > Updated patch bellow:
> > 
> > Cool, thanks.  I hope you don't find it too ugly anymore :-)
> 
> It's getting trick and you know how people love when you have to play
> and rely on atomics with memory barriers...

My bumper sticker reads "I don't believe in mutual exclusion" (the
kernel hacker's version of smile for the red light camera).

I mean, you were the one complaining about the lock...

> > That way, if the dead count gives the go-ahead, you KNOW that the
> > position cache is valid, because it has been updated first.
> 
> OK, you are right. We can live without css_tryget because dead_count is
> either OK which means that css would be alive at least this rcu period
> (and RCU walk would be safe as well) or it is incremented which means
> that we have started css_offline already and then css is dead already.
> So css_tryget can be dropped.

Not quite :)

The dead_count check is for completed destructions, but the try_get is
needed to detect a race with an ongoing destruction.

Basically, the dead_count verifies the iterator pointer is valid (and
the rcu reader lock keeps it that way), the try_get verifies that the
object pointed to is still alive.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-11 22:39                 ` Johannes Weiner
  0 siblings, 0 replies; 78+ messages in thread
From: Johannes Weiner @ 2013-02-11 22:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote:
> On Mon 11-02-13 14:58:24, Johannes Weiner wrote:
> > On Mon, Feb 11, 2013 at 08:29:29PM +0100, Michal Hocko wrote:
> > > On Mon 11-02-13 12:56:19, Johannes Weiner wrote:
> > > > On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote:
> > > > > Maybe we could keep the counter per memcg but that would mean that we
> > > > > would need to go up the hierarchy as well. We wouldn't have to go over
> > > > > node-zone-priority cleanup so it would be much more lightweight.
> > > > > 
> > > > > I am not sure this is necessarily better than explicit cleanup because
> > > > > it brings yet another kind of generation number to the game but I guess
> > > > > I can live with it if people really thing the relaxed way is much
> > > > > better.
> > > > > What do you think about the patch below (untested yet)?
> > > > 
> > > > Better, but I think you can get rid of both locks:
> > > 
> > > What is the other lock you have in mind.
> > 
> > The iter lock itself.  I mean, multiple reclaimers can still race but
> > there won't be any corruption (if you make iter->dead_count a long,
> > setting it happens atomically, we nly need the memcg->dead_count to be
> > an atomic because of the inc) and the worst that could happen is that
> > a reclaim starts at the wrong point in hierarchy, right?
> 
> The lack of synchronization basically means that 2 parallel reclaimers
> can reclaim every group exactly once (ideally) or up to each group
> twice in the worst case.
> So the exclusion was quite comfortable.

It's quite unlikely, though.  Don't forget that they actually reclaim
in between, I just can't see them line up perfectly and race to the
iterator at the same time repeatedly.  It's more likely to happen at
the higher priority levels where less reclaim happens, and then it's
not a big deal anyway.  With lower priority levels, when the glitches
would be more problematic, they also become even less likely.

> > But as you said in the changelog that introduced the lock, it's never
> > actually been a practical problem.
> 
> That is true but those bugs would be subtle though so I wouldn't be
> opposed to prevent from them before we get burnt. But if you think that
> we should keep the previous semantic I can drop that patch.

I just think that the problem is unlikely and not that big of a deal.

> > You just need to put the wmb back in place, so that we never see the
> > dead_count give the green light while the cached position is stale, or
> > we'll tryget random memory.
> > 
> > > > mem_cgroup_iter:
> > > > rcu_read_lock()
> > > > if atomic_read(&root->dead_count) == iter->dead_count:
> > > >   smp_rmb()
> > > >   if tryget(iter->position):
> > > >     position = iter->position
> > > > memcg = find_next(postion)
> > > > css_put(position)
> > > > iter->position = memcg
> > > > smp_wmb() /* Write position cache BEFORE marking it uptodate */
> > > > iter->dead_count = atomic_read(&root->dead_count)
> > > > rcu_read_unlock()
> > > 
> > > Updated patch bellow:
> > 
> > Cool, thanks.  I hope you don't find it too ugly anymore :-)
> 
> It's getting trick and you know how people love when you have to play
> and rely on atomics with memory barriers...

My bumper sticker reads "I don't believe in mutual exclusion" (the
kernel hacker's version of smile for the red light camera).

I mean, you were the one complaining about the lock...

> > That way, if the dead count gives the go-ahead, you KNOW that the
> > position cache is valid, because it has been updated first.
> 
> OK, you are right. We can live without css_tryget because dead_count is
> either OK which means that css would be alive at least this rcu period
> (and RCU walk would be safe as well) or it is incremented which means
> that we have started css_offline already and then css is dead already.
> So css_tryget can be dropped.

Not quite :)

The dead_count check is for completed destructions, but the try_get is
needed to detect a race with an ongoing destruction.

Basically, the dead_count verifies the iterator pointer is valid (and
the rcu reader lock keeps it that way), the try_get verifies that the
object pointed to is still alive.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-11 22:39                 ` Johannes Weiner
@ 2013-02-12  9:54                   ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-12  9:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Mon 11-02-13 17:39:43, Johannes Weiner wrote:
> On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote:
> > On Mon 11-02-13 14:58:24, Johannes Weiner wrote:
> > > On Mon, Feb 11, 2013 at 08:29:29PM +0100, Michal Hocko wrote:
> > > > On Mon 11-02-13 12:56:19, Johannes Weiner wrote:
> > > > > On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote:
> > > > > > Maybe we could keep the counter per memcg but that would mean that we
> > > > > > would need to go up the hierarchy as well. We wouldn't have to go over
> > > > > > node-zone-priority cleanup so it would be much more lightweight.
> > > > > > 
> > > > > > I am not sure this is necessarily better than explicit cleanup because
> > > > > > it brings yet another kind of generation number to the game but I guess
> > > > > > I can live with it if people really thing the relaxed way is much
> > > > > > better.
> > > > > > What do you think about the patch below (untested yet)?
> > > > > 
> > > > > Better, but I think you can get rid of both locks:
> > > > 
> > > > What is the other lock you have in mind.
> > > 
> > > The iter lock itself.  I mean, multiple reclaimers can still race but
> > > there won't be any corruption (if you make iter->dead_count a long,
> > > setting it happens atomically, we nly need the memcg->dead_count to be
> > > an atomic because of the inc) and the worst that could happen is that
> > > a reclaim starts at the wrong point in hierarchy, right?
> > 
> > The lack of synchronization basically means that 2 parallel reclaimers
> > can reclaim every group exactly once (ideally) or up to each group
> > twice in the worst case.
> > So the exclusion was quite comfortable.
> 
> It's quite unlikely, though.  Don't forget that they actually reclaim
> in between, I just can't see them line up perfectly and race to the
> iterator at the same time repeatedly.  It's more likely to happen at
> the higher priority levels where less reclaim happens, and then it's
> not a big deal anyway.  With lower priority levels, when the glitches
> would be more problematic, they also become even less likely.

Fair enough, I will drop that patch in the next version.
 
> > > But as you said in the changelog that introduced the lock, it's never
> > > actually been a practical problem.
> > 
> > That is true but those bugs would be subtle though so I wouldn't be
> > opposed to prevent from them before we get burnt. But if you think that
> > we should keep the previous semantic I can drop that patch.
> 
> I just think that the problem is unlikely and not that big of a deal.
> 
> > > You just need to put the wmb back in place, so that we never see the
> > > dead_count give the green light while the cached position is stale, or
> > > we'll tryget random memory.
> > > 
> > > > > mem_cgroup_iter:
> > > > > rcu_read_lock()
> > > > > if atomic_read(&root->dead_count) == iter->dead_count:
> > > > >   smp_rmb()
> > > > >   if tryget(iter->position):
> > > > >     position = iter->position
> > > > > memcg = find_next(postion)
> > > > > css_put(position)
> > > > > iter->position = memcg
> > > > > smp_wmb() /* Write position cache BEFORE marking it uptodate */
> > > > > iter->dead_count = atomic_read(&root->dead_count)
> > > > > rcu_read_unlock()
> > > > 
> > > > Updated patch bellow:
> > > 
> > > Cool, thanks.  I hope you don't find it too ugly anymore :-)
> > 
> > It's getting trick and you know how people love when you have to play
> > and rely on atomics with memory barriers...
> 
> My bumper sticker reads "I don't believe in mutual exclusion" (the
> kernel hacker's version of smile for the red light camera).

Ohh, those easy riders.
 
> I mean, you were the one complaining about the lock...
> 
> > > That way, if the dead count gives the go-ahead, you KNOW that the
> > > position cache is valid, because it has been updated first.
> > 
> > OK, you are right. We can live without css_tryget because dead_count is
> > either OK which means that css would be alive at least this rcu period
> > (and RCU walk would be safe as well) or it is incremented which means
> > that we have started css_offline already and then css is dead already.
> > So css_tryget can be dropped.
> 
> Not quite :)
> 
> The dead_count check is for completed destructions,

Not quite :P. dead_count is incremented in css_offline callback which is
called before the cgroup core releases its last reference and unlinks
the group from the siblinks. css_tryget would already fail at this stage
because CSS_DEACT_BIAS is in place at that time but this doesn't break
RCU walk. So I think we are safe even without css_get.

Or am I missing something?
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-12  9:54                   ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-12  9:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Mon 11-02-13 17:39:43, Johannes Weiner wrote:
> On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote:
> > On Mon 11-02-13 14:58:24, Johannes Weiner wrote:
> > > On Mon, Feb 11, 2013 at 08:29:29PM +0100, Michal Hocko wrote:
> > > > On Mon 11-02-13 12:56:19, Johannes Weiner wrote:
> > > > > On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote:
> > > > > > Maybe we could keep the counter per memcg but that would mean that we
> > > > > > would need to go up the hierarchy as well. We wouldn't have to go over
> > > > > > node-zone-priority cleanup so it would be much more lightweight.
> > > > > > 
> > > > > > I am not sure this is necessarily better than explicit cleanup because
> > > > > > it brings yet another kind of generation number to the game but I guess
> > > > > > I can live with it if people really thing the relaxed way is much
> > > > > > better.
> > > > > > What do you think about the patch below (untested yet)?
> > > > > 
> > > > > Better, but I think you can get rid of both locks:
> > > > 
> > > > What is the other lock you have in mind.
> > > 
> > > The iter lock itself.  I mean, multiple reclaimers can still race but
> > > there won't be any corruption (if you make iter->dead_count a long,
> > > setting it happens atomically, we nly need the memcg->dead_count to be
> > > an atomic because of the inc) and the worst that could happen is that
> > > a reclaim starts at the wrong point in hierarchy, right?
> > 
> > The lack of synchronization basically means that 2 parallel reclaimers
> > can reclaim every group exactly once (ideally) or up to each group
> > twice in the worst case.
> > So the exclusion was quite comfortable.
> 
> It's quite unlikely, though.  Don't forget that they actually reclaim
> in between, I just can't see them line up perfectly and race to the
> iterator at the same time repeatedly.  It's more likely to happen at
> the higher priority levels where less reclaim happens, and then it's
> not a big deal anyway.  With lower priority levels, when the glitches
> would be more problematic, they also become even less likely.

Fair enough, I will drop that patch in the next version.
 
> > > But as you said in the changelog that introduced the lock, it's never
> > > actually been a practical problem.
> > 
> > That is true but those bugs would be subtle though so I wouldn't be
> > opposed to prevent from them before we get burnt. But if you think that
> > we should keep the previous semantic I can drop that patch.
> 
> I just think that the problem is unlikely and not that big of a deal.
> 
> > > You just need to put the wmb back in place, so that we never see the
> > > dead_count give the green light while the cached position is stale, or
> > > we'll tryget random memory.
> > > 
> > > > > mem_cgroup_iter:
> > > > > rcu_read_lock()
> > > > > if atomic_read(&root->dead_count) == iter->dead_count:
> > > > >   smp_rmb()
> > > > >   if tryget(iter->position):
> > > > >     position = iter->position
> > > > > memcg = find_next(postion)
> > > > > css_put(position)
> > > > > iter->position = memcg
> > > > > smp_wmb() /* Write position cache BEFORE marking it uptodate */
> > > > > iter->dead_count = atomic_read(&root->dead_count)
> > > > > rcu_read_unlock()
> > > > 
> > > > Updated patch bellow:
> > > 
> > > Cool, thanks.  I hope you don't find it too ugly anymore :-)
> > 
> > It's getting trick and you know how people love when you have to play
> > and rely on atomics with memory barriers...
> 
> My bumper sticker reads "I don't believe in mutual exclusion" (the
> kernel hacker's version of smile for the red light camera).

Ohh, those easy riders.
 
> I mean, you were the one complaining about the lock...
> 
> > > That way, if the dead count gives the go-ahead, you KNOW that the
> > > position cache is valid, because it has been updated first.
> > 
> > OK, you are right. We can live without css_tryget because dead_count is
> > either OK which means that css would be alive at least this rcu period
> > (and RCU walk would be safe as well) or it is incremented which means
> > that we have started css_offline already and then css is dead already.
> > So css_tryget can be dropped.
> 
> Not quite :)
> 
> The dead_count check is for completed destructions,

Not quite :P. dead_count is incremented in css_offline callback which is
called before the cgroup core releases its last reference and unlinks
the group from the siblinks. css_tryget would already fail at this stage
because CSS_DEACT_BIAS is in place at that time but this doesn't break
RCU walk. So I think we are safe even without css_get.

Or am I missing something?
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-12  9:54                   ` Michal Hocko
@ 2013-02-12 15:10                     ` Johannes Weiner
  -1 siblings, 0 replies; 78+ messages in thread
From: Johannes Weiner @ 2013-02-12 15:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote:
> On Mon 11-02-13 17:39:43, Johannes Weiner wrote:
> > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote:
> > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote:
> > > > That way, if the dead count gives the go-ahead, you KNOW that the
> > > > position cache is valid, because it has been updated first.
> > > 
> > > OK, you are right. We can live without css_tryget because dead_count is
> > > either OK which means that css would be alive at least this rcu period
> > > (and RCU walk would be safe as well) or it is incremented which means
> > > that we have started css_offline already and then css is dead already.
> > > So css_tryget can be dropped.
> > 
> > Not quite :)
> > 
> > The dead_count check is for completed destructions,
> 
> Not quite :P. dead_count is incremented in css_offline callback which is
> called before the cgroup core releases its last reference and unlinks
> the group from the siblinks. css_tryget would already fail at this stage
> because CSS_DEACT_BIAS is in place at that time but this doesn't break
> RCU walk. So I think we are safe even without css_get.

But you drop the RCU lock before you return.

dead_count IS incremented for every destruction, but it's not reliable
for concurrent ones, is what I meant.  Again, if there is a dead_count
mismatch, your pointer might be dangling, easy case.  However, even if
there is no mismatch, you could still race with a destruction that has
marked the object dead, and then frees it once you drop the RCU lock,
so you need try_get() to check if the object is dead, or you could
return a pointer to freed or soon to be freed memory.

/*
 * If the dead_count mismatches, a destruction has happened or is
 * happening concurrently.  If the dead_count matches, a destruction
 * might still happen concurrently, but since we checked under RCU,
 * that destruction won't free the object until we release the RCU
 * reader lock.  Thus, the dead_count check verifies the pointer is
 * still valid, css_tryget() verifies the cgroup pointed to is alive.
 */

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-12 15:10                     ` Johannes Weiner
  0 siblings, 0 replies; 78+ messages in thread
From: Johannes Weiner @ 2013-02-12 15:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote:
> On Mon 11-02-13 17:39:43, Johannes Weiner wrote:
> > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote:
> > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote:
> > > > That way, if the dead count gives the go-ahead, you KNOW that the
> > > > position cache is valid, because it has been updated first.
> > > 
> > > OK, you are right. We can live without css_tryget because dead_count is
> > > either OK which means that css would be alive at least this rcu period
> > > (and RCU walk would be safe as well) or it is incremented which means
> > > that we have started css_offline already and then css is dead already.
> > > So css_tryget can be dropped.
> > 
> > Not quite :)
> > 
> > The dead_count check is for completed destructions,
> 
> Not quite :P. dead_count is incremented in css_offline callback which is
> called before the cgroup core releases its last reference and unlinks
> the group from the siblinks. css_tryget would already fail at this stage
> because CSS_DEACT_BIAS is in place at that time but this doesn't break
> RCU walk. So I think we are safe even without css_get.

But you drop the RCU lock before you return.

dead_count IS incremented for every destruction, but it's not reliable
for concurrent ones, is what I meant.  Again, if there is a dead_count
mismatch, your pointer might be dangling, easy case.  However, even if
there is no mismatch, you could still race with a destruction that has
marked the object dead, and then frees it once you drop the RCU lock,
so you need try_get() to check if the object is dead, or you could
return a pointer to freed or soon to be freed memory.

/*
 * If the dead_count mismatches, a destruction has happened or is
 * happening concurrently.  If the dead_count matches, a destruction
 * might still happen concurrently, but since we checked under RCU,
 * that destruction won't free the object until we release the RCU
 * reader lock.  Thus, the dead_count check verifies the pointer is
 * still valid, css_tryget() verifies the cgroup pointed to is alive.
 */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-12 15:10                     ` Johannes Weiner
@ 2013-02-12 15:43                       ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-12 15:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Tue 12-02-13 10:10:02, Johannes Weiner wrote:
> On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote:
> > On Mon 11-02-13 17:39:43, Johannes Weiner wrote:
> > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote:
> > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote:
> > > > > That way, if the dead count gives the go-ahead, you KNOW that the
> > > > > position cache is valid, because it has been updated first.
> > > > 
> > > > OK, you are right. We can live without css_tryget because dead_count is
> > > > either OK which means that css would be alive at least this rcu period
> > > > (and RCU walk would be safe as well) or it is incremented which means
> > > > that we have started css_offline already and then css is dead already.
> > > > So css_tryget can be dropped.
> > > 
> > > Not quite :)
> > > 
> > > The dead_count check is for completed destructions,
> > 
> > Not quite :P. dead_count is incremented in css_offline callback which is
> > called before the cgroup core releases its last reference and unlinks
> > the group from the siblinks. css_tryget would already fail at this stage
> > because CSS_DEACT_BIAS is in place at that time but this doesn't break
> > RCU walk. So I think we are safe even without css_get.
> 
> But you drop the RCU lock before you return.
>
> dead_count IS incremented for every destruction, but it's not reliable
> for concurrent ones, is what I meant.  Again, if there is a dead_count
> mismatch, your pointer might be dangling, easy case.  However, even if
> there is no mismatch, you could still race with a destruction that has
> marked the object dead, and then frees it once you drop the RCU lock,
> so you need try_get() to check if the object is dead, or you could
> return a pointer to freed or soon to be freed memory.

Wait a moment. But what prevents from the following race?

rcu_read_lock()
						mem_cgroup_css_offline(memcg)
						root->dead_count++
iter->last_dead_count = root->dead_count
iter->last_visited = memcg
						// final
						css_put(memcg);
// last_visited is still valid
rcu_read_unlock()
[...]
// next iteration
rcu_read_lock()
iter->last_dead_count == root->dead_count
// KABOOM

The race window between dead_count++ and css_put is quite big but that
is not important because that css_put can happen anytime before we start
the next iteration and take rcu_read_lock.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-12 15:43                       ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-12 15:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Tue 12-02-13 10:10:02, Johannes Weiner wrote:
> On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote:
> > On Mon 11-02-13 17:39:43, Johannes Weiner wrote:
> > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote:
> > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote:
> > > > > That way, if the dead count gives the go-ahead, you KNOW that the
> > > > > position cache is valid, because it has been updated first.
> > > > 
> > > > OK, you are right. We can live without css_tryget because dead_count is
> > > > either OK which means that css would be alive at least this rcu period
> > > > (and RCU walk would be safe as well) or it is incremented which means
> > > > that we have started css_offline already and then css is dead already.
> > > > So css_tryget can be dropped.
> > > 
> > > Not quite :)
> > > 
> > > The dead_count check is for completed destructions,
> > 
> > Not quite :P. dead_count is incremented in css_offline callback which is
> > called before the cgroup core releases its last reference and unlinks
> > the group from the siblinks. css_tryget would already fail at this stage
> > because CSS_DEACT_BIAS is in place at that time but this doesn't break
> > RCU walk. So I think we are safe even without css_get.
> 
> But you drop the RCU lock before you return.
>
> dead_count IS incremented for every destruction, but it's not reliable
> for concurrent ones, is what I meant.  Again, if there is a dead_count
> mismatch, your pointer might be dangling, easy case.  However, even if
> there is no mismatch, you could still race with a destruction that has
> marked the object dead, and then frees it once you drop the RCU lock,
> so you need try_get() to check if the object is dead, or you could
> return a pointer to freed or soon to be freed memory.

Wait a moment. But what prevents from the following race?

rcu_read_lock()
						mem_cgroup_css_offline(memcg)
						root->dead_count++
iter->last_dead_count = root->dead_count
iter->last_visited = memcg
						// final
						css_put(memcg);
// last_visited is still valid
rcu_read_unlock()
[...]
// next iteration
rcu_read_lock()
iter->last_dead_count == root->dead_count
// KABOOM

The race window between dead_count++ and css_put is quite big but that
is not important because that css_put can happen anytime before we start
the next iteration and take rcu_read_lock.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-12 15:43                       ` Michal Hocko
@ 2013-02-12 16:10                         ` Paul E. McKenney
  -1 siblings, 0 replies; 78+ messages in thread
From: Paul E. McKenney @ 2013-02-12 16:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Ying Han, Tejun Heo, Glauber Costa, Li Zefan

On Tue, Feb 12, 2013 at 04:43:30PM +0100, Michal Hocko wrote:
> On Tue 12-02-13 10:10:02, Johannes Weiner wrote:
> > On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote:
> > > On Mon 11-02-13 17:39:43, Johannes Weiner wrote:
> > > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote:
> > > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote:
> > > > > > That way, if the dead count gives the go-ahead, you KNOW that the
> > > > > > position cache is valid, because it has been updated first.
> > > > > 
> > > > > OK, you are right. We can live without css_tryget because dead_count is
> > > > > either OK which means that css would be alive at least this rcu period
> > > > > (and RCU walk would be safe as well) or it is incremented which means
> > > > > that we have started css_offline already and then css is dead already.
> > > > > So css_tryget can be dropped.
> > > > 
> > > > Not quite :)
> > > > 
> > > > The dead_count check is for completed destructions,
> > > 
> > > Not quite :P. dead_count is incremented in css_offline callback which is
> > > called before the cgroup core releases its last reference and unlinks
> > > the group from the siblinks. css_tryget would already fail at this stage
> > > because CSS_DEACT_BIAS is in place at that time but this doesn't break
> > > RCU walk. So I think we are safe even without css_get.
> > 
> > But you drop the RCU lock before you return.
> >
> > dead_count IS incremented for every destruction, but it's not reliable
> > for concurrent ones, is what I meant.  Again, if there is a dead_count
> > mismatch, your pointer might be dangling, easy case.  However, even if
> > there is no mismatch, you could still race with a destruction that has
> > marked the object dead, and then frees it once you drop the RCU lock,
> > so you need try_get() to check if the object is dead, or you could
> > return a pointer to freed or soon to be freed memory.
> 
> Wait a moment. But what prevents from the following race?
> 
> rcu_read_lock()
> 						mem_cgroup_css_offline(memcg)
> 						root->dead_count++
> iter->last_dead_count = root->dead_count
> iter->last_visited = memcg
> 						// final
> 						css_put(memcg);
> // last_visited is still valid
> rcu_read_unlock()
> [...]
> // next iteration
> rcu_read_lock()
> iter->last_dead_count == root->dead_count
> // KABOOM
> 
> The race window between dead_count++ and css_put is quite big but that
> is not important because that css_put can happen anytime before we start
> the next iteration and take rcu_read_lock.

The usual approach is to make sure that there is a grace period (either
synchronize_rcu() or call_rcu()) between the time that the data is
made inaccessible to readers (this would be mem_cgroup_css_offline()?)
and the time it is freed (css_put(), correct?).

							Thanx, Paul


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-12 16:10                         ` Paul E. McKenney
  0 siblings, 0 replies; 78+ messages in thread
From: Paul E. McKenney @ 2013-02-12 16:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Ying Han, Tejun Heo, Glauber Costa, Li Zefan

On Tue, Feb 12, 2013 at 04:43:30PM +0100, Michal Hocko wrote:
> On Tue 12-02-13 10:10:02, Johannes Weiner wrote:
> > On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote:
> > > On Mon 11-02-13 17:39:43, Johannes Weiner wrote:
> > > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote:
> > > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote:
> > > > > > That way, if the dead count gives the go-ahead, you KNOW that the
> > > > > > position cache is valid, because it has been updated first.
> > > > > 
> > > > > OK, you are right. We can live without css_tryget because dead_count is
> > > > > either OK which means that css would be alive at least this rcu period
> > > > > (and RCU walk would be safe as well) or it is incremented which means
> > > > > that we have started css_offline already and then css is dead already.
> > > > > So css_tryget can be dropped.
> > > > 
> > > > Not quite :)
> > > > 
> > > > The dead_count check is for completed destructions,
> > > 
> > > Not quite :P. dead_count is incremented in css_offline callback which is
> > > called before the cgroup core releases its last reference and unlinks
> > > the group from the siblinks. css_tryget would already fail at this stage
> > > because CSS_DEACT_BIAS is in place at that time but this doesn't break
> > > RCU walk. So I think we are safe even without css_get.
> > 
> > But you drop the RCU lock before you return.
> >
> > dead_count IS incremented for every destruction, but it's not reliable
> > for concurrent ones, is what I meant.  Again, if there is a dead_count
> > mismatch, your pointer might be dangling, easy case.  However, even if
> > there is no mismatch, you could still race with a destruction that has
> > marked the object dead, and then frees it once you drop the RCU lock,
> > so you need try_get() to check if the object is dead, or you could
> > return a pointer to freed or soon to be freed memory.
> 
> Wait a moment. But what prevents from the following race?
> 
> rcu_read_lock()
> 						mem_cgroup_css_offline(memcg)
> 						root->dead_count++
> iter->last_dead_count = root->dead_count
> iter->last_visited = memcg
> 						// final
> 						css_put(memcg);
> // last_visited is still valid
> rcu_read_unlock()
> [...]
> // next iteration
> rcu_read_lock()
> iter->last_dead_count == root->dead_count
> // KABOOM
> 
> The race window between dead_count++ and css_put is quite big but that
> is not important because that css_put can happen anytime before we start
> the next iteration and take rcu_read_lock.

The usual approach is to make sure that there is a grace period (either
synchronize_rcu() or call_rcu()) between the time that the data is
made inaccessible to readers (this would be mem_cgroup_css_offline()?)
and the time it is freed (css_put(), correct?).

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-12 15:43                       ` Michal Hocko
@ 2013-02-12 16:13                         ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-12 16:13 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Tue 12-02-13 16:43:30, Michal Hocko wrote:
[...]
The example was not complete:

> Wait a moment. But what prevents from the following race?
> 
> rcu_read_lock()

cgroup_next_descendant_pre
css_tryget(css);
memcg = mem_cgroup_from_css(css)		atomic_add(CSS_DEACT_BIAS, &css->refcnt)

> 						mem_cgroup_css_offline(memcg)

We should be safe if we did synchronize_rcu() before root->dead_count++,
no?
Because then we would have a guarantee that if css_tryget(memcg)
suceeded then we wouldn't race with dead_count++ it triggered.

> 						root->dead_count++
> iter->last_dead_count = root->dead_count
> iter->last_visited = memcg
> 						// final
> 						css_put(memcg);
> // last_visited is still valid
> rcu_read_unlock()
> [...]
> // next iteration
> rcu_read_lock()
> iter->last_dead_count == root->dead_count
> // KABOOM
> 
> The race window between dead_count++ and css_put is quite big but that
> is not important because that css_put can happen anytime before we start
> the next iteration and take rcu_read_lock.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-12 16:13                         ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-12 16:13 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Tue 12-02-13 16:43:30, Michal Hocko wrote:
[...]
The example was not complete:

> Wait a moment. But what prevents from the following race?
> 
> rcu_read_lock()

cgroup_next_descendant_pre
css_tryget(css);
memcg = mem_cgroup_from_css(css)		atomic_add(CSS_DEACT_BIAS, &css->refcnt)

> 						mem_cgroup_css_offline(memcg)

We should be safe if we did synchronize_rcu() before root->dead_count++,
no?
Because then we would have a guarantee that if css_tryget(memcg)
suceeded then we wouldn't race with dead_count++ it triggered.

> 						root->dead_count++
> iter->last_dead_count = root->dead_count
> iter->last_visited = memcg
> 						// final
> 						css_put(memcg);
> // last_visited is still valid
> rcu_read_unlock()
> [...]
> // next iteration
> rcu_read_lock()
> iter->last_dead_count == root->dead_count
> // KABOOM
> 
> The race window between dead_count++ and css_put is quite big but that
> is not important because that css_put can happen anytime before we start
> the next iteration and take rcu_read_lock.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-12 16:13                         ` Michal Hocko
@ 2013-02-12 16:24                           ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-12 16:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Tue 12-02-13 17:13:32, Michal Hocko wrote:
> On Tue 12-02-13 16:43:30, Michal Hocko wrote:
> [...]
> The example was not complete:
> 
> > Wait a moment. But what prevents from the following race?
> > 
> > rcu_read_lock()
> 
> cgroup_next_descendant_pre
> css_tryget(css);
> memcg = mem_cgroup_from_css(css)		atomic_add(CSS_DEACT_BIAS, &css->refcnt)
> 
> > 						mem_cgroup_css_offline(memcg)
> 
> We should be safe if we did synchronize_rcu() before root->dead_count++,
> no?
> Because then we would have a guarantee that if css_tryget(memcg)
> suceeded then we wouldn't race with dead_count++ it triggered.
> 
> > 						root->dead_count++
> > iter->last_dead_count = root->dead_count
> > iter->last_visited = memcg
> > 						// final
> > 						css_put(memcg);
> > // last_visited is still valid
> > rcu_read_unlock()
> > [...]
> > // next iteration
> > rcu_read_lock()
> > iter->last_dead_count == root->dead_count
> > // KABOOM

Ohh I have missed that we took a reference on the current memcg which
will be stored into last_visited. And then later, during the next
iteration it will be still alive until we are done because previous
patch moved css_put to the very end.
So this race is not possible. I still need to think about parallel
iteration and a race with removal.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-12 16:24                           ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-12 16:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Tue 12-02-13 17:13:32, Michal Hocko wrote:
> On Tue 12-02-13 16:43:30, Michal Hocko wrote:
> [...]
> The example was not complete:
> 
> > Wait a moment. But what prevents from the following race?
> > 
> > rcu_read_lock()
> 
> cgroup_next_descendant_pre
> css_tryget(css);
> memcg = mem_cgroup_from_css(css)		atomic_add(CSS_DEACT_BIAS, &css->refcnt)
> 
> > 						mem_cgroup_css_offline(memcg)
> 
> We should be safe if we did synchronize_rcu() before root->dead_count++,
> no?
> Because then we would have a guarantee that if css_tryget(memcg)
> suceeded then we wouldn't race with dead_count++ it triggered.
> 
> > 						root->dead_count++
> > iter->last_dead_count = root->dead_count
> > iter->last_visited = memcg
> > 						// final
> > 						css_put(memcg);
> > // last_visited is still valid
> > rcu_read_unlock()
> > [...]
> > // next iteration
> > rcu_read_lock()
> > iter->last_dead_count == root->dead_count
> > // KABOOM

Ohh I have missed that we took a reference on the current memcg which
will be stored into last_visited. And then later, during the next
iteration it will be still alive until we are done because previous
patch moved css_put to the very end.
So this race is not possible. I still need to think about parallel
iteration and a race with removal.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-12 15:43                       ` Michal Hocko
@ 2013-02-12 16:33                         ` Johannes Weiner
  -1 siblings, 0 replies; 78+ messages in thread
From: Johannes Weiner @ 2013-02-12 16:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan



Michal Hocko <mhocko@suse.cz> wrote:

>On Tue 12-02-13 10:10:02, Johannes Weiner wrote:
>> On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote:
>> > On Mon 11-02-13 17:39:43, Johannes Weiner wrote:
>> > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote:
>> > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote:
>> > > > > That way, if the dead count gives the go-ahead, you KNOW that
>the
>> > > > > position cache is valid, because it has been updated first.
>> > > > 
>> > > > OK, you are right. We can live without css_tryget because
>dead_count is
>> > > > either OK which means that css would be alive at least this rcu
>period
>> > > > (and RCU walk would be safe as well) or it is incremented which
>means
>> > > > that we have started css_offline already and then css is dead
>already.
>> > > > So css_tryget can be dropped.
>> > > 
>> > > Not quite :)
>> > > 
>> > > The dead_count check is for completed destructions,
>> > 
>> > Not quite :P. dead_count is incremented in css_offline callback
>which is
>> > called before the cgroup core releases its last reference and
>unlinks
>> > the group from the siblinks. css_tryget would already fail at this
>stage
>> > because CSS_DEACT_BIAS is in place at that time but this doesn't
>break
>> > RCU walk. So I think we are safe even without css_get.
>> 
>> But you drop the RCU lock before you return.
>>
>> dead_count IS incremented for every destruction, but it's not
>reliable
>> for concurrent ones, is what I meant.  Again, if there is a
>dead_count
>> mismatch, your pointer might be dangling, easy case.  However, even
>if
>> there is no mismatch, you could still race with a destruction that
>has
>> marked the object dead, and then frees it once you drop the RCU lock,
>> so you need try_get() to check if the object is dead, or you could
>> return a pointer to freed or soon to be freed memory.
>
>Wait a moment. But what prevents from the following race?
>
>rcu_read_lock()
>						mem_cgroup_css_offline(memcg)
>						root->dead_count++
>iter->last_dead_count = root->dead_count

use the dead count read the first time for comparison, i.e. only one atomic read in that function.  you are right, we would miss to account for that concurrent destruction otherwise.

>iter->last_visited = memcg
>						// final
>						css_put(memcg);
>// last_visited is still valid
>rcu_read_unlock()
>[...]
>// next iteration
>rcu_read_lock()
>iter->last_dead_count == root->dead_count
>// KABOOM
>
>The race window between dead_count++ and css_put is quite big but that
>is not important because that css_put can happen anytime before we
>start
>the next iteration and take rcu_read_lock.

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-12 16:33                         ` Johannes Weiner
  0 siblings, 0 replies; 78+ messages in thread
From: Johannes Weiner @ 2013-02-12 16:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan



Michal Hocko <mhocko@suse.cz> wrote:

>On Tue 12-02-13 10:10:02, Johannes Weiner wrote:
>> On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote:
>> > On Mon 11-02-13 17:39:43, Johannes Weiner wrote:
>> > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote:
>> > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote:
>> > > > > That way, if the dead count gives the go-ahead, you KNOW that
>the
>> > > > > position cache is valid, because it has been updated first.
>> > > >
>> > > > OK, you are right. We can live without css_tryget because
>dead_count is
>> > > > either OK which means that css would be alive at least this rcu
>period
>> > > > (and RCU walk would be safe as well) or it is incremented which
>means
>> > > > that we have started css_offline already and then css is dead
>already.
>> > > > So css_tryget can be dropped.
>> > >
>> > > Not quite :)
>> > >
>> > > The dead_count check is for completed destructions,
>> >
>> > Not quite :P. dead_count is incremented in css_offline callback
>which is
>> > called before the cgroup core releases its last reference and
>unlinks
>> > the group from the siblinks. css_tryget would already fail at this
>stage
>> > because CSS_DEACT_BIAS is in place at that time but this doesn't
>break
>> > RCU walk. So I think we are safe even without css_get.
>>
>> But you drop the RCU lock before you return.
>>
>> dead_count IS incremented for every destruction, but it's not
>reliable
>> for concurrent ones, is what I meant.  Again, if there is a
>dead_count
>> mismatch, your pointer might be dangling, easy case.  However, even
>if
>> there is no mismatch, you could still race with a destruction that
>has
>> marked the object dead, and then frees it once you drop the RCU lock,
>> so you need try_get() to check if the object is dead, or you could
>> return a pointer to freed or soon to be freed memory.
>
>Wait a moment. But what prevents from the following race?
>
>rcu_read_lock()
>						mem_cgroup_css_offline(memcg)
>						root->dead_count++
>iter->last_dead_count = root->dead_count

use the dead count read the first time for comparison, i.e. only one atomic read in that function.  you are right, we would miss to account for that concurrent destruction otherwise.

>iter->last_visited = memcg
>						// final
>						css_put(memcg);
>// last_visited is still valid
>rcu_read_unlock()
>[...]
>// next iteration
>rcu_read_lock()
>iter->last_dead_count == root->dead_count
>// KABOOM
>
>The race window between dead_count++ and css_put is quite big but that
>is not important because that css_put can happen anytime before we
>start
>the next iteration and take rcu_read_lock.

--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-12 16:24                           ` Michal Hocko
@ 2013-02-12 16:37                             ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-12 16:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Tue 12-02-13 17:24:42, Michal Hocko wrote:
> On Tue 12-02-13 17:13:32, Michal Hocko wrote:
> > On Tue 12-02-13 16:43:30, Michal Hocko wrote:
> > [...]
> > The example was not complete:
> > 
> > > Wait a moment. But what prevents from the following race?
> > > 
> > > rcu_read_lock()
> > 
> > cgroup_next_descendant_pre
> > css_tryget(css);
> > memcg = mem_cgroup_from_css(css)		atomic_add(CSS_DEACT_BIAS, &css->refcnt)
> > 
> > > 						mem_cgroup_css_offline(memcg)
> > 
> > We should be safe if we did synchronize_rcu() before root->dead_count++,
> > no?
> > Because then we would have a guarantee that if css_tryget(memcg)
> > suceeded then we wouldn't race with dead_count++ it triggered.
> > 
> > > 						root->dead_count++
> > > iter->last_dead_count = root->dead_count
> > > iter->last_visited = memcg
> > > 						// final
> > > 						css_put(memcg);
> > > // last_visited is still valid
> > > rcu_read_unlock()
> > > [...]
> > > // next iteration
> > > rcu_read_lock()
> > > iter->last_dead_count == root->dead_count
> > > // KABOOM
> 
> Ohh I have missed that we took a reference on the current memcg which
> will be stored into last_visited. And then later, during the next
> iteration it will be still alive until we are done because previous
> patch moved css_put to the very end.

And that wouldn't help because:
css_tryget(memcg) // OK
                                                CSS_DEACT_BIAS
                                                root->dead_count++
iter->last_visited = memcg
iter->last_dead_count = root->dead_count
prev = memcg                                    css_put(memcg)

memcg_iter_break
  css_put(memcg) // it will released

//new iteration
iter->last_dead_count == root->dead_count //ok
css_tryget() // KABOOM because css is already gone

Bit I still might be missing something and need to get back to this with
a clean head.

Sorry about the spam
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-12 16:37                             ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-12 16:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Tue 12-02-13 17:24:42, Michal Hocko wrote:
> On Tue 12-02-13 17:13:32, Michal Hocko wrote:
> > On Tue 12-02-13 16:43:30, Michal Hocko wrote:
> > [...]
> > The example was not complete:
> > 
> > > Wait a moment. But what prevents from the following race?
> > > 
> > > rcu_read_lock()
> > 
> > cgroup_next_descendant_pre
> > css_tryget(css);
> > memcg = mem_cgroup_from_css(css)		atomic_add(CSS_DEACT_BIAS, &css->refcnt)
> > 
> > > 						mem_cgroup_css_offline(memcg)
> > 
> > We should be safe if we did synchronize_rcu() before root->dead_count++,
> > no?
> > Because then we would have a guarantee that if css_tryget(memcg)
> > suceeded then we wouldn't race with dead_count++ it triggered.
> > 
> > > 						root->dead_count++
> > > iter->last_dead_count = root->dead_count
> > > iter->last_visited = memcg
> > > 						// final
> > > 						css_put(memcg);
> > > // last_visited is still valid
> > > rcu_read_unlock()
> > > [...]
> > > // next iteration
> > > rcu_read_lock()
> > > iter->last_dead_count == root->dead_count
> > > // KABOOM
> 
> Ohh I have missed that we took a reference on the current memcg which
> will be stored into last_visited. And then later, during the next
> iteration it will be still alive until we are done because previous
> patch moved css_put to the very end.

And that wouldn't help because:
css_tryget(memcg) // OK
                                                CSS_DEACT_BIAS
                                                root->dead_count++
iter->last_visited = memcg
iter->last_dead_count = root->dead_count
prev = memcg                                    css_put(memcg)

memcg_iter_break
  css_put(memcg) // it will released

//new iteration
iter->last_dead_count == root->dead_count //ok
css_tryget() // KABOOM because css is already gone

Bit I still might be missing something and need to get back to this with
a clean head.

Sorry about the spam
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-12 16:24                           ` Michal Hocko
@ 2013-02-12 16:41                             ` Johannes Weiner
  -1 siblings, 0 replies; 78+ messages in thread
From: Johannes Weiner @ 2013-02-12 16:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan



Michal Hocko <mhocko@suse.cz> wrote:

>On Tue 12-02-13 17:13:32, Michal Hocko wrote:
>> On Tue 12-02-13 16:43:30, Michal Hocko wrote:
>> [...]
>> The example was not complete:
>> 
>> > Wait a moment. But what prevents from the following race?
>> > 
>> > rcu_read_lock()
>> 
>> cgroup_next_descendant_pre
>> css_tryget(css);
>> memcg = mem_cgroup_from_css(css)		atomic_add(CSS_DEACT_BIAS,
>&css->refcnt)
>> 
>> > 						mem_cgroup_css_offline(memcg)
>> 
>> We should be safe if we did synchronize_rcu() before
>root->dead_count++,
>> no?
>> Because then we would have a guarantee that if css_tryget(memcg)
>> suceeded then we wouldn't race with dead_count++ it triggered.
>> 
>> > 						root->dead_count++
>> > iter->last_dead_count = root->dead_count
>> > iter->last_visited = memcg
>> > 						// final
>> > 						css_put(memcg);
>> > // last_visited is still valid
>> > rcu_read_unlock()
>> > [...]
>> > // next iteration
>> > rcu_read_lock()
>> > iter->last_dead_count == root->dead_count
>> > // KABOOM
>
>Ohh I have missed that we took a reference on the current memcg which
>will be stored into last_visited. And then later, during the next
>iteration it will be still alive until we are done because previous
>patch moved css_put to the very end.
>So this race is not possible. I still need to think about parallel
>iteration and a race with removal.

I thought the whole point was to not have a reference in last_visited because have the iterator might be unused indefinitely :-)

We only store a pointer and validate it before use the next time around.  So I think the race is still possible, but we can deal with it by not losing concurrent dead count changes, i.e. one atomic read in the iterator function.

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-12 16:41                             ` Johannes Weiner
  0 siblings, 0 replies; 78+ messages in thread
From: Johannes Weiner @ 2013-02-12 16:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan



Michal Hocko <mhocko@suse.cz> wrote:

>On Tue 12-02-13 17:13:32, Michal Hocko wrote:
>> On Tue 12-02-13 16:43:30, Michal Hocko wrote:
>> [...]
>> The example was not complete:
>>
>> > Wait a moment. But what prevents from the following race?
>> >
>> > rcu_read_lock()
>>
>> cgroup_next_descendant_pre
>> css_tryget(css);
>> memcg = mem_cgroup_from_css(css)		atomic_add(CSS_DEACT_BIAS,
>&css->refcnt)
>>
>> > 						mem_cgroup_css_offline(memcg)
>>
>> We should be safe if we did synchronize_rcu() before
>root->dead_count++,
>> no?
>> Because then we would have a guarantee that if css_tryget(memcg)
>> suceeded then we wouldn't race with dead_count++ it triggered.
>>
>> > 						root->dead_count++
>> > iter->last_dead_count = root->dead_count
>> > iter->last_visited = memcg
>> > 						// final
>> > 						css_put(memcg);
>> > // last_visited is still valid
>> > rcu_read_unlock()
>> > [...]
>> > // next iteration
>> > rcu_read_lock()
>> > iter->last_dead_count == root->dead_count
>> > // KABOOM
>
>Ohh I have missed that we took a reference on the current memcg which
>will be stored into last_visited. And then later, during the next
>iteration it will be still alive until we are done because previous
>patch moved css_put to the very end.
>So this race is not possible. I still need to think about parallel
>iteration and a race with removal.

I thought the whole point was to not have a reference in last_visited because have the iterator might be unused indefinitely :-)

We only store a pointer and validate it before use the next time around.  So I think the race is still possible, but we can deal with it by not losing concurrent dead count changes, i.e. one atomic read in the iterator function.

--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-12 16:41                             ` Johannes Weiner
@ 2013-02-12 17:12                               ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-12 17:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Tue 12-02-13 11:41:03, Johannes Weiner wrote:
> 
> 
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> >On Tue 12-02-13 17:13:32, Michal Hocko wrote:
> >> On Tue 12-02-13 16:43:30, Michal Hocko wrote:
> >> [...]
> >> The example was not complete:
> >> 
> >> > Wait a moment. But what prevents from the following race?
> >> > 
> >> > rcu_read_lock()
> >> 
> >> cgroup_next_descendant_pre
> >> css_tryget(css);
> >> memcg = mem_cgroup_from_css(css)		atomic_add(CSS_DEACT_BIAS,
> >&css->refcnt)
> >> 
> >> > 						mem_cgroup_css_offline(memcg)
> >> 
> >> We should be safe if we did synchronize_rcu() before
> >root->dead_count++,
> >> no?
> >> Because then we would have a guarantee that if css_tryget(memcg)
> >> suceeded then we wouldn't race with dead_count++ it triggered.
> >> 
> >> > 						root->dead_count++
> >> > iter->last_dead_count = root->dead_count
> >> > iter->last_visited = memcg
> >> > 						// final
> >> > 						css_put(memcg);
> >> > // last_visited is still valid
> >> > rcu_read_unlock()
> >> > [...]
> >> > // next iteration
> >> > rcu_read_lock()
> >> > iter->last_dead_count == root->dead_count
> >> > // KABOOM
> >
> >Ohh I have missed that we took a reference on the current memcg which
> >will be stored into last_visited. And then later, during the next
> >iteration it will be still alive until we are done because previous
> >patch moved css_put to the very end.
> >So this race is not possible. I still need to think about parallel
> >iteration and a race with removal.
> 
> I thought the whole point was to not have a reference in last_visited
> because have the iterator might be unused indefinitely :-)

OK, it seems that I managed to confuse ;)

> We only store a pointer and validate it before use the next time
> around.  So I think the race is still possible, but we can deal with
> it by not losing concurrent dead count changes, i.e. one atomic read
> in the iterator function.

All reads from root->dead_count are atomic already, so I am not sure
what you mean here. Anyway, I hope I won't make this even more confusing
if I post what I have right now:
---
>From 52121928be61282dc19e32179056615ffdf128a9 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Tue, 12 Feb 2013 18:08:26 +0100
Subject: [PATCH] memcg: relax memcg iter caching

Now that per-node-zone-priority iterator caches memory cgroups rather
than their css ids we have to be careful and remove them from the
iterator when they are on the way out otherwise they might hang for
unbounded amount of time (until the global/targeted reclaim triggers the
zone under priority to find out the group is dead and let it to find the
final rest).

We can fix this issue by relaxing rules for the last_visited memcg as
well.
Instead of taking reference to css before it is stored into
iter->last_visited we can just store its pointer and track the number of
removed groups for each memcg. This number would be stored into iterator
everytime when a memcg is cached. If the iter count doesn't match the
curent walker root's one we will start over from the root again. The
group counter is incremented upwards the hierarchy every time a group is
removed.

Locking rules got a bit complicated. We primarily rely on rcu read
lock which makes sure that once we see an up-to-date dead_count then
iter->last_visited is valid for RCU walk. smp_rmb makes sure that
dead_count is read before last_visited and last_dead_count while smp_wmb
makes sure that last_visited is updated before last_dead_count so the
up-to-date last_dead_count cannot point to an outdated last_visited.
css_tryget then makes sure that the last_visited is still alive.

Spotted-by: Ying Han <yinghan@google.com>
Original-idea-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |   69 +++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 60 insertions(+), 9 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 727ec39..31bb9b0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu {
 };
 
 struct mem_cgroup_reclaim_iter {
-	/* last scanned hierarchy member with elevated css ref count */
+	/*
+	 * last scanned hierarchy member. Valid only if last_dead_count
+	 * matches memcg->dead_count of the hierarchy root group.
+	 */
 	struct mem_cgroup *last_visited;
+	unsigned int last_dead_count;
+
 	/* scan generation, increased every round-trip */
 	unsigned int generation;
 };
@@ -355,6 +360,7 @@ struct mem_cgroup {
 	struct mem_cgroup_stat_cpu nocpu_base;
 	spinlock_t pcp_counter_lock;
 
+	atomic_t	dead_count;
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
 	struct tcp_memcontrol tcp_mem;
 #endif
@@ -1156,17 +1162,36 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			int nid = zone_to_nid(reclaim->zone);
 			int zid = zone_idx(reclaim->zone);
 			struct mem_cgroup_per_zone *mz;
+			unsigned int dead_count;
 
 			mz = mem_cgroup_zoneinfo(root, nid, zid);
 			iter = &mz->reclaim_iter[reclaim->priority];
-			last_visited = iter->last_visited;
 			if (prev && reclaim->generation != iter->generation) {
-				if (last_visited) {
-					css_put(&last_visited->css);
-					iter->last_visited = NULL;
-				}
+				iter->last_visited = NULL;
 				goto out_unlock;
 			}
+
+			/*
+                         * If the dead_count mismatches, a destruction
+                         * has happened or is happening concurrently.
+                         * If the dead_count matches, a destruction
+                         * might still happen concurrently, but since
+                         * we checked under RCU, that destruction
+                         * won't free the object until we release the
+                         * RCU reader lock.  Thus, the dead_count
+                         * check verifies the pointer is still valid,
+                         * css_tryget() verifies the cgroup pointed to
+                         * is alive.
+			 */
+			dead_count = atomic_read(&root->dead_count);
+			smp_rmb();
+			last_visited = iter->last_visited;
+			if (last_visited) {
+				if ((dead_count != iter->last_dead_count) ||
+					!css_tryget(&last_visited->css)) {
+					last_visited = NULL;
+				}
+			}
 		}
 
 		/*
@@ -1206,10 +1231,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			if (css && !memcg)
 				curr = mem_cgroup_from_css(css);
 
-			/* make sure that the cached memcg is not removed */
-			if (curr)
-				css_get(&curr->css);
 			iter->last_visited = curr;
+			smp_wmb();
+			iter->last_dead_count = atomic_read(&root->dead_count);
 
 			if (!css)
 				iter->generation++;
@@ -6366,10 +6390,37 @@ free_out:
 	return ERR_PTR(error);
 }
 
+/*
+ * Announce all parents that a group from their hierarchy is gone.
+ */
+static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *parent = memcg;
+
+	/*
+	 * Make sure we are not racing with mem_cgroup_iter when it stores
+	 * a new iter->last_visited. Wait until that RCU finishes so that
+	 * it cannot see already incremented dead_count with memcg which
+	 * would be already dead next time but dead_count wouldn't tell
+	 * us about that.
+	 */
+	synchronize_rcu();
+	while ((parent = parent_mem_cgroup(parent)))
+		atomic_inc(&parent->dead_count);
+
+	/*
+	 * if the root memcg is not hierarchical we have to check it
+	 * explicitely.
+	 */
+	if (!root_mem_cgroup->use_hierarchy)
+		atomic_inc(&root_mem_cgroup->dead_count);
+}
+
 static void mem_cgroup_css_offline(struct cgroup *cont)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
 
+	mem_cgroup_invalidate_reclaim_iterators(memcg);
 	mem_cgroup_reparent_charges(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
 }
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-12 17:12                               ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-12 17:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Tue 12-02-13 11:41:03, Johannes Weiner wrote:
> 
> 
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> >On Tue 12-02-13 17:13:32, Michal Hocko wrote:
> >> On Tue 12-02-13 16:43:30, Michal Hocko wrote:
> >> [...]
> >> The example was not complete:
> >> 
> >> > Wait a moment. But what prevents from the following race?
> >> > 
> >> > rcu_read_lock()
> >> 
> >> cgroup_next_descendant_pre
> >> css_tryget(css);
> >> memcg = mem_cgroup_from_css(css)		atomic_add(CSS_DEACT_BIAS,
> >&css->refcnt)
> >> 
> >> > 						mem_cgroup_css_offline(memcg)
> >> 
> >> We should be safe if we did synchronize_rcu() before
> >root->dead_count++,
> >> no?
> >> Because then we would have a guarantee that if css_tryget(memcg)
> >> suceeded then we wouldn't race with dead_count++ it triggered.
> >> 
> >> > 						root->dead_count++
> >> > iter->last_dead_count = root->dead_count
> >> > iter->last_visited = memcg
> >> > 						// final
> >> > 						css_put(memcg);
> >> > // last_visited is still valid
> >> > rcu_read_unlock()
> >> > [...]
> >> > // next iteration
> >> > rcu_read_lock()
> >> > iter->last_dead_count == root->dead_count
> >> > // KABOOM
> >
> >Ohh I have missed that we took a reference on the current memcg which
> >will be stored into last_visited. And then later, during the next
> >iteration it will be still alive until we are done because previous
> >patch moved css_put to the very end.
> >So this race is not possible. I still need to think about parallel
> >iteration and a race with removal.
> 
> I thought the whole point was to not have a reference in last_visited
> because have the iterator might be unused indefinitely :-)

OK, it seems that I managed to confuse ;)

> We only store a pointer and validate it before use the next time
> around.  So I think the race is still possible, but we can deal with
> it by not losing concurrent dead count changes, i.e. one atomic read
> in the iterator function.

All reads from root->dead_count are atomic already, so I am not sure
what you mean here. Anyway, I hope I won't make this even more confusing
if I post what I have right now:
---

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-12 16:10                         ` Paul E. McKenney
@ 2013-02-12 17:25                           ` Johannes Weiner
  -1 siblings, 0 replies; 78+ messages in thread
From: Johannes Weiner @ 2013-02-12 17:25 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michal Hocko, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Ying Han, Tejun Heo, Glauber Costa, Li Zefan

On Tue, Feb 12, 2013 at 08:10:51AM -0800, Paul E. McKenney wrote:
> On Tue, Feb 12, 2013 at 04:43:30PM +0100, Michal Hocko wrote:
> > On Tue 12-02-13 10:10:02, Johannes Weiner wrote:
> > > On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote:
> > > > On Mon 11-02-13 17:39:43, Johannes Weiner wrote:
> > > > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote:
> > > > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote:
> > > > > > > That way, if the dead count gives the go-ahead, you KNOW that the
> > > > > > > position cache is valid, because it has been updated first.
> > > > > > 
> > > > > > OK, you are right. We can live without css_tryget because dead_count is
> > > > > > either OK which means that css would be alive at least this rcu period
> > > > > > (and RCU walk would be safe as well) or it is incremented which means
> > > > > > that we have started css_offline already and then css is dead already.
> > > > > > So css_tryget can be dropped.
> > > > > 
> > > > > Not quite :)
> > > > > 
> > > > > The dead_count check is for completed destructions,
> > > > 
> > > > Not quite :P. dead_count is incremented in css_offline callback which is
> > > > called before the cgroup core releases its last reference and unlinks
> > > > the group from the siblinks. css_tryget would already fail at this stage
> > > > because CSS_DEACT_BIAS is in place at that time but this doesn't break
> > > > RCU walk. So I think we are safe even without css_get.
> > > 
> > > But you drop the RCU lock before you return.
> > >
> > > dead_count IS incremented for every destruction, but it's not reliable
> > > for concurrent ones, is what I meant.  Again, if there is a dead_count
> > > mismatch, your pointer might be dangling, easy case.  However, even if
> > > there is no mismatch, you could still race with a destruction that has
> > > marked the object dead, and then frees it once you drop the RCU lock,
> > > so you need try_get() to check if the object is dead, or you could
> > > return a pointer to freed or soon to be freed memory.
> > 
> > Wait a moment. But what prevents from the following race?
> > 
> > rcu_read_lock()
> > 						mem_cgroup_css_offline(memcg)
> > 						root->dead_count++
> > iter->last_dead_count = root->dead_count
> > iter->last_visited = memcg
> > 						// final
> > 						css_put(memcg);
> > // last_visited is still valid
> > rcu_read_unlock()
> > [...]
> > // next iteration
> > rcu_read_lock()
> > iter->last_dead_count == root->dead_count
> > // KABOOM
> > 
> > The race window between dead_count++ and css_put is quite big but that
> > is not important because that css_put can happen anytime before we start
> > the next iteration and take rcu_read_lock.
> 
> The usual approach is to make sure that there is a grace period (either
> synchronize_rcu() or call_rcu()) between the time that the data is
> made inaccessible to readers (this would be mem_cgroup_css_offline()?)
> and the time it is freed (css_put(), correct?).

Absolutely!  And there is a synchronize_rcu() in between those two
operations.

However, we want to keep a weak reference to the cgroup after we drop
the rcu read-side lock, so rcu alone is not enough for us to guarantee
object life time.  We still have to carefully detect any concurrent
offlinings in order to validate the weak reference next time around.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-12 17:25                           ` Johannes Weiner
  0 siblings, 0 replies; 78+ messages in thread
From: Johannes Weiner @ 2013-02-12 17:25 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michal Hocko, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Ying Han, Tejun Heo, Glauber Costa, Li Zefan

On Tue, Feb 12, 2013 at 08:10:51AM -0800, Paul E. McKenney wrote:
> On Tue, Feb 12, 2013 at 04:43:30PM +0100, Michal Hocko wrote:
> > On Tue 12-02-13 10:10:02, Johannes Weiner wrote:
> > > On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote:
> > > > On Mon 11-02-13 17:39:43, Johannes Weiner wrote:
> > > > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote:
> > > > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote:
> > > > > > > That way, if the dead count gives the go-ahead, you KNOW that the
> > > > > > > position cache is valid, because it has been updated first.
> > > > > > 
> > > > > > OK, you are right. We can live without css_tryget because dead_count is
> > > > > > either OK which means that css would be alive at least this rcu period
> > > > > > (and RCU walk would be safe as well) or it is incremented which means
> > > > > > that we have started css_offline already and then css is dead already.
> > > > > > So css_tryget can be dropped.
> > > > > 
> > > > > Not quite :)
> > > > > 
> > > > > The dead_count check is for completed destructions,
> > > > 
> > > > Not quite :P. dead_count is incremented in css_offline callback which is
> > > > called before the cgroup core releases its last reference and unlinks
> > > > the group from the siblinks. css_tryget would already fail at this stage
> > > > because CSS_DEACT_BIAS is in place at that time but this doesn't break
> > > > RCU walk. So I think we are safe even without css_get.
> > > 
> > > But you drop the RCU lock before you return.
> > >
> > > dead_count IS incremented for every destruction, but it's not reliable
> > > for concurrent ones, is what I meant.  Again, if there is a dead_count
> > > mismatch, your pointer might be dangling, easy case.  However, even if
> > > there is no mismatch, you could still race with a destruction that has
> > > marked the object dead, and then frees it once you drop the RCU lock,
> > > so you need try_get() to check if the object is dead, or you could
> > > return a pointer to freed or soon to be freed memory.
> > 
> > Wait a moment. But what prevents from the following race?
> > 
> > rcu_read_lock()
> > 						mem_cgroup_css_offline(memcg)
> > 						root->dead_count++
> > iter->last_dead_count = root->dead_count
> > iter->last_visited = memcg
> > 						// final
> > 						css_put(memcg);
> > // last_visited is still valid
> > rcu_read_unlock()
> > [...]
> > // next iteration
> > rcu_read_lock()
> > iter->last_dead_count == root->dead_count
> > // KABOOM
> > 
> > The race window between dead_count++ and css_put is quite big but that
> > is not important because that css_put can happen anytime before we start
> > the next iteration and take rcu_read_lock.
> 
> The usual approach is to make sure that there is a grace period (either
> synchronize_rcu() or call_rcu()) between the time that the data is
> made inaccessible to readers (this would be mem_cgroup_css_offline()?)
> and the time it is freed (css_put(), correct?).

Absolutely!  And there is a synchronize_rcu() in between those two
operations.

However, we want to keep a weak reference to the cgroup after we drop
the rcu read-side lock, so rcu alone is not enough for us to guarantee
object life time.  We still have to carefully detect any concurrent
offlinings in order to validate the weak reference next time around.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-12 17:12                               ` Michal Hocko
@ 2013-02-12 17:37                                 ` Johannes Weiner
  -1 siblings, 0 replies; 78+ messages in thread
From: Johannes Weiner @ 2013-02-12 17:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Tue, Feb 12, 2013 at 06:12:16PM +0100, Michal Hocko wrote:
> On Tue 12-02-13 11:41:03, Johannes Weiner wrote:
> > 
> > 
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > >On Tue 12-02-13 17:13:32, Michal Hocko wrote:
> > >> On Tue 12-02-13 16:43:30, Michal Hocko wrote:
> > >> [...]
> > >> The example was not complete:
> > >> 
> > >> > Wait a moment. But what prevents from the following race?
> > >> > 
> > >> > rcu_read_lock()
> > >> 
> > >> cgroup_next_descendant_pre
> > >> css_tryget(css);
> > >> memcg = mem_cgroup_from_css(css)		atomic_add(CSS_DEACT_BIAS,
> > >&css->refcnt)
> > >> 
> > >> > 						mem_cgroup_css_offline(memcg)
> > >> 
> > >> We should be safe if we did synchronize_rcu() before
> > >root->dead_count++,
> > >> no?
> > >> Because then we would have a guarantee that if css_tryget(memcg)
> > >> suceeded then we wouldn't race with dead_count++ it triggered.
> > >> 
> > >> > 						root->dead_count++
> > >> > iter->last_dead_count = root->dead_count
> > >> > iter->last_visited = memcg
> > >> > 						// final
> > >> > 						css_put(memcg);
> > >> > // last_visited is still valid
> > >> > rcu_read_unlock()
> > >> > [...]
> > >> > // next iteration
> > >> > rcu_read_lock()
> > >> > iter->last_dead_count == root->dead_count
> > >> > // KABOOM
> > >
> > >Ohh I have missed that we took a reference on the current memcg which
> > >will be stored into last_visited. And then later, during the next
> > >iteration it will be still alive until we are done because previous
> > >patch moved css_put to the very end.
> > >So this race is not possible. I still need to think about parallel
> > >iteration and a race with removal.
> > 
> > I thought the whole point was to not have a reference in last_visited
> > because have the iterator might be unused indefinitely :-)
> 
> OK, it seems that I managed to confuse ;)
> 
> > We only store a pointer and validate it before use the next time
> > around.  So I think the race is still possible, but we can deal with
> > it by not losing concurrent dead count changes, i.e. one atomic read
> > in the iterator function.
> 
> All reads from root->dead_count are atomic already, so I am not sure
> what you mean here. Anyway, I hope I won't make this even more confusing
> if I post what I have right now:

Yes, but we are doing two reads.  Can't the memcg that we'll store in
last_visited be offlined during this and be freed after we drop the
rcu read lock?  If we had just one read, we would detect this
properly.

> ---
> >From 52121928be61282dc19e32179056615ffdf128a9 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Tue, 12 Feb 2013 18:08:26 +0100
> Subject: [PATCH] memcg: relax memcg iter caching
> 
> Now that per-node-zone-priority iterator caches memory cgroups rather
> than their css ids we have to be careful and remove them from the
> iterator when they are on the way out otherwise they might hang for
> unbounded amount of time (until the global/targeted reclaim triggers the
> zone under priority to find out the group is dead and let it to find the
> final rest).
> 
> We can fix this issue by relaxing rules for the last_visited memcg as
> well.
> Instead of taking reference to css before it is stored into
> iter->last_visited we can just store its pointer and track the number of
> removed groups for each memcg. This number would be stored into iterator
> everytime when a memcg is cached. If the iter count doesn't match the
> curent walker root's one we will start over from the root again. The
> group counter is incremented upwards the hierarchy every time a group is
> removed.
> 
> Locking rules got a bit complicated. We primarily rely on rcu read
> lock which makes sure that once we see an up-to-date dead_count then
> iter->last_visited is valid for RCU walk. smp_rmb makes sure that
> dead_count is read before last_visited and last_dead_count while smp_wmb
> makes sure that last_visited is updated before last_dead_count so the
> up-to-date last_dead_count cannot point to an outdated last_visited.
> css_tryget then makes sure that the last_visited is still alive.
> 
> Spotted-by: Ying Han <yinghan@google.com>
> Original-idea-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c |   69 +++++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 60 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 727ec39..31bb9b0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu {
>  };
>  
>  struct mem_cgroup_reclaim_iter {
> -	/* last scanned hierarchy member with elevated css ref count */
> +	/*
> +	 * last scanned hierarchy member. Valid only if last_dead_count
> +	 * matches memcg->dead_count of the hierarchy root group.
> +	 */
>  	struct mem_cgroup *last_visited;
> +	unsigned int last_dead_count;

Since we read and write this without a lock, I would feel more
comfortable if this were a full word, i.e. unsigned long.  That
guarantees we don't see any partial states.

> @@ -1156,17 +1162,36 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  			int nid = zone_to_nid(reclaim->zone);
>  			int zid = zone_idx(reclaim->zone);
>  			struct mem_cgroup_per_zone *mz;
> +			unsigned int dead_count;
>  
>  			mz = mem_cgroup_zoneinfo(root, nid, zid);
>  			iter = &mz->reclaim_iter[reclaim->priority];
> -			last_visited = iter->last_visited;
>  			if (prev && reclaim->generation != iter->generation) {
> -				if (last_visited) {
> -					css_put(&last_visited->css);
> -					iter->last_visited = NULL;
> -				}
> +				iter->last_visited = NULL;
>  				goto out_unlock;
>  			}
> +
> +			/*
> +                         * If the dead_count mismatches, a destruction
> +                         * has happened or is happening concurrently.
> +                         * If the dead_count matches, a destruction
> +                         * might still happen concurrently, but since
> +                         * we checked under RCU, that destruction
> +                         * won't free the object until we release the
> +                         * RCU reader lock.  Thus, the dead_count
> +                         * check verifies the pointer is still valid,
> +                         * css_tryget() verifies the cgroup pointed to
> +                         * is alive.
> +			 */
> +			dead_count = atomic_read(&root->dead_count);
> +			smp_rmb();
> +			last_visited = iter->last_visited;
> +			if (last_visited) {
> +				if ((dead_count != iter->last_dead_count) ||
> +					!css_tryget(&last_visited->css)) {
> +					last_visited = NULL;
> +				}
> +			}
>  		}
>  
>  		/*
> @@ -1206,10 +1231,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  			if (css && !memcg)
>  				curr = mem_cgroup_from_css(css);
>  
> -			/* make sure that the cached memcg is not removed */
> -			if (curr)
> -				css_get(&curr->css);
>  			iter->last_visited = curr;
> +			smp_wmb();
> +			iter->last_dead_count = atomic_read(&root->dead_count);

iter->last_dead_count = dead_count

This way, we detect if curr is offlined between the first reading and
the second reading.  Otherwise, it could get freed when the reference
is dropped and then last_visited points to invalid memory while the
dead_count is uptodate.

> @@ -6366,10 +6390,37 @@ free_out:
>  	return ERR_PTR(error);
>  }
>  
> +/*
> + * Announce all parents that a group from their hierarchy is gone.
> + */
> +static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg)
> +{
> +	struct mem_cgroup *parent = memcg;
> +
> +	/*
> +	 * Make sure we are not racing with mem_cgroup_iter when it stores
> +	 * a new iter->last_visited. Wait until that RCU finishes so that
> +	 * it cannot see already incremented dead_count with memcg which
> +	 * would be already dead next time but dead_count wouldn't tell
> +	 * us about that.
> +	 */
> +	synchronize_rcu();

Ah, you are stabilizing the counter between the two reads.  It's
cheaper to just do one read instead.  Saves the atomic op and saves
the synchronization point :-)


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-12 17:37                                 ` Johannes Weiner
  0 siblings, 0 replies; 78+ messages in thread
From: Johannes Weiner @ 2013-02-12 17:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Tue, Feb 12, 2013 at 06:12:16PM +0100, Michal Hocko wrote:
> On Tue 12-02-13 11:41:03, Johannes Weiner wrote:
> > 
> > 
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > >On Tue 12-02-13 17:13:32, Michal Hocko wrote:
> > >> On Tue 12-02-13 16:43:30, Michal Hocko wrote:
> > >> [...]
> > >> The example was not complete:
> > >> 
> > >> > Wait a moment. But what prevents from the following race?
> > >> > 
> > >> > rcu_read_lock()
> > >> 
> > >> cgroup_next_descendant_pre
> > >> css_tryget(css);
> > >> memcg = mem_cgroup_from_css(css)		atomic_add(CSS_DEACT_BIAS,
> > >&css->refcnt)
> > >> 
> > >> > 						mem_cgroup_css_offline(memcg)
> > >> 
> > >> We should be safe if we did synchronize_rcu() before
> > >root->dead_count++,
> > >> no?
> > >> Because then we would have a guarantee that if css_tryget(memcg)
> > >> suceeded then we wouldn't race with dead_count++ it triggered.
> > >> 
> > >> > 						root->dead_count++
> > >> > iter->last_dead_count = root->dead_count
> > >> > iter->last_visited = memcg
> > >> > 						// final
> > >> > 						css_put(memcg);
> > >> > // last_visited is still valid
> > >> > rcu_read_unlock()
> > >> > [...]
> > >> > // next iteration
> > >> > rcu_read_lock()
> > >> > iter->last_dead_count == root->dead_count
> > >> > // KABOOM
> > >
> > >Ohh I have missed that we took a reference on the current memcg which
> > >will be stored into last_visited. And then later, during the next
> > >iteration it will be still alive until we are done because previous
> > >patch moved css_put to the very end.
> > >So this race is not possible. I still need to think about parallel
> > >iteration and a race with removal.
> > 
> > I thought the whole point was to not have a reference in last_visited
> > because have the iterator might be unused indefinitely :-)
> 
> OK, it seems that I managed to confuse ;)
> 
> > We only store a pointer and validate it before use the next time
> > around.  So I think the race is still possible, but we can deal with
> > it by not losing concurrent dead count changes, i.e. one atomic read
> > in the iterator function.
> 
> All reads from root->dead_count are atomic already, so I am not sure
> what you mean here. Anyway, I hope I won't make this even more confusing
> if I post what I have right now:

Yes, but we are doing two reads.  Can't the memcg that we'll store in
last_visited be offlined during this and be freed after we drop the
rcu read lock?  If we had just one read, we would detect this
properly.

> ---
> >From 52121928be61282dc19e32179056615ffdf128a9 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Tue, 12 Feb 2013 18:08:26 +0100
> Subject: [PATCH] memcg: relax memcg iter caching
> 
> Now that per-node-zone-priority iterator caches memory cgroups rather
> than their css ids we have to be careful and remove them from the
> iterator when they are on the way out otherwise they might hang for
> unbounded amount of time (until the global/targeted reclaim triggers the
> zone under priority to find out the group is dead and let it to find the
> final rest).
> 
> We can fix this issue by relaxing rules for the last_visited memcg as
> well.
> Instead of taking reference to css before it is stored into
> iter->last_visited we can just store its pointer and track the number of
> removed groups for each memcg. This number would be stored into iterator
> everytime when a memcg is cached. If the iter count doesn't match the
> curent walker root's one we will start over from the root again. The
> group counter is incremented upwards the hierarchy every time a group is
> removed.
> 
> Locking rules got a bit complicated. We primarily rely on rcu read
> lock which makes sure that once we see an up-to-date dead_count then
> iter->last_visited is valid for RCU walk. smp_rmb makes sure that
> dead_count is read before last_visited and last_dead_count while smp_wmb
> makes sure that last_visited is updated before last_dead_count so the
> up-to-date last_dead_count cannot point to an outdated last_visited.
> css_tryget then makes sure that the last_visited is still alive.
> 
> Spotted-by: Ying Han <yinghan@google.com>
> Original-idea-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c |   69 +++++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 60 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 727ec39..31bb9b0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu {
>  };
>  
>  struct mem_cgroup_reclaim_iter {
> -	/* last scanned hierarchy member with elevated css ref count */
> +	/*
> +	 * last scanned hierarchy member. Valid only if last_dead_count
> +	 * matches memcg->dead_count of the hierarchy root group.
> +	 */
>  	struct mem_cgroup *last_visited;
> +	unsigned int last_dead_count;

Since we read and write this without a lock, I would feel more
comfortable if this were a full word, i.e. unsigned long.  That
guarantees we don't see any partial states.

> @@ -1156,17 +1162,36 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  			int nid = zone_to_nid(reclaim->zone);
>  			int zid = zone_idx(reclaim->zone);
>  			struct mem_cgroup_per_zone *mz;
> +			unsigned int dead_count;
>  
>  			mz = mem_cgroup_zoneinfo(root, nid, zid);
>  			iter = &mz->reclaim_iter[reclaim->priority];
> -			last_visited = iter->last_visited;
>  			if (prev && reclaim->generation != iter->generation) {
> -				if (last_visited) {
> -					css_put(&last_visited->css);
> -					iter->last_visited = NULL;
> -				}
> +				iter->last_visited = NULL;
>  				goto out_unlock;
>  			}
> +
> +			/*
> +                         * If the dead_count mismatches, a destruction
> +                         * has happened or is happening concurrently.
> +                         * If the dead_count matches, a destruction
> +                         * might still happen concurrently, but since
> +                         * we checked under RCU, that destruction
> +                         * won't free the object until we release the
> +                         * RCU reader lock.  Thus, the dead_count
> +                         * check verifies the pointer is still valid,
> +                         * css_tryget() verifies the cgroup pointed to
> +                         * is alive.
> +			 */
> +			dead_count = atomic_read(&root->dead_count);
> +			smp_rmb();
> +			last_visited = iter->last_visited;
> +			if (last_visited) {
> +				if ((dead_count != iter->last_dead_count) ||
> +					!css_tryget(&last_visited->css)) {
> +					last_visited = NULL;
> +				}
> +			}
>  		}
>  
>  		/*
> @@ -1206,10 +1231,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  			if (css && !memcg)
>  				curr = mem_cgroup_from_css(css);
>  
> -			/* make sure that the cached memcg is not removed */
> -			if (curr)
> -				css_get(&curr->css);
>  			iter->last_visited = curr;
> +			smp_wmb();
> +			iter->last_dead_count = atomic_read(&root->dead_count);

iter->last_dead_count = dead_count

This way, we detect if curr is offlined between the first reading and
the second reading.  Otherwise, it could get freed when the reference
is dropped and then last_visited points to invalid memory while the
dead_count is uptodate.

> @@ -6366,10 +6390,37 @@ free_out:
>  	return ERR_PTR(error);
>  }
>  
> +/*
> + * Announce all parents that a group from their hierarchy is gone.
> + */
> +static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg)
> +{
> +	struct mem_cgroup *parent = memcg;
> +
> +	/*
> +	 * Make sure we are not racing with mem_cgroup_iter when it stores
> +	 * a new iter->last_visited. Wait until that RCU finishes so that
> +	 * it cannot see already incremented dead_count with memcg which
> +	 * would be already dead next time but dead_count wouldn't tell
> +	 * us about that.
> +	 */
> +	synchronize_rcu();

Ah, you are stabilizing the counter between the two reads.  It's
cheaper to just do one read instead.  Saves the atomic op and saves
the synchronization point :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-12 16:10                         ` Paul E. McKenney
@ 2013-02-12 17:56                           ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-12 17:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Johannes Weiner, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Ying Han, Tejun Heo, Glauber Costa, Li Zefan

On Tue 12-02-13 08:10:51, Paul E. McKenney wrote:
> On Tue, Feb 12, 2013 at 04:43:30PM +0100, Michal Hocko wrote:
> > On Tue 12-02-13 10:10:02, Johannes Weiner wrote:
> > > On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote:
> > > > On Mon 11-02-13 17:39:43, Johannes Weiner wrote:
> > > > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote:
> > > > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote:
> > > > > > > That way, if the dead count gives the go-ahead, you KNOW that the
> > > > > > > position cache is valid, because it has been updated first.
> > > > > > 
> > > > > > OK, you are right. We can live without css_tryget because dead_count is
> > > > > > either OK which means that css would be alive at least this rcu period
> > > > > > (and RCU walk would be safe as well) or it is incremented which means
> > > > > > that we have started css_offline already and then css is dead already.
> > > > > > So css_tryget can be dropped.
> > > > > 
> > > > > Not quite :)
> > > > > 
> > > > > The dead_count check is for completed destructions,
> > > > 
> > > > Not quite :P. dead_count is incremented in css_offline callback which is
> > > > called before the cgroup core releases its last reference and unlinks
> > > > the group from the siblinks. css_tryget would already fail at this stage
> > > > because CSS_DEACT_BIAS is in place at that time but this doesn't break
> > > > RCU walk. So I think we are safe even without css_get.
> > > 
> > > But you drop the RCU lock before you return.
> > >
> > > dead_count IS incremented for every destruction, but it's not reliable
> > > for concurrent ones, is what I meant.  Again, if there is a dead_count
> > > mismatch, your pointer might be dangling, easy case.  However, even if
> > > there is no mismatch, you could still race with a destruction that has
> > > marked the object dead, and then frees it once you drop the RCU lock,
> > > so you need try_get() to check if the object is dead, or you could
> > > return a pointer to freed or soon to be freed memory.
> > 
> > Wait a moment. But what prevents from the following race?
> > 
> > rcu_read_lock()
> > 						mem_cgroup_css_offline(memcg)
> > 						root->dead_count++
> > iter->last_dead_count = root->dead_count
> > iter->last_visited = memcg
> > 						// final
> > 						css_put(memcg);
> > // last_visited is still valid
> > rcu_read_unlock()
> > [...]
> > // next iteration
> > rcu_read_lock()
> > iter->last_dead_count == root->dead_count
> > // KABOOM
> > 
> > The race window between dead_count++ and css_put is quite big but that
> > is not important because that css_put can happen anytime before we start
> > the next iteration and take rcu_read_lock.
> 
> The usual approach is to make sure that there is a grace period (either
> synchronize_rcu() or call_rcu()) between the time that the data is
> made inaccessible to readers (this would be mem_cgroup_css_offline()?)
> and the time it is freed (css_put(), correct?).

Yes, that was my suggestion and I put it before dead_count is
incremented down the mem_cgroup_css_offline road.

Johannes still thinks we can do without it if we reduce the number of
atomic_read(dead_count) which sounds like a way to go but I will rather
think about it with a fresh head tomorrow.

Anyway, thanks for jumping in. Earth is always a bit shaky when all the
barriers and rcu mix together.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-12 17:56                           ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-12 17:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Johannes Weiner, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Ying Han, Tejun Heo, Glauber Costa, Li Zefan

On Tue 12-02-13 08:10:51, Paul E. McKenney wrote:
> On Tue, Feb 12, 2013 at 04:43:30PM +0100, Michal Hocko wrote:
> > On Tue 12-02-13 10:10:02, Johannes Weiner wrote:
> > > On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote:
> > > > On Mon 11-02-13 17:39:43, Johannes Weiner wrote:
> > > > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote:
> > > > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote:
> > > > > > > That way, if the dead count gives the go-ahead, you KNOW that the
> > > > > > > position cache is valid, because it has been updated first.
> > > > > > 
> > > > > > OK, you are right. We can live without css_tryget because dead_count is
> > > > > > either OK which means that css would be alive at least this rcu period
> > > > > > (and RCU walk would be safe as well) or it is incremented which means
> > > > > > that we have started css_offline already and then css is dead already.
> > > > > > So css_tryget can be dropped.
> > > > > 
> > > > > Not quite :)
> > > > > 
> > > > > The dead_count check is for completed destructions,
> > > > 
> > > > Not quite :P. dead_count is incremented in css_offline callback which is
> > > > called before the cgroup core releases its last reference and unlinks
> > > > the group from the siblinks. css_tryget would already fail at this stage
> > > > because CSS_DEACT_BIAS is in place at that time but this doesn't break
> > > > RCU walk. So I think we are safe even without css_get.
> > > 
> > > But you drop the RCU lock before you return.
> > >
> > > dead_count IS incremented for every destruction, but it's not reliable
> > > for concurrent ones, is what I meant.  Again, if there is a dead_count
> > > mismatch, your pointer might be dangling, easy case.  However, even if
> > > there is no mismatch, you could still race with a destruction that has
> > > marked the object dead, and then frees it once you drop the RCU lock,
> > > so you need try_get() to check if the object is dead, or you could
> > > return a pointer to freed or soon to be freed memory.
> > 
> > Wait a moment. But what prevents from the following race?
> > 
> > rcu_read_lock()
> > 						mem_cgroup_css_offline(memcg)
> > 						root->dead_count++
> > iter->last_dead_count = root->dead_count
> > iter->last_visited = memcg
> > 						// final
> > 						css_put(memcg);
> > // last_visited is still valid
> > rcu_read_unlock()
> > [...]
> > // next iteration
> > rcu_read_lock()
> > iter->last_dead_count == root->dead_count
> > // KABOOM
> > 
> > The race window between dead_count++ and css_put is quite big but that
> > is not important because that css_put can happen anytime before we start
> > the next iteration and take rcu_read_lock.
> 
> The usual approach is to make sure that there is a grace period (either
> synchronize_rcu() or call_rcu()) between the time that the data is
> made inaccessible to readers (this would be mem_cgroup_css_offline()?)
> and the time it is freed (css_put(), correct?).

Yes, that was my suggestion and I put it before dead_count is
incremented down the mem_cgroup_css_offline road.

Johannes still thinks we can do without it if we reduce the number of
atomic_read(dead_count) which sounds like a way to go but I will rather
think about it with a fresh head tomorrow.

Anyway, thanks for jumping in. Earth is always a bit shaky when all the
barriers and rcu mix together.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-12 17:25                           ` Johannes Weiner
@ 2013-02-12 18:31                             ` Paul E. McKenney
  -1 siblings, 0 replies; 78+ messages in thread
From: Paul E. McKenney @ 2013-02-12 18:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Ying Han, Tejun Heo, Glauber Costa, Li Zefan

On Tue, Feb 12, 2013 at 12:25:26PM -0500, Johannes Weiner wrote:
> On Tue, Feb 12, 2013 at 08:10:51AM -0800, Paul E. McKenney wrote:
> > On Tue, Feb 12, 2013 at 04:43:30PM +0100, Michal Hocko wrote:
> > > On Tue 12-02-13 10:10:02, Johannes Weiner wrote:
> > > > On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote:
> > > > > On Mon 11-02-13 17:39:43, Johannes Weiner wrote:
> > > > > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote:
> > > > > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote:
> > > > > > > > That way, if the dead count gives the go-ahead, you KNOW that the
> > > > > > > > position cache is valid, because it has been updated first.
> > > > > > > 
> > > > > > > OK, you are right. We can live without css_tryget because dead_count is
> > > > > > > either OK which means that css would be alive at least this rcu period
> > > > > > > (and RCU walk would be safe as well) or it is incremented which means
> > > > > > > that we have started css_offline already and then css is dead already.
> > > > > > > So css_tryget can be dropped.
> > > > > > 
> > > > > > Not quite :)
> > > > > > 
> > > > > > The dead_count check is for completed destructions,
> > > > > 
> > > > > Not quite :P. dead_count is incremented in css_offline callback which is
> > > > > called before the cgroup core releases its last reference and unlinks
> > > > > the group from the siblinks. css_tryget would already fail at this stage
> > > > > because CSS_DEACT_BIAS is in place at that time but this doesn't break
> > > > > RCU walk. So I think we are safe even without css_get.
> > > > 
> > > > But you drop the RCU lock before you return.
> > > >
> > > > dead_count IS incremented for every destruction, but it's not reliable
> > > > for concurrent ones, is what I meant.  Again, if there is a dead_count
> > > > mismatch, your pointer might be dangling, easy case.  However, even if
> > > > there is no mismatch, you could still race with a destruction that has
> > > > marked the object dead, and then frees it once you drop the RCU lock,
> > > > so you need try_get() to check if the object is dead, or you could
> > > > return a pointer to freed or soon to be freed memory.
> > > 
> > > Wait a moment. But what prevents from the following race?
> > > 
> > > rcu_read_lock()
> > > 						mem_cgroup_css_offline(memcg)
> > > 						root->dead_count++
> > > iter->last_dead_count = root->dead_count
> > > iter->last_visited = memcg
> > > 						// final
> > > 						css_put(memcg);
> > > // last_visited is still valid
> > > rcu_read_unlock()
> > > [...]
> > > // next iteration
> > > rcu_read_lock()
> > > iter->last_dead_count == root->dead_count
> > > // KABOOM
> > > 
> > > The race window between dead_count++ and css_put is quite big but that
> > > is not important because that css_put can happen anytime before we start
> > > the next iteration and take rcu_read_lock.
> > 
> > The usual approach is to make sure that there is a grace period (either
> > synchronize_rcu() or call_rcu()) between the time that the data is
> > made inaccessible to readers (this would be mem_cgroup_css_offline()?)
> > and the time it is freed (css_put(), correct?).
> 
> Absolutely!  And there is a synchronize_rcu() in between those two
> operations.
> 
> However, we want to keep a weak reference to the cgroup after we drop
> the rcu read-side lock, so rcu alone is not enough for us to guarantee
> object life time.  We still have to carefully detect any concurrent
> offlinings in order to validate the weak reference next time around.

That would make things more interesting.  ;-)

Exactly who or what holds the weak reference?  And the idea is that if
you attempt to use the weak reference beforehand, the css_put() does not
actually free it, but if you attempt to use it afterwards, you get some
sort of failure indication?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-12 18:31                             ` Paul E. McKenney
  0 siblings, 0 replies; 78+ messages in thread
From: Paul E. McKenney @ 2013-02-12 18:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Ying Han, Tejun Heo, Glauber Costa, Li Zefan

On Tue, Feb 12, 2013 at 12:25:26PM -0500, Johannes Weiner wrote:
> On Tue, Feb 12, 2013 at 08:10:51AM -0800, Paul E. McKenney wrote:
> > On Tue, Feb 12, 2013 at 04:43:30PM +0100, Michal Hocko wrote:
> > > On Tue 12-02-13 10:10:02, Johannes Weiner wrote:
> > > > On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote:
> > > > > On Mon 11-02-13 17:39:43, Johannes Weiner wrote:
> > > > > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote:
> > > > > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote:
> > > > > > > > That way, if the dead count gives the go-ahead, you KNOW that the
> > > > > > > > position cache is valid, because it has been updated first.
> > > > > > > 
> > > > > > > OK, you are right. We can live without css_tryget because dead_count is
> > > > > > > either OK which means that css would be alive at least this rcu period
> > > > > > > (and RCU walk would be safe as well) or it is incremented which means
> > > > > > > that we have started css_offline already and then css is dead already.
> > > > > > > So css_tryget can be dropped.
> > > > > > 
> > > > > > Not quite :)
> > > > > > 
> > > > > > The dead_count check is for completed destructions,
> > > > > 
> > > > > Not quite :P. dead_count is incremented in css_offline callback which is
> > > > > called before the cgroup core releases its last reference and unlinks
> > > > > the group from the siblinks. css_tryget would already fail at this stage
> > > > > because CSS_DEACT_BIAS is in place at that time but this doesn't break
> > > > > RCU walk. So I think we are safe even without css_get.
> > > > 
> > > > But you drop the RCU lock before you return.
> > > >
> > > > dead_count IS incremented for every destruction, but it's not reliable
> > > > for concurrent ones, is what I meant.  Again, if there is a dead_count
> > > > mismatch, your pointer might be dangling, easy case.  However, even if
> > > > there is no mismatch, you could still race with a destruction that has
> > > > marked the object dead, and then frees it once you drop the RCU lock,
> > > > so you need try_get() to check if the object is dead, or you could
> > > > return a pointer to freed or soon to be freed memory.
> > > 
> > > Wait a moment. But what prevents from the following race?
> > > 
> > > rcu_read_lock()
> > > 						mem_cgroup_css_offline(memcg)
> > > 						root->dead_count++
> > > iter->last_dead_count = root->dead_count
> > > iter->last_visited = memcg
> > > 						// final
> > > 						css_put(memcg);
> > > // last_visited is still valid
> > > rcu_read_unlock()
> > > [...]
> > > // next iteration
> > > rcu_read_lock()
> > > iter->last_dead_count == root->dead_count
> > > // KABOOM
> > > 
> > > The race window between dead_count++ and css_put is quite big but that
> > > is not important because that css_put can happen anytime before we start
> > > the next iteration and take rcu_read_lock.
> > 
> > The usual approach is to make sure that there is a grace period (either
> > synchronize_rcu() or call_rcu()) between the time that the data is
> > made inaccessible to readers (this would be mem_cgroup_css_offline()?)
> > and the time it is freed (css_put(), correct?).
> 
> Absolutely!  And there is a synchronize_rcu() in between those two
> operations.
> 
> However, we want to keep a weak reference to the cgroup after we drop
> the rcu read-side lock, so rcu alone is not enough for us to guarantee
> object life time.  We still have to carefully detect any concurrent
> offlinings in order to validate the weak reference next time around.

That would make things more interesting.  ;-)

Exactly who or what holds the weak reference?  And the idea is that if
you attempt to use the weak reference beforehand, the css_put() does not
actually free it, but if you attempt to use it afterwards, you get some
sort of failure indication?

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-12 18:31                             ` Paul E. McKenney
@ 2013-02-12 19:53                               ` Johannes Weiner
  -1 siblings, 0 replies; 78+ messages in thread
From: Johannes Weiner @ 2013-02-12 19:53 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michal Hocko, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Ying Han, Tejun Heo, Glauber Costa, Li Zefan

On Tue, Feb 12, 2013 at 10:31:48AM -0800, Paul E. McKenney wrote:
> On Tue, Feb 12, 2013 at 12:25:26PM -0500, Johannes Weiner wrote:
> > On Tue, Feb 12, 2013 at 08:10:51AM -0800, Paul E. McKenney wrote:
> > > On Tue, Feb 12, 2013 at 04:43:30PM +0100, Michal Hocko wrote:
> > > > On Tue 12-02-13 10:10:02, Johannes Weiner wrote:
> > > > > On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote:
> > > > > > On Mon 11-02-13 17:39:43, Johannes Weiner wrote:
> > > > > > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote:
> > > > > > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote:
> > > > > > > > > That way, if the dead count gives the go-ahead, you KNOW that the
> > > > > > > > > position cache is valid, because it has been updated first.
> > > > > > > > 
> > > > > > > > OK, you are right. We can live without css_tryget because dead_count is
> > > > > > > > either OK which means that css would be alive at least this rcu period
> > > > > > > > (and RCU walk would be safe as well) or it is incremented which means
> > > > > > > > that we have started css_offline already and then css is dead already.
> > > > > > > > So css_tryget can be dropped.
> > > > > > > 
> > > > > > > Not quite :)
> > > > > > > 
> > > > > > > The dead_count check is for completed destructions,
> > > > > > 
> > > > > > Not quite :P. dead_count is incremented in css_offline callback which is
> > > > > > called before the cgroup core releases its last reference and unlinks
> > > > > > the group from the siblinks. css_tryget would already fail at this stage
> > > > > > because CSS_DEACT_BIAS is in place at that time but this doesn't break
> > > > > > RCU walk. So I think we are safe even without css_get.
> > > > > 
> > > > > But you drop the RCU lock before you return.
> > > > >
> > > > > dead_count IS incremented for every destruction, but it's not reliable
> > > > > for concurrent ones, is what I meant.  Again, if there is a dead_count
> > > > > mismatch, your pointer might be dangling, easy case.  However, even if
> > > > > there is no mismatch, you could still race with a destruction that has
> > > > > marked the object dead, and then frees it once you drop the RCU lock,
> > > > > so you need try_get() to check if the object is dead, or you could
> > > > > return a pointer to freed or soon to be freed memory.
> > > > 
> > > > Wait a moment. But what prevents from the following race?
> > > > 
> > > > rcu_read_lock()
> > > > 						mem_cgroup_css_offline(memcg)
> > > > 						root->dead_count++
> > > > iter->last_dead_count = root->dead_count
> > > > iter->last_visited = memcg
> > > > 						// final
> > > > 						css_put(memcg);
> > > > // last_visited is still valid
> > > > rcu_read_unlock()
> > > > [...]
> > > > // next iteration
> > > > rcu_read_lock()
> > > > iter->last_dead_count == root->dead_count
> > > > // KABOOM
> > > > 
> > > > The race window between dead_count++ and css_put is quite big but that
> > > > is not important because that css_put can happen anytime before we start
> > > > the next iteration and take rcu_read_lock.
> > > 
> > > The usual approach is to make sure that there is a grace period (either
> > > synchronize_rcu() or call_rcu()) between the time that the data is
> > > made inaccessible to readers (this would be mem_cgroup_css_offline()?)
> > > and the time it is freed (css_put(), correct?).
> > 
> > Absolutely!  And there is a synchronize_rcu() in between those two
> > operations.
> > 
> > However, we want to keep a weak reference to the cgroup after we drop
> > the rcu read-side lock, so rcu alone is not enough for us to guarantee
> > object life time.  We still have to carefully detect any concurrent
> > offlinings in order to validate the weak reference next time around.
> 
> That would make things more interesting.  ;-)
> 
> Exactly who or what holds the weak reference?  And the idea is that if
> you attempt to use the weak reference beforehand, the css_put() does not
> actually free it, but if you attempt to use it afterwards, you get some
> sort of failure indication?

Yes, exactly.  We are using a seqlock-style cookie comparison to see
if any objects in the pool of objects that we may point to was
destroyed.  We are having trouble to agree on how to safely read the
counter :-)

Long version:

It's an iterator over a hierarchy of cgroups, but page reclaim may
stop iteration at will and might not come back for an indefinite
amount of time (until memory pressure triggers reclaim again).  So we
want to allow cgroups to be destroyed while one of the iterators may
still pointing at it (we have iterators per-node, per-zone, per
reclaim priority level, that's why it's not feasible to invalidate
them pro-actively upon cgroup destruction).

The idea is that we have a counter that counts cgroup destructions in
each cgroup hierarchy and we remember a snapshot of that counter at
the time we remember the iterator position.  If any group in that
group's hierarchy gets killed before we come back to the iterator, the
counter mismatches.  Easy.  If any group is getting killed
concurrently, the counter might match our cookie, but the object could
be marked dead already, while rcu prevents it from being freed.  The
remaining worry is/was that we have two reads of the destruction
counter: one when validating the weak reference, another one when
updating the iterator.  If a destruction starts in between those two,
and modifies the counter, we would miss that destruction and the
object that is now weakly referenced could get freed while the
corresponding snapshot matches the latest value of the destruction
counter.  Michal's idea was to hold off the destruction counter inc
between those reads with synchronize_rcu().  My idea was to simply
read the counter only once and use that same value to both check and
update the iterator with.  That should catch this type of race
condition and save the atomic & the extra synchronize_rcu().  At least
I fail to see the downside of reading it only once:

iteration:
rcu_read_lock()
dead_count = atomic_read(&hierarchy->dead_count)
smp_rmb()
previous = iterator->position
if (iterator->dead_count != dead_count)
   /* A cgroup in our hierarchy was killed, pointer might be dangling */
   don't use iterator
if (!tryget(&previous))
   /* The cgroup is marked dead, don't use it */
   don't use iterator
next = find_next_and_tryget(hierarchy, &previous)
/* what happens if destruction of next starts NOW? */
css_put(previous)
iterator->position = next
smp_wmb()
iterator->dead_count = dead_count /* my suggestion, instead of a second atomic_read() */
rcu_read_unlock()
return next /* caller drops ref eventually, iterator->cgroup becomes weak */

destruction:
bias(cgroup->refcount) /* disables future tryget */
//synchronize_rcu() /* Michal's suggestion */
atomic_inc(&cgroup->hierarchy->dead_count)
synchronize_rcu()
free(cgroup)

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-12 19:53                               ` Johannes Weiner
  0 siblings, 0 replies; 78+ messages in thread
From: Johannes Weiner @ 2013-02-12 19:53 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michal Hocko, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Ying Han, Tejun Heo, Glauber Costa, Li Zefan

On Tue, Feb 12, 2013 at 10:31:48AM -0800, Paul E. McKenney wrote:
> On Tue, Feb 12, 2013 at 12:25:26PM -0500, Johannes Weiner wrote:
> > On Tue, Feb 12, 2013 at 08:10:51AM -0800, Paul E. McKenney wrote:
> > > On Tue, Feb 12, 2013 at 04:43:30PM +0100, Michal Hocko wrote:
> > > > On Tue 12-02-13 10:10:02, Johannes Weiner wrote:
> > > > > On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote:
> > > > > > On Mon 11-02-13 17:39:43, Johannes Weiner wrote:
> > > > > > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote:
> > > > > > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote:
> > > > > > > > > That way, if the dead count gives the go-ahead, you KNOW that the
> > > > > > > > > position cache is valid, because it has been updated first.
> > > > > > > > 
> > > > > > > > OK, you are right. We can live without css_tryget because dead_count is
> > > > > > > > either OK which means that css would be alive at least this rcu period
> > > > > > > > (and RCU walk would be safe as well) or it is incremented which means
> > > > > > > > that we have started css_offline already and then css is dead already.
> > > > > > > > So css_tryget can be dropped.
> > > > > > > 
> > > > > > > Not quite :)
> > > > > > > 
> > > > > > > The dead_count check is for completed destructions,
> > > > > > 
> > > > > > Not quite :P. dead_count is incremented in css_offline callback which is
> > > > > > called before the cgroup core releases its last reference and unlinks
> > > > > > the group from the siblinks. css_tryget would already fail at this stage
> > > > > > because CSS_DEACT_BIAS is in place at that time but this doesn't break
> > > > > > RCU walk. So I think we are safe even without css_get.
> > > > > 
> > > > > But you drop the RCU lock before you return.
> > > > >
> > > > > dead_count IS incremented for every destruction, but it's not reliable
> > > > > for concurrent ones, is what I meant.  Again, if there is a dead_count
> > > > > mismatch, your pointer might be dangling, easy case.  However, even if
> > > > > there is no mismatch, you could still race with a destruction that has
> > > > > marked the object dead, and then frees it once you drop the RCU lock,
> > > > > so you need try_get() to check if the object is dead, or you could
> > > > > return a pointer to freed or soon to be freed memory.
> > > > 
> > > > Wait a moment. But what prevents from the following race?
> > > > 
> > > > rcu_read_lock()
> > > > 						mem_cgroup_css_offline(memcg)
> > > > 						root->dead_count++
> > > > iter->last_dead_count = root->dead_count
> > > > iter->last_visited = memcg
> > > > 						// final
> > > > 						css_put(memcg);
> > > > // last_visited is still valid
> > > > rcu_read_unlock()
> > > > [...]
> > > > // next iteration
> > > > rcu_read_lock()
> > > > iter->last_dead_count == root->dead_count
> > > > // KABOOM
> > > > 
> > > > The race window between dead_count++ and css_put is quite big but that
> > > > is not important because that css_put can happen anytime before we start
> > > > the next iteration and take rcu_read_lock.
> > > 
> > > The usual approach is to make sure that there is a grace period (either
> > > synchronize_rcu() or call_rcu()) between the time that the data is
> > > made inaccessible to readers (this would be mem_cgroup_css_offline()?)
> > > and the time it is freed (css_put(), correct?).
> > 
> > Absolutely!  And there is a synchronize_rcu() in between those two
> > operations.
> > 
> > However, we want to keep a weak reference to the cgroup after we drop
> > the rcu read-side lock, so rcu alone is not enough for us to guarantee
> > object life time.  We still have to carefully detect any concurrent
> > offlinings in order to validate the weak reference next time around.
> 
> That would make things more interesting.  ;-)
> 
> Exactly who or what holds the weak reference?  And the idea is that if
> you attempt to use the weak reference beforehand, the css_put() does not
> actually free it, but if you attempt to use it afterwards, you get some
> sort of failure indication?

Yes, exactly.  We are using a seqlock-style cookie comparison to see
if any objects in the pool of objects that we may point to was
destroyed.  We are having trouble to agree on how to safely read the
counter :-)

Long version:

It's an iterator over a hierarchy of cgroups, but page reclaim may
stop iteration at will and might not come back for an indefinite
amount of time (until memory pressure triggers reclaim again).  So we
want to allow cgroups to be destroyed while one of the iterators may
still pointing at it (we have iterators per-node, per-zone, per
reclaim priority level, that's why it's not feasible to invalidate
them pro-actively upon cgroup destruction).

The idea is that we have a counter that counts cgroup destructions in
each cgroup hierarchy and we remember a snapshot of that counter at
the time we remember the iterator position.  If any group in that
group's hierarchy gets killed before we come back to the iterator, the
counter mismatches.  Easy.  If any group is getting killed
concurrently, the counter might match our cookie, but the object could
be marked dead already, while rcu prevents it from being freed.  The
remaining worry is/was that we have two reads of the destruction
counter: one when validating the weak reference, another one when
updating the iterator.  If a destruction starts in between those two,
and modifies the counter, we would miss that destruction and the
object that is now weakly referenced could get freed while the
corresponding snapshot matches the latest value of the destruction
counter.  Michal's idea was to hold off the destruction counter inc
between those reads with synchronize_rcu().  My idea was to simply
read the counter only once and use that same value to both check and
update the iterator with.  That should catch this type of race
condition and save the atomic & the extra synchronize_rcu().  At least
I fail to see the downside of reading it only once:

iteration:
rcu_read_lock()
dead_count = atomic_read(&hierarchy->dead_count)
smp_rmb()
previous = iterator->position
if (iterator->dead_count != dead_count)
   /* A cgroup in our hierarchy was killed, pointer might be dangling */
   don't use iterator
if (!tryget(&previous))
   /* The cgroup is marked dead, don't use it */
   don't use iterator
next = find_next_and_tryget(hierarchy, &previous)
/* what happens if destruction of next starts NOW? */
css_put(previous)
iterator->position = next
smp_wmb()
iterator->dead_count = dead_count /* my suggestion, instead of a second atomic_read() */
rcu_read_unlock()
return next /* caller drops ref eventually, iterator->cgroup becomes weak */

destruction:
bias(cgroup->refcount) /* disables future tryget */
//synchronize_rcu() /* Michal's suggestion */
atomic_inc(&cgroup->hierarchy->dead_count)
synchronize_rcu()
free(cgroup)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-12 17:37                                 ` Johannes Weiner
@ 2013-02-13  8:11                                   ` Glauber Costa
  -1 siblings, 0 replies; 78+ messages in thread
From: Glauber Costa @ 2013-02-13  8:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Ying Han, Tejun Heo, Li Zefan

On 02/12/2013 09:37 PM, Johannes Weiner wrote:
>> > All reads from root->dead_count are atomic already, so I am not sure
>> > what you mean here. Anyway, I hope I won't make this even more confusing
>> > if I post what I have right now:
> Yes, but we are doing two reads.  Can't the memcg that we'll store in
> last_visited be offlined during this and be freed after we drop the
> rcu read lock?  If we had just one read, we would detect this
> properly.
> 

I don't want to add any more confusion to an already fun discussion, but
IIUC, you are trying to avoid triggering a second round of reclaim in an
already dead memcg, right?

Can't you generalize the mechanism I use for kmemcg, where a very
similar problem exists ? This is how it looks like:


  /* this atomically sets a bit in the memcg. It does so
   * unconditionally, and it is (so far) okay if it is set
   * twice
   */
  memcg_kmem_mark_dead(memcg);

  /*
   * Then if kmem charges is not zero, we don't actually destroy the
   * memcg. The function where it lives will always be called when usage
   * reaches 0, so we guarantee that we will never miss the chance to
   * call the destruction function at least once.
   *
   * I suspect you could use a mechanism like this, or extend
   * this very same, to prevent the second reclaim to be even called
   */
  if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0)
          return;

  /*
   * this is how we guarantee that the destruction fuction is called at
   * most once. The second caller would see the bit unset.
   */
  if (memcg_kmem_test_and_clear_dead(memcg))
          mem_cgroup_put(memcg);


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-13  8:11                                   ` Glauber Costa
  0 siblings, 0 replies; 78+ messages in thread
From: Glauber Costa @ 2013-02-13  8:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Ying Han, Tejun Heo, Li Zefan

On 02/12/2013 09:37 PM, Johannes Weiner wrote:
>> > All reads from root->dead_count are atomic already, so I am not sure
>> > what you mean here. Anyway, I hope I won't make this even more confusing
>> > if I post what I have right now:
> Yes, but we are doing two reads.  Can't the memcg that we'll store in
> last_visited be offlined during this and be freed after we drop the
> rcu read lock?  If we had just one read, we would detect this
> properly.
> 

I don't want to add any more confusion to an already fun discussion, but
IIUC, you are trying to avoid triggering a second round of reclaim in an
already dead memcg, right?

Can't you generalize the mechanism I use for kmemcg, where a very
similar problem exists ? This is how it looks like:


  /* this atomically sets a bit in the memcg. It does so
   * unconditionally, and it is (so far) okay if it is set
   * twice
   */
  memcg_kmem_mark_dead(memcg);

  /*
   * Then if kmem charges is not zero, we don't actually destroy the
   * memcg. The function where it lives will always be called when usage
   * reaches 0, so we guarantee that we will never miss the chance to
   * call the destruction function at least once.
   *
   * I suspect you could use a mechanism like this, or extend
   * this very same, to prevent the second reclaim to be even called
   */
  if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0)
          return;

  /*
   * this is how we guarantee that the destruction fuction is called at
   * most once. The second caller would see the bit unset.
   */
  if (memcg_kmem_test_and_clear_dead(memcg))
          mem_cgroup_put(memcg);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-12 19:53                               ` Johannes Weiner
@ 2013-02-13  9:51                                 ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-13  9:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Paul E. McKenney, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Ying Han, Tejun Heo, Glauber Costa, Li Zefan

On Tue 12-02-13 14:53:58, Johannes Weiner wrote:
[...]
> iteration:
> rcu_read_lock()
> dead_count = atomic_read(&hierarchy->dead_count)
> smp_rmb()
> previous = iterator->position
> if (iterator->dead_count != dead_count)
>    /* A cgroup in our hierarchy was killed, pointer might be dangling */
>    don't use iterator
> if (!tryget(&previous))
>    /* The cgroup is marked dead, don't use it */
>    don't use iterator
> next = find_next_and_tryget(hierarchy, &previous)
> /* what happens if destruction of next starts NOW? */

OK, I thought that this depends on the ordering of CSS_DEACT_BIAS and
dead_count writes - because there is no memory ordering enforced between
those two. But it shouldn't matter because we are checking both. If the
increment is seen sooner then we do not care about css_tryget and if css
is deactivated before dead_count++ then the css_tryget would shout.

More interesting ordering, however, is dead_count++ vs. css_put from
cgroup core. Say we have the following:

	CPU0			CPU1			CPU2

iter->position = A;
iter->dead_count = dead_count;
rcu_read_unlock()
return A

mem_cgroup_iter_break
  css_put(A)					bias(A)
  						css_offline()
  						css_put(A) // in cgroup_destroy_locked
							   // last ref and A will be freed
  			rcu_read_lock()
			read parent->dead_count
						parent->dead_count++ // got reordered from css_offline
			css_tryget(A) // kaboom

The reordering window is really huge and I think it is impossible
to trigger in real life. And mem_cgroup_reparent_charges calls
mem_cgroup_start_move unconditionally which in turn calls
synchronize_rcu() which is a full barrier AFAIU so dead_count++ cannot
be reordered ATM.
But should we rely on that? Shouldn't we add smp_wmb
after dead_count++ as I had in an earlier version of the patch?

> css_put(previous)
> iterator->position = next
> smp_wmb()
> iterator->dead_count = dead_count /* my suggestion, instead of a second atomic_read() */
> rcu_read_unlock()
> return next /* caller drops ref eventually, iterator->cgroup becomes weak */
> 
> destruction:
> bias(cgroup->refcount) /* disables future tryget */
> //synchronize_rcu() /* Michal's suggestion */
> atomic_inc(&cgroup->hierarchy->dead_count)
> synchronize_rcu()
> free(cgroup)

Other than that this should work. I will update the patch accordingly.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-13  9:51                                 ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-13  9:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Paul E. McKenney, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Ying Han, Tejun Heo, Glauber Costa, Li Zefan

On Tue 12-02-13 14:53:58, Johannes Weiner wrote:
[...]
> iteration:
> rcu_read_lock()
> dead_count = atomic_read(&hierarchy->dead_count)
> smp_rmb()
> previous = iterator->position
> if (iterator->dead_count != dead_count)
>    /* A cgroup in our hierarchy was killed, pointer might be dangling */
>    don't use iterator
> if (!tryget(&previous))
>    /* The cgroup is marked dead, don't use it */
>    don't use iterator
> next = find_next_and_tryget(hierarchy, &previous)
> /* what happens if destruction of next starts NOW? */

OK, I thought that this depends on the ordering of CSS_DEACT_BIAS and
dead_count writes - because there is no memory ordering enforced between
those two. But it shouldn't matter because we are checking both. If the
increment is seen sooner then we do not care about css_tryget and if css
is deactivated before dead_count++ then the css_tryget would shout.

More interesting ordering, however, is dead_count++ vs. css_put from
cgroup core. Say we have the following:

	CPU0			CPU1			CPU2

iter->position = A;
iter->dead_count = dead_count;
rcu_read_unlock()
return A

mem_cgroup_iter_break
  css_put(A)					bias(A)
  						css_offline()
  						css_put(A) // in cgroup_destroy_locked
							   // last ref and A will be freed
  			rcu_read_lock()
			read parent->dead_count
						parent->dead_count++ // got reordered from css_offline
			css_tryget(A) // kaboom

The reordering window is really huge and I think it is impossible
to trigger in real life. And mem_cgroup_reparent_charges calls
mem_cgroup_start_move unconditionally which in turn calls
synchronize_rcu() which is a full barrier AFAIU so dead_count++ cannot
be reordered ATM.
But should we rely on that? Shouldn't we add smp_wmb
after dead_count++ as I had in an earlier version of the patch?

> css_put(previous)
> iterator->position = next
> smp_wmb()
> iterator->dead_count = dead_count /* my suggestion, instead of a second atomic_read() */
> rcu_read_unlock()
> return next /* caller drops ref eventually, iterator->cgroup becomes weak */
> 
> destruction:
> bias(cgroup->refcount) /* disables future tryget */
> //synchronize_rcu() /* Michal's suggestion */
> atomic_inc(&cgroup->hierarchy->dead_count)
> synchronize_rcu()
> free(cgroup)

Other than that this should work. I will update the patch accordingly.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-12 17:37                                 ` Johannes Weiner
@ 2013-02-13 10:34                                   ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-13 10:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Tue 12-02-13 12:37:41, Johannes Weiner wrote:
> On Tue, Feb 12, 2013 at 06:12:16PM +0100, Michal Hocko wrote:
[...]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 727ec39..31bb9b0 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu {
> >  };
> >  
> >  struct mem_cgroup_reclaim_iter {
> > -	/* last scanned hierarchy member with elevated css ref count */
> > +	/*
> > +	 * last scanned hierarchy member. Valid only if last_dead_count
> > +	 * matches memcg->dead_count of the hierarchy root group.
> > +	 */
> >  	struct mem_cgroup *last_visited;
> > +	unsigned int last_dead_count;
> 
> Since we read and write this without a lock, I would feel more
> comfortable if this were a full word, i.e. unsigned long.  That
> guarantees we don't see any partial states.

OK. Changed. Although I though that int is read/modified atomically as
well if it is aligned to its size.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-13 10:34                                   ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-13 10:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Tue 12-02-13 12:37:41, Johannes Weiner wrote:
> On Tue, Feb 12, 2013 at 06:12:16PM +0100, Michal Hocko wrote:
[...]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 727ec39..31bb9b0 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu {
> >  };
> >  
> >  struct mem_cgroup_reclaim_iter {
> > -	/* last scanned hierarchy member with elevated css ref count */
> > +	/*
> > +	 * last scanned hierarchy member. Valid only if last_dead_count
> > +	 * matches memcg->dead_count of the hierarchy root group.
> > +	 */
> >  	struct mem_cgroup *last_visited;
> > +	unsigned int last_dead_count;
> 
> Since we read and write this without a lock, I would feel more
> comfortable if this were a full word, i.e. unsigned long.  That
> guarantees we don't see any partial states.

OK. Changed. Although I though that int is read/modified atomically as
well if it is aligned to its size.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-13  8:11                                   ` Glauber Costa
@ 2013-02-13 10:38                                     ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-13 10:38 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Johannes Weiner, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Ying Han, Tejun Heo, Li Zefan

On Wed 13-02-13 12:11:59, Glauber Costa wrote:
> On 02/12/2013 09:37 PM, Johannes Weiner wrote:
> >> > All reads from root->dead_count are atomic already, so I am not sure
> >> > what you mean here. Anyway, I hope I won't make this even more confusing
> >> > if I post what I have right now:
> > Yes, but we are doing two reads.  Can't the memcg that we'll store in
> > last_visited be offlined during this and be freed after we drop the
> > rcu read lock?  If we had just one read, we would detect this
> > properly.
> > 
> 
> I don't want to add any more confusion to an already fun discussion, but
> IIUC, you are trying to avoid triggering a second round of reclaim in an
> already dead memcg, right?

No this is not about the second round of the reclaim but rather
iteration racing with removal. And we want to do it as lightweight as
possible. We cannot work with memcg directly because it might have
disappeared in the mean time and we do not want to hold a reference on
it because there would be no guarantee somebody will release it later
on. So mark_dead && test_and_clear_dead would not work in this context.
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-13 10:38                                     ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-13 10:38 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Johannes Weiner, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Ying Han, Tejun Heo, Li Zefan

On Wed 13-02-13 12:11:59, Glauber Costa wrote:
> On 02/12/2013 09:37 PM, Johannes Weiner wrote:
> >> > All reads from root->dead_count are atomic already, so I am not sure
> >> > what you mean here. Anyway, I hope I won't make this even more confusing
> >> > if I post what I have right now:
> > Yes, but we are doing two reads.  Can't the memcg that we'll store in
> > last_visited be offlined during this and be freed after we drop the
> > rcu read lock?  If we had just one read, we would detect this
> > properly.
> > 
> 
> I don't want to add any more confusion to an already fun discussion, but
> IIUC, you are trying to avoid triggering a second round of reclaim in an
> already dead memcg, right?

No this is not about the second round of the reclaim but rather
iteration racing with removal. And we want to do it as lightweight as
possible. We cannot work with memcg directly because it might have
disappeared in the mean time and we do not want to hold a reference on
it because there would be no guarantee somebody will release it later
on. So mark_dead && test_and_clear_dead would not work in this context.
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
  2013-02-13 10:34                                   ` Michal Hocko
@ 2013-02-13 12:56                                     ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-13 12:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Wed 13-02-13 11:34:59, Michal Hocko wrote:
> On Tue 12-02-13 12:37:41, Johannes Weiner wrote:
> > On Tue, Feb 12, 2013 at 06:12:16PM +0100, Michal Hocko wrote:
> [...]
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 727ec39..31bb9b0 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu {
> > >  };
> > >  
> > >  struct mem_cgroup_reclaim_iter {
> > > -	/* last scanned hierarchy member with elevated css ref count */
> > > +	/*
> > > +	 * last scanned hierarchy member. Valid only if last_dead_count
> > > +	 * matches memcg->dead_count of the hierarchy root group.
> > > +	 */
> > >  	struct mem_cgroup *last_visited;
> > > +	unsigned int last_dead_count;
> > 
> > Since we read and write this without a lock, I would feel more
> > comfortable if this were a full word, i.e. unsigned long.  That
> > guarantees we don't see any partial states.
> 
> OK. Changed. Although I though that int is read/modified atomically as
> well if it is aligned to its size.

Ohh, I guess what was your concern. If last_dead_count was int then it
would fit into the same full word slot with generation and so the
parallel read-modify-update cycle could be an issue.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators
@ 2013-02-13 12:56                                     ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2013-02-13 12:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Tejun Heo,
	Glauber Costa, Li Zefan

On Wed 13-02-13 11:34:59, Michal Hocko wrote:
> On Tue 12-02-13 12:37:41, Johannes Weiner wrote:
> > On Tue, Feb 12, 2013 at 06:12:16PM +0100, Michal Hocko wrote:
> [...]
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 727ec39..31bb9b0 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu {
> > >  };
> > >  
> > >  struct mem_cgroup_reclaim_iter {
> > > -	/* last scanned hierarchy member with elevated css ref count */
> > > +	/*
> > > +	 * last scanned hierarchy member. Valid only if last_dead_count
> > > +	 * matches memcg->dead_count of the hierarchy root group.
> > > +	 */
> > >  	struct mem_cgroup *last_visited;
> > > +	unsigned int last_dead_count;
> > 
> > Since we read and write this without a lock, I would feel more
> > comfortable if this were a full word, i.e. unsigned long.  That
> > guarantees we don't see any partial states.
> 
> OK. Changed. Although I though that int is read/modified atomically as
> well if it is aligned to its size.

Ohh, I guess what was your concern. If last_dead_count was int then it
would fit into the same full word slot with generation and so the
parallel read-modify-update cycle could be an issue.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2013-02-13 12:56 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-03 17:54 [PATCH v3 0/7] rework mem_cgroup iterator Michal Hocko
2013-01-03 17:54 ` Michal Hocko
2013-01-03 17:54 ` [PATCH v3 1/7] memcg: synchronize per-zone iterator access by a spinlock Michal Hocko
2013-01-03 17:54   ` Michal Hocko
2013-01-03 17:54 ` [PATCH v3 2/7] memcg: keep prev's css alive for the whole mem_cgroup_iter Michal Hocko
2013-01-03 17:54   ` Michal Hocko
2013-01-03 17:54 ` [PATCH v3 3/7] memcg: rework mem_cgroup_iter to use cgroup iterators Michal Hocko
2013-01-03 17:54   ` Michal Hocko
2013-01-03 17:54 ` [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Michal Hocko
2013-01-03 17:54   ` Michal Hocko
2013-01-07  6:18   ` Kamezawa Hiroyuki
2013-01-07  6:18     ` Kamezawa Hiroyuki
2013-02-08 19:33   ` Johannes Weiner
2013-02-08 19:33     ` Johannes Weiner
2013-02-11 15:16     ` Michal Hocko
2013-02-11 15:16       ` Michal Hocko
2013-02-11 17:56       ` Johannes Weiner
2013-02-11 17:56         ` Johannes Weiner
2013-02-11 19:29         ` Michal Hocko
2013-02-11 19:29           ` Michal Hocko
2013-02-11 19:58           ` Johannes Weiner
2013-02-11 19:58             ` Johannes Weiner
2013-02-11 21:27             ` Michal Hocko
2013-02-11 21:27               ` Michal Hocko
2013-02-11 22:07               ` Michal Hocko
2013-02-11 22:07                 ` Michal Hocko
2013-02-11 22:39               ` Johannes Weiner
2013-02-11 22:39                 ` Johannes Weiner
2013-02-12  9:54                 ` Michal Hocko
2013-02-12  9:54                   ` Michal Hocko
2013-02-12 15:10                   ` Johannes Weiner
2013-02-12 15:10                     ` Johannes Weiner
2013-02-12 15:43                     ` Michal Hocko
2013-02-12 15:43                       ` Michal Hocko
2013-02-12 16:10                       ` Paul E. McKenney
2013-02-12 16:10                         ` Paul E. McKenney
2013-02-12 17:25                         ` Johannes Weiner
2013-02-12 17:25                           ` Johannes Weiner
2013-02-12 18:31                           ` Paul E. McKenney
2013-02-12 18:31                             ` Paul E. McKenney
2013-02-12 19:53                             ` Johannes Weiner
2013-02-12 19:53                               ` Johannes Weiner
2013-02-13  9:51                               ` Michal Hocko
2013-02-13  9:51                                 ` Michal Hocko
2013-02-12 17:56                         ` Michal Hocko
2013-02-12 17:56                           ` Michal Hocko
2013-02-12 16:13                       ` Michal Hocko
2013-02-12 16:13                         ` Michal Hocko
2013-02-12 16:24                         ` Michal Hocko
2013-02-12 16:24                           ` Michal Hocko
2013-02-12 16:37                           ` Michal Hocko
2013-02-12 16:37                             ` Michal Hocko
2013-02-12 16:41                           ` Johannes Weiner
2013-02-12 16:41                             ` Johannes Weiner
2013-02-12 17:12                             ` Michal Hocko
2013-02-12 17:12                               ` Michal Hocko
2013-02-12 17:37                               ` Johannes Weiner
2013-02-12 17:37                                 ` Johannes Weiner
2013-02-13  8:11                                 ` Glauber Costa
2013-02-13  8:11                                   ` Glauber Costa
2013-02-13 10:38                                   ` Michal Hocko
2013-02-13 10:38                                     ` Michal Hocko
2013-02-13 10:34                                 ` Michal Hocko
2013-02-13 10:34                                   ` Michal Hocko
2013-02-13 12:56                                   ` Michal Hocko
2013-02-13 12:56                                     ` Michal Hocko
2013-02-12 16:33                       ` Johannes Weiner
2013-02-12 16:33                         ` Johannes Weiner
2013-01-03 17:54 ` [PATCH v3 5/7] memcg: simplify mem_cgroup_iter Michal Hocko
2013-01-03 17:54   ` Michal Hocko
2013-01-03 17:54 ` [PATCH v3 6/7] memcg: further " Michal Hocko
2013-01-03 17:54   ` Michal Hocko
2013-01-03 17:54 ` [PATCH v3 7/7] cgroup: remove css_get_next Michal Hocko
2013-01-03 17:54   ` Michal Hocko
2013-01-04  3:42   ` Li Zefan
2013-01-04  3:42     ` Li Zefan
2013-01-23 12:52 ` [PATCH v3 0/7] rework mem_cgroup iterator Michal Hocko
2013-01-23 12:52   ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.