linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/3] mm/memcg: fix last_dead_count memory wastage
@ 2014-01-14  1:50 Hugh Dickins
  2014-01-14  1:52 ` [PATCH 2/3] mm/memcg: fix endless iteration in reclaim Hugh Dickins
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Hugh Dickins @ 2014-01-14  1:50 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

Shorten mem_cgroup_reclaim_iter.last_dead_count from unsigned long to
int: it's assigned from an int and compared with an int, and adjacent
to an unsigned int: so there's no point to it being unsigned long,
which wasted 104 bytes in every mem_cgroup_per_zone.
    
Signed-off-by: Hugh Dickins <hughd@google.com>
---
Putting this one first as it should be nicely uncontroversial.
I'm assuming much too late for v3.13, so all 3 diffed against mmotm.

 mm/memcontrol.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- mmotm/mm/memcontrol.c	2014-01-10 18:25:02.236448954 -0800
+++ linux/mm/memcontrol.c	2014-01-12 22:21:10.700570471 -0800
@@ -149,7 +149,7 @@ struct mem_cgroup_reclaim_iter {
 	 * matches memcg->dead_count of the hierarchy root group.
 	 */
 	struct mem_cgroup *last_visited;
-	unsigned long last_dead_count;
+	int last_dead_count;
 
 	/* scan generation, increased every round-trip */
 	unsigned int generation;

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 2/3] mm/memcg: fix endless iteration in reclaim
  2014-01-14  1:50 [PATCH 1/3] mm/memcg: fix last_dead_count memory wastage Hugh Dickins
@ 2014-01-14  1:52 ` Hugh Dickins
  2014-01-14 13:27   ` Michal Hocko
  2014-01-14  1:54 ` [PATCH 3/3] mm/memcg: iteration skip memcgs not yet fully initialized Hugh Dickins
  2014-01-14 13:03 ` [PATCH 1/3] mm/memcg: fix last_dead_count memory wastage Michal Hocko
  2 siblings, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2014-01-14  1:52 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On one home machine I can easily reproduce (by rmdir of memcgdir during
reclaim) multiple processes stuck looping forever in mem_cgroup_iter():
__mem_cgroup_iter_next() keeps selecting the memcg being destroyed, fails
to tryget it, returns NULL to mem_cgroup_iter(), which goes around again.

It's better to err on the side of leaving the loop too soon than never
when such races occur: once we've served prev (using root if none),
get out the next time __mem_cgroup_iter_next() cannot deliver.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
Securing the tree iterator against such races is difficult, I've
certainly got it wrong myself before.  Although the bug is real, and
deserves a Cc stable, you may want to play around with other solutions
before committing to this one.  The current iterator goes back to v3.12:
I'm really not sure if v3.11 was good or not - I never saw the problem
in the vanilla kernel, but with Google mods in we also had to make an
adjustment, there to stop __mem_cgroup_iter() being called endlessly
from the reclaim level.

 mm/memcontrol.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- mmotm/mm/memcontrol.c	2014-01-10 18:25:02.236448954 -0800
+++ linux/mm/memcontrol.c	2014-01-12 22:21:10.700570471 -0800
@@ -1254,8 +1252,11 @@ struct mem_cgroup *mem_cgroup_iter(struc
 				reclaim->generation = iter->generation;
 		}
 
-		if (prev && !memcg)
+		if (!memcg) {
+			if (!prev)
+				memcg = root;
 			goto out_unlock;
+		}
 	}
 out_unlock:
 	rcu_read_unlock();

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 3/3] mm/memcg: iteration skip memcgs not yet fully initialized
  2014-01-14  1:50 [PATCH 1/3] mm/memcg: fix last_dead_count memory wastage Hugh Dickins
  2014-01-14  1:52 ` [PATCH 2/3] mm/memcg: fix endless iteration in reclaim Hugh Dickins
@ 2014-01-14  1:54 ` Hugh Dickins
  2014-01-14 13:30   ` Michal Hocko
  2014-01-15  8:21   ` Michal Hocko
  2014-01-14 13:03 ` [PATCH 1/3] mm/memcg: fix last_dead_count memory wastage Michal Hocko
  2 siblings, 2 replies; 30+ messages in thread
From: Hugh Dickins @ 2014-01-14  1:54 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

It is surprising that the mem_cgroup iterator can return memcgs which
have not yet been fully initialized.  By accident (or trial and error?)
this appears not to present an actual problem; but it may be better to
prevent such surprises, by skipping memcgs not yet online.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
Decide for yourself whether to take this or not.  I spent quite a while
digging into a mysterious "trying to register non-static key" issue from
lockdep, which originated from the iterator returning a vmalloc'ed memcg
a moment before the res_counter_init()s had done their spin_lock_init()s.
But the backtrace was an odd one of our own mis-devising, not a charge or
reclaim or stats trace, so probably it's never been a problem for vanilla.

 mm/memcontrol.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

--- mmotm/mm/memcontrol.c	2014-01-10 18:25:02.236448954 -0800
+++ linux/mm/memcontrol.c	2014-01-12 22:21:10.700570471 -0800
@@ -1119,10 +1119,8 @@ skip_node:
 	 * protected by css_get and the tree walk is rcu safe.
 	 */
 	if (next_css) {
-		struct mem_cgroup *mem = mem_cgroup_from_css(next_css);
-
-		if (css_tryget(&mem->css))
-			return mem;
+		if ((next_css->flags & CSS_ONLINE) && css_tryget(next_css))
+			return mem_cgroup_from_css(next_css);
 		else {
 			prev_css = next_css;
 			goto skip_node;

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/3] mm/memcg: fix last_dead_count memory wastage
  2014-01-14  1:50 [PATCH 1/3] mm/memcg: fix last_dead_count memory wastage Hugh Dickins
  2014-01-14  1:52 ` [PATCH 2/3] mm/memcg: fix endless iteration in reclaim Hugh Dickins
  2014-01-14  1:54 ` [PATCH 3/3] mm/memcg: iteration skip memcgs not yet fully initialized Hugh Dickins
@ 2014-01-14 13:03 ` Michal Hocko
  2 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2014-01-14 13:03 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Mon 13-01-14 17:50:49, Hugh Dickins wrote:
> Shorten mem_cgroup_reclaim_iter.last_dead_count from unsigned long to
> int: it's assigned from an int and compared with an int, and adjacent
> to an unsigned int: so there's no point to it being unsigned long,
> which wasted 104 bytes in every mem_cgroup_per_zone.
>     
> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
> Putting this one first as it should be nicely uncontroversial.
> I'm assuming much too late for v3.13, so all 3 diffed against mmotm.
> 
>  mm/memcontrol.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- mmotm/mm/memcontrol.c	2014-01-10 18:25:02.236448954 -0800
> +++ linux/mm/memcontrol.c	2014-01-12 22:21:10.700570471 -0800
> @@ -149,7 +149,7 @@ struct mem_cgroup_reclaim_iter {
>  	 * matches memcg->dead_count of the hierarchy root group.
>  	 */
>  	struct mem_cgroup *last_visited;
> -	unsigned long last_dead_count;
> +	int last_dead_count;
>  
>  	/* scan generation, increased every round-trip */
>  	unsigned int generation;

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/3] mm/memcg: fix endless iteration in reclaim
  2014-01-14  1:52 ` [PATCH 2/3] mm/memcg: fix endless iteration in reclaim Hugh Dickins
@ 2014-01-14 13:27   ` Michal Hocko
  2014-01-14 13:34     ` Michal Hocko
  2014-01-14 14:26     ` Michal Hocko
  0 siblings, 2 replies; 30+ messages in thread
From: Michal Hocko @ 2014-01-14 13:27 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Mon 13-01-14 17:52:30, Hugh Dickins wrote:
> On one home machine I can easily reproduce (by rmdir of memcgdir during
> reclaim) multiple processes stuck looping forever in mem_cgroup_iter():
> __mem_cgroup_iter_next() keeps selecting the memcg being destroyed, fails
> to tryget it, returns NULL to mem_cgroup_iter(), which goes around again.

So you had a single memcg (without any children) and a limit-reclaim
on it when you removed it, right? This is nasty because
__mem_cgroup_iter_next will try to skip it but there is nothing else so
it returns NULL. We update iter->generation++ but that doesn't help us
as prev = NULL as this is the first iteration so
		if (prev && reclaim->generation != iter->generation)

break out will not help us. You patch will surely help I am just not
sure it is the right thing to do. Let me think about this.

Anyway very well spotted!

> It's better to err on the side of leaving the loop too soon than never
> when such races occur: once we've served prev (using root if none),
> get out the next time __mem_cgroup_iter_next() cannot deliver.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
> Securing the tree iterator against such races is difficult, I've
> certainly got it wrong myself before.  Although the bug is real, and
> deserves a Cc stable, you may want to play around with other solutions
> before committing to this one.  The current iterator goes back to v3.12:
> I'm really not sure if v3.11 was good or not - I never saw the problem
> in the vanilla kernel, but with Google mods in we also had to make an
> adjustment, there to stop __mem_cgroup_iter() being called endlessly
> from the reclaim level.
> 
>  mm/memcontrol.c |    5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> --- mmotm/mm/memcontrol.c	2014-01-10 18:25:02.236448954 -0800
> +++ linux/mm/memcontrol.c	2014-01-12 22:21:10.700570471 -0800
> @@ -1254,8 +1252,11 @@ struct mem_cgroup *mem_cgroup_iter(struc
>  				reclaim->generation = iter->generation;
>  		}
>  
> -		if (prev && !memcg)
> +		if (!memcg) {
> +			if (!prev)
> +				memcg = root;
>  			goto out_unlock;
> +		}
>  	}
>  out_unlock:
>  	rcu_read_unlock();

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/3] mm/memcg: iteration skip memcgs not yet fully initialized
  2014-01-14  1:54 ` [PATCH 3/3] mm/memcg: iteration skip memcgs not yet fully initialized Hugh Dickins
@ 2014-01-14 13:30   ` Michal Hocko
  2014-01-14 14:29     ` Tejun Heo
  2014-01-15  8:21   ` Michal Hocko
  1 sibling, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2014-01-14 13:30 UTC (permalink / raw)
  To: Hugh Dickins, Tejun Heo
  Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Mon 13-01-14 17:54:04, Hugh Dickins wrote:
> It is surprising that the mem_cgroup iterator can return memcgs which
> have not yet been fully initialized.  By accident (or trial and error?)
> this appears not to present an actual problem; but it may be better to
> prevent such surprises, by skipping memcgs not yet online.

My understanding was that !online cgroups are not visible for the
iterator. it is css_online that has to be called before they are made
visible.

Tejun?
 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
> Decide for yourself whether to take this or not.  I spent quite a while
> digging into a mysterious "trying to register non-static key" issue from
> lockdep, which originated from the iterator returning a vmalloc'ed memcg
> a moment before the res_counter_init()s had done their spin_lock_init()s.
> But the backtrace was an odd one of our own mis-devising, not a charge or
> reclaim or stats trace, so probably it's never been a problem for vanilla.
> 
>  mm/memcontrol.c |    6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> --- mmotm/mm/memcontrol.c	2014-01-10 18:25:02.236448954 -0800
> +++ linux/mm/memcontrol.c	2014-01-12 22:21:10.700570471 -0800
> @@ -1119,10 +1119,8 @@ skip_node:
>  	 * protected by css_get and the tree walk is rcu safe.
>  	 */
>  	if (next_css) {
> -		struct mem_cgroup *mem = mem_cgroup_from_css(next_css);
> -
> -		if (css_tryget(&mem->css))
> -			return mem;
> +		if ((next_css->flags & CSS_ONLINE) && css_tryget(next_css))
> +			return mem_cgroup_from_css(next_css);
>  		else {
>  			prev_css = next_css;
>  			goto skip_node;

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/3] mm/memcg: fix endless iteration in reclaim
  2014-01-14 13:27   ` Michal Hocko
@ 2014-01-14 13:34     ` Michal Hocko
  2014-01-14 14:26     ` Michal Hocko
  1 sibling, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2014-01-14 13:34 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Tue 14-01-14 14:27:27, Michal Hocko wrote:
> On Mon 13-01-14 17:52:30, Hugh Dickins wrote:
> > On one home machine I can easily reproduce (by rmdir of memcgdir during
> > reclaim) multiple processes stuck looping forever in mem_cgroup_iter():
> > __mem_cgroup_iter_next() keeps selecting the memcg being destroyed, fails
> > to tryget it, returns NULL to mem_cgroup_iter(), which goes around again.
> 
> So you had a single memcg (without any children) and a limit-reclaim
> on it when you removed it, right? This is nasty because
> __mem_cgroup_iter_next will try to skip it but there is nothing else so
> it returns NULL. We update iter->generation++ but that doesn't help us
> as prev = NULL as this is the first iteration so
> 		if (prev && reclaim->generation != iter->generation)
> 
> break out will not help us.

And looking closer at it we even do no need to be in reclaim
for_each_mem_cgroup_tree is used from more places and that might race as
well.

> You patch will surely help I am just not
> sure it is the right thing to do. Let me think about this.
> 
> Anyway very well spotted!
> 
> > It's better to err on the side of leaving the loop too soon than never
> > when such races occur: once we've served prev (using root if none),
> > get out the next time __mem_cgroup_iter_next() cannot deliver.
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> > Securing the tree iterator against such races is difficult, I've
> > certainly got it wrong myself before.  Although the bug is real, and
> > deserves a Cc stable, you may want to play around with other solutions
> > before committing to this one.  The current iterator goes back to v3.12:
> > I'm really not sure if v3.11 was good or not - I never saw the problem
> > in the vanilla kernel, but with Google mods in we also had to make an
> > adjustment, there to stop __mem_cgroup_iter() being called endlessly
> > from the reclaim level.
> > 
> >  mm/memcontrol.c |    5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > --- mmotm/mm/memcontrol.c	2014-01-10 18:25:02.236448954 -0800
> > +++ linux/mm/memcontrol.c	2014-01-12 22:21:10.700570471 -0800
> > @@ -1254,8 +1252,11 @@ struct mem_cgroup *mem_cgroup_iter(struc
> >  				reclaim->generation = iter->generation;
> >  		}
> >  
> > -		if (prev && !memcg)
> > +		if (!memcg) {
> > +			if (!prev)
> > +				memcg = root;
> >  			goto out_unlock;
> > +		}
> >  	}
> >  out_unlock:
> >  	rcu_read_unlock();
> 
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/3] mm/memcg: fix endless iteration in reclaim
  2014-01-14 13:27   ` Michal Hocko
  2014-01-14 13:34     ` Michal Hocko
@ 2014-01-14 14:26     ` Michal Hocko
  2014-01-14 20:42       ` Hugh Dickins
  1 sibling, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2014-01-14 14:26 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Tue 14-01-14 14:27:27, Michal Hocko wrote:
> On Mon 13-01-14 17:52:30, Hugh Dickins wrote:
> > On one home machine I can easily reproduce (by rmdir of memcgdir during
> > reclaim) multiple processes stuck looping forever in mem_cgroup_iter():
> > __mem_cgroup_iter_next() keeps selecting the memcg being destroyed, fails
> > to tryget it, returns NULL to mem_cgroup_iter(), which goes around again.
> 
> So you had a single memcg (without any children) and a limit-reclaim
> on it when you removed it, right?

Hmm, thinking about this once more how can this happen? There must be a
task to trigger the limit reclaim so the cgroup cannot go away (or is
this somehow related to kmem accounting?). Only if the taks was migrated
after the reclaim was initiated but before we started iterating?

I am confused now and have to rush shortly so I will think about it
tomorrow some more.

> This is nasty because __mem_cgroup_iter_next will try to skip it but
> there is nothing else so it returns NULL. We update iter->generation++
> but that doesn't help us as prev = NULL as this is the first iteration
> so
> 		if (prev && reclaim->generation != iter->generation)
> 
> break out will not help us.

> You patch will surely help I am just not sure it is the right thing to
> do. Let me think about this.

The patch is actually not correct after all. You are returning root
memcg without taking a reference. So there is a risk that memcg will
disappear. Although, it is true that the race with removal is not that
probable because mem_cgroup_css_offline (resp. css_free) will see some
pages on LRUs and they will reclaim as well.

Ouch. And thinking about this shows that out_css_put is broken as well
for subtree walks (those that do not start at root_mem_cgroup level). We
need something like the the snippet bellow.
I really hate this code, especially when I tried to de-obfuscate it and
that introduced other subtle issues.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1f9d14e2f8de..f75277b0bf82 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1080,7 +1080,7 @@ skip_node:
 	if (next_css) {
 		struct mem_cgroup *mem = mem_cgroup_from_css(next_css);
 
-		if (css_tryget(&mem->css))
+		if (mem == root_mem_cgroup || css_tryget(&mem->css))
 			return mem;
 		else {
 			prev_css = next_css;
@@ -1219,7 +1219,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 out_unlock:
 	rcu_read_unlock();
 out_css_put:
-	if (prev && prev != root)
+	if (prev && prev != root_mem_cgroup)
 		css_put(&prev->css);
 
 	return memcg;

> Anyway very well spotted!
> 
> > It's better to err on the side of leaving the loop too soon than never
> > when such races occur: once we've served prev (using root if none),
> > get out the next time __mem_cgroup_iter_next() cannot deliver.
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> > Securing the tree iterator against such races is difficult, I've
> > certainly got it wrong myself before.  Although the bug is real, and
> > deserves a Cc stable, you may want to play around with other solutions
> > before committing to this one.  The current iterator goes back to v3.12:
> > I'm really not sure if v3.11 was good or not - I never saw the problem
> > in the vanilla kernel, but with Google mods in we also had to make an
> > adjustment, there to stop __mem_cgroup_iter() being called endlessly
> > from the reclaim level.
> > 
> >  mm/memcontrol.c |    5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > --- mmotm/mm/memcontrol.c	2014-01-10 18:25:02.236448954 -0800
> > +++ linux/mm/memcontrol.c	2014-01-12 22:21:10.700570471 -0800
> > @@ -1254,8 +1252,11 @@ struct mem_cgroup *mem_cgroup_iter(struc
> >  				reclaim->generation = iter->generation;
> >  		}
> >  
> > -		if (prev && !memcg)
> > +		if (!memcg) {
> > +			if (!prev)
> > +				memcg = root;
> >  			goto out_unlock;
> > +		}
> >  	}
> >  out_unlock:
> >  	rcu_read_unlock();
> 
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/3] mm/memcg: iteration skip memcgs not yet fully initialized
  2014-01-14 13:30   ` Michal Hocko
@ 2014-01-14 14:29     ` Tejun Heo
  2014-01-15  8:20       ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2014-01-14 14:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

Hey, Michal.

On Tue, Jan 14, 2014 at 02:30:05PM +0100, Michal Hocko wrote:
> On Mon 13-01-14 17:54:04, Hugh Dickins wrote:
> > It is surprising that the mem_cgroup iterator can return memcgs which
> > have not yet been fully initialized.  By accident (or trial and error?)
> > this appears not to present an actual problem; but it may be better to
> > prevent such surprises, by skipping memcgs not yet online.
> 
> My understanding was that !online cgroups are not visible for the
> iterator. it is css_online that has to be called before they are made
> visible.
> 
> Tejun?

>From the comment above css_for_each_descendant_pre()

 * Walk @root's descendants.  @root is included in the iteration and the
 * first node to be visited.  Must be called under rcu_read_lock().  A
 * descendant css which hasn't finished ->css_online() or already has
 * finished ->css_offline() may show up during traversal and it's each
 * subsystem's responsibility to verify that each @pos is alive.
     
What it guarantees is that an online css would *always* show up in the
iteration.  It's kinda difficult to guarantee both directions just
with RCU locking.  You gotta make at least one end loose to make it
work with RCU.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/3] mm/memcg: fix endless iteration in reclaim
  2014-01-14 14:26     ` Michal Hocko
@ 2014-01-14 20:42       ` Hugh Dickins
  2014-01-15  9:58         ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2014-01-14 20:42 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Tue, 14 Jan 2014, Michal Hocko wrote:
> On Tue 14-01-14 14:27:27, Michal Hocko wrote:
> > On Mon 13-01-14 17:52:30, Hugh Dickins wrote:
> > > On one home machine I can easily reproduce (by rmdir of memcgdir during
> > > reclaim) multiple processes stuck looping forever in mem_cgroup_iter():
> > > __mem_cgroup_iter_next() keeps selecting the memcg being destroyed, fails
> > > to tryget it, returns NULL to mem_cgroup_iter(), which goes around again.
> > 
> > So you had a single memcg (without any children) and a limit-reclaim
> > on it when you removed it, right?
> 
> Hmm, thinking about this once more how can this happen? There must be a
> task to trigger the limit reclaim so the cgroup cannot go away (or is
> this somehow related to kmem accounting?). Only if the taks was migrated
> after the reclaim was initiated but before we started iterating?

Yes, I believe that's how it comes about (but no kmem accounting:
it's configured in but I'm not setting limits).

The "cg" script I run for testing appended below.  Normally I just run
it as "cg 2" to set up two memcgs, then my dual-tmpfs-kbuild script runs
one kbuild on tmpfs in cg 1, and another kbuild on ext4 on loop on tmpfs
in cg 2, mainly to test swapping.  But for this bug I run it as "cg m",
to repeatedly create new memcg, move running tasks from old to new, and
remove old.

Sometimes I'm doing swapoff and swapon in the background too, but
that's not needed to see this bug.  And although we're accustomed to
move_charge_at_immigrate being a beast, for this bug it's much quicker
to have that turned off.

(Until a couple of months ago, I was working in /cg/1 and /cg/2; but
have now pushed down a level to /cg/cg/1 and /cg/cg/2 after realizing
that working at the root would miss some important issues - in particular
the mem_cgroup_reparent_charges wrong-usage hang; but in fact I have
*never* caught that here, just know that it still exists from some
Google watchdog dumps, but we've still not identified the cause -
seen even without MEMCG_SWAP and with Hannes's extra reparent_charges.)

> 
> I am confused now and have to rush shortly so I will think about it
> tomorrow some more.

Thanks, yes, I knew it's one you'd want to think about first: no rush.

> 
> > This is nasty because __mem_cgroup_iter_next will try to skip it but
> > there is nothing else so it returns NULL. We update iter->generation++
> > but that doesn't help us as prev = NULL as this is the first iteration
> > so
> > 		if (prev && reclaim->generation != iter->generation)
> > 
> > break out will not help us.
> 
> > You patch will surely help I am just not sure it is the right thing to
> > do. Let me think about this.
> 
> The patch is actually not correct after all. You are returning root
> memcg without taking a reference. So there is a risk that memcg will
> disappear. Although, it is true that the race with removal is not that
> probable because mem_cgroup_css_offline (resp. css_free) will see some
> pages on LRUs and they will reclaim as well.
> 
> Ouch. And thinking about this shows that out_css_put is broken as well
> for subtree walks (those that do not start at root_mem_cgroup level). We
> need something like the the snippet bellow.

It's the out_css_put precedent that I was following in not incrementing
for the root.  I think that's been discussed in the past, and rightly or
wrongly we've concluded that the caller of mem_cgroup_iter() always has
some hold on the root, which makes it safe to skip get/put on it here.
No doubt one of those many short cuts to avoid memcg overhead when
there's no memcg other than the root_mem_cgroup.

I've not given enough thought to whether that is still a good assumption.
The try_charge route does a css_tryget, and that will be the hold on the
root in the reclaim case, won't it?  And its css_tryget succeeding does
not guarantee that a css_tryget a moment later will also succeed, which
is what happens in this bug.

But I have not attempted to audit other uses of mem_cgroup_iter() and
for_each_mem_cgroup_tree().  I've not hit any problems from them, but
may not have exercised those paths at all.  And the question of
whether there's a good hold on the root is a separate issue, really.

Hugh

> I really hate this code, especially when I tried to de-obfuscate it and
> that introduced other subtle issues.
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1f9d14e2f8de..f75277b0bf82 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1080,7 +1080,7 @@ skip_node:
>  	if (next_css) {
>  		struct mem_cgroup *mem = mem_cgroup_from_css(next_css);
>  
> -		if (css_tryget(&mem->css))
> +		if (mem == root_mem_cgroup || css_tryget(&mem->css))
>  			return mem;
>  		else {
>  			prev_css = next_css;
> @@ -1219,7 +1219,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  out_unlock:
>  	rcu_read_unlock();
>  out_css_put:
> -	if (prev && prev != root)
> +	if (prev && prev != root_mem_cgroup)
>  		css_put(&prev->css);
>  
>  	return memcg;
> 
> > Anyway very well spotted!
> > 
> > > It's better to err on the side of leaving the loop too soon than never
> > > when such races occur: once we've served prev (using root if none),
> > > get out the next time __mem_cgroup_iter_next() cannot deliver.
> > > 
> > > Signed-off-by: Hugh Dickins <hughd@google.com>
> > > ---
> > > Securing the tree iterator against such races is difficult, I've
> > > certainly got it wrong myself before.  Although the bug is real, and
> > > deserves a Cc stable, you may want to play around with other solutions
> > > before committing to this one.  The current iterator goes back to v3.12:
> > > I'm really not sure if v3.11 was good or not - I never saw the problem
> > > in the vanilla kernel, but with Google mods in we also had to make an
> > > adjustment, there to stop __mem_cgroup_iter() being called endlessly
> > > from the reclaim level.
> > > 
> > >  mm/memcontrol.c |    5 ++++-
> > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > 
> > > --- mmotm/mm/memcontrol.c	2014-01-10 18:25:02.236448954 -0800
> > > +++ linux/mm/memcontrol.c	2014-01-12 22:21:10.700570471 -0800
> > > @@ -1254,8 +1252,11 @@ struct mem_cgroup *mem_cgroup_iter(struc
> > >  				reclaim->generation = iter->generation;
> > >  		}
> > >  
> > > -		if (prev && !memcg)
> > > +		if (!memcg) {
> > > +			if (!prev)
> > > +				memcg = root;
> > >  			goto out_unlock;
> > > +		}
> > >  	}
> > >  out_unlock:
> > >  	rcu_read_unlock();
> > 
> > -- 
> > Michal Hocko
> > SUSE Labs
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> -- 
> Michal Hocko
> SUSE Labs

#!/bin/sh
seconds=60

move() {
	s=$1
	d=$2
	mkdir $cg/$d || exit $?
	echo $mem >$cg/$d/memory.limit_in_bytes || exit $?
	[ -z "$soft" ] || echo $soft >$cg/$d/memory.soft_limit_in_bytes || exit $?
	[ -z "$swam" ] || echo $swam >$cg/$d/memory.memsw.limit_in_bytes || exit $?
#	echo 3 >$cg/$d/memory.move_charge_at_immigrate || exit $?
	tasks=0
	while [ $tasks -lt 4 ]
	do	sleep 1
		[ -f /tmp/swapswap ] || exit 0
		set -- `wc -l $cg/$s/tasks`
		tasks=$1
	done
	while :
	do
		while :
		do	read task <$cg/$s/tasks  || break
			echo $task >$cg/$d/tasks # but might have gone already
			[ -f /tmp/swapswap ] || exit 0
		done	2>/dev/null
		rmdir $cg/$s 2>/dev/null && break
		sleep 1
		[ -f /tmp/swapswap ] || exit 0
	done
	sleep $seconds
	[ -f /tmp/swapswap ] || exit 0
}

case "x$1" in
x)	mem=700M ; soft=""  ; memcgs=1 ;;
x1)	mem=700M ; soft=""  ; memcgs=1 ;;
x2)	mem=300M ; soft=250M; memcgs=2 ;;
xm)	mem=300M ; soft=250M; memcgs=2 ;;
*)	mem=$1   ; soft=""  ; memcgs=1 ;;
esac

> /tmp/swapswap
mkdir -p /cg || exit $?
[ -f /cg/memory.usage_in_bytes ] ||
	mount -t cgroup -o memory cg /cg || exit $?
[ -f /cg/memory.memsw.usage_in_bytes ] && swam=1000M || swam=""
echo 1 >/cg/memory.use_hierarchy || exit $?

cg=/cg/cg
mkdir -p $cg || exit $?
echo 1000M >$cg/memory.limit_in_bytes
[ -z "$soft" ] || echo  800M >$cg/memory.soft_limit_in_bytes
[ -z "$swam" ] || echo 2000M >$cg/memory.memsw.limit_in_bytes
echo $$ >$cg/tasks

i=0
while [ $i -lt $memcgs ]
do	let i=$i+1
	mkdir -p $cg/$i || exit $?
	echo $mem >$cg/$i/memory.limit_in_bytes || exit $?
	[ -z "$soft" ] || echo $soft >$cg/$i/memory.soft_limit_in_bytes || exit $?
	[ -z "$swam" ] || echo $swam >$cg/$i/memory.memsw.limit_in_bytes || swam=""

	chmod a+w $cg/$i/tasks || exit $?
done
while [ $i -lt 4 ]
do	let i=$i+1
	[ ! -d $cg/$i ] || rmdir $cg/$i || exit $?
done
[ "x$1" = xm ] || exit 0

while :
do	move 1 3
	move 2 4
	move 3 1
	move 4 2
done &

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/3] mm/memcg: iteration skip memcgs not yet fully initialized
  2014-01-14 14:29     ` Tejun Heo
@ 2014-01-15  8:20       ` Michal Hocko
  0 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2014-01-15  8:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Hugh Dickins, Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Tue 14-01-14 09:29:04, Tejun Heo wrote:
> Hey, Michal.
> 
> On Tue, Jan 14, 2014 at 02:30:05PM +0100, Michal Hocko wrote:
> > On Mon 13-01-14 17:54:04, Hugh Dickins wrote:
> > > It is surprising that the mem_cgroup iterator can return memcgs which
> > > have not yet been fully initialized.  By accident (or trial and error?)
> > > this appears not to present an actual problem; but it may be better to
> > > prevent such surprises, by skipping memcgs not yet online.
> > 
> > My understanding was that !online cgroups are not visible for the
> > iterator. it is css_online that has to be called before they are made
> > visible.
> > 
> > Tejun?
> 
> From the comment above css_for_each_descendant_pre()
> 
>  * Walk @root's descendants.  @root is included in the iteration and the
>  * first node to be visited.  Must be called under rcu_read_lock().  A
>  * descendant css which hasn't finished ->css_online() or already has
>  * finished ->css_offline() may show up during traversal and it's each
>  * subsystem's responsibility to verify that each @pos is alive.

/me slaps self. I was even reviewing patches which introduced that.
But still I managed to convince myself that online means before online
rather than right after again and again.

Sorry about the confusion.

> What it guarantees is that an online css would *always* show up in the
> iteration.  It's kinda difficult to guarantee both directions just
> with RCU locking.  You gotta make at least one end loose to make it
> work with RCU.
> 
> Thanks.
> 
> -- 
> tejun

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/3] mm/memcg: iteration skip memcgs not yet fully initialized
  2014-01-14  1:54 ` [PATCH 3/3] mm/memcg: iteration skip memcgs not yet fully initialized Hugh Dickins
  2014-01-14 13:30   ` Michal Hocko
@ 2014-01-15  8:21   ` Michal Hocko
  1 sibling, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2014-01-15  8:21 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Mon 13-01-14 17:54:04, Hugh Dickins wrote:
> It is surprising that the mem_cgroup iterator can return memcgs which
> have not yet been fully initialized.  By accident (or trial and error?)
> this appears not to present an actual problem; but it may be better to
> prevent such surprises, by skipping memcgs not yet online.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

This makes a perfect sense now after Tejun pointed out that I was wrong
assuming css_online is called before cgroup is made visible.

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
> Decide for yourself whether to take this or not.  I spent quite a while
> digging into a mysterious "trying to register non-static key" issue from
> lockdep, which originated from the iterator returning a vmalloc'ed memcg
> a moment before the res_counter_init()s had done their spin_lock_init()s.
> But the backtrace was an odd one of our own mis-devising, not a charge or
> reclaim or stats trace, so probably it's never been a problem for vanilla.
> 
>  mm/memcontrol.c |    6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> --- mmotm/mm/memcontrol.c	2014-01-10 18:25:02.236448954 -0800
> +++ linux/mm/memcontrol.c	2014-01-12 22:21:10.700570471 -0800
> @@ -1119,10 +1119,8 @@ skip_node:
>  	 * protected by css_get and the tree walk is rcu safe.
>  	 */
>  	if (next_css) {
> -		struct mem_cgroup *mem = mem_cgroup_from_css(next_css);
> -
> -		if (css_tryget(&mem->css))
> -			return mem;
> +		if ((next_css->flags & CSS_ONLINE) && css_tryget(next_css))
> +			return mem_cgroup_from_css(next_css);
>  		else {
>  			prev_css = next_css;
>  			goto skip_node;

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/3] mm/memcg: fix endless iteration in reclaim
  2014-01-14 20:42       ` Hugh Dickins
@ 2014-01-15  9:58         ` Michal Hocko
  2014-01-15 12:17           ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2014-01-15  9:58 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Tue 14-01-14 12:42:28, Hugh Dickins wrote:
> On Tue, 14 Jan 2014, Michal Hocko wrote:
> > On Tue 14-01-14 14:27:27, Michal Hocko wrote:
> > > On Mon 13-01-14 17:52:30, Hugh Dickins wrote:
> > > > On one home machine I can easily reproduce (by rmdir of memcgdir during
> > > > reclaim) multiple processes stuck looping forever in mem_cgroup_iter():
> > > > __mem_cgroup_iter_next() keeps selecting the memcg being destroyed, fails
> > > > to tryget it, returns NULL to mem_cgroup_iter(), which goes around again.
> > > 
> > > So you had a single memcg (without any children) and a limit-reclaim
> > > on it when you removed it, right?
> > 
> > Hmm, thinking about this once more how can this happen? There must be a
> > task to trigger the limit reclaim so the cgroup cannot go away (or is
> > this somehow related to kmem accounting?). Only if the taks was migrated
> > after the reclaim was initiated but before we started iterating?
> 
> Yes, I believe that's how it comes about (but no kmem accounting:
> it's configured in but I'm not setting limits).

OK, then it makes more sense now.

> The "cg" script I run for testing appended below.  Normally I just run
> it as "cg 2" to set up two memcgs, then my dual-tmpfs-kbuild script runs
> one kbuild on tmpfs in cg 1, and another kbuild on ext4 on loop on tmpfs
> in cg 2, mainly to test swapping.  But for this bug I run it as "cg m",
> to repeatedly create new memcg, move running tasks from old to new, and
> remove old.

thanks I will play with it.

> Sometimes I'm doing swapoff and swapon in the background too, but
> that's not needed to see this bug.  And although we're accustomed to
> move_charge_at_immigrate being a beast, for this bug it's much quicker
> to have that turned off.

yes, that makes sense becuase the task doesn't take its memory with it
and the reclaim contains even without any task in the group.

> (Until a couple of months ago, I was working in /cg/1 and /cg/2; but
> have now pushed down a level to /cg/cg/1 and /cg/cg/2 after realizing
> that working at the root would miss some important issues - in particular
> the mem_cgroup_reparent_charges wrong-usage hang; but in fact I have
> *never* caught that here, just know that it still exists from some
> Google watchdog dumps, but we've still not identified the cause -
> seen even without MEMCG_SWAP and with Hannes's extra reparent_charges.)

Interesting. There is another user seeing hangs with the reparent
workaround as well. I still didn't get to his test case because of other
interal work that is due.

> > I am confused now and have to rush shortly so I will think about it
> > tomorrow some more.
> 
> Thanks, yes, I knew it's one you'd want to think about first: no rush.
> 
> > 
> > > This is nasty because __mem_cgroup_iter_next will try to skip it but
> > > there is nothing else so it returns NULL. We update iter->generation++
> > > but that doesn't help us as prev = NULL as this is the first iteration
> > > so
> > > 		if (prev && reclaim->generation != iter->generation)
> > > 
> > > break out will not help us.
> > 
> > > You patch will surely help I am just not sure it is the right thing to
> > > do. Let me think about this.
> > 
> > The patch is actually not correct after all. You are returning root
> > memcg without taking a reference. So there is a risk that memcg will
> > disappear. Although, it is true that the race with removal is not that
> > probable because mem_cgroup_css_offline (resp. css_free) will see some
> > pages on LRUs and they will reclaim as well.
> > 
> > Ouch. And thinking about this shows that out_css_put is broken as well
> > for subtree walks (those that do not start at root_mem_cgroup level). We
> > need something like the the snippet bellow.
> 
> It's the out_css_put precedent that I was following in not incrementing
> for the root.  I think that's been discussed in the past, and rightly or
> wrongly we've concluded that the caller of mem_cgroup_iter() always has
> some hold on the root, which makes it safe to skip get/put on it here.
> No doubt one of those many short cuts to avoid memcg overhead when
> there's no memcg other than the root_mem_cgroup.

That might be true but I guess it makes sense to get rid of some subtle
assumptions. Especially now that we have an effective per-cpu ref.
counting for css.

> I've not given enough thought to whether that is still a good assumption.
> The try_charge route does a css_tryget, and that will be the hold on the
> root in the reclaim case, won't it?

Yes, that is what I realized when looked at the code yesterday and tried
to handle by the snippet. I will post it as a separate patch.

> And its css_tryget succeeding does not guarantee that a css_tryget a
> moment later will also succeed, which is what happens in this bug.

Once the tryget succeeds then we get the memcg so another tryget won't
be needed. Or am I missing something? What I tried to tell is that you
should do css_get(root) in your escape path.

> But I have not attempted to audit other uses of mem_cgroup_iter() and
> for_each_mem_cgroup_tree().  I've not hit any problems from them, but
> may not have exercised those paths at all.  And the question of
> whether there's a good hold on the root is a separate issue, really.
> 
> Hugh
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/3] mm/memcg: fix endless iteration in reclaim
  2014-01-15  9:58         ` Michal Hocko
@ 2014-01-15 12:17           ` Michal Hocko
  2014-01-15 21:24             ` Hugh Dickins
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2014-01-15 12:17 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Wed 15-01-14 10:58:29, Michal Hocko wrote:
> On Tue 14-01-14 12:42:28, Hugh Dickins wrote:
> > On Tue, 14 Jan 2014, Michal Hocko wrote:
[...]
> > > Ouch. And thinking about this shows that out_css_put is broken as well
> > > for subtree walks (those that do not start at root_mem_cgroup level). We
> > > need something like the the snippet bellow.
> > 
> > It's the out_css_put precedent that I was following in not incrementing
> > for the root.  I think that's been discussed in the past, and rightly or
> > wrongly we've concluded that the caller of mem_cgroup_iter() always has
> > some hold on the root, which makes it safe to skip get/put on it here.
> > No doubt one of those many short cuts to avoid memcg overhead when
> > there's no memcg other than the root_mem_cgroup.
> 
> That might be true but I guess it makes sense to get rid of some subtle
> assumptions. Especially now that we have an effective per-cpu ref.
> counting for css.

OK, I finally found some time to think about this some more and it seems
that the issue you have reported and the above issue are in fact
identical. css reference counting optimization in fact also prevent from
the endless loop you are seeing here because we simply didn't call
css_tryget on the root...

Therefore I guess we should reintroduce the optimization. What do you
think about the following? This is on top of the current mmotm but it
certainly needs backporting to the stable kernels.
---
>From 560924e86059947ab9418732cb329ad149dd8f6a Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Wed, 15 Jan 2014 11:52:09 +0100
Subject: [PATCH] memcg: fix css reference leak from mem_cgroup_iter

19f39402864e (memcg: simplify mem_cgroup_iter) has introduced a css
refrence leak (thus memory leak) because mem_cgroup_iter makes sure it
doesn't put a css reference on the root of the tree walk. The mentioned
commit however dropped the root check when the css reference is taken
while it keept the css_put optimization fora the root in place.

This means that css_put is not called and so css along with mem_cgroup
and other cgroup internal object tied by css lifetime are never freed.

Fix the issue by reintroducing root check in __mem_cgroup_iter_next.

This patch also fixes issue reported by Hugh Dickins when
mem_cgroup_iter might end up in an endless loop because a group which is
under hard limit reclaim is removed in parallel with iteration.
__mem_cgroup_iter_next would always return NULL because css_tryget on
the root (reclaimed memcg) would fail and there are no other memcg in
the hierarchy. prev == NULL in mem_cgroup_iter would prevent break out
from the root and so the while (!memcg) loop would never terminate.
as css_tryget is no longer called for the root of the tree walk this
doesn't happen anymore.

Cc: stable@vger.kernel.org # 3.10+
Reported-and-debugged-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f016d26adfd3..dd3974c9f08d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1078,7 +1078,8 @@ skip_node:
 	 * protected by css_get and the tree walk is rcu safe.
 	 */
 	if (next_css) {
-		if ((next_css->flags & CSS_ONLINE) && css_tryget(next_css))
+		if ((next_css->flags & CSS_ONLINE) &&
+				(next_css == root || css_tryget(next_css)))
 			return mem_cgroup_from_css(next_css);
 		else {
 			prev_css = next_css;
-- 
1.8.5.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/3] mm/memcg: fix endless iteration in reclaim
  2014-01-15 12:17           ` Michal Hocko
@ 2014-01-15 21:24             ` Hugh Dickins
  2014-01-16  8:17               ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2014-01-15 21:24 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Wed, 15 Jan 2014, Michal Hocko wrote:
> On Wed 15-01-14 10:58:29, Michal Hocko wrote:
> > On Tue 14-01-14 12:42:28, Hugh Dickins wrote:
> > > On Tue, 14 Jan 2014, Michal Hocko wrote:
> [...]
> > > > Ouch. And thinking about this shows that out_css_put is broken as well
> > > > for subtree walks (those that do not start at root_mem_cgroup level). We
> > > > need something like the the snippet bellow.
> > > 
> > > It's the out_css_put precedent that I was following in not incrementing
> > > for the root.  I think that's been discussed in the past, and rightly or
> > > wrongly we've concluded that the caller of mem_cgroup_iter() always has
> > > some hold on the root, which makes it safe to skip get/put on it here.
> > > No doubt one of those many short cuts to avoid memcg overhead when
> > > there's no memcg other than the root_mem_cgroup.
> > 
> > That might be true but I guess it makes sense to get rid of some subtle
> > assumptions. Especially now that we have an effective per-cpu ref.
> > counting for css.
> 
> OK, I finally found some time to think about this some more and it seems
> that the issue you have reported and the above issue are in fact
> identical. css reference counting optimization in fact also prevent from
> the endless loop you are seeing here because we simply didn't call
> css_tryget on the root...

Wow.  I don't see them as the same issue, but yes, one fix for both.
I completely missed that: so we've been leaking a "struct mem_cgroup"
and its attached stuff, for any memcg on which reclaim or other iteration
had been done, from v3.10 onwards?

Or am I confused and overstating it?  I'd have sworn I've checked for
memcg leaks myself after doing tests, and not realized this; but now I
put in a count of those allocated, yes, I see it going up and up (with
my half-fix in to proceed beyond the hang) without your fix, but staying
steady with your fix in (which also gets around my hang).

> 
> Therefore I guess we should reintroduce the optimization. What do you

It's not a question of reintroducing an optimization, but of either
fixing the broken end of the optimization, or ripping the other end out.
At this point I'm for simply fixing what we know is broken; then later
one of us audit the other iterators to check the original assumption
behind the optimization is still valid (that callers of mem_cgroup_iter
have some kind of hold on the root they call it for).

Ah.  But perhaps the unfreeable memcg has been protecting us from
nasties which can now emerge.  Hmm: I like your fix, but it's not
something to rush into mainline and stable immediately.  We'll have
to give it some exposure first.

And given the confusion, and progressive little modifications in
just these few lines of code, I wonder if it won't be easier for
stable to combine the CSS_ONLINE one into this to make a single
patch.  It is a separate issue, but similar, and now it's clear
that it's what you intended originally, I don't think it would be
inappropriate to make a single patch correcting those few lines,
with log comment listing the three issues.


> think about the following? This is on top of the current mmotm but it
> certainly needs backporting to the stable kernels.
> ---
> From 560924e86059947ab9418732cb329ad149dd8f6a Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Wed, 15 Jan 2014 11:52:09 +0100
> Subject: [PATCH] memcg: fix css reference leak from mem_cgroup_iter
> 
> 19f39402864e (memcg: simplify mem_cgroup_iter) has introduced a css
> refrence leak (thus memory leak) because mem_cgroup_iter makes sure it
> doesn't put a css reference on the root of the tree walk. The mentioned
> commit however dropped the root check when the css reference is taken
> while it keept the css_put optimization fora the root in place.
> 
> This means that css_put is not called and so css along with mem_cgroup
> and other cgroup internal object tied by css lifetime are never freed.
> 
> Fix the issue by reintroducing root check in __mem_cgroup_iter_next.
> 
> This patch also fixes issue reported by Hugh Dickins when
> mem_cgroup_iter might end up in an endless loop because a group which is
> under hard limit reclaim is removed in parallel with iteration.
> __mem_cgroup_iter_next would always return NULL because css_tryget on
> the root (reclaimed memcg) would fail and there are no other memcg in
> the hierarchy. prev == NULL in mem_cgroup_iter would prevent break out
> from the root and so the while (!memcg) loop would never terminate.
> as css_tryget is no longer called for the root of the tree walk this
> doesn't happen anymore.
> 
> Cc: stable@vger.kernel.org # 3.10+
> Reported-and-debugged-by: Hugh Dickins <hughd@google.com>

Definitely not debugged by me!  Debugged and understood by you.

> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f016d26adfd3..dd3974c9f08d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1078,7 +1078,8 @@ skip_node:
>  	 * protected by css_get and the tree walk is rcu safe.
>  	 */
>  	if (next_css) {
> -		if ((next_css->flags & CSS_ONLINE) && css_tryget(next_css))
> +		if ((next_css->flags & CSS_ONLINE) &&
> +				(next_css == root || css_tryget(next_css)))

Not quite: next_css points to one thing and root to another.
>  			return mem_cgroup_from_css(next_css);
>  		else {
>  			prev_css = next_css;
> -- 

This is how I've re-written that block, and started testing on it;
the unnecessary "else {" part was looking increasingly ugly to me
(though let loose on it, I might change it all around more...)

	if (next_css) {
		if ((next_css->flags & CSS_ONLINE) &&
		    (next_css == &root->css || css_tryget(next_css)))
			return mem_cgroup_from_css(next_css);
		prev_css = next_css;
		goto skip_node;
	}

Sorry for being so slow to respond, by the way: for a couple of hours
I couldn't test at all, and thought I was going mad - one day I send
you that "cg" script, the next day it starts to break, it couldn't
"mkdir -p /cg/cg", claiming it already existed, wha???  Turns out the
fix for that has gone into yesterday's mmotm (though I've not had time
to move on to that yet): uninitialized ret in memcg_propagate_kmem().

Hugh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/3] mm/memcg: fix endless iteration in reclaim
  2014-01-15 21:24             ` Hugh Dickins
@ 2014-01-16  8:17               ` Michal Hocko
  2014-01-16 15:22                 ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2014-01-16  8:17 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Wed 15-01-14 13:24:34, Hugh Dickins wrote:
> On Wed, 15 Jan 2014, Michal Hocko wrote:
> > On Wed 15-01-14 10:58:29, Michal Hocko wrote:
> > > On Tue 14-01-14 12:42:28, Hugh Dickins wrote:
> > > > On Tue, 14 Jan 2014, Michal Hocko wrote:
> > [...]
> > > > > Ouch. And thinking about this shows that out_css_put is broken as well
> > > > > for subtree walks (those that do not start at root_mem_cgroup level). We
> > > > > need something like the the snippet bellow.
> > > > 
> > > > It's the out_css_put precedent that I was following in not incrementing
> > > > for the root.  I think that's been discussed in the past, and rightly or
> > > > wrongly we've concluded that the caller of mem_cgroup_iter() always has
> > > > some hold on the root, which makes it safe to skip get/put on it here.
> > > > No doubt one of those many short cuts to avoid memcg overhead when
> > > > there's no memcg other than the root_mem_cgroup.
> > > 
> > > That might be true but I guess it makes sense to get rid of some subtle
> > > assumptions. Especially now that we have an effective per-cpu ref.
> > > counting for css.
> > 
> > OK, I finally found some time to think about this some more and it seems
> > that the issue you have reported and the above issue are in fact
> > identical. css reference counting optimization in fact also prevent from
> > the endless loop you are seeing here because we simply didn't call
> > css_tryget on the root...
> 
> Wow.  I don't see them as the same issue, but yes, one fix for both.

Yeah, they have an identical culprit, that's what I wanted to write but
then little dwarfs have have changed that to make me sound funny.

> I completely missed that: so we've been leaking a "struct mem_cgroup"
> and its attached stuff, for any memcg on which reclaim or other iteration
> had been done, from v3.10 onwards?

I am afraid so :/
I've added a simple trace_printk into mem_cgroup_css_free just to be
sure and it didn't show up after rmdir THAT_GROUP

> Or am I confused and overstating it?  I'd have sworn I've checked for
> memcg leaks myself after doing tests, and not realized this; but now I
> put in a count of those allocated, yes, I see it going up and up (with
> my half-fix in to proceed beyond the hang) without your fix, but staying
> steady with your fix in (which also gets around my hang).
> 
> > 
> > Therefore I guess we should reintroduce the optimization. What do you
> 
> It's not a question of reintroducing an optimization, but of either
> fixing the broken end of the optimization, or ripping the other end out.

Heh, I didn't have a better name for the root css exclusion so I kept
calling it optimization.

> At this point I'm for simply fixing what we know is broken; then later
> one of us audit the other iterators to check the original assumption
> behind the optimization is still valid (that callers of mem_cgroup_iter
> have some kind of hold on the root they call it for).

Agreed.

> Ah.  But perhaps the unfreeable memcg has been protecting us from
> nasties which can now emerge.  Hmm: I like your fix, but it's not
> something to rush into mainline and stable immediately.  We'll have
> to give it some exposure first.

I would like to get rid of the root css refcounting optimization as well
but when I was thinking about it I figured it would be safer to go back
what we had in 3.10 before my patch oversimplified that code. Later
fixes can be built on top.

Stable backport will be safer as well IMO.

> And given the confusion, and progressive little modifications in
> just these few lines of code, I wonder if it won't be easier for
> stable to combine the CSS_ONLINE one into this to make a single
> patch.  It is a separate issue, but similar, and now it's clear
> that it's what you intended originally, I don't think it would be
> inappropriate to make a single patch correcting those few lines,
> with log comment listing the three issues.

I am not sure how much combining the two would be helpful. This one
already fixes 2 issues. But if you think it is worthwhile then I won't
block it. 

> > think about the following? This is on top of the current mmotm but it
> > certainly needs backporting to the stable kernels.
> > ---
> > From 560924e86059947ab9418732cb329ad149dd8f6a Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.cz>
> > Date: Wed, 15 Jan 2014 11:52:09 +0100
> > Subject: [PATCH] memcg: fix css reference leak from mem_cgroup_iter
> > 
> > 19f39402864e (memcg: simplify mem_cgroup_iter) has introduced a css
> > refrence leak (thus memory leak) because mem_cgroup_iter makes sure it
> > doesn't put a css reference on the root of the tree walk. The mentioned
> > commit however dropped the root check when the css reference is taken
> > while it keept the css_put optimization fora the root in place.
> > 
> > This means that css_put is not called and so css along with mem_cgroup
> > and other cgroup internal object tied by css lifetime are never freed.
> > 
> > Fix the issue by reintroducing root check in __mem_cgroup_iter_next.
> > 
> > This patch also fixes issue reported by Hugh Dickins when
> > mem_cgroup_iter might end up in an endless loop because a group which is
> > under hard limit reclaim is removed in parallel with iteration.
> > __mem_cgroup_iter_next would always return NULL because css_tryget on
> > the root (reclaimed memcg) would fail and there are no other memcg in
> > the hierarchy. prev == NULL in mem_cgroup_iter would prevent break out
> > from the root and so the while (!memcg) loop would never terminate.
> > as css_tryget is no longer called for the root of the tree walk this
> > doesn't happen anymore.
> > 
> > Cc: stable@vger.kernel.org # 3.10+
> > Reported-and-debugged-by: Hugh Dickins <hughd@google.com>
> 
> Definitely not debugged by me!  Debugged and understood by you.

You still have debugged the second part of the problem (endless loop).
But I will go with whatever tag you like.

> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > ---
> >  mm/memcontrol.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index f016d26adfd3..dd3974c9f08d 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1078,7 +1078,8 @@ skip_node:
> >  	 * protected by css_get and the tree walk is rcu safe.
> >  	 */
> >  	if (next_css) {
> > -		if ((next_css->flags & CSS_ONLINE) && css_tryget(next_css))
> > +		if ((next_css->flags & CSS_ONLINE) &&
> > +				(next_css == root || css_tryget(next_css)))
> 
> Not quite: next_css points to one thing and root to another.

Dohh, right you are next_css == root->css. I was wondering how I was
able to see the leak being fixed and then realized that root->css has
the same address as root...
Anyway very well spotted.

> >  			return mem_cgroup_from_css(next_css);
> >  		else {
> >  			prev_css = next_css;
> > -- 
> 
> This is how I've re-written that block, and started testing on it;
> the unnecessary "else {" part was looking increasingly ugly to me
> (though let loose on it, I might change it all around more...)
> 
> 	if (next_css) {
> 		if ((next_css->flags & CSS_ONLINE) &&
> 		    (next_css == &root->css || css_tryget(next_css)))
> 			return mem_cgroup_from_css(next_css);
> 		prev_css = next_css;
> 		goto skip_node;
> 	}

Yes, that looks better. Maybe put a blank line before prev_css = next_css?

> Sorry for being so slow to respond, by the way: for a couple of hours
> I couldn't test at all, and thought I was going mad - one day I send
> you that "cg" script, the next day it starts to break, it couldn't
> "mkdir -p /cg/cg", claiming it already existed, wha???  Turns out the
> fix for that has gone into yesterday's mmotm (though I've not had time
> to move on to that yet): uninitialized ret in memcg_propagate_kmem().

Thanks a lot!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/3] mm/memcg: fix endless iteration in reclaim
  2014-01-16  8:17               ` Michal Hocko
@ 2014-01-16 15:22                 ` Michal Hocko
  2014-01-16 19:15                   ` Hugh Dickins
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2014-01-16 15:22 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Thu 16-01-14 09:17:38, Michal Hocko wrote:
> On Wed 15-01-14 13:24:34, Hugh Dickins wrote:
> > On Wed, 15 Jan 2014, Michal Hocko wrote:
[...]
> > > From 560924e86059947ab9418732cb329ad149dd8f6a Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko <mhocko@suse.cz>
> > > Date: Wed, 15 Jan 2014 11:52:09 +0100
> > > Subject: [PATCH] memcg: fix css reference leak from mem_cgroup_iter
> > > 
> > > 19f39402864e (memcg: simplify mem_cgroup_iter) has introduced a css
> > > refrence leak (thus memory leak) because mem_cgroup_iter makes sure it
> > > doesn't put a css reference on the root of the tree walk. The mentioned
> > > commit however dropped the root check when the css reference is taken
> > > while it keept the css_put optimization fora the root in place.
> > > 
> > > This means that css_put is not called and so css along with mem_cgroup
> > > and other cgroup internal object tied by css lifetime are never freed.
> > > 
> > > Fix the issue by reintroducing root check in __mem_cgroup_iter_next.
> > > 
> > > This patch also fixes issue reported by Hugh Dickins when
> > > mem_cgroup_iter might end up in an endless loop because a group which is
> > > under hard limit reclaim is removed in parallel with iteration.
> > > __mem_cgroup_iter_next would always return NULL because css_tryget on
> > > the root (reclaimed memcg) would fail and there are no other memcg in
> > > the hierarchy. prev == NULL in mem_cgroup_iter would prevent break out
> > > from the root and so the while (!memcg) loop would never terminate.
> > > as css_tryget is no longer called for the root of the tree walk this
> > > doesn't happen anymore.
> > > 
> > > Cc: stable@vger.kernel.org # 3.10+
> > > Reported-and-debugged-by: Hugh Dickins <hughd@google.com>
> > 
> > Definitely not debugged by me!  Debugged and understood by you.
> 
> You still have debugged the second part of the problem (endless loop).
> But I will go with whatever tag you like.
> 
> > > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > > ---
> > >  mm/memcontrol.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index f016d26adfd3..dd3974c9f08d 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1078,7 +1078,8 @@ skip_node:
> > >  	 * protected by css_get and the tree walk is rcu safe.
> > >  	 */
> > >  	if (next_css) {
> > > -		if ((next_css->flags & CSS_ONLINE) && css_tryget(next_css))
> > > +		if ((next_css->flags & CSS_ONLINE) &&
> > > +				(next_css == root || css_tryget(next_css)))
> > 
> > Not quite: next_css points to one thing and root to another.
> 
> Dohh, right you are next_css == root->css. I was wondering how I was
> able to see the leak being fixed and then realized that root->css has
> the same address as root...
> Anyway very well spotted.
> 
> > >  			return mem_cgroup_from_css(next_css);
> > >  		else {
> > >  			prev_css = next_css;
> > > -- 
> > 
> > This is how I've re-written that block, and started testing on it;
> > the unnecessary "else {" part was looking increasingly ugly to me
> > (though let loose on it, I might change it all around more...)
> > 
> > 	if (next_css) {
> > 		if ((next_css->flags & CSS_ONLINE) &&
> > 		    (next_css == &root->css || css_tryget(next_css)))
> > 			return mem_cgroup_from_css(next_css);
> > 		prev_css = next_css;
> > 		goto skip_node;
> > 	}
> 
> Yes, that looks better. Maybe put a blank line before prev_css = next_css?
> 
> > Sorry for being so slow to respond, by the way: for a couple of hours
> > I couldn't test at all, and thought I was going mad - one day I send
> > you that "cg" script, the next day it starts to break, it couldn't
> > "mkdir -p /cg/cg", claiming it already existed, wha???  Turns out the
> > fix for that has gone into yesterday's mmotm (though I've not had time
> > to move on to that yet): uninitialized ret in memcg_propagate_kmem().
> 
> Thanks a lot!

What about this? I have incorporated your root->css fix and else removal
+ added a comment to clarify importance of root css refcount exclusion.
I still prefer this going to the stable separately from CSS_ONLINE fix.
---
>From 543df5c82f6eec622f669ea322ba6ff03924fded Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Thu, 16 Jan 2014 16:17:13 +0100
Subject: [PATCH] memcg: fix css reference leak from mem_cgroup_iter

19f39402864e (memcg: simplify mem_cgroup_iter) has introduced a css
refrence leak (thus memory leak) because mem_cgroup_iter makes sure it
doesn't put a css reference on the root of the tree walk. The mentioned
commit however dropped the root check when the css reference is taken
while it keept the css_put optimization fora the root in place.

This means that css_put is not called and so css along with mem_cgroup
and other cgroup internal object tied by css lifetime are never freed.

Fix the issue by reintroducing root check in __mem_cgroup_iter_next.

This patch also fixes issue reported by Hugh Dickins when
mem_cgroup_iter might end up in an endless loop because a group which is
under hard limit reclaim is removed in parallel with iteration.
__mem_cgroup_iter_next would always return NULL because css_tryget on
the root (reclaimed memcg) would fail and there are no other memcg in
the hierarchy. prev == NULL in mem_cgroup_iter would prevent break out
from the root and so the while (!memcg) loop would never terminate.
as css_tryget is no longer called for the root of the tree walk this
doesn't happen anymore.

[hughd@google.com: Fixed root vs. root->css fix]
[hughd@google.com: Get rid of else branch because it is ugly]
Cc: stable@vger.kernel.org # 3.10+
<Hugh's-selection>-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f016d26adfd3..969f14d32b30 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1076,14 +1076,22 @@ skip_node:
 	 * skipped and we should continue the tree walk.
 	 * last_visited css is safe to use because it is
 	 * protected by css_get and the tree walk is rcu safe.
+	 *
+	 * We do not take a reference on the root of the tree walk
+	 * because we might race with the root removal when it would
+	 * be the only node in the iterated hierarchy and mem_cgroup_iter
+	 * would end up in an endless loop because it expects that at
+	 * least one valid node will be returned. Root cannot disappear
+	 * because caller of the iterator should hold it already so
+	 * skipping css reference should be safe.
 	 */
 	if (next_css) {
-		if ((next_css->flags & CSS_ONLINE) && css_tryget(next_css))
+		if ((next_css->flags & CSS_ONLINE) &&
+				(next_css == root->css || css_tryget(next_css)))
 			return mem_cgroup_from_css(next_css);
-		else {
-			prev_css = next_css;
-			goto skip_node;
-		}
+
+		prev_css = next_css;
+		goto skip_node;
 	}
 
 	return NULL;
-- 
1.8.5.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/3] mm/memcg: fix endless iteration in reclaim
  2014-01-16 15:22                 ` Michal Hocko
@ 2014-01-16 19:15                   ` Hugh Dickins
  2014-01-17 15:41                     ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2014-01-16 19:15 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Thu, 16 Jan 2014, Michal Hocko wrote:
> From 543df5c82f6eec622f669ea322ba6ff03924fded Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Thu, 16 Jan 2014 16:17:13 +0100
> Subject: [PATCH] memcg: fix css reference leak from mem_cgroup_iter
> 
> 19f39402864e (memcg: simplify mem_cgroup_iter) has introduced a css
> refrence leak (thus memory leak) because mem_cgroup_iter makes sure it
> doesn't put a css reference on the root of the tree walk. The mentioned
> commit however dropped the root check when the css reference is taken
> while it keept the css_put optimization fora the root in place.

I don't think that's quite right, actually - and I think it's all
so confusing that we do need to be pedantic and set it down right.

I spent quite a while yesterday trying out my "cg m" on 3.10, 3.11,
3.12 and 3.13-rc8 on this laptop: first just counting mem_cgroup_allocs
and frees (if I could get that far without hanging or crashing), then
also with your patch in (on 3.12 and 3.13-rc8) or the completely
different patch appended at the bottom (on 3.10 and 3.11), checking
for leftover mem_cgroups afterwards.

I saw no evidence of mem_cgroup leakage on 3.10 and 3.11, which had
	/*
	 * Root is not visited by cgroup iterators so it needs an
	 * explicit visit.
	 */
	if (!last_visited)
		return root;
at the head of __mem_cgroup_iter_next(), removed around the same
time as changeover from prev_cgroup etc to prev_css etc in 3.12.

I don't believe 19f39402864e was responsible for a reference leak,
that came later.  But I think it was responsible for the original
endless iteration (shrink_zone going around and around getting root
again and again from mem_cgroup_iter).

But beware of my conclusion, please check for yourself: with my
separate kbuilds in separate /cg/cg/? memcgs, what "cg m" is doing
is very simple and segregated, can hardly be called testing reclaim
iteration, so I hope you have something better to check it.  Plus
I was testing on 3.10 and 3.11 vanilla, not latest stable versions.

(If I'm very honest, I'll admit that I still did not see that hang
on 3.11 vanilla: what I hit was a crash in kfree instead, but the
same patch got rid of that too.  Of course I ought to investigate
further, but at some point I just have to give up and move on,
there's just too much breakage to chase all over the kernel...)

> 
> This means that css_put is not called and so css along with mem_cgroup
> and other cgroup internal object tied by css lifetime are never freed.
> 
> Fix the issue by reintroducing root check in __mem_cgroup_iter_next.
> 
> This patch also fixes issue reported by Hugh Dickins when
> mem_cgroup_iter might end up in an endless loop because a group which is
> under hard limit reclaim is removed in parallel with iteration.
> __mem_cgroup_iter_next would always return NULL because css_tryget on
> the root (reclaimed memcg) would fail and there are no other memcg in
> the hierarchy. prev == NULL in mem_cgroup_iter would prevent break out
> from the root and so the while (!memcg) loop would never terminate.
> as css_tryget is no longer called for the root of the tree walk this
> doesn't happen anymore.
> 
> [hughd@google.com: Fixed root vs. root->css fix]
> [hughd@google.com: Get rid of else branch because it is ugly]

Thanks for your courtesy!  But let's not clutter it with those two.

> <Hugh's-selection>-by: Hugh Dickins <hughd@google.com>

You already credited me above, but "Reported-by:" here if you insist.

> Cc: stable@vger.kernel.org # 3.10+

Well, I'm okay with that, if we use that as a way to shoehorn in the
patch at the bottom instead for 3.10 and 3.11 stables.  Whether that's
an abuse of the stable system... I think not, the patch at the bottom
(though it could be written in a variety of other ways) is what we're
relying on for 3.11 at Google, and the iteration hang it fixes is
equivalent to the one you're fixing here (but a hang repeatedly
calling mem_cgroup_iter morphed into a tighter hang repeatedly
calling __mem_cgroup_iter_next with the 3.12 rewrite, plus leakage).

Or, if you're uncomfortable with the misrepresentation, you could
just say 3.12+; but I think we serve other users of 3.10 and 3.11
best by saying 3.10+ there as you have it.

> Signed-off-by: Michal Hocko <mhocko@suse.cz>

No quarrels with that!

> ---
>  mm/memcontrol.c | 18 +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f016d26adfd3..969f14d32b30 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1076,14 +1076,22 @@ skip_node:
>  	 * skipped and we should continue the tree walk.
>  	 * last_visited css is safe to use because it is
>  	 * protected by css_get and the tree walk is rcu safe.
> +	 *
> +	 * We do not take a reference on the root of the tree walk
> +	 * because we might race with the root removal when it would
> +	 * be the only node in the iterated hierarchy and mem_cgroup_iter
> +	 * would end up in an endless loop because it expects that at
> +	 * least one valid node will be returned. Root cannot disappear
> +	 * because caller of the iterator should hold it already so
> +	 * skipping css reference should be safe.
>  	 */
>  	if (next_css) {
> -		if ((next_css->flags & CSS_ONLINE) && css_tryget(next_css))
> +		if ((next_css->flags & CSS_ONLINE) &&

Well, okay.  It's fine by me to keep the CSS_ONLINE one separate if you
prefer, but since we're not intending that one for -stable (or are we?),
basing this on top of that means that this patch will not apply to stable
and gregkh will ask us to craft a separate version for each release.
That's okay, I just preferred not to revisit this later (any more than
will anyway be necessary for 3.10 and 3.11).

> +				(next_css == root->css || css_tryget(next_css)))
>  			return mem_cgroup_from_css(next_css);
> -		else {
> -			prev_css = next_css;
> -			goto skip_node;
> -		}
> +

Yes, thanks, it's better with the blank line.

> +		prev_css = next_css;
> +		goto skip_node;
>  	}
>  
>  	return NULL;
> -- 
> 1.8.5.2
> 
> -- 
> Michal Hocko
> SUSE Labs

"Equivalent" patch for 3.10 or 3.11: fixing similar hangs but no leakage.

Signed-off-by: Hugh Dickins <hughd@google.com>

--- v3.10/mm/memcontrol.c	2013-06-30 15:13:29.000000000 -0700
+++ linux/mm/memcontrol.c	2014-01-15 18:18:24.476566659 -0800
@@ -1226,7 +1226,8 @@ struct mem_cgroup *mem_cgroup_iter(struc
 			}
 		}
 
-		memcg = __mem_cgroup_iter_next(root, last_visited);
+		if (!prev || last_visited)
+			memcg = __mem_cgroup_iter_next(root, last_visited);
 
 		if (reclaim) {
 			if (last_visited)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/3] mm/memcg: fix endless iteration in reclaim
  2014-01-16 19:15                   ` Hugh Dickins
@ 2014-01-17 15:41                     ` Michal Hocko
  2014-01-21  5:16                       ` Hugh Dickins
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2014-01-17 15:41 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Thu 16-01-14 11:15:36, Hugh Dickins wrote:
> On Thu, 16 Jan 2014, Michal Hocko wrote:
> > From 543df5c82f6eec622f669ea322ba6ff03924fded Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.cz>
> > Date: Thu, 16 Jan 2014 16:17:13 +0100
> > Subject: [PATCH] memcg: fix css reference leak from mem_cgroup_iter
> > 
> > 19f39402864e (memcg: simplify mem_cgroup_iter) has introduced a css
> > refrence leak (thus memory leak) because mem_cgroup_iter makes sure it
> > doesn't put a css reference on the root of the tree walk. The mentioned
> > commit however dropped the root check when the css reference is taken
> > while it keept the css_put optimization fora the root in place.
> 
> I don't think that's quite right, actually - and I think it's all
> so confusing that we do need to be pedantic and set it down right.

You are right!

> I spent quite a while yesterday trying out my "cg m" on 3.10, 3.11,

I have done the same now (with a different test - simple mem_eater
with hard_limit really low to trigger reclaim and trace_printk in both
mem_cgroup_css_{alloc,free}) and you are right that 3.10 and 3.11 were
OK regarding the leak. Which is a relief...
3.12 resp. mmotm which I was testing on previously has the leak though.
So there must have been some other escape part which didn't allow
css_tryget on the root.

> 3.12 and 3.13-rc8 on this laptop: first just counting mem_cgroup_allocs
> and frees (if I could get that far without hanging or crashing), then
> also with your patch in (on 3.12 and 3.13-rc8) or the completely
> different patch appended at the bottom (on 3.10 and 3.11), checking
> for leftover mem_cgroups afterwards.
> 
> I saw no evidence of mem_cgroup leakage on 3.10 and 3.11, which had
> 	/*
> 	 * Root is not visited by cgroup iterators so it needs an
> 	 * explicit visit.
> 	 */
> 	if (!last_visited)
> 		return root;
> at the head of __mem_cgroup_iter_next(), removed around the same
> time as changeover from prev_cgroup etc to prev_css etc in 3.12.

Ohh, now I get it. Cgroup iterators originally didn't visit the root and
all the callers had to special case it. Then Tejun changed them to visit
root as well by bd8815a6d802 (cgroup: make css_for_each_descendant() and
friends include the origin css in the iteration) which was a good change
but I didn't realize it would be a problem when I reviewed it. Now it
makes sense again.

> I don't believe 19f39402864e was responsible for a reference leak,
> that came later.  But I think it was responsible for the original
> endless iteration (shrink_zone going around and around getting root
> again and again from mem_cgroup_iter).

So your hang is not within mem_cgroup_iter but you are getting root all
the time without any way out?

[3.10 code base]
shrink_zone
						[rmdir root]
  mem_cgroup_iter(root, NULL, reclaim)
    // prev = NULL
    rcu_read_lock()
    last_visited = iter->last_visited	// gets root || NULL
    css_tryget(last_visited) 		// failed
    last_visited = NULL			[1]
    memcg = root = __mem_cgroup_iter_next(root, NULL)
    iter->last_visited = root;
    reclaim->generation = iter->generation

 mem_cgroup_iter(root, root, reclaim)
   // prev = root
   rcu_read_lock
    last_visited = iter->last_visited	// gets root
    css_tryget(last_visited) 		// failed
    [1]

So we indeed can loop here without any progress. I just fail
to see how my patch could help. We even do not get down to
cgroup_next_descendant_pre.

Or am I missing something?

The following should fix this kind of endless loop:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 194721839cf5..168e5abcca92 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1221,7 +1221,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 				smp_rmb();
 				last_visited = iter->last_visited;
 				if (last_visited &&
-				    !css_tryget(&last_visited->css))
+				    last_visited != root &&
+				     !css_tryget(&last_visited->css))
 					last_visited = NULL;
 			}
 		}
@@ -1229,7 +1230,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		memcg = __mem_cgroup_iter_next(root, last_visited);
 
 		if (reclaim) {
-			if (last_visited)
+			if (last_visited && last_visited != root)
 				css_put(&last_visited->css);
 
 			iter->last_visited = memcg;

Not that I like it much :/

> But beware of my conclusion, please check for yourself: with my
> separate kbuilds in separate /cg/cg/? memcgs, what "cg m" is doing
> is very simple and segregated, can hardly be called testing reclaim
> iteration, so I hope you have something better to check it.  Plus
> I was testing on 3.10 and 3.11 vanilla, not latest stable versions.
> 
> (If I'm very honest, I'll admit that I still did not see that hang
> on 3.11 vanilla:

But I assume you can still reproduce it with 3.10, right?
I am sorry but I didn't get to run your script yet.

> what I hit was a crash in kfree instead, but the
> same patch got rid of that too. 

Care to post an oops?

> Of course I ought to investigate
> further, but at some point I just have to give up and move on,
> there's just too much breakage to chase all over the kernel...)
> 
> > This means that css_put is not called and so css along with mem_cgroup
> > and other cgroup internal object tied by css lifetime are never freed.
> > 
> > Fix the issue by reintroducing root check in __mem_cgroup_iter_next.
> > 
> > This patch also fixes issue reported by Hugh Dickins when
> > mem_cgroup_iter might end up in an endless loop because a group which is
> > under hard limit reclaim is removed in parallel with iteration.
> > __mem_cgroup_iter_next would always return NULL because css_tryget on
> > the root (reclaimed memcg) would fail and there are no other memcg in
> > the hierarchy. prev == NULL in mem_cgroup_iter would prevent break out
> > from the root and so the while (!memcg) loop would never terminate.
> > as css_tryget is no longer called for the root of the tree walk this
> > doesn't happen anymore.
> > 
> > [hughd@google.com: Fixed root vs. root->css fix]
> > [hughd@google.com: Get rid of else branch because it is ugly]
> 
> Thanks for your courtesy!  But let's not clutter it with those two.
> 
> > <Hugh's-selection>-by: Hugh Dickins <hughd@google.com>
> 
> You already credited me above, but "Reported-by:" here if you insist.
> 
> > Cc: stable@vger.kernel.org # 3.10+
> 
> Well, I'm okay with that, if we use that as a way to shoehorn in the
> patch at the bottom instead for 3.10 and 3.11 stables.

So far I do not see how it would make a change for those two kernels as
they have the special handling for root.

[...]
> "Equivalent" patch for 3.10 or 3.11: fixing similar hangs but no leakage.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> 
> --- v3.10/mm/memcontrol.c	2013-06-30 15:13:29.000000000 -0700
> +++ linux/mm/memcontrol.c	2014-01-15 18:18:24.476566659 -0800
> @@ -1226,7 +1226,8 @@ struct mem_cgroup *mem_cgroup_iter(struc
>  			}
>  		}
>  
> -		memcg = __mem_cgroup_iter_next(root, last_visited);
> +		if (!prev || last_visited)
> +			memcg = __mem_cgroup_iter_next(root, last_visited);

I am confused. What would change between those two calls to change the
outcome? The function doesn't have any internal state.

>  
>  		if (reclaim) {
>  			if (last_visited)

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/3] mm/memcg: fix endless iteration in reclaim
  2014-01-17 15:41                     ` Michal Hocko
@ 2014-01-21  5:16                       ` Hugh Dickins
  2014-01-21  8:34                         ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2014-01-21  5:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, Greg Thelen, linux-mm, linux-kernel

On Fri, 17 Jan 2014, Michal Hocko wrote:
> On Thu 16-01-14 11:15:36, Hugh Dickins wrote:
> 
> > I don't believe 19f39402864e was responsible for a reference leak,
> > that came later.  But I think it was responsible for the original
> > endless iteration (shrink_zone going around and around getting root
> > again and again from mem_cgroup_iter).
> 
> So your hang is not within mem_cgroup_iter but you are getting root all
> the time without any way out?

In the 3.10 and 3.11 cases, yes.

> 
> [3.10 code base]
> shrink_zone
> 						[rmdir root]
>   mem_cgroup_iter(root, NULL, reclaim)
>     // prev = NULL
>     rcu_read_lock()
>     last_visited = iter->last_visited	// gets root || NULL
>     css_tryget(last_visited) 		// failed
>     last_visited = NULL			[1]
>     memcg = root = __mem_cgroup_iter_next(root, NULL)
>     iter->last_visited = root;
>     reclaim->generation = iter->generation
> 
>  mem_cgroup_iter(root, root, reclaim)
>    // prev = root
>    rcu_read_lock
>     last_visited = iter->last_visited	// gets root
>     css_tryget(last_visited) 		// failed
>     [1]
> 
> So we indeed can loop here without any progress. I just fail
> to see how my patch could help. We even do not get down to
> cgroup_next_descendant_pre.
> 
> Or am I missing something?

Your patch to 3.12 and 3.13 mem_cgroup_iter_next() doesn't help
in 3.10 and 3.11, correct.  That's why I appended a different patch,
to mem_cgroup_iter(), for the 3.10 and 3.11 versions of the hang.

> 
> The following should fix this kind of endless loop:
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 194721839cf5..168e5abcca92 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1221,7 +1221,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  				smp_rmb();
>  				last_visited = iter->last_visited;
>  				if (last_visited &&
> -				    !css_tryget(&last_visited->css))
> +				    last_visited != root &&
> +				     !css_tryget(&last_visited->css))
>  					last_visited = NULL;
>  			}
>  		}
> @@ -1229,7 +1230,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  		memcg = __mem_cgroup_iter_next(root, last_visited);
>  
>  		if (reclaim) {
> -			if (last_visited)
> +			if (last_visited && last_visited != root)
>  				css_put(&last_visited->css);
>  
>  			iter->last_visited = memcg;

Right, that appears to fix 3.10, and seems a better alternative to the
patch I suggested.  I say "appears" because my success in reproducing
the hang is variable, so when I see that it's "fixed" I cannot be
quite sure.  I say "seems" because I think yours respects the intention
of the iterator better than mine, but I've never been convinced that
the iterator is as sensible as it intends in the face of races.

At the bottom I've appended the version of yours that I've been trying
on 3.11.  I did succeed in reproducing the hang twice on 3.11.10.3
(which I don't think differs in any essential from 3.11.0 for this issue,
but after my lack of success with 3.11.0 I tried harder with that.)

More so than in the 3.10 case, I haven't really given it long enough
with the patch to really assert that it's good; and Greg Thelen came
across a different reproduction case that I've yet to remind myself
of and try, I'll have to report back to you later in the week when
I've run that with your fix.

> 
> Not that I like it much :/

Well, I'm not in love with it, but I do think it's more appropriate
than mine, if it really does fix the issues.  It was only under
questioning from you that we arrived at the belief that the problem
is with the css_tryget of a root being removed: my patch was vaguer
than that, not identifying the root cause.

I suspect that the underlying problem is actually the "do {} while ()"
nature of the iteration loops, instead of "while () {}"s.  That places
us (not for the first time) in the awkward position of having to supply
something once (and once only) even when it doesn't really fit.

(I have wondered whether making mem_cgroup_invalidate_reclaim_iterators
visit the memcg as well as its parents, might provide another fix; nice
if it did, but I doubt it, and have spent so much time fiddling around
here that I've lost the will to try anything else.)

> 
> > But beware of my conclusion, please check for yourself: with my
> > separate kbuilds in separate /cg/cg/? memcgs, what "cg m" is doing
> > is very simple and segregated, can hardly be called testing reclaim
> > iteration, so I hope you have something better to check it.  Plus
> > I was testing on 3.10 and 3.11 vanilla, not latest stable versions.
> > 
> > (If I'm very honest, I'll admit that I still did not see that hang
> > on 3.11 vanilla:
> 
> But I assume you can still reproduce it with 3.10, right?

Yes, and given subsequent "success" with 3.11.10.3 (after several hours),
I expect I would manage to reproduce it with 3.11 if allowed enough time.

> I am sorry but I didn't get to run your script yet.
> 
> > what I hit was a crash in kfree instead, but the
> > same patch got rid of that too. 
> 
> Care to post an oops?

I didn't collect one at the time, and haven't seen it since.
It was an oops within __kfree() called from __mem_cgroup_free(),
but whether of the "struct mem_cgroup" or one of its ancillary
structures I cannot say.  I've chosen to believe that it wasn't
actually a memcg problem, but something caused by unrelated code
sharing the same kmalloc slab.

...

> > 
> > > Cc: stable@vger.kernel.org # 3.10+
> > 
> > Well, I'm okay with that, if we use that as a way to shoehorn in the
> > patch at the bottom instead for 3.10 and 3.11 stables.
> 
> So far I do not see how it would make a change for those two kernels as
> they have the special handling for root.

That was my point: that patch does not fix 3.10 and 3.11 at all,
but they suffer from the same problem (manifesting in a slightly
different way, the hang revisiting mem_cgroup_iter repeatedly instead
of being trapped inside it); so it may not be inappropriate to say 3.10+
even though that particular patch will not apply and would not fix them.

> 
> [...]
> > "Equivalent" patch for 3.10 or 3.11: fixing similar hangs but no leakage.
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > 
> > --- v3.10/mm/memcontrol.c	2013-06-30 15:13:29.000000000 -0700
> > +++ linux/mm/memcontrol.c	2014-01-15 18:18:24.476566659 -0800
> > @@ -1226,7 +1226,8 @@ struct mem_cgroup *mem_cgroup_iter(struc
> >  			}
> >  		}
> >  
> > -		memcg = __mem_cgroup_iter_next(root, last_visited);
> > +		if (!prev || last_visited)
> > +			memcg = __mem_cgroup_iter_next(root, last_visited);
> 
> I am confused. What would change between those two calls to change the
> outcome? The function doesn't have any internal state.

I don't understand your question (what two calls?).  The 3.10 or 3.11
__mem_cgroup_iter_next() begins with "if (!last_visited) return root;",
which was problematic because again and again it would return root.
Originally I passed in prev, and returned NULL instead of root if prev
but !last_visited; but I've an aversion to passing a function an extra
argument to say it shouldn't have been called, so in this version I'm
testing !prev || last_visited before calling it.  Perhaps your "two
calls" are the first with prev == NULL and the second with prev == root.

But I say I prefer your fix because mine above says nothing about root,
which we now believe is the only problematic case.  Mine would leave
memcg NULL whenever a change resets last_visited to NULL (once one memcg
has been delivered): which is simple, but not what the iterator intends
(if I read it right, it wants to start again from the beginning, whereas
I'm hastening it to the end).  In practice mine works well, and I haven't
seen the premature OOMs that you might suppose it leads to; but let's go
for yours as more in keeping with the spirit of the iterator.

"The spirit of the iterator", now that's a fine phrase.

Here's my 3.11 version of your 3.10, in case you spot something silly.
I'll give it a try on Greg's testcase in coming days and report back.
(Greg did suggest a different fix from mine back when he hit the issue,
I'll also look that one out again in case it offers something better.)

--- v3.11/mm/memcontrol.c	2014-01-19 14:16:38.656701990 -0800
+++ linux/mm/memcontrol.c	2014-01-20 19:04:50.635637615 -0800
@@ -1148,19 +1148,17 @@ mem_cgroup_iter_load(struct mem_cgroup_r
 	if (iter->last_dead_count == *sequence) {
 		smp_rmb();
 		position = iter->last_visited;
-		if (position && !css_tryget(&position->css))
+		if (position && position != root &&
+		    !css_tryget(&position->css))
 			position = NULL;
 	}
 	return position;
 }
 
 static void mem_cgroup_iter_update(struct mem_cgroup_reclaim_iter *iter,
-				   struct mem_cgroup *last_visited,
 				   struct mem_cgroup *new_position,
 				   int sequence)
 {
-	if (last_visited)
-		css_put(&last_visited->css);
 	/*
 	 * We store the sequence count from the time @last_visited was
 	 * loaded successfully instead of rereading it here so that we
@@ -1234,7 +1232,10 @@ struct mem_cgroup *mem_cgroup_iter(struc
 		memcg = __mem_cgroup_iter_next(root, last_visited);
 
 		if (reclaim) {
-			mem_cgroup_iter_update(iter, last_visited, memcg, seq);
+			if (last_visited && last_visited != root)
+				css_put(&last_visited->css);
+
+			mem_cgroup_iter_update(iter, memcg, seq);
 
 			if (!memcg)
 				iter->generation++;

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/3] mm/memcg: fix endless iteration in reclaim
  2014-01-21  5:16                       ` Hugh Dickins
@ 2014-01-21  8:34                         ` Michal Hocko
  2014-01-21 10:45                           ` [PATCH -mm 1/2] memcg: fix endless loop caused by mem_cgroup_iter Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2014-01-21  8:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Johannes Weiner, Andrew Morton, Greg Thelen, linux-mm, linux-kernel

On Mon 20-01-14 21:16:36, Hugh Dickins wrote:
> On Fri, 17 Jan 2014, Michal Hocko wrote:
> > On Thu 16-01-14 11:15:36, Hugh Dickins wrote:
> > 
> > > I don't believe 19f39402864e was responsible for a reference leak,
> > > that came later.  But I think it was responsible for the original
> > > endless iteration (shrink_zone going around and around getting root
> > > again and again from mem_cgroup_iter).
> > 
> > So your hang is not within mem_cgroup_iter but you are getting root all
> > the time without any way out?
> 
> In the 3.10 and 3.11 cases, yes.

OK, that makes sense.
 
> > [3.10 code base]
> > shrink_zone
> > 						[rmdir root]
> >   mem_cgroup_iter(root, NULL, reclaim)
> >     // prev = NULL
> >     rcu_read_lock()
> >     last_visited = iter->last_visited	// gets root || NULL
> >     css_tryget(last_visited) 		// failed
> >     last_visited = NULL			[1]
> >     memcg = root = __mem_cgroup_iter_next(root, NULL)
> >     iter->last_visited = root;
> >     reclaim->generation = iter->generation
> > 
> >  mem_cgroup_iter(root, root, reclaim)
> >    // prev = root
> >    rcu_read_lock
> >     last_visited = iter->last_visited	// gets root
> >     css_tryget(last_visited) 		// failed
> >     [1]
> > 
> > So we indeed can loop here without any progress. I just fail
> > to see how my patch could help. We even do not get down to
> > cgroup_next_descendant_pre.
> > 
> > Or am I missing something?
> 
> Your patch to 3.12 and 3.13 mem_cgroup_iter_next() doesn't help
> in 3.10 and 3.11, correct.  That's why I appended a different patch,
> to mem_cgroup_iter(), for the 3.10 and 3.11 versions of the hang.
> 
> > 
> > The following should fix this kind of endless loop:
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 194721839cf5..168e5abcca92 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1221,7 +1221,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> >  				smp_rmb();
> >  				last_visited = iter->last_visited;
> >  				if (last_visited &&
> > -				    !css_tryget(&last_visited->css))
> > +				    last_visited != root &&
> > +				     !css_tryget(&last_visited->css))
> >  					last_visited = NULL;
> >  			}
> >  		}
> > @@ -1229,7 +1230,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> >  		memcg = __mem_cgroup_iter_next(root, last_visited);
> >  
> >  		if (reclaim) {
> > -			if (last_visited)
> > +			if (last_visited && last_visited != root)
> >  				css_put(&last_visited->css);
> >  
> >  			iter->last_visited = memcg;
> 
> Right, that appears to fix 3.10, and seems a better alternative to the
> patch I suggested.  I say "appears" because my success in reproducing
> the hang is variable, so when I see that it's "fixed" I cannot be
> quite sure. 

Understood.

> I say "seems" because I think yours respects the intention
> of the iterator better than mine, but I've never been convinced that
> the iterator is as sensible as it intends in the face of races.
> 
> At the bottom I've appended the version of yours that I've been trying
> on 3.11.  I did succeed in reproducing the hang twice on 3.11.10.3
> (which I don't think differs in any essential from 3.11.0 for this issue,
> but after my lack of success with 3.11.0 I tried harder with that.)

git log points only at 3 patches in mm/memcontrol.c and all of them seem
unrelated.

> More so than in the 3.10 case, I haven't really given it long enough
> with the patch to really assert that it's good; and Greg Thelen came
> across a different reproduction case that I've yet to remind myself
> of and try, I'll have to report back to you later in the week when
> I've run that with your fix.

Great, thanks a lot for your testing. It is really appreciated
especially now that I am quite busy with other internal stuff.

> > Not that I like it much :/
> 
> Well, I'm not in love with it, but I do think it's more appropriate
> than mine, if it really does fix the issues.

It fixes a potential endless loop. It is a question it is the one you
are seeing.

> It was only under questioning from you that we arrived at the belief
> that the problem is with the css_tryget of a root being removed: my
> patch was vaguer than that, not identifying the root cause.
> 
> I suspect that the underlying problem is actually the "do {} while ()"
> nature of the iteration loops, instead of "while () {}"s. 

I think the outside caller shouldn't care much. The iterator code has to
make sure that it doesn't loop itself. Doing while () {} has some issues
as well. Having a reason to reclaim but hen do not reclaim anything
might pop out as an issue upper in the calling stack.

> That places us (not for the first time) in the awkward position of
> having to supply something once (and once only) even when it doesn't
> really fit.
>
> (I have wondered whether making mem_cgroup_invalidate_reclaim_iterators
> visit the memcg as well as its parents, might provide another fix; nice
> if it did, but I doubt it, and have spent so much time fiddling around
> here that I've lost the will to try anything else.)

I do not see it as an easier alternative.

[...]
> > > > Cc: stable@vger.kernel.org # 3.10+
> > > 
> > > Well, I'm okay with that, if we use that as a way to shoehorn in the
> > > patch at the bottom instead for 3.10 and 3.11 stables.
> > 
> > So far I do not see how it would make a change for those two kernels as
> > they have the special handling for root.
> 
> That was my point: that patch does not fix 3.10 and 3.11 at all,
> but they suffer from the same problem (manifesting in a slightly
> different way, the hang revisiting mem_cgroup_iter repeatedly instead
> of being trapped inside it); so it may not be inappropriate to say 3.10+
> even though that particular patch will not apply and would not fix them.

OK, understood now. I will repost that patch with updated changelog
later.
 
> > [...]
> > > "Equivalent" patch for 3.10 or 3.11: fixing similar hangs but no leakage.
> > > 
> > > Signed-off-by: Hugh Dickins <hughd@google.com>
> > > 
> > > --- v3.10/mm/memcontrol.c	2013-06-30 15:13:29.000000000 -0700
> > > +++ linux/mm/memcontrol.c	2014-01-15 18:18:24.476566659 -0800
> > > @@ -1226,7 +1226,8 @@ struct mem_cgroup *mem_cgroup_iter(struc
> > >  			}
> > >  		}
> > >  
> > > -		memcg = __mem_cgroup_iter_next(root, last_visited);
> > > +		if (!prev || last_visited)
> > > +			memcg = __mem_cgroup_iter_next(root, last_visited);
> > 
> > I am confused. What would change between those two calls to change the
> > outcome? The function doesn't have any internal state.
> 
> I don't understand your question (what two calls?).

OK, it was my selective blindness that stroke again here. Sorry about
the confusion.

With fresh eyes. Yes it would work as well.

> The 3.10 or 3.11
> __mem_cgroup_iter_next() begins with "if (!last_visited) return root;",
> which was problematic because again and again it would return root.
> Originally I passed in prev, and returned NULL instead of root if prev
> but !last_visited; but I've an aversion to passing a function an extra
> argument to say it shouldn't have been called, so in this version I'm
> testing !prev || last_visited before calling it.  Perhaps your "two
> calls" are the first with prev == NULL and the second with prev == root.
> 
> But I say I prefer your fix because mine above says nothing about root,
> which we now believe is the only problematic case.  Mine would leave
> memcg NULL whenever a change resets last_visited to NULL (once one memcg
> has been delivered): which is simple, but not what the iterator intends
> (if I read it right, it wants to start again from the beginning, whereas
> I'm hastening it to the end).  In practice mine works well, and I haven't
> seen the premature OOMs that you might suppose it leads to; but let's go
> for yours as more in keeping with the spirit of the iterator.

OK, let's keep it consistently ugly.

> "The spirit of the iterator", now that's a fine phrase.

:)

> Here's my 3.11 version of your 3.10, in case you spot something silly.
> I'll give it a try on Greg's testcase in coming days and report back.
> (Greg did suggest a different fix from mine back when he hit the issue,
> I'll also look that one out again in case it offers something better.)
> 
> --- v3.11/mm/memcontrol.c	2014-01-19 14:16:38.656701990 -0800
> +++ linux/mm/memcontrol.c	2014-01-20 19:04:50.635637615 -0800
> @@ -1148,19 +1148,17 @@ mem_cgroup_iter_load(struct mem_cgroup_r
>  	if (iter->last_dead_count == *sequence) {
>  		smp_rmb();
>  		position = iter->last_visited;
> -		if (position && !css_tryget(&position->css))
> +		if (position && position != root &&
> +		    !css_tryget(&position->css))
>  			position = NULL;
>  	}
>  	return position;
>  }
>  
>  static void mem_cgroup_iter_update(struct mem_cgroup_reclaim_iter *iter,
> -				   struct mem_cgroup *last_visited,
>  				   struct mem_cgroup *new_position,
>  				   int sequence)
>  {
> -	if (last_visited)
> -		css_put(&last_visited->css);
>  	/*
>  	 * We store the sequence count from the time @last_visited was
>  	 * loaded successfully instead of rereading it here so that we
> @@ -1234,7 +1232,10 @@ struct mem_cgroup *mem_cgroup_iter(struc
>  		memcg = __mem_cgroup_iter_next(root, last_visited);
>  
>  		if (reclaim) {
> -			mem_cgroup_iter_update(iter, last_visited, memcg, seq);
> +			if (last_visited && last_visited != root)
> +				css_put(&last_visited->css);
> +
> +			mem_cgroup_iter_update(iter, memcg, seq);
>  
>  			if (!memcg)
>  				iter->generation++;

Yes it looks good. Although I would probably go and add root into
mem_cgroup_iter_update and do the check and css_put there to have
it symmetric with mem_cgroup_iter_load. I will cook up a changelog for
this one as well (for both 3.10 and 3.11 because they share fail on root
case).

Thanks a lot!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH -mm 1/2] memcg: fix endless loop caused by mem_cgroup_iter
  2014-01-21  8:34                         ` Michal Hocko
@ 2014-01-21 10:45                           ` Michal Hocko
  2014-01-21 10:45                             ` [PATCH -mm 2/2] memcg: fix css reference leak and endless loop in mem_cgroup_iter Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2014-01-21 10:45 UTC (permalink / raw)
  To: Hugh Dickins, Johannes Weiner; +Cc: Andrew Morton, Greg Thelen, linux-mm, LKML

Hugh has reported an endless loop when the hardlimit reclaim sees the
same group all the time. This might happen when the reclaim races with
the memcg removal.

shrink_zone
                                                [rmdir root]
  mem_cgroup_iter(root, NULL, reclaim)
    // prev = NULL
    rcu_read_lock()
    mem_cgroup_iter_load
      last_visited = iter->last_visited   // gets root || NULL
      css_tryget(last_visited)            // failed
      last_visited = NULL                 [1]
    memcg = root = __mem_cgroup_iter_next(root, NULL)
    mem_cgroup_iter_update
      iter->last_visited = root;
    reclaim->generation = iter->generation

 mem_cgroup_iter(root, root, reclaim)
   // prev = root
   rcu_read_lock
    mem_cgroup_iter_load
      last_visited = iter->last_visited   // gets root
      css_tryget(last_visited)            // failed
    [1]

The issue seemed to be introduced by 5f5781619718 (memcg: relax memcg
iter caching) which has replaced unconditional css_get/css_put by
css_tryget/css_put for the cached iterator.

This patch fixes the issue by skipping css_tryget on the root of the
tree walk in mem_cgroup_iter_load and symmetrically doesn't release it
in mem_cgroup_iter_update.

Stable: stable@vger.kernel.org # 3.10+
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f016d26adfd3..45786dc129dc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1117,7 +1117,15 @@ mem_cgroup_iter_load(struct mem_cgroup_reclaim_iter *iter,
 	if (iter->last_dead_count == *sequence) {
 		smp_rmb();
 		position = iter->last_visited;
-		if (position && !css_tryget(&position->css))
+
+		/*
+		 * We cannot take a reference to root because we might race
+		 * with root removal and returning NULL would end up in
+		 * an endless loop on the iterator user level when root
+		 * would be returned all the time.
+		 */
+		if (position && position != root &&
+				!css_tryget(&position->css))
 			position = NULL;
 	}
 	return position;
@@ -1126,9 +1134,11 @@ mem_cgroup_iter_load(struct mem_cgroup_reclaim_iter *iter,
 static void mem_cgroup_iter_update(struct mem_cgroup_reclaim_iter *iter,
 				   struct mem_cgroup *last_visited,
 				   struct mem_cgroup *new_position,
+				   struct mem_cgroup *root,
 				   int sequence)
 {
-	if (last_visited)
+	/* root reference counting symmetric to mem_cgroup_iter_load */
+	if (last_visited && last_visited != root)
 		css_put(&last_visited->css);
 	/*
 	 * We store the sequence count from the time @last_visited was
@@ -1203,7 +1213,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		memcg = __mem_cgroup_iter_next(root, last_visited);
 
 		if (reclaim) {
-			mem_cgroup_iter_update(iter, last_visited, memcg, seq);
+			mem_cgroup_iter_update(iter, last_visited, memcg, root,
+					seq);
 
 			if (!memcg)
 				iter->generation++;
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH -mm 2/2] memcg: fix css reference leak and endless loop in mem_cgroup_iter
  2014-01-21 10:45                           ` [PATCH -mm 1/2] memcg: fix endless loop caused by mem_cgroup_iter Michal Hocko
@ 2014-01-21 10:45                             ` Michal Hocko
  2014-01-21 19:42                               ` Andrew Morton
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2014-01-21 10:45 UTC (permalink / raw)
  To: Hugh Dickins, Johannes Weiner; +Cc: Andrew Morton, Greg Thelen, linux-mm, LKML

19f39402864e (memcg: simplify mem_cgroup_iter) has reorganized
mem_cgroup_iter code in order to simplify it. A part of that change was
dropping an optimization which didn't call css_tryget on the root of
the walked tree. The patch however didn't change the css_put part in
mem_cgroup_iter which excludes root.
This wasn't an issue at the time because __mem_cgroup_iter_next bailed
out for root early without taking a reference as cgroup iterators
(css_next_descendant_pre) didn't visit root themselves.

Nevertheless cgroup iterators have been reworked to visit root by
bd8815a6d802 (cgroup: make css_for_each_descendant() and friends include
the origin css in the iteration) when the root bypass have been dropped
in __mem_cgroup_iter_next. This means that css_put is not called for
root and so css along with mem_cgroup and other cgroup internal object
tied by css lifetime are never freed.

Fix the issue by reintroducing root check in __mem_cgroup_iter_next
and do not take css reference for it.

This reference counting magic protects us also from another issue, an
endless loop reported by Hugh Dickins when reclaim races with root
removal and css_tryget called by iterator internally would fail. There
would be no other nodes to visit so __mem_cgroup_iter_next would return
NULL and mem_cgroup_iter would interpret it as "start looping from root
again" and so mem_cgroup_iter would loop forever internally.

Cc: stable@vger.kernel.org # mem_leak part 3.12+
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 45786dc129dc..55bb1f8c6907 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1076,14 +1076,22 @@ skip_node:
 	 * skipped and we should continue the tree walk.
 	 * last_visited css is safe to use because it is
 	 * protected by css_get and the tree walk is rcu safe.
+	 *
+	 * We do not take a reference on the root of the tree walk
+	 * because we might race with the root removal when it would
+	 * be the only node in the iterated hierarchy and mem_cgroup_iter
+	 * would end up in an endless loop because it expects that at
+	 * least one valid node will be returned. Root cannot disappear
+	 * because caller of the iterator should hold it already so
+	 * skipping css reference should be safe.
 	 */
 	if (next_css) {
-		if ((next_css->flags & CSS_ONLINE) && css_tryget(next_css))
+		if ((next_css->flags & CSS_ONLINE) &&
+				(next_css == &root->css || css_tryget(next_css)))
 			return mem_cgroup_from_css(next_css);
-		else {
-			prev_css = next_css;
-			goto skip_node;
-		}
+
+		prev_css = next_css;
+		goto skip_node;
 	}
 
 	return NULL;
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH -mm 2/2] memcg: fix css reference leak and endless loop in mem_cgroup_iter
  2014-01-21 10:45                             ` [PATCH -mm 2/2] memcg: fix css reference leak and endless loop in mem_cgroup_iter Michal Hocko
@ 2014-01-21 19:42                               ` Andrew Morton
  2014-01-21 21:18                                 ` Hugh Dickins
  2014-01-22  8:12                                 ` Michal Hocko
  0 siblings, 2 replies; 30+ messages in thread
From: Andrew Morton @ 2014-01-21 19:42 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Hugh Dickins, Johannes Weiner, Greg Thelen, linux-mm, LKML

On Tue, 21 Jan 2014 11:45:43 +0100 Michal Hocko <mhocko@suse.cz> wrote:

> 19f39402864e (memcg: simplify mem_cgroup_iter) has reorganized
> mem_cgroup_iter code in order to simplify it. A part of that change was
> dropping an optimization which didn't call css_tryget on the root of
> the walked tree. The patch however didn't change the css_put part in
> mem_cgroup_iter which excludes root.
> This wasn't an issue at the time because __mem_cgroup_iter_next bailed
> out for root early without taking a reference as cgroup iterators
> (css_next_descendant_pre) didn't visit root themselves.
> 
> Nevertheless cgroup iterators have been reworked to visit root by
> bd8815a6d802 (cgroup: make css_for_each_descendant() and friends include
> the origin css in the iteration) when the root bypass have been dropped
> in __mem_cgroup_iter_next. This means that css_put is not called for
> root and so css along with mem_cgroup and other cgroup internal object
> tied by css lifetime are never freed.
> 
> Fix the issue by reintroducing root check in __mem_cgroup_iter_next
> and do not take css reference for it.
> 
> This reference counting magic protects us also from another issue, an
> endless loop reported by Hugh Dickins when reclaim races with root
> removal and css_tryget called by iterator internally would fail. There
> would be no other nodes to visit so __mem_cgroup_iter_next would return
> NULL and mem_cgroup_iter would interpret it as "start looping from root
> again" and so mem_cgroup_iter would loop forever internally.

I grabbed these two patches but I will sit on them for a week or so,
pending review-n-test.

> Cc: stable@vger.kernel.org # mem_leak part 3.12+

What does this mean?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH -mm 2/2] memcg: fix css reference leak and endless loop in mem_cgroup_iter
  2014-01-21 19:42                               ` Andrew Morton
@ 2014-01-21 21:18                                 ` Hugh Dickins
  2014-01-22  8:27                                   ` Michal Hocko
  2014-01-22  8:12                                 ` Michal Hocko
  1 sibling, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2014-01-21 21:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, Johannes Weiner, Greg Thelen, linux-mm, LKML

On Tue, 21 Jan 2014, Andrew Morton wrote:
> On Tue, 21 Jan 2014 11:45:43 +0100 Michal Hocko <mhocko@suse.cz> wrote:
> 
> > 19f39402864e (memcg: simplify mem_cgroup_iter) has reorganized
> > mem_cgroup_iter code in order to simplify it. A part of that change was
> > dropping an optimization which didn't call css_tryget on the root of
> > the walked tree. The patch however didn't change the css_put part in
> > mem_cgroup_iter which excludes root.
> > This wasn't an issue at the time because __mem_cgroup_iter_next bailed
> > out for root early without taking a reference as cgroup iterators
> > (css_next_descendant_pre) didn't visit root themselves.
> > 
> > Nevertheless cgroup iterators have been reworked to visit root by
> > bd8815a6d802 (cgroup: make css_for_each_descendant() and friends include
> > the origin css in the iteration) when the root bypass have been dropped
> > in __mem_cgroup_iter_next. This means that css_put is not called for
> > root and so css along with mem_cgroup and other cgroup internal object
> > tied by css lifetime are never freed.
> > 
> > Fix the issue by reintroducing root check in __mem_cgroup_iter_next
> > and do not take css reference for it.
> > 
> > This reference counting magic protects us also from another issue, an
> > endless loop reported by Hugh Dickins when reclaim races with root
> > removal and css_tryget called by iterator internally would fail. There
> > would be no other nodes to visit so __mem_cgroup_iter_next would return
> > NULL and mem_cgroup_iter would interpret it as "start looping from root
> > again" and so mem_cgroup_iter would loop forever internally.
> 
> I grabbed these two patches but I will sit on them for a week or so,
> pending review-n-test.

Thank you, yes, I'm about to give them more testing.

> 
> > Cc: stable@vger.kernel.org # mem_leak part 3.12+
> 
> What does this mean?

It's certainly a confusing comment.

I suggest just deleting the "mem_leak part ": Michal isn't referring to
any two parts of the patch itself, but to parts of his commit comment;
but it's still unclear what he's claiming.

We do have a confusing situation.  The hang goes back to 3.10 but takes
two different forms, because of intervening changes: in 3.10 and 3.11
mem_cgroup_iter repeatedly returns root memcg to its caller, in 3.12 and
3.13 mem_cgroup_iter repeatedly gets NULL memcg from mem_cgroup_iter_next
and cannot return to its caller.

Patch 1/2 is what's needed to fix 3.10 and 3.11 (and applies correctly
to 3.11, but will have to be rediffed for 3.10 because of rearrangement
in between).  Patch 2/2 is what's needed to fix 3.12 and 3.13 (but applies
correctly to neither of them because it's diffed on top of my CSS_ONLINE
fix).  Patch 1/2 is correct but unnecessary in 3.12 and 3.13: I'm unclear
whether Michal is claiming that it would also fix the hang in 3.12 and
3.13 if we didn't have 2/2: I doubt that, and haven't tested that.

Given how Michal has diffed this patch on top of my CSS_ONLINE one
(mm-memcg-iteration-skip-memcgs-not-yet-fully-initialized.patch),
it would be helpful if you could mark that one also for stable 3.12+,
to save us from having to rediff this one for stable.  We don't have
a concrete example of a problem it solves in the vanilla kernel, but
it makes more sense to include it than to exclude it.

(You would be right to point out that the CSS_ONLINE one fixes
something that goes back to 3.10: I'm saying 3.12+ because I'm not
motivated to rediff it for 3.10 and 3.11 when there's nothing to
go on top; but that's not a very good reason to lie - overrule me.)

Hugh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH -mm 2/2] memcg: fix css reference leak and endless loop in mem_cgroup_iter
  2014-01-21 19:42                               ` Andrew Morton
  2014-01-21 21:18                                 ` Hugh Dickins
@ 2014-01-22  8:12                                 ` Michal Hocko
  1 sibling, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2014-01-22  8:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, Johannes Weiner, Greg Thelen, linux-mm, LKML

On Tue 21-01-14 11:42:19, Andrew Morton wrote:
> On Tue, 21 Jan 2014 11:45:43 +0100 Michal Hocko <mhocko@suse.cz> wrote:
> 
> > 19f39402864e (memcg: simplify mem_cgroup_iter) has reorganized
> > mem_cgroup_iter code in order to simplify it. A part of that change was
> > dropping an optimization which didn't call css_tryget on the root of
> > the walked tree. The patch however didn't change the css_put part in
> > mem_cgroup_iter which excludes root.
> > This wasn't an issue at the time because __mem_cgroup_iter_next bailed
> > out for root early without taking a reference as cgroup iterators
> > (css_next_descendant_pre) didn't visit root themselves.
> > 
> > Nevertheless cgroup iterators have been reworked to visit root by
> > bd8815a6d802 (cgroup: make css_for_each_descendant() and friends include
> > the origin css in the iteration) when the root bypass have been dropped
> > in __mem_cgroup_iter_next. This means that css_put is not called for
> > root and so css along with mem_cgroup and other cgroup internal object
> > tied by css lifetime are never freed.
> > 
> > Fix the issue by reintroducing root check in __mem_cgroup_iter_next
> > and do not take css reference for it.
> > 
> > This reference counting magic protects us also from another issue, an
> > endless loop reported by Hugh Dickins when reclaim races with root
> > removal and css_tryget called by iterator internally would fail. There
> > would be no other nodes to visit so __mem_cgroup_iter_next would return
> > NULL and mem_cgroup_iter would interpret it as "start looping from root
> > again" and so mem_cgroup_iter would loop forever internally.
> 
> I grabbed these two patches but I will sit on them for a week or so,
> pending review-n-test.

Yes, there is no rush and this needs a proper review.

> > Cc: stable@vger.kernel.org # mem_leak part 3.12+
> 
> What does this mean?

Dohh. I had both patches in one but then I decided to split it. This is
just left over. Should be 3.12+.

Sorry about the confusion.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH -mm 2/2] memcg: fix css reference leak and endless loop in mem_cgroup_iter
  2014-01-21 21:18                                 ` Hugh Dickins
@ 2014-01-22  8:27                                   ` Michal Hocko
  2014-01-23 10:42                                     ` Hugh Dickins
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2014-01-22  8:27 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Johannes Weiner, Greg Thelen, linux-mm, LKML

On Tue 21-01-14 13:18:42, Hugh Dickins wrote:
[...]
> We do have a confusing situation.  The hang goes back to 3.10 but takes
> two different forms, because of intervening changes: in 3.10 and 3.11
> mem_cgroup_iter repeatedly returns root memcg to its caller, in 3.12 and
> 3.13 mem_cgroup_iter repeatedly gets NULL memcg from mem_cgroup_iter_next
> and cannot return to its caller.
> 
> Patch 1/2 is what's needed to fix 3.10 and 3.11 (and applies correctly
> to 3.11, but will have to be rediffed for 3.10 because of rearrangement
> in between). 

I will backport it when it reaches stable queue.

> Patch 2/2 is what's needed to fix 3.12 and 3.13 (but applies
> correctly to neither of them because it's diffed on top of my CSS_ONLINE
> fix).  Patch 1/2 is correct but unnecessary in 3.12 and 3.13: I'm unclear
> whether Michal is claiming that it would also fix the hang in 3.12 and
> 3.13 if we didn't have 2/2: I doubt that, and haven't tested that.

Actually both patches are needed. If we had only 2/2 then we wouldn't
endless loop inside mem_cgroup_iter but we could still return root to
caller all the time because mem_cgroup_iter_load would return NULL on
css_tryget failure on the cached root. Or am I missing something that
would prevent that?

> Given how Michal has diffed this patch on top of my CSS_ONLINE one
> (mm-memcg-iteration-skip-memcgs-not-yet-fully-initialized.patch),
> it would be helpful if you could mark that one also for stable 3.12+,
> to save us from having to rediff this one for stable.  We don't have
> a concrete example of a problem it solves in the vanilla kernel, but
> it makes more sense to include it than to exclude it.

Yes, I think it makes sense to queue it for 3.12+ as well because it is
non intrusive and potential issues would be really subtle.

> (You would be right to point out that the CSS_ONLINE one fixes
> something that goes back to 3.10: I'm saying 3.12+ because I'm not
> motivated to rediff it for 3.10 and 3.11 when there's nothing to
> go on top; but that's not a very good reason to lie - overrule me.)
> 
> Hugh

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH -mm 2/2] memcg: fix css reference leak and endless loop in mem_cgroup_iter
  2014-01-22  8:27                                   ` Michal Hocko
@ 2014-01-23 10:42                                     ` Hugh Dickins
  2014-01-23 11:09                                       ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2014-01-23 10:42 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Johannes Weiner, Greg Thelen, linux-mm, LKML

On Wed, 22 Jan 2014, Michal Hocko wrote:
> On Tue 21-01-14 13:18:42, Hugh Dickins wrote:
> [...]
> > We do have a confusing situation.  The hang goes back to 3.10 but takes
> > two different forms, because of intervening changes: in 3.10 and 3.11
> > mem_cgroup_iter repeatedly returns root memcg to its caller, in 3.12 and
> > 3.13 mem_cgroup_iter repeatedly gets NULL memcg from mem_cgroup_iter_next
> > and cannot return to its caller.
> > 
> > Patch 1/2 is what's needed to fix 3.10 and 3.11 (and applies correctly
> > to 3.11, but will have to be rediffed for 3.10 because of rearrangement
> > in between). 
> 
> I will backport it when it reaches stable queue.

Thank you.

Testing, at home and at work, has confirmed your two patches are good.
Greg's particular test on 3.11 gave convincing results, and I was helped
by Linus's tree of last night (df32e43a54d0) happening to be quite quick
to reproduce the issue on my laptop.

> 
> > Patch 2/2 is what's needed to fix 3.12 and 3.13 (but applies
> > correctly to neither of them because it's diffed on top of my CSS_ONLINE
> > fix).  Patch 1/2 is correct but unnecessary in 3.12 and 3.13: I'm unclear
> > whether Michal is claiming that it would also fix the hang in 3.12 and
> > 3.13 if we didn't have 2/2: I doubt that, and haven't tested that.
> 
> Actually both patches are needed. If we had only 2/2 then we wouldn't
> endless loop inside mem_cgroup_iter but we could still return root to
> caller all the time because mem_cgroup_iter_load would return NULL on
> css_tryget failure on the cached root. Or am I missing something that
> would prevent that?

In theory I agree with you; and if you're missing something, I can't see
it either.  But in practice, all my earlier testing of 3.12 and 3.13 was
just with 2/2, and I've tried without your 1/2 since: whereas I have hung
on 3.12 and 3.13 a convincing number of times without 2/2, I have never
hung on them with 2/2 without 1/2.  In practice 1/2 appears essential
for 3.10 and 3.11, but irrelevant for 3.12 and 3.13.

That could be easy to explain if there were a difference at the calling
end, shrink_zone(), between those releases: but I don't see that.  Odd.
Either we're both missing something, or my testing is even less reliable
than I'd thought.  But since I certainly don't dispute 1/2, it is merely
academic.  Though still bothersome.

> 
> > Given how Michal has diffed this patch on top of my CSS_ONLINE one
> > (mm-memcg-iteration-skip-memcgs-not-yet-fully-initialized.patch),
> > it would be helpful if you could mark that one also for stable 3.12+,
> > to save us from having to rediff this one for stable.  We don't have
> > a concrete example of a problem it solves in the vanilla kernel, but
> > it makes more sense to include it than to exclude it.
> 
> Yes, I think it makes sense to queue it for 3.12+ as well because it is
> non intrusive and potential issues would be really subtle.

Before Andrew sends these all off to Linus, I should admit that there's
probably a refinement still to come to the CSS_ONLINE one.  I'm ashamed
to admit that I overlooked a much earlier comment from Greg Thelen, who
suggested that a memory barrier might be required.  I think he's right,
and I'd have liked to say exactly what and where before answering you
now; but barriers are tricky elusive things, I've not yet decided,
and better report back to you now on the testing result.

Hugh

> 
> > (You would be right to point out that the CSS_ONLINE one fixes
> > something that goes back to 3.10: I'm saying 3.12+ because I'm not
> > motivated to rediff it for 3.10 and 3.11 when there's nothing to
> > go on top; but that's not a very good reason to lie - overrule me.)
> > 
> > Hugh
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH -mm 2/2] memcg: fix css reference leak and endless loop in mem_cgroup_iter
  2014-01-23 10:42                                     ` Hugh Dickins
@ 2014-01-23 11:09                                       ` Michal Hocko
  2014-01-23 12:53                                         ` Hugh Dickins
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2014-01-23 11:09 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Johannes Weiner, Greg Thelen, linux-mm, LKML

On Thu 23-01-14 02:42:58, Hugh Dickins wrote:
> On Wed, 22 Jan 2014, Michal Hocko wrote:
> > On Tue 21-01-14 13:18:42, Hugh Dickins wrote:
> > [...]
> > > We do have a confusing situation.  The hang goes back to 3.10 but takes
> > > two different forms, because of intervening changes: in 3.10 and 3.11
> > > mem_cgroup_iter repeatedly returns root memcg to its caller, in 3.12 and
> > > 3.13 mem_cgroup_iter repeatedly gets NULL memcg from mem_cgroup_iter_next
> > > and cannot return to its caller.
> > > 
> > > Patch 1/2 is what's needed to fix 3.10 and 3.11 (and applies correctly
> > > to 3.11, but will have to be rediffed for 3.10 because of rearrangement
> > > in between). 
> > 
> > I will backport it when it reaches stable queue.
> 
> Thank you.
> 
> Testing, at home and at work, has confirmed your two patches are good.

Great. Thanks a lot!

> Greg's particular test on 3.11 gave convincing results, and I was helped
> by Linus's tree of last night (df32e43a54d0) happening to be quite quick
> to reproduce the issue on my laptop.
> 
> > 
> > > Patch 2/2 is what's needed to fix 3.12 and 3.13 (but applies
> > > correctly to neither of them because it's diffed on top of my CSS_ONLINE
> > > fix).  Patch 1/2 is correct but unnecessary in 3.12 and 3.13: I'm unclear
> > > whether Michal is claiming that it would also fix the hang in 3.12 and
> > > 3.13 if we didn't have 2/2: I doubt that, and haven't tested that.
> > 
> > Actually both patches are needed. If we had only 2/2 then we wouldn't
> > endless loop inside mem_cgroup_iter but we could still return root to
> > caller all the time because mem_cgroup_iter_load would return NULL on
> > css_tryget failure on the cached root. Or am I missing something that
> > would prevent that?
> 
> In theory I agree with you; and if you're missing something, I can't see
> it either.  But in practice, all my earlier testing of 3.12 and 3.13 was
> just with 2/2, and I've tried without your 1/2 since: whereas I have hung
> on 3.12 and 3.13 a convincing number of times without 2/2, I have never
> hung on them with 2/2 without 1/2.  In practice 1/2 appears essential
> for 3.10 and 3.11, but irrelevant for 3.12 and 3.13.
> 
> That could be easy to explain if there were a difference at the calling
> end, shrink_zone(), between those releases: but I don't see that.  Odd.
> Either we're both missing something, or my testing is even less reliable
> than I'd thought.  But since I certainly don't dispute 1/2, it is merely
> academic.  Though still bothersome.

I would assume that it is (sc->nr_reclaimed >= sc->nr_to_reclaim) that
helps us to back off. SWAP_CLUSTER_MAX shouldn't be that hard to get to
before css_offline racing part reparents all the memory.

Anyway, I would feel safer if this was pushed fixed although you haven't
reporoduced it.
 
> > > Given how Michal has diffed this patch on top of my CSS_ONLINE one
> > > (mm-memcg-iteration-skip-memcgs-not-yet-fully-initialized.patch),
> > > it would be helpful if you could mark that one also for stable 3.12+,
> > > to save us from having to rediff this one for stable.  We don't have
> > > a concrete example of a problem it solves in the vanilla kernel, but
> > > it makes more sense to include it than to exclude it.
> > 
> > Yes, I think it makes sense to queue it for 3.12+ as well because it is
> > non intrusive and potential issues would be really subtle.
> 
> Before Andrew sends these all off to Linus, I should admit that there's
> probably a refinement still to come to the CSS_ONLINE one.  I'm ashamed
> to admit that I overlooked a much earlier comment from Greg Thelen, who
> suggested that a memory barrier might be required.

I was thinking about mem barrier while reviewing your patch but then I
convinced myself that we should be safe also without using one when
checking CSS_ONLINE.
We have basically two situations.
	- online_css when we can miss it being set which is OK because
	  we would miss a new empty group.
	- offline_css when we could still see the flag being set but
	  then css_tryget would be already failing.

So while all this is subtle and relies on cgroup core heavily I think we
should be safe wrt. memory barriers.

Or did you mean something else here?

> I think he's right, and I'd have liked to say exactly what and where
> before answering you now; but barriers are tricky elusive things, I've
> not yet decided, and better report back to you now on the testing
> result.
> 
> Hugh

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH -mm 2/2] memcg: fix css reference leak and endless loop in mem_cgroup_iter
  2014-01-23 11:09                                       ` Michal Hocko
@ 2014-01-23 12:53                                         ` Hugh Dickins
  0 siblings, 0 replies; 30+ messages in thread
From: Hugh Dickins @ 2014-01-23 12:53 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Johannes Weiner, Greg Thelen, linux-mm, LKML

On Thu, 23 Jan 2014, Michal Hocko wrote:
> On Thu 23-01-14 02:42:58, Hugh Dickins wrote:
> > > 
> > > Actually both patches are needed. If we had only 2/2 then we wouldn't
> > > endless loop inside mem_cgroup_iter but we could still return root to
> > > caller all the time because mem_cgroup_iter_load would return NULL on
> > > css_tryget failure on the cached root. Or am I missing something that
> > > would prevent that?
> > 
> > In theory I agree with you; and if you're missing something, I can't see
> > it either.  But in practice, all my earlier testing of 3.12 and 3.13 was
> > just with 2/2, and I've tried without your 1/2 since: whereas I have hung
> > on 3.12 and 3.13 a convincing number of times without 2/2, I have never
> > hung on them with 2/2 without 1/2.  In practice 1/2 appears essential
> > for 3.10 and 3.11, but irrelevant for 3.12 and 3.13.
> > 
> > That could be easy to explain if there were a difference at the calling
> > end, shrink_zone(), between those releases: but I don't see that.  Odd.
> > Either we're both missing something, or my testing is even less reliable
> > than I'd thought.  But since I certainly don't dispute 1/2, it is merely
> > academic.  Though still bothersome.
> 
> I would assume that it is (sc->nr_reclaimed >= sc->nr_to_reclaim) that
> helps us to back off. SWAP_CLUSTER_MAX shouldn't be that hard to get to
> before css_offline racing part reparents all the memory.

But wouldn't explain why I could see it on 3.10,11 but not on 3.12,13.

Perhaps the 2/2 problem is a lot easier to hit than the 1/2 problem,
and I mistakenly expected to see the 1/2 problem in the timescale that
I saw the 2/2 problem; but I don't really think either is the case.

> 
> Anyway, I would feel safer if this was pushed fixed although you haven't
> reporoduced it.

Absolutely.

> > Before Andrew sends these all off to Linus, I should admit that there's
> > probably a refinement still to come to the CSS_ONLINE one.  I'm ashamed
> > to admit that I overlooked a much earlier comment from Greg Thelen, who
> > suggested that a memory barrier might be required.
> 
> I was thinking about mem barrier while reviewing your patch but then I
> convinced myself that we should be safe also without using one when
> checking CSS_ONLINE.
> We have basically two situations.
> 	- online_css when we can miss it being set which is OK because
> 	  we would miss a new empty group.
> 	- offline_css when we could still see the flag being set but
> 	  then css_tryget would be already failing.
> 
> So while all this is subtle and relies on cgroup core heavily I think we
> should be safe wrt. memory barriers.
> 
> Or did you mean something else here?

Something else.  My CSS_OFFLINE patch claims to prevent the iterator
from returning an uninitialized struct mem_cgroup: if that is to be
relied upon, then it needs to make sure that the initialization of
the mem_cgroup is visible to the caller before the CSS_OFFLINE flag.

kernel/cgroup.c online_css() does nowadays have an smp_wmb() buried
in its rcu_assign_pointer(); but it's not in the right place to
make this particular guarantee.  And an smp_rmb() needed somewhere
too, if it doesn't already come for free somehow.

Hugh

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2014-01-23 12:53 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-14  1:50 [PATCH 1/3] mm/memcg: fix last_dead_count memory wastage Hugh Dickins
2014-01-14  1:52 ` [PATCH 2/3] mm/memcg: fix endless iteration in reclaim Hugh Dickins
2014-01-14 13:27   ` Michal Hocko
2014-01-14 13:34     ` Michal Hocko
2014-01-14 14:26     ` Michal Hocko
2014-01-14 20:42       ` Hugh Dickins
2014-01-15  9:58         ` Michal Hocko
2014-01-15 12:17           ` Michal Hocko
2014-01-15 21:24             ` Hugh Dickins
2014-01-16  8:17               ` Michal Hocko
2014-01-16 15:22                 ` Michal Hocko
2014-01-16 19:15                   ` Hugh Dickins
2014-01-17 15:41                     ` Michal Hocko
2014-01-21  5:16                       ` Hugh Dickins
2014-01-21  8:34                         ` Michal Hocko
2014-01-21 10:45                           ` [PATCH -mm 1/2] memcg: fix endless loop caused by mem_cgroup_iter Michal Hocko
2014-01-21 10:45                             ` [PATCH -mm 2/2] memcg: fix css reference leak and endless loop in mem_cgroup_iter Michal Hocko
2014-01-21 19:42                               ` Andrew Morton
2014-01-21 21:18                                 ` Hugh Dickins
2014-01-22  8:27                                   ` Michal Hocko
2014-01-23 10:42                                     ` Hugh Dickins
2014-01-23 11:09                                       ` Michal Hocko
2014-01-23 12:53                                         ` Hugh Dickins
2014-01-22  8:12                                 ` Michal Hocko
2014-01-14  1:54 ` [PATCH 3/3] mm/memcg: iteration skip memcgs not yet fully initialized Hugh Dickins
2014-01-14 13:30   ` Michal Hocko
2014-01-14 14:29     ` Tejun Heo
2014-01-15  8:20       ` Michal Hocko
2014-01-15  8:21   ` Michal Hocko
2014-01-14 13:03 ` [PATCH 1/3] mm/memcg: fix last_dead_count memory wastage Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).