All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] memcg: fix endless loop in __mem_cgroup_iter_next
@ 2014-02-13  1:26 ` Hugh Dickins
  0 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2014-02-13  1:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Greg Thelen, Tejun Heo, linux-mm,
	linux-kernel

Commit 0eef615665ed ("memcg: fix css reference leak and endless loop in
mem_cgroup_iter") got the interaction with the commit a few before it
d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully initialized")
slightly wrong, and we didn't notice at the time.

It's elusive, and harder to get than the original, but for a couple of
days before rc1, I several times saw a endless loop similar to that
supposedly being fixed.

This time it was a tighter loop in __mem_cgroup_iter_next(): because we
can get here when our root has already been offlined, and the ordering
of conditions was such that we then just cycled around forever.

Fixes: 0eef615665ed ("memcg: fix css reference leak and endless loop in mem_cgroup_iter")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # 3.12+
---
Of course I'd have preferred to send this before that commit went through
to -stable, but priorities kept preempting; I did wonder whether to ask
GregKH to delay it, but decided it's not serious enough to trouble him,
just go with the flow of stable fixing stable.

 mm/memcontrol.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- 3.14-rc2/mm/memcontrol.c	2014-02-02 18:49:07.897302115 -0800
+++ linux/mm/memcontrol.c	2014-02-12 11:55:02.836035004 -0800
@@ -1127,8 +1127,8 @@ skip_node:
 	 * skipping css reference should be safe.
 	 */
 	if (next_css) {
-		if ((next_css->flags & CSS_ONLINE) &&
-				(next_css == &root->css || css_tryget(next_css)))
+		if ((next_css == &root->css) ||
+		    ((next_css->flags & CSS_ONLINE) && css_tryget(next_css)))
 			return mem_cgroup_from_css(next_css);
 
 		prev_css = next_css;

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] memcg: fix endless loop in __mem_cgroup_iter_next
@ 2014-02-13  1:26 ` Hugh Dickins
  0 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2014-02-13  1:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Greg Thelen, Tejun Heo, linux-mm,
	linux-kernel

Commit 0eef615665ed ("memcg: fix css reference leak and endless loop in
mem_cgroup_iter") got the interaction with the commit a few before it
d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully initialized")
slightly wrong, and we didn't notice at the time.

It's elusive, and harder to get than the original, but for a couple of
days before rc1, I several times saw a endless loop similar to that
supposedly being fixed.

This time it was a tighter loop in __mem_cgroup_iter_next(): because we
can get here when our root has already been offlined, and the ordering
of conditions was such that we then just cycled around forever.

Fixes: 0eef615665ed ("memcg: fix css reference leak and endless loop in mem_cgroup_iter")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # 3.12+
---
Of course I'd have preferred to send this before that commit went through
to -stable, but priorities kept preempting; I did wonder whether to ask
GregKH to delay it, but decided it's not serious enough to trouble him,
just go with the flow of stable fixing stable.

 mm/memcontrol.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- 3.14-rc2/mm/memcontrol.c	2014-02-02 18:49:07.897302115 -0800
+++ linux/mm/memcontrol.c	2014-02-12 11:55:02.836035004 -0800
@@ -1127,8 +1127,8 @@ skip_node:
 	 * skipping css reference should be safe.
 	 */
 	if (next_css) {
-		if ((next_css->flags & CSS_ONLINE) &&
-				(next_css == &root->css || css_tryget(next_css)))
+		if ((next_css == &root->css) ||
+		    ((next_css->flags & CSS_ONLINE) && css_tryget(next_css)))
 			return mem_cgroup_from_css(next_css);
 
 		prev_css = next_css;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 2/2] memcg: barriers to see memcgs as fully initialized
  2014-02-13  1:26 ` Hugh Dickins
@ 2014-02-13  1:29   ` Hugh Dickins
  -1 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2014-02-13  1:29 UTC (permalink / raw)
  To: Michal Hocko, Tejun Heo
  Cc: Andrew Morton, Johannes Weiner, Greg Thelen, linux-mm, linux-kernel

Commit d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully
initialized") is not bad, but Greg Thelen asks "Are barriers needed?"

Yes, I'm afraid so: this makes it a little heavier than the original,
but there's no point in guaranteeing that mem_cgroup_iter() returns only
fully initialized memcgs, if we don't guarantee that the initialization
is visible.

If we move online_css()'s setting CSS_ONLINE after rcu_assign_pointer()
(I don't see why not), we can reasonably rely on the smp_wmb() in that.
But I can't find a pre-existing barrier at the mem_cgroup_iter() end,
so add an smp_rmb() where __mem_cgroup_iter_next() returns non-NULL.

Fixes: d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully initialized")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # 3.12+
---
I'd have been happier not to have to add this patch: maybe you can see
a better placement, or a way we can avoid this altogether.

 kernel/cgroup.c |    8 +++++++-
 mm/memcontrol.c |   11 +++++++++--
 2 files changed, 16 insertions(+), 3 deletions(-)

--- 3.14-rc2+/kernel/cgroup.c	2014-02-02 18:49:07.737302111 -0800
+++ linux/kernel/cgroup.c	2014-02-12 11:59:52.804041895 -0800
@@ -4063,9 +4063,15 @@ static int online_css(struct cgroup_subs
 	if (ss->css_online)
 		ret = ss->css_online(css);
 	if (!ret) {
-		css->flags |= CSS_ONLINE;
 		css->cgroup->nr_css++;
 		rcu_assign_pointer(css->cgroup->subsys[ss->subsys_id], css);
+		/*
+		 * Set CSS_ONLINE after rcu_assign_pointer(), so that its
+		 * smp_wmb() will guarantee that those seeing CSS_ONLINE
+		 * can see the initialization done in ss->css_online() - if
+		 * they provide an smp_rmb(), as in __mem_cgroup_iter_next().
+		 */
+		css->flags |= CSS_ONLINE;
 	}
 	return ret;
 }
--- 3.14-rc2+/mm/memcontrol.c	2014-02-12 11:55:02.836035004 -0800
+++ linux/mm/memcontrol.c	2014-02-12 11:59:52.804041895 -0800
@@ -1128,9 +1128,16 @@ skip_node:
 	 */
 	if (next_css) {
 		if ((next_css == &root->css) ||
-		    ((next_css->flags & CSS_ONLINE) && css_tryget(next_css)))
+		    ((next_css->flags & CSS_ONLINE) && css_tryget(next_css))) {
+			/*
+			 * Ensure that all memcg initialization, done before
+			 * CSS_ONLINE was set, will be visible to our caller.
+			 * This matches the smp_wmb() in online_css()'s
+			 * rcu_assign_pointer(), before it set CSS_ONLINE.
+			 */
+			smp_rmb();
 			return mem_cgroup_from_css(next_css);
-
+		}
 		prev_css = next_css;
 		goto skip_node;
 	}

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 2/2] memcg: barriers to see memcgs as fully initialized
@ 2014-02-13  1:29   ` Hugh Dickins
  0 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2014-02-13  1:29 UTC (permalink / raw)
  To: Michal Hocko, Tejun Heo
  Cc: Andrew Morton, Johannes Weiner, Greg Thelen, linux-mm, linux-kernel

Commit d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully
initialized") is not bad, but Greg Thelen asks "Are barriers needed?"

Yes, I'm afraid so: this makes it a little heavier than the original,
but there's no point in guaranteeing that mem_cgroup_iter() returns only
fully initialized memcgs, if we don't guarantee that the initialization
is visible.

If we move online_css()'s setting CSS_ONLINE after rcu_assign_pointer()
(I don't see why not), we can reasonably rely on the smp_wmb() in that.
But I can't find a pre-existing barrier at the mem_cgroup_iter() end,
so add an smp_rmb() where __mem_cgroup_iter_next() returns non-NULL.

Fixes: d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully initialized")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # 3.12+
---
I'd have been happier not to have to add this patch: maybe you can see
a better placement, or a way we can avoid this altogether.

 kernel/cgroup.c |    8 +++++++-
 mm/memcontrol.c |   11 +++++++++--
 2 files changed, 16 insertions(+), 3 deletions(-)

--- 3.14-rc2+/kernel/cgroup.c	2014-02-02 18:49:07.737302111 -0800
+++ linux/kernel/cgroup.c	2014-02-12 11:59:52.804041895 -0800
@@ -4063,9 +4063,15 @@ static int online_css(struct cgroup_subs
 	if (ss->css_online)
 		ret = ss->css_online(css);
 	if (!ret) {
-		css->flags |= CSS_ONLINE;
 		css->cgroup->nr_css++;
 		rcu_assign_pointer(css->cgroup->subsys[ss->subsys_id], css);
+		/*
+		 * Set CSS_ONLINE after rcu_assign_pointer(), so that its
+		 * smp_wmb() will guarantee that those seeing CSS_ONLINE
+		 * can see the initialization done in ss->css_online() - if
+		 * they provide an smp_rmb(), as in __mem_cgroup_iter_next().
+		 */
+		css->flags |= CSS_ONLINE;
 	}
 	return ret;
 }
--- 3.14-rc2+/mm/memcontrol.c	2014-02-12 11:55:02.836035004 -0800
+++ linux/mm/memcontrol.c	2014-02-12 11:59:52.804041895 -0800
@@ -1128,9 +1128,16 @@ skip_node:
 	 */
 	if (next_css) {
 		if ((next_css == &root->css) ||
-		    ((next_css->flags & CSS_ONLINE) && css_tryget(next_css)))
+		    ((next_css->flags & CSS_ONLINE) && css_tryget(next_css))) {
+			/*
+			 * Ensure that all memcg initialization, done before
+			 * CSS_ONLINE was set, will be visible to our caller.
+			 * This matches the smp_wmb() in online_css()'s
+			 * rcu_assign_pointer(), before it set CSS_ONLINE.
+			 */
+			smp_rmb();
 			return mem_cgroup_from_css(next_css);
-
+		}
 		prev_css = next_css;
 		goto skip_node;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] memcg: fix endless loop in __mem_cgroup_iter_next
  2014-02-13  1:26 ` Hugh Dickins
@ 2014-02-13 14:23   ` Michal Hocko
  -1 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2014-02-13 14:23 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Johannes Weiner, Greg Thelen, Tejun Heo, linux-mm,
	linux-kernel

On Wed 12-02-14 17:26:46, Hugh Dickins wrote:
> Commit 0eef615665ed ("memcg: fix css reference leak and endless loop in
> mem_cgroup_iter") got the interaction with the commit a few before it
> d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully initialized")
> slightly wrong, and we didn't notice at the time.
> 
> It's elusive, and harder to get than the original, but for a couple of
> days before rc1, I several times saw a endless loop similar to that
> supposedly being fixed.
> 
> This time it was a tighter loop in __mem_cgroup_iter_next(): because we
> can get here when our root has already been offlined, and the ordering
> of conditions was such that we then just cycled around forever.
> 
> Fixes: 0eef615665ed ("memcg: fix css reference leak and endless loop in mem_cgroup_iter")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: stable@vger.kernel.org # 3.12+

You are right of course. This is really embarrassing. I should have
noticed this when porting my original patch on top of yours.

Acked-by: Michal Hocko <mhocko@suse.cz>

Thanks!

> ---
> Of course I'd have preferred to send this before that commit went through
> to -stable, but priorities kept preempting; I did wonder whether to ask
> GregKH to delay it, but decided it's not serious enough to trouble him,
> just go with the flow of stable fixing stable.
> 
>  mm/memcontrol.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> --- 3.14-rc2/mm/memcontrol.c	2014-02-02 18:49:07.897302115 -0800
> +++ linux/mm/memcontrol.c	2014-02-12 11:55:02.836035004 -0800
> @@ -1127,8 +1127,8 @@ skip_node:
>  	 * skipping css reference should be safe.
>  	 */
>  	if (next_css) {
> -		if ((next_css->flags & CSS_ONLINE) &&
> -				(next_css == &root->css || css_tryget(next_css)))
> +		if ((next_css == &root->css) ||
> +		    ((next_css->flags & CSS_ONLINE) && css_tryget(next_css)))
>  			return mem_cgroup_from_css(next_css);
>  
>  		prev_css = next_css;

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] memcg: fix endless loop in __mem_cgroup_iter_next
@ 2014-02-13 14:23   ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2014-02-13 14:23 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Johannes Weiner, Greg Thelen, Tejun Heo, linux-mm,
	linux-kernel

On Wed 12-02-14 17:26:46, Hugh Dickins wrote:
> Commit 0eef615665ed ("memcg: fix css reference leak and endless loop in
> mem_cgroup_iter") got the interaction with the commit a few before it
> d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully initialized")
> slightly wrong, and we didn't notice at the time.
> 
> It's elusive, and harder to get than the original, but for a couple of
> days before rc1, I several times saw a endless loop similar to that
> supposedly being fixed.
> 
> This time it was a tighter loop in __mem_cgroup_iter_next(): because we
> can get here when our root has already been offlined, and the ordering
> of conditions was such that we then just cycled around forever.
> 
> Fixes: 0eef615665ed ("memcg: fix css reference leak and endless loop in mem_cgroup_iter")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: stable@vger.kernel.org # 3.12+

You are right of course. This is really embarrassing. I should have
noticed this when porting my original patch on top of yours.

Acked-by: Michal Hocko <mhocko@suse.cz>

Thanks!

> ---
> Of course I'd have preferred to send this before that commit went through
> to -stable, but priorities kept preempting; I did wonder whether to ask
> GregKH to delay it, but decided it's not serious enough to trouble him,
> just go with the flow of stable fixing stable.
> 
>  mm/memcontrol.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> --- 3.14-rc2/mm/memcontrol.c	2014-02-02 18:49:07.897302115 -0800
> +++ linux/mm/memcontrol.c	2014-02-12 11:55:02.836035004 -0800
> @@ -1127,8 +1127,8 @@ skip_node:
>  	 * skipping css reference should be safe.
>  	 */
>  	if (next_css) {
> -		if ((next_css->flags & CSS_ONLINE) &&
> -				(next_css == &root->css || css_tryget(next_css)))
> +		if ((next_css == &root->css) ||
> +		    ((next_css->flags & CSS_ONLINE) && css_tryget(next_css)))
>  			return mem_cgroup_from_css(next_css);
>  
>  		prev_css = next_css;

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] memcg: barriers to see memcgs as fully initialized
  2014-02-13  1:29   ` Hugh Dickins
@ 2014-02-13 14:53     ` Michal Hocko
  -1 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2014-02-13 14:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tejun Heo, Andrew Morton, Johannes Weiner, Greg Thelen, linux-mm,
	linux-kernel

On Wed 12-02-14 17:29:09, Hugh Dickins wrote:
> Commit d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully
> initialized") is not bad, but Greg Thelen asks "Are barriers needed?"
> 
> Yes, I'm afraid so: this makes it a little heavier than the original,
> but there's no point in guaranteeing that mem_cgroup_iter() returns only
> fully initialized memcgs, if we don't guarantee that the initialization
> is visible.
> 
> If we move online_css()'s setting CSS_ONLINE after rcu_assign_pointer()
> (I don't see why not), we can reasonably rely on the smp_wmb() in that.
> But I can't find a pre-existing barrier at the mem_cgroup_iter() end,
> so add an smp_rmb() where __mem_cgroup_iter_next() returns non-NULL.
> 
> Fixes: d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully initialized")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: stable@vger.kernel.org # 3.12+
> ---
> I'd have been happier not to have to add this patch: maybe you can see
> a better placement, or a way we can avoid this altogether.

I don't know. I have thought about this again and I really do not see
why we have to provide such a guarantee, to be honest.

Such a half initialized memcg wouldn't see its hierarchical parent
properly (including inheritted attributes) and it wouldn't have kmem
fully initialized. But it also wouldn't have any tasks in it IIRC so it
shouldn't matter much.

So I really don't know whether this all is worth all the troubles. 
I am not saying your patch is wrong (although I am not sure whether
css->flags vs. subsystem css association ordering is relevant and
ae7f164a09408 changelog didn't help me much) and it made sense when
you proposed it back then but the additional ordering requirements
complicates the thing.

I will keep thinking about that.

>  kernel/cgroup.c |    8 +++++++-
>  mm/memcontrol.c |   11 +++++++++--
>  2 files changed, 16 insertions(+), 3 deletions(-)
> 
> --- 3.14-rc2+/kernel/cgroup.c	2014-02-02 18:49:07.737302111 -0800
> +++ linux/kernel/cgroup.c	2014-02-12 11:59:52.804041895 -0800
> @@ -4063,9 +4063,15 @@ static int online_css(struct cgroup_subs
>  	if (ss->css_online)
>  		ret = ss->css_online(css);
>  	if (!ret) {
> -		css->flags |= CSS_ONLINE;
>  		css->cgroup->nr_css++;
>  		rcu_assign_pointer(css->cgroup->subsys[ss->subsys_id], css);
> +		/*
> +		 * Set CSS_ONLINE after rcu_assign_pointer(), so that its
> +		 * smp_wmb() will guarantee that those seeing CSS_ONLINE
> +		 * can see the initialization done in ss->css_online() - if
> +		 * they provide an smp_rmb(), as in __mem_cgroup_iter_next().
> +		 */
> +		css->flags |= CSS_ONLINE;
>  	}
>  	return ret;
>  }
> --- 3.14-rc2+/mm/memcontrol.c	2014-02-12 11:55:02.836035004 -0800
> +++ linux/mm/memcontrol.c	2014-02-12 11:59:52.804041895 -0800
> @@ -1128,9 +1128,16 @@ skip_node:
>  	 */
>  	if (next_css) {
>  		if ((next_css == &root->css) ||
> -		    ((next_css->flags & CSS_ONLINE) && css_tryget(next_css)))
> +		    ((next_css->flags & CSS_ONLINE) && css_tryget(next_css))) {
> +			/*
> +			 * Ensure that all memcg initialization, done before
> +			 * CSS_ONLINE was set, will be visible to our caller.
> +			 * This matches the smp_wmb() in online_css()'s
> +			 * rcu_assign_pointer(), before it set CSS_ONLINE.
> +			 */
> +			smp_rmb();
>  			return mem_cgroup_from_css(next_css);
> -
> +		}
>  		prev_css = next_css;
>  		goto skip_node;
>  	}
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] memcg: barriers to see memcgs as fully initialized
@ 2014-02-13 14:53     ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2014-02-13 14:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tejun Heo, Andrew Morton, Johannes Weiner, Greg Thelen, linux-mm,
	linux-kernel

On Wed 12-02-14 17:29:09, Hugh Dickins wrote:
> Commit d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully
> initialized") is not bad, but Greg Thelen asks "Are barriers needed?"
> 
> Yes, I'm afraid so: this makes it a little heavier than the original,
> but there's no point in guaranteeing that mem_cgroup_iter() returns only
> fully initialized memcgs, if we don't guarantee that the initialization
> is visible.
> 
> If we move online_css()'s setting CSS_ONLINE after rcu_assign_pointer()
> (I don't see why not), we can reasonably rely on the smp_wmb() in that.
> But I can't find a pre-existing barrier at the mem_cgroup_iter() end,
> so add an smp_rmb() where __mem_cgroup_iter_next() returns non-NULL.
> 
> Fixes: d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully initialized")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: stable@vger.kernel.org # 3.12+
> ---
> I'd have been happier not to have to add this patch: maybe you can see
> a better placement, or a way we can avoid this altogether.

I don't know. I have thought about this again and I really do not see
why we have to provide such a guarantee, to be honest.

Such a half initialized memcg wouldn't see its hierarchical parent
properly (including inheritted attributes) and it wouldn't have kmem
fully initialized. But it also wouldn't have any tasks in it IIRC so it
shouldn't matter much.

So I really don't know whether this all is worth all the troubles. 
I am not saying your patch is wrong (although I am not sure whether
css->flags vs. subsystem css association ordering is relevant and
ae7f164a09408 changelog didn't help me much) and it made sense when
you proposed it back then but the additional ordering requirements
complicates the thing.

I will keep thinking about that.

>  kernel/cgroup.c |    8 +++++++-
>  mm/memcontrol.c |   11 +++++++++--
>  2 files changed, 16 insertions(+), 3 deletions(-)
> 
> --- 3.14-rc2+/kernel/cgroup.c	2014-02-02 18:49:07.737302111 -0800
> +++ linux/kernel/cgroup.c	2014-02-12 11:59:52.804041895 -0800
> @@ -4063,9 +4063,15 @@ static int online_css(struct cgroup_subs
>  	if (ss->css_online)
>  		ret = ss->css_online(css);
>  	if (!ret) {
> -		css->flags |= CSS_ONLINE;
>  		css->cgroup->nr_css++;
>  		rcu_assign_pointer(css->cgroup->subsys[ss->subsys_id], css);
> +		/*
> +		 * Set CSS_ONLINE after rcu_assign_pointer(), so that its
> +		 * smp_wmb() will guarantee that those seeing CSS_ONLINE
> +		 * can see the initialization done in ss->css_online() - if
> +		 * they provide an smp_rmb(), as in __mem_cgroup_iter_next().
> +		 */
> +		css->flags |= CSS_ONLINE;
>  	}
>  	return ret;
>  }
> --- 3.14-rc2+/mm/memcontrol.c	2014-02-12 11:55:02.836035004 -0800
> +++ linux/mm/memcontrol.c	2014-02-12 11:59:52.804041895 -0800
> @@ -1128,9 +1128,16 @@ skip_node:
>  	 */
>  	if (next_css) {
>  		if ((next_css == &root->css) ||
> -		    ((next_css->flags & CSS_ONLINE) && css_tryget(next_css)))
> +		    ((next_css->flags & CSS_ONLINE) && css_tryget(next_css))) {
> +			/*
> +			 * Ensure that all memcg initialization, done before
> +			 * CSS_ONLINE was set, will be visible to our caller.
> +			 * This matches the smp_wmb() in online_css()'s
> +			 * rcu_assign_pointer(), before it set CSS_ONLINE.
> +			 */
> +			smp_rmb();
>  			return mem_cgroup_from_css(next_css);
> -
> +		}
>  		prev_css = next_css;
>  		goto skip_node;
>  	}
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] memcg: barriers to see memcgs as fully initialized
  2014-02-13  1:29   ` Hugh Dickins
@ 2014-02-13 21:07     ` Tejun Heo
  -1 siblings, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2014-02-13 21:07 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Greg Thelen,
	linux-mm, linux-kernel

Hello, Hugh.

On Wed, Feb 12, 2014 at 05:29:09PM -0800, Hugh Dickins wrote:
> Commit d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully
> initialized") is not bad, but Greg Thelen asks "Are barriers needed?"
> 
> Yes, I'm afraid so: this makes it a little heavier than the original,
> but there's no point in guaranteeing that mem_cgroup_iter() returns only
> fully initialized memcgs, if we don't guarantee that the initialization
> is visible.
> 
> If we move online_css()'s setting CSS_ONLINE after rcu_assign_pointer()
> (I don't see why not), we can reasonably rely on the smp_wmb() in that.
> But I can't find a pre-existing barrier at the mem_cgroup_iter() end,
> so add an smp_rmb() where __mem_cgroup_iter_next() returns non-NULL.

Hmmm.... so, CSS_ONLINE was never meant to be used outside cgroup
proper.  The only guarantee that the css iterators make is that a css
which has finished its ->css_online() will be included in the
iteration, which implies that css's which haven't finished
->css_online() or already went past ->css_offline() may be included in
the iteration.  In fact, it's impossible to achieve the guarantee
without such implications if we want to avoid synchronizing everything
using common locking, which we apparently can't do across different
controllers.

The expectation is that if a controller needs to distinguish fully
online css's, it will perform its own synchronization among its
online, offline and iterations, which can usually be achieved through
per-css synchronization. There is asymmetry here due to the way
css_tryget() behaves.  Unfortuantely, I don't think it can be expanded
to become symmetrical for online testing without adding, say,
->css_post_online() callback.

So, the only thing that memcg can depend on while iterating is that it
will include all css's which finished ->css_online() and if memcg
wants to filter out the ones which haven't yet, it should do its own
marking in ->css_online() rather than depending on what cgroup core
does with the flags.  That way, locking rules are a lot more evident
in each subsystem and we don't end up depending on cgroup internal
details which aren't immediately obvious.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] memcg: barriers to see memcgs as fully initialized
@ 2014-02-13 21:07     ` Tejun Heo
  0 siblings, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2014-02-13 21:07 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Greg Thelen,
	linux-mm, linux-kernel

Hello, Hugh.

On Wed, Feb 12, 2014 at 05:29:09PM -0800, Hugh Dickins wrote:
> Commit d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully
> initialized") is not bad, but Greg Thelen asks "Are barriers needed?"
> 
> Yes, I'm afraid so: this makes it a little heavier than the original,
> but there's no point in guaranteeing that mem_cgroup_iter() returns only
> fully initialized memcgs, if we don't guarantee that the initialization
> is visible.
> 
> If we move online_css()'s setting CSS_ONLINE after rcu_assign_pointer()
> (I don't see why not), we can reasonably rely on the smp_wmb() in that.
> But I can't find a pre-existing barrier at the mem_cgroup_iter() end,
> so add an smp_rmb() where __mem_cgroup_iter_next() returns non-NULL.

Hmmm.... so, CSS_ONLINE was never meant to be used outside cgroup
proper.  The only guarantee that the css iterators make is that a css
which has finished its ->css_online() will be included in the
iteration, which implies that css's which haven't finished
->css_online() or already went past ->css_offline() may be included in
the iteration.  In fact, it's impossible to achieve the guarantee
without such implications if we want to avoid synchronizing everything
using common locking, which we apparently can't do across different
controllers.

The expectation is that if a controller needs to distinguish fully
online css's, it will perform its own synchronization among its
online, offline and iterations, which can usually be achieved through
per-css synchronization. There is asymmetry here due to the way
css_tryget() behaves.  Unfortuantely, I don't think it can be expanded
to become symmetrical for online testing without adding, say,
->css_post_online() callback.

So, the only thing that memcg can depend on while iterating is that it
will include all css's which finished ->css_online() and if memcg
wants to filter out the ones which haven't yet, it should do its own
marking in ->css_online() rather than depending on what cgroup core
does with the flags.  That way, locking rules are a lot more evident
in each subsystem and we don't end up depending on cgroup internal
details which aren't immediately obvious.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] memcg: barriers to see memcgs as fully initialized
  2014-02-13 14:53     ` Michal Hocko
@ 2014-02-16  2:52       ` Hugh Dickins
  -1 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2014-02-16  2:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Tejun Heo, Andrew Morton, Johannes Weiner,
	Greg Thelen, linux-mm, linux-kernel

On Thu, 13 Feb 2014, Michal Hocko wrote:
> On Wed 12-02-14 17:29:09, Hugh Dickins wrote:
> > Commit d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully
> > initialized") is not bad, but Greg Thelen asks "Are barriers needed?"
> > 
> > Yes, I'm afraid so: this makes it a little heavier than the original,
> > but there's no point in guaranteeing that mem_cgroup_iter() returns only
> > fully initialized memcgs, if we don't guarantee that the initialization
> > is visible.
> > 
> > If we move online_css()'s setting CSS_ONLINE after rcu_assign_pointer()
> > (I don't see why not), we can reasonably rely on the smp_wmb() in that.
> > But I can't find a pre-existing barrier at the mem_cgroup_iter() end,
> > so add an smp_rmb() where __mem_cgroup_iter_next() returns non-NULL.
> > 
> > Fixes: d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully initialized")
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > Cc: stable@vger.kernel.org # 3.12+
> > ---
> > I'd have been happier not to have to add this patch: maybe you can see
> > a better placement, or a way we can avoid this altogether.
> 
> I don't know. I have thought about this again and I really do not see
> why we have to provide such a guarantee, to be honest.
> 
> Such a half initialized memcg wouldn't see its hierarchical parent
> properly (including inheritted attributes) and it wouldn't have kmem
> fully initialized. But it also wouldn't have any tasks in it IIRC so it
> shouldn't matter much.
> 
> So I really don't know whether this all is worth all the troubles. 
> I am not saying your patch is wrong (although I am not sure whether
> css->flags vs. subsystem css association ordering is relevant and
> ae7f164a09408 changelog didn't help me much) and it made sense when
> you proposed it back then but the additional ordering requirements
> complicates the thing.

Your feelings match mine exactly: nice enough when it was just a matter
of testing a flag, but rather a bore to have to go adding barriers.
And Tejun didn't like it either, would prefer the barriers to be
internal to memcg, if we really need them.

No surprise: it's why I made it an easily skippable 2/2.
Let's forget this patch - but I still don't want to remove the
CSS_ONLINE check, not yet anyway.

At the time I added that check, I only had out-of-tree changes
and lockdep weirdness in support of it.  I did spend a little time
yesterday looking to see if there's a stronger case, thought I'd
found one, but looking again don't see it - I think I was muddling
stats with RES_USAGE.

The kind of case I was looking for was stats gathering doing a
res_counter_read_u64() inside a for_each_mem_cgroup() loop.  On
a 32-bit kernel, res_counter_read_u64() has to use the spinlock
which is not initialized until mem_cgroup_css_online().  Which
it should manage with unadorned ticket lock, but then the unlock
might race with its initialization (I'm not sure how that will
then behave).  But actually I don't think stats gathering ever
does iterative res_counter_reads (and there may be good design
reasons why that would never make sense).

Now I do see the existing mem_cgroup_soft_reclaim() doing a 
res_counter_soft_limit_excess() in a mem_cgroup_iter() loop:
I guess that is vulnerable, even on 64-bit, and a lot safer
with the CSS_ONLINE check, even lacking barriers.

I'm thinking that we'd be safer if those res_counters
initialized in mem_cgroup_css_online() could be initialized in
mem_cgroup_css_alloc(), then updated in mem_cgroup_css_online();
but I don't think res_counter.c offers that option today,
not to change parent (or could parent be set from the start?
maybe that gets into races with setting use_hierarchy).

I haven't looked into what memcg_init_kmem() gets up to,
and whether that's safe before it's initialized.

Not something urgent I'm intending to rush into, and please don't
feel you need rush to respond, these are just thoughts for later:
let's move away from the CSS_ONLINE check and barriers, and
towards having the struct mem_cgroup sensibly initialized earlier.

Hugh

> 
> I will keep thinking about that.
> 
> >  kernel/cgroup.c |    8 +++++++-
> >  mm/memcontrol.c |   11 +++++++++--
> >  2 files changed, 16 insertions(+), 3 deletions(-)
> > 
> > --- 3.14-rc2+/kernel/cgroup.c	2014-02-02 18:49:07.737302111 -0800
> > +++ linux/kernel/cgroup.c	2014-02-12 11:59:52.804041895 -0800
> > @@ -4063,9 +4063,15 @@ static int online_css(struct cgroup_subs
> >  	if (ss->css_online)
> >  		ret = ss->css_online(css);
> >  	if (!ret) {
> > -		css->flags |= CSS_ONLINE;
> >  		css->cgroup->nr_css++;
> >  		rcu_assign_pointer(css->cgroup->subsys[ss->subsys_id], css);
> > +		/*
> > +		 * Set CSS_ONLINE after rcu_assign_pointer(), so that its
> > +		 * smp_wmb() will guarantee that those seeing CSS_ONLINE
> > +		 * can see the initialization done in ss->css_online() - if
> > +		 * they provide an smp_rmb(), as in __mem_cgroup_iter_next().
> > +		 */
> > +		css->flags |= CSS_ONLINE;
> >  	}
> >  	return ret;
> >  }
> > --- 3.14-rc2+/mm/memcontrol.c	2014-02-12 11:55:02.836035004 -0800
> > +++ linux/mm/memcontrol.c	2014-02-12 11:59:52.804041895 -0800
> > @@ -1128,9 +1128,16 @@ skip_node:
> >  	 */
> >  	if (next_css) {
> >  		if ((next_css == &root->css) ||
> > -		    ((next_css->flags & CSS_ONLINE) && css_tryget(next_css)))
> > +		    ((next_css->flags & CSS_ONLINE) && css_tryget(next_css))) {
> > +			/*
> > +			 * Ensure that all memcg initialization, done before
> > +			 * CSS_ONLINE was set, will be visible to our caller.
> > +			 * This matches the smp_wmb() in online_css()'s
> > +			 * rcu_assign_pointer(), before it set CSS_ONLINE.
> > +			 */
> > +			smp_rmb();
> >  			return mem_cgroup_from_css(next_css);
> > -
> > +		}
> >  		prev_css = next_css;
> >  		goto skip_node;
> >  	}
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] memcg: barriers to see memcgs as fully initialized
@ 2014-02-16  2:52       ` Hugh Dickins
  0 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2014-02-16  2:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Tejun Heo, Andrew Morton, Johannes Weiner,
	Greg Thelen, linux-mm, linux-kernel

On Thu, 13 Feb 2014, Michal Hocko wrote:
> On Wed 12-02-14 17:29:09, Hugh Dickins wrote:
> > Commit d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully
> > initialized") is not bad, but Greg Thelen asks "Are barriers needed?"
> > 
> > Yes, I'm afraid so: this makes it a little heavier than the original,
> > but there's no point in guaranteeing that mem_cgroup_iter() returns only
> > fully initialized memcgs, if we don't guarantee that the initialization
> > is visible.
> > 
> > If we move online_css()'s setting CSS_ONLINE after rcu_assign_pointer()
> > (I don't see why not), we can reasonably rely on the smp_wmb() in that.
> > But I can't find a pre-existing barrier at the mem_cgroup_iter() end,
> > so add an smp_rmb() where __mem_cgroup_iter_next() returns non-NULL.
> > 
> > Fixes: d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully initialized")
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > Cc: stable@vger.kernel.org # 3.12+
> > ---
> > I'd have been happier not to have to add this patch: maybe you can see
> > a better placement, or a way we can avoid this altogether.
> 
> I don't know. I have thought about this again and I really do not see
> why we have to provide such a guarantee, to be honest.
> 
> Such a half initialized memcg wouldn't see its hierarchical parent
> properly (including inheritted attributes) and it wouldn't have kmem
> fully initialized. But it also wouldn't have any tasks in it IIRC so it
> shouldn't matter much.
> 
> So I really don't know whether this all is worth all the troubles. 
> I am not saying your patch is wrong (although I am not sure whether
> css->flags vs. subsystem css association ordering is relevant and
> ae7f164a09408 changelog didn't help me much) and it made sense when
> you proposed it back then but the additional ordering requirements
> complicates the thing.

Your feelings match mine exactly: nice enough when it was just a matter
of testing a flag, but rather a bore to have to go adding barriers.
And Tejun didn't like it either, would prefer the barriers to be
internal to memcg, if we really need them.

No surprise: it's why I made it an easily skippable 2/2.
Let's forget this patch - but I still don't want to remove the
CSS_ONLINE check, not yet anyway.

At the time I added that check, I only had out-of-tree changes
and lockdep weirdness in support of it.  I did spend a little time
yesterday looking to see if there's a stronger case, thought I'd
found one, but looking again don't see it - I think I was muddling
stats with RES_USAGE.

The kind of case I was looking for was stats gathering doing a
res_counter_read_u64() inside a for_each_mem_cgroup() loop.  On
a 32-bit kernel, res_counter_read_u64() has to use the spinlock
which is not initialized until mem_cgroup_css_online().  Which
it should manage with unadorned ticket lock, but then the unlock
might race with its initialization (I'm not sure how that will
then behave).  But actually I don't think stats gathering ever
does iterative res_counter_reads (and there may be good design
reasons why that would never make sense).

Now I do see the existing mem_cgroup_soft_reclaim() doing a 
res_counter_soft_limit_excess() in a mem_cgroup_iter() loop:
I guess that is vulnerable, even on 64-bit, and a lot safer
with the CSS_ONLINE check, even lacking barriers.

I'm thinking that we'd be safer if those res_counters
initialized in mem_cgroup_css_online() could be initialized in
mem_cgroup_css_alloc(), then updated in mem_cgroup_css_online();
but I don't think res_counter.c offers that option today,
not to change parent (or could parent be set from the start?
maybe that gets into races with setting use_hierarchy).

I haven't looked into what memcg_init_kmem() gets up to,
and whether that's safe before it's initialized.

Not something urgent I'm intending to rush into, and please don't
feel you need rush to respond, these are just thoughts for later:
let's move away from the CSS_ONLINE check and barriers, and
towards having the struct mem_cgroup sensibly initialized earlier.

Hugh

> 
> I will keep thinking about that.
> 
> >  kernel/cgroup.c |    8 +++++++-
> >  mm/memcontrol.c |   11 +++++++++--
> >  2 files changed, 16 insertions(+), 3 deletions(-)
> > 
> > --- 3.14-rc2+/kernel/cgroup.c	2014-02-02 18:49:07.737302111 -0800
> > +++ linux/kernel/cgroup.c	2014-02-12 11:59:52.804041895 -0800
> > @@ -4063,9 +4063,15 @@ static int online_css(struct cgroup_subs
> >  	if (ss->css_online)
> >  		ret = ss->css_online(css);
> >  	if (!ret) {
> > -		css->flags |= CSS_ONLINE;
> >  		css->cgroup->nr_css++;
> >  		rcu_assign_pointer(css->cgroup->subsys[ss->subsys_id], css);
> > +		/*
> > +		 * Set CSS_ONLINE after rcu_assign_pointer(), so that its
> > +		 * smp_wmb() will guarantee that those seeing CSS_ONLINE
> > +		 * can see the initialization done in ss->css_online() - if
> > +		 * they provide an smp_rmb(), as in __mem_cgroup_iter_next().
> > +		 */
> > +		css->flags |= CSS_ONLINE;
> >  	}
> >  	return ret;
> >  }
> > --- 3.14-rc2+/mm/memcontrol.c	2014-02-12 11:55:02.836035004 -0800
> > +++ linux/mm/memcontrol.c	2014-02-12 11:59:52.804041895 -0800
> > @@ -1128,9 +1128,16 @@ skip_node:
> >  	 */
> >  	if (next_css) {
> >  		if ((next_css == &root->css) ||
> > -		    ((next_css->flags & CSS_ONLINE) && css_tryget(next_css)))
> > +		    ((next_css->flags & CSS_ONLINE) && css_tryget(next_css))) {
> > +			/*
> > +			 * Ensure that all memcg initialization, done before
> > +			 * CSS_ONLINE was set, will be visible to our caller.
> > +			 * This matches the smp_wmb() in online_css()'s
> > +			 * rcu_assign_pointer(), before it set CSS_ONLINE.
> > +			 */
> > +			smp_rmb();
> >  			return mem_cgroup_from_css(next_css);
> > -
> > +		}
> >  		prev_css = next_css;
> >  		goto skip_node;
> >  	}
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-02-16  2:53 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-13  1:26 [PATCH 1/2] memcg: fix endless loop in __mem_cgroup_iter_next Hugh Dickins
2014-02-13  1:26 ` Hugh Dickins
2014-02-13  1:29 ` [PATCH 2/2] memcg: barriers to see memcgs as fully initialized Hugh Dickins
2014-02-13  1:29   ` Hugh Dickins
2014-02-13 14:53   ` Michal Hocko
2014-02-13 14:53     ` Michal Hocko
2014-02-16  2:52     ` Hugh Dickins
2014-02-16  2:52       ` Hugh Dickins
2014-02-13 21:07   ` Tejun Heo
2014-02-13 21:07     ` Tejun Heo
2014-02-13 14:23 ` [PATCH 1/2] memcg: fix endless loop in __mem_cgroup_iter_next Michal Hocko
2014-02-13 14:23   ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.