[PATCH] cgroup: use an ordered workqueue for cgroup destruction

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] cgroup: use an ordered workqueue for cgroup destruction
@ 2014-02-06 23:56 ` Hugh Dickins
  0 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2014-02-06 23:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Filipe Brandenburger, Li Zefan, Andrew Morton, Michal Hocko,
	Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

Sometimes the cleanup after memcg hierarchy testing gets stuck in
mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.

There may turn out to be several causes, but a major cause is this: the
workitem to offline parent can get run before workitem to offline child;
parent's mem_cgroup_reparent_charges() circles around waiting for the
child's pages to be reparented to its lrus, but it's holding cgroup_mutex
which prevents the child from reaching its mem_cgroup_reparent_charges().

Just use an ordered workqueue for cgroup_destroy_wq.

Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
Suggested-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # 3.10+
---

 kernel/cgroup.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- 3.14-rc1/kernel/cgroup.c	2014-02-02 18:49:07.737302111 -0800
+++ linux/kernel/cgroup.c	2014-02-06 15:20:35.548904965 -0800
@@ -4845,12 +4845,12 @@ static int __init cgroup_wq_init(void)
 	/*
 	 * There isn't much point in executing destruction path in
 	 * parallel.  Good chunk is serialized with cgroup_mutex anyway.
-	 * Use 1 for @max_active.
+	 * Must be ordered to make sure parent is offlined after children.
 	 *
 	 * We would prefer to do this in cgroup_init() above, but that
 	 * is called before init_workqueues(): so leave this until after.
 	 */
-	cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
+	cgroup_destroy_wq = alloc_ordered_workqueue("cgroup_destroy", 0);
 	BUG_ON(!cgroup_destroy_wq);
 
 	/*

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH] cgroup: use an ordered workqueue for cgroup destruction
@ 2014-02-06 23:56 ` Hugh Dickins
  0 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2014-02-06 23:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Filipe Brandenburger, Li Zefan, Andrew Morton, Michal Hocko,
	Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

Sometimes the cleanup after memcg hierarchy testing gets stuck in
mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.

There may turn out to be several causes, but a major cause is this: the
workitem to offline parent can get run before workitem to offline child;
parent's mem_cgroup_reparent_charges() circles around waiting for the
child's pages to be reparented to its lrus, but it's holding cgroup_mutex
which prevents the child from reaching its mem_cgroup_reparent_charges().

Just use an ordered workqueue for cgroup_destroy_wq.

Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
Suggested-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # 3.10+
---

 kernel/cgroup.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- 3.14-rc1/kernel/cgroup.c	2014-02-02 18:49:07.737302111 -0800
+++ linux/kernel/cgroup.c	2014-02-06 15:20:35.548904965 -0800
@@ -4845,12 +4845,12 @@ static int __init cgroup_wq_init(void)
 	/*
 	 * There isn't much point in executing destruction path in
 	 * parallel.  Good chunk is serialized with cgroup_mutex anyway.
-	 * Use 1 for @max_active.
+	 * Must be ordered to make sure parent is offlined after children.
 	 *
 	 * We would prefer to do this in cgroup_init() above, but that
 	 * is called before init_workqueues(): so leave this until after.
 	 */
-	cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
+	cgroup_destroy_wq = alloc_ordered_workqueue("cgroup_destroy", 0);
 	BUG_ON(!cgroup_destroy_wq);
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH] cgroup: use an ordered workqueue for cgroup destruction
@ 2014-02-06 23:56 ` Hugh Dickins
  0 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2014-02-06 23:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Filipe Brandenburger, Li Zefan, Andrew Morton, Michal Hocko,
	Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

Sometimes the cleanup after memcg hierarchy testing gets stuck in
mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.

There may turn out to be several causes, but a major cause is this: the
workitem to offline parent can get run before workitem to offline child;
parent's mem_cgroup_reparent_charges() circles around waiting for the
child's pages to be reparented to its lrus, but it's holding cgroup_mutex
which prevents the child from reaching its mem_cgroup_reparent_charges().

Just use an ordered workqueue for cgroup_destroy_wq.

Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
Suggested-by: Filipe Brandenburger <filbranden-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org # 3.10+
---

 kernel/cgroup.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- 3.14-rc1/kernel/cgroup.c	2014-02-02 18:49:07.737302111 -0800
+++ linux/kernel/cgroup.c	2014-02-06 15:20:35.548904965 -0800
@@ -4845,12 +4845,12 @@ static int __init cgroup_wq_init(void)
 	/*
 	 * There isn't much point in executing destruction path in
 	 * parallel.  Good chunk is serialized with cgroup_mutex anyway.
-	 * Use 1 for @max_active.
+	 * Must be ordered to make sure parent is offlined after children.
 	 *
 	 * We would prefer to do this in cgroup_init() above, but that
 	 * is called before init_workqueues(): so leave this until after.
 	 */
-	cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
+	cgroup_destroy_wq = alloc_ordered_workqueue("cgroup_destroy", 0);
 	BUG_ON(!cgroup_destroy_wq);
 
 	/*

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
  2014-02-06 23:56 ` Hugh Dickins
  (?)
@ 2014-02-07 13:45   ` Michal Hocko
  -1 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2014-02-07 13:45 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tejun Heo, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

On Thu 06-02-14 15:56:01, Hugh Dickins wrote:
> Sometimes the cleanup after memcg hierarchy testing gets stuck in
> mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
> 
> There may turn out to be several causes, but a major cause is this: the
> workitem to offline parent can get run before workitem to offline child;
> parent's mem_cgroup_reparent_charges() circles around waiting for the
> child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> which prevents the child from reaching its mem_cgroup_reparent_charges().
> 
> Just use an ordered workqueue for cgroup_destroy_wq.

Hmm, interesting. Markus has seen hangs even with mem_cgroup_css_offline
and the referenced cgroup fixes, maybe this is the the right one
finally.

> Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
> Suggested-by: Filipe Brandenburger <filbranden@google.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: stable@vger.kernel.org # 3.10+

Reviewed-by: Michal Hocko <mhocko@suse.cz>

e5fca243abae was marked for 3.9 stable but I do not see it in the Greg's
3.9 stable branch so 3.10+ seems to be sufficient.

> ---
> 
>  kernel/cgroup.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> --- 3.14-rc1/kernel/cgroup.c	2014-02-02 18:49:07.737302111 -0800
> +++ linux/kernel/cgroup.c	2014-02-06 15:20:35.548904965 -0800
> @@ -4845,12 +4845,12 @@ static int __init cgroup_wq_init(void)
>  	/*
>  	 * There isn't much point in executing destruction path in
>  	 * parallel.  Good chunk is serialized with cgroup_mutex anyway.
> -	 * Use 1 for @max_active.
> +	 * Must be ordered to make sure parent is offlined after children.
>  	 *
>  	 * We would prefer to do this in cgroup_init() above, but that
>  	 * is called before init_workqueues(): so leave this until after.
>  	 */
> -	cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
> +	cgroup_destroy_wq = alloc_ordered_workqueue("cgroup_destroy", 0);
>  	BUG_ON(!cgroup_destroy_wq);
>  
>  	/*

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
@ 2014-02-07 13:45   ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2014-02-07 13:45 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tejun Heo, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

On Thu 06-02-14 15:56:01, Hugh Dickins wrote:
> Sometimes the cleanup after memcg hierarchy testing gets stuck in
> mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
> 
> There may turn out to be several causes, but a major cause is this: the
> workitem to offline parent can get run before workitem to offline child;
> parent's mem_cgroup_reparent_charges() circles around waiting for the
> child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> which prevents the child from reaching its mem_cgroup_reparent_charges().
> 
> Just use an ordered workqueue for cgroup_destroy_wq.

Hmm, interesting. Markus has seen hangs even with mem_cgroup_css_offline
and the referenced cgroup fixes, maybe this is the the right one
finally.

> Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
> Suggested-by: Filipe Brandenburger <filbranden@google.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: stable@vger.kernel.org # 3.10+

Reviewed-by: Michal Hocko <mhocko@suse.cz>

e5fca243abae was marked for 3.9 stable but I do not see it in the Greg's
3.9 stable branch so 3.10+ seems to be sufficient.

> ---
> 
>  kernel/cgroup.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> --- 3.14-rc1/kernel/cgroup.c	2014-02-02 18:49:07.737302111 -0800
> +++ linux/kernel/cgroup.c	2014-02-06 15:20:35.548904965 -0800
> @@ -4845,12 +4845,12 @@ static int __init cgroup_wq_init(void)
>  	/*
>  	 * There isn't much point in executing destruction path in
>  	 * parallel.  Good chunk is serialized with cgroup_mutex anyway.
> -	 * Use 1 for @max_active.
> +	 * Must be ordered to make sure parent is offlined after children.
>  	 *
>  	 * We would prefer to do this in cgroup_init() above, but that
>  	 * is called before init_workqueues(): so leave this until after.
>  	 */
> -	cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
> +	cgroup_destroy_wq = alloc_ordered_workqueue("cgroup_destroy", 0);
>  	BUG_ON(!cgroup_destroy_wq);
>  
>  	/*

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
@ 2014-02-07 13:45   ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2014-02-07 13:45 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tejun Heo, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Thu 06-02-14 15:56:01, Hugh Dickins wrote:
> Sometimes the cleanup after memcg hierarchy testing gets stuck in
> mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
> 
> There may turn out to be several causes, but a major cause is this: the
> workitem to offline parent can get run before workitem to offline child;
> parent's mem_cgroup_reparent_charges() circles around waiting for the
> child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> which prevents the child from reaching its mem_cgroup_reparent_charges().
> 
> Just use an ordered workqueue for cgroup_destroy_wq.

Hmm, interesting. Markus has seen hangs even with mem_cgroup_css_offline
and the referenced cgroup fixes, maybe this is the the right one
finally.

> Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
> Suggested-by: Filipe Brandenburger <filbranden-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org # 3.10+

Reviewed-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

e5fca243abae was marked for 3.9 stable but I do not see it in the Greg's
3.9 stable branch so 3.10+ seems to be sufficient.

> ---
> 
>  kernel/cgroup.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> --- 3.14-rc1/kernel/cgroup.c	2014-02-02 18:49:07.737302111 -0800
> +++ linux/kernel/cgroup.c	2014-02-06 15:20:35.548904965 -0800
> @@ -4845,12 +4845,12 @@ static int __init cgroup_wq_init(void)
>  	/*
>  	 * There isn't much point in executing destruction path in
>  	 * parallel.  Good chunk is serialized with cgroup_mutex anyway.
> -	 * Use 1 for @max_active.
> +	 * Must be ordered to make sure parent is offlined after children.
>  	 *
>  	 * We would prefer to do this in cgroup_init() above, but that
>  	 * is called before init_workqueues(): so leave this until after.
>  	 */
> -	cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
> +	cgroup_destroy_wq = alloc_ordered_workqueue("cgroup_destroy", 0);
>  	BUG_ON(!cgroup_destroy_wq);
>  
>  	/*

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
  2014-02-06 23:56 ` Hugh Dickins
@ 2014-02-07 14:04   ` Tejun Heo
  -1 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2014-02-07 14:04 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Filipe Brandenburger, Li Zefan, Andrew Morton, Michal Hocko,
	Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

Hello, Hugh.

On Thu, Feb 06, 2014 at 03:56:01PM -0800, Hugh Dickins wrote:
> Sometimes the cleanup after memcg hierarchy testing gets stuck in
> mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
> 
> There may turn out to be several causes, but a major cause is this: the
> workitem to offline parent can get run before workitem to offline child;
> parent's mem_cgroup_reparent_charges() circles around waiting for the
> child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> which prevents the child from reaching its mem_cgroup_reparent_charges().
> 
> Just use an ordered workqueue for cgroup_destroy_wq.

Hmmm... I'm not really comfortable with this.  This would seal shut
any possiblity of increasing concurrency in that path, which is okay
now but I find the combination of such long term commitment and the
non-obviousness (it's not apparent from looking at memcg code why it
wouldn't deadlock) very unappealing.  Besides, the only reason
offline() is currently called under cgroup_mutex is history.  We can
move it out of cgroup_mutex right now.

But even with offline being called outside cgroup_mutex, IIRC, the
described problem would still be able to deadlock as long as the tree
depth is deeper than max concurrency level of the destruction
workqueue.  Sure, we can give it large enough number but it's
generally nasty.

One thing I don't get is why memcg has such reverse dependency at all.
Why does the parent wait for its descendants to do something during
offline?  Shouldn't it be able to just bail and let whatever
descendant which is stil busy propagate things upwards?  That's a
usual pattern we use to tree shutdowns anyway.  Would that be nasty to
implement in memcg?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
@ 2014-02-07 14:04   ` Tejun Heo
  0 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2014-02-07 14:04 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Filipe Brandenburger, Li Zefan, Andrew Morton, Michal Hocko,
	Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

Hello, Hugh.

On Thu, Feb 06, 2014 at 03:56:01PM -0800, Hugh Dickins wrote:
> Sometimes the cleanup after memcg hierarchy testing gets stuck in
> mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
> 
> There may turn out to be several causes, but a major cause is this: the
> workitem to offline parent can get run before workitem to offline child;
> parent's mem_cgroup_reparent_charges() circles around waiting for the
> child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> which prevents the child from reaching its mem_cgroup_reparent_charges().
> 
> Just use an ordered workqueue for cgroup_destroy_wq.

Hmmm... I'm not really comfortable with this.  This would seal shut
any possiblity of increasing concurrency in that path, which is okay
now but I find the combination of such long term commitment and the
non-obviousness (it's not apparent from looking at memcg code why it
wouldn't deadlock) very unappealing.  Besides, the only reason
offline() is currently called under cgroup_mutex is history.  We can
move it out of cgroup_mutex right now.

But even with offline being called outside cgroup_mutex, IIRC, the
described problem would still be able to deadlock as long as the tree
depth is deeper than max concurrency level of the destruction
workqueue.  Sure, we can give it large enough number but it's
generally nasty.

One thing I don't get is why memcg has such reverse dependency at all.
Why does the parent wait for its descendants to do something during
offline?  Shouldn't it be able to just bail and let whatever
descendant which is stil busy propagate things upwards?  That's a
usual pattern we use to tree shutdowns anyway.  Would that be nasty to
implement in memcg?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
  2014-02-07 14:04   ` Tejun Heo
@ 2014-02-07 14:37     ` Michal Hocko
  -1 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2014-02-07 14:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Hugh Dickins, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

On Fri 07-02-14 09:04:02, Tejun Heo wrote:
> Hello, Hugh.
> 
> On Thu, Feb 06, 2014 at 03:56:01PM -0800, Hugh Dickins wrote:
> > Sometimes the cleanup after memcg hierarchy testing gets stuck in
> > mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
> > 
> > There may turn out to be several causes, but a major cause is this: the
> > workitem to offline parent can get run before workitem to offline child;
> > parent's mem_cgroup_reparent_charges() circles around waiting for the
> > child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> > which prevents the child from reaching its mem_cgroup_reparent_charges().
> > 
> > Just use an ordered workqueue for cgroup_destroy_wq.
> 
> Hmmm... I'm not really comfortable with this.  This would seal shut
> any possiblity of increasing concurrency in that path, which is okay
> now but I find the combination of such long term commitment and the
> non-obviousness (it's not apparent from looking at memcg code why it
> wouldn't deadlock) very unappealing.  Besides, the only reason
> offline() is currently called under cgroup_mutex is history.  We can
> move it out of cgroup_mutex right now.
> 
> But even with offline being called outside cgroup_mutex, IIRC, the
> described problem would still be able to deadlock as long as the tree
> depth is deeper than max concurrency level of the destruction
> workqueue.  Sure, we can give it large enough number but it's
> generally nasty.
> 
> One thing I don't get is why memcg has such reverse dependency at all.
> Why does the parent wait for its descendants to do something during
> offline?

Because the parent sees charges of its children but it doesn't see pages
as they are on the LRU of those children. So it cannot reach 0 charges.
We are are assuming that the offlining memcg doesn't have any children
which sounds like a reasonable expectation to me.

> Shouldn't it be able to just bail and let whatever
> descendant which is stil busy propagate things upwards?  That's a
> usual pattern we use to tree shutdowns anyway.  Would that be nasty to
> implement in memcg?

Hmm, this is a bit tricky. We cannot use memcg iterators to reach
children because css_tryget would fail on them. We can use cgroup
iterators instead, alright, and reparent pages from leafs but this all
sounds like a lot of complications.

Another option would be weakening css_offline reparenting and do not
insist on having 0 charges. We want to get rid of as many charges as
possible but do not need to have all of them gone
(http://marc.info/?l=linux-kernel&m=139161412932193&w=2). The last part
would be reparenting to the upmost parent which is still online.

I guess this is implementable but I would prefer Hugh's fix for now and
for stable.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
@ 2014-02-07 14:37     ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2014-02-07 14:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Hugh Dickins, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

On Fri 07-02-14 09:04:02, Tejun Heo wrote:
> Hello, Hugh.
> 
> On Thu, Feb 06, 2014 at 03:56:01PM -0800, Hugh Dickins wrote:
> > Sometimes the cleanup after memcg hierarchy testing gets stuck in
> > mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
> > 
> > There may turn out to be several causes, but a major cause is this: the
> > workitem to offline parent can get run before workitem to offline child;
> > parent's mem_cgroup_reparent_charges() circles around waiting for the
> > child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> > which prevents the child from reaching its mem_cgroup_reparent_charges().
> > 
> > Just use an ordered workqueue for cgroup_destroy_wq.
> 
> Hmmm... I'm not really comfortable with this.  This would seal shut
> any possiblity of increasing concurrency in that path, which is okay
> now but I find the combination of such long term commitment and the
> non-obviousness (it's not apparent from looking at memcg code why it
> wouldn't deadlock) very unappealing.  Besides, the only reason
> offline() is currently called under cgroup_mutex is history.  We can
> move it out of cgroup_mutex right now.
> 
> But even with offline being called outside cgroup_mutex, IIRC, the
> described problem would still be able to deadlock as long as the tree
> depth is deeper than max concurrency level of the destruction
> workqueue.  Sure, we can give it large enough number but it's
> generally nasty.
> 
> One thing I don't get is why memcg has such reverse dependency at all.
> Why does the parent wait for its descendants to do something during
> offline?

Because the parent sees charges of its children but it doesn't see pages
as they are on the LRU of those children. So it cannot reach 0 charges.
We are are assuming that the offlining memcg doesn't have any children
which sounds like a reasonable expectation to me.

> Shouldn't it be able to just bail and let whatever
> descendant which is stil busy propagate things upwards?  That's a
> usual pattern we use to tree shutdowns anyway.  Would that be nasty to
> implement in memcg?

Hmm, this is a bit tricky. We cannot use memcg iterators to reach
children because css_tryget would fail on them. We can use cgroup
iterators instead, alright, and reparent pages from leafs but this all
sounds like a lot of complications.

Another option would be weakening css_offline reparenting and do not
insist on having 0 charges. We want to get rid of as many charges as
possible but do not need to have all of them gone
(http://marc.info/?l=linux-kernel&m=139161412932193&w=2). The last part
would be reparenting to the upmost parent which is still online.

I guess this is implementable but I would prefer Hugh's fix for now and
for stable.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
  2014-02-07 14:37     ` Michal Hocko
  (?)
@ 2014-02-07 15:13       ` Tejun Heo
  -1 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2014-02-07 15:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

Hello, Michal.

On Fri, Feb 07, 2014 at 03:37:40PM +0100, Michal Hocko wrote:
> Hmm, this is a bit tricky. We cannot use memcg iterators to reach
> children because css_tryget would fail on them. We can use cgroup
> iterators instead, alright, and reparent pages from leafs but this all
> sounds like a lot of complications.

Hmmm... I think we're talking past each other here.  Why would the
parent need to reach down to the children?  Just bail out if it can't
make things down to zero and let the child when it finishes its own
cleaning walk up the tree propagating changes.  ->parent is always
accessible.  Would that be complicated too?

> Another option would be weakening css_offline reparenting and do not
> insist on having 0 charges. We want to get rid of as many charges as
> possible but do not need to have all of them gone
> (http://marc.info/?l=linux-kernel&m=139161412932193&w=2). The last part
> would be reparenting to the upmost parent which is still online.
> 
> I guess this is implementable but I would prefer Hugh's fix for now and
> for stable.

Yeah, for -stable, I think Hugh's patch is good but I really don't
want to keep it long term.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
@ 2014-02-07 15:13       ` Tejun Heo
  0 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2014-02-07 15:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

Hello, Michal.

On Fri, Feb 07, 2014 at 03:37:40PM +0100, Michal Hocko wrote:
> Hmm, this is a bit tricky. We cannot use memcg iterators to reach
> children because css_tryget would fail on them. We can use cgroup
> iterators instead, alright, and reparent pages from leafs but this all
> sounds like a lot of complications.

Hmmm... I think we're talking past each other here.  Why would the
parent need to reach down to the children?  Just bail out if it can't
make things down to zero and let the child when it finishes its own
cleaning walk up the tree propagating changes.  ->parent is always
accessible.  Would that be complicated too?

> Another option would be weakening css_offline reparenting and do not
> insist on having 0 charges. We want to get rid of as many charges as
> possible but do not need to have all of them gone
> (http://marc.info/?l=linux-kernel&m=139161412932193&w=2). The last part
> would be reparenting to the upmost parent which is still online.
> 
> I guess this is implementable but I would prefer Hugh's fix for now and
> for stable.

Yeah, for -stable, I think Hugh's patch is good but I really don't
want to keep it long term.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
@ 2014-02-07 15:13       ` Tejun Heo
  0 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2014-02-07 15:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

Hello, Michal.

On Fri, Feb 07, 2014 at 03:37:40PM +0100, Michal Hocko wrote:
> Hmm, this is a bit tricky. We cannot use memcg iterators to reach
> children because css_tryget would fail on them. We can use cgroup
> iterators instead, alright, and reparent pages from leafs but this all
> sounds like a lot of complications.

Hmmm... I think we're talking past each other here.  Why would the
parent need to reach down to the children?  Just bail out if it can't
make things down to zero and let the child when it finishes its own
cleaning walk up the tree propagating changes.  ->parent is always
accessible.  Would that be complicated too?

> Another option would be weakening css_offline reparenting and do not
> insist on having 0 charges. We want to get rid of as many charges as
> possible but do not need to have all of them gone
> (http://marc.info/?l=linux-kernel&m=139161412932193&w=2). The last part
> would be reparenting to the upmost parent which is still online.
> 
> I guess this is implementable but I would prefer Hugh's fix for now and
> for stable.

Yeah, for -stable, I think Hugh's patch is good but I really don't
want to keep it long term.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
  2014-02-06 23:56 ` Hugh Dickins
@ 2014-02-07 15:21   ` Tejun Heo
  -1 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2014-02-07 15:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Filipe Brandenburger, Li Zefan, Andrew Morton, Michal Hocko,
	Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

On Thu, Feb 06, 2014 at 03:56:01PM -0800, Hugh Dickins wrote:
> Sometimes the cleanup after memcg hierarchy testing gets stuck in
> mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
> 
> There may turn out to be several causes, but a major cause is this: the
> workitem to offline parent can get run before workitem to offline child;
> parent's mem_cgroup_reparent_charges() circles around waiting for the
> child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> which prevents the child from reaching its mem_cgroup_reparent_charges().
> 
> Just use an ordered workqueue for cgroup_destroy_wq.
> 
> Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
> Suggested-by: Filipe Brandenburger <filbranden@google.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: stable@vger.kernel.org # 3.10+

Applied to cgroup/for-3.14-fixes with comment updated to indicate that
this is temporary.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
@ 2014-02-07 15:21   ` Tejun Heo
  0 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2014-02-07 15:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Filipe Brandenburger, Li Zefan, Andrew Morton, Michal Hocko,
	Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

On Thu, Feb 06, 2014 at 03:56:01PM -0800, Hugh Dickins wrote:
> Sometimes the cleanup after memcg hierarchy testing gets stuck in
> mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
> 
> There may turn out to be several causes, but a major cause is this: the
> workitem to offline parent can get run before workitem to offline child;
> parent's mem_cgroup_reparent_charges() circles around waiting for the
> child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> which prevents the child from reaching its mem_cgroup_reparent_charges().
> 
> Just use an ordered workqueue for cgroup_destroy_wq.
> 
> Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
> Suggested-by: Filipe Brandenburger <filbranden@google.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: stable@vger.kernel.org # 3.10+

Applied to cgroup/for-3.14-fixes with comment updated to indicate that
this is temporary.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
  2014-02-07 15:13       ` Tejun Heo
@ 2014-02-07 15:28         ` Michal Hocko
  -1 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2014-02-07 15:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Hugh Dickins, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

On Fri 07-02-14 10:13:41, Tejun Heo wrote:
> Hello, Michal.
> 
> On Fri, Feb 07, 2014 at 03:37:40PM +0100, Michal Hocko wrote:
> > Hmm, this is a bit tricky. We cannot use memcg iterators to reach
> > children because css_tryget would fail on them. We can use cgroup
> > iterators instead, alright, and reparent pages from leafs but this all
> > sounds like a lot of complications.
> 
> Hmmm... I think we're talking past each other here.  Why would the
> parent need to reach down to the children?  Just bail out if it can't
> make things down to zero and let the child when it finishes its own
> cleaning walk up the tree propagating changes.  ->parent is always
> accessible.  Would that be complicated too?

This would be basically the option #2 bellow.

> > Another option would be weakening css_offline reparenting and do not
> > insist on having 0 charges. We want to get rid of as many charges as
> > possible but do not need to have all of them gone
> > (http://marc.info/?l=linux-kernel&m=139161412932193&w=2). The last part
> > would be reparenting to the upmost parent which is still online.
> > 
> > I guess this is implementable but I would prefer Hugh's fix for now and
> > for stable.
> 
> Yeah, for -stable, I think Hugh's patch is good but I really don't
> want to keep it long term.

Based on our recent discussion regarding css_offline semantic we want to
do some changes in that area. I thought we would simply update comments
but considering this report css_offline needs some changes as well. I
will look at it. The idea is to split mem_cgroup_reparent_charges into
two parts. The core one which drains LRUs and would be called from
mem_cgroup_css_offline and one which loops until all charges are gone
for mem_cgroup_css_free. mem_cgroup_move_parent will need an update as
well. It would have to go up the hierarchy to the first alive parent.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
@ 2014-02-07 15:28         ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2014-02-07 15:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Hugh Dickins, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

On Fri 07-02-14 10:13:41, Tejun Heo wrote:
> Hello, Michal.
> 
> On Fri, Feb 07, 2014 at 03:37:40PM +0100, Michal Hocko wrote:
> > Hmm, this is a bit tricky. We cannot use memcg iterators to reach
> > children because css_tryget would fail on them. We can use cgroup
> > iterators instead, alright, and reparent pages from leafs but this all
> > sounds like a lot of complications.
> 
> Hmmm... I think we're talking past each other here.  Why would the
> parent need to reach down to the children?  Just bail out if it can't
> make things down to zero and let the child when it finishes its own
> cleaning walk up the tree propagating changes.  ->parent is always
> accessible.  Would that be complicated too?

This would be basically the option #2 bellow.

> > Another option would be weakening css_offline reparenting and do not
> > insist on having 0 charges. We want to get rid of as many charges as
> > possible but do not need to have all of them gone
> > (http://marc.info/?l=linux-kernel&m=139161412932193&w=2). The last part
> > would be reparenting to the upmost parent which is still online.
> > 
> > I guess this is implementable but I would prefer Hugh's fix for now and
> > for stable.
> 
> Yeah, for -stable, I think Hugh's patch is good but I really don't
> want to keep it long term.

Based on our recent discussion regarding css_offline semantic we want to
do some changes in that area. I thought we would simply update comments
but considering this report css_offline needs some changes as well. I
will look at it. The idea is to split mem_cgroup_reparent_charges into
two parts. The core one which drains LRUs and would be called from
mem_cgroup_css_offline and one which loops until all charges are gone
for mem_cgroup_css_free. mem_cgroup_move_parent will need an update as
well. It would have to go up the hierarchy to the first alive parent.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
  2014-02-06 23:56 ` Hugh Dickins
  (?)
@ 2014-02-07 16:43   ` Johannes Weiner
  -1 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2014-02-07 16:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tejun Heo, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Michal Hocko, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

On Thu, Feb 06, 2014 at 03:56:01PM -0800, Hugh Dickins wrote:
> Sometimes the cleanup after memcg hierarchy testing gets stuck in
> mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
> 
> There may turn out to be several causes, but a major cause is this: the
> workitem to offline parent can get run before workitem to offline child;
> parent's mem_cgroup_reparent_charges() circles around waiting for the
> child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> which prevents the child from reaching its mem_cgroup_reparent_charges().
> 
> Just use an ordered workqueue for cgroup_destroy_wq.
> 
> Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
> Suggested-by: Filipe Brandenburger <filbranden@google.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: stable@vger.kernel.org # 3.10+

I think this is a good idea for now and -stable:
Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Long-term, I think we may want to get rid of the reparenting in
css_offline() entirely and only do it in the naturally ordered
css_free() callback.  We only reparent in css_offline() because
swapout records pin the css and we don't want to hang on to
potentially gigabytes of unreclaimable (css_tryget() disabled) cache
indefinitely until the last swapout records disappear.

So I'm currently mucking around with the following patch, which drops
the css pin from swapout records and reparents them in css_free().
It's lightly tested and there might well be dragons but I don't see
any fundamental problems with it.

What do you think?

---
 include/linux/page_cgroup.h |   8 ++++
 mm/memcontrol.c             | 101 +++++++++++++-------------------------------
 mm/page_cgroup.c            |  41 ++++++++++++++++++
 3 files changed, 78 insertions(+), 72 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 777a524716db..3ededda8934f 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -111,6 +111,8 @@ extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
 					unsigned short old, unsigned short new);
 extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
 extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
+extern unsigned long swap_cgroup_migrate(unsigned short old,
+					 unsigned short new);
 extern int swap_cgroup_swapon(int type, unsigned long max_pages);
 extern void swap_cgroup_swapoff(int type);
 #else
@@ -127,6 +129,12 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
 	return 0;
 }
 
+static inline unsigned long swap_cgroup_migrate(unsigned short old,
+						unsigned short new)
+{
+	return 0;
+}
+
 static inline int
 swap_cgroup_swapon(int type, unsigned long max_pages)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 53385cd4e6f0..e2a0d3986c74 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -892,11 +892,9 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
 	return val;
 }
 
-static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
-					 bool charge)
+static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg, long nr_pages)
 {
-	int val = (charge) ? 1 : -1;
-	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
+	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], nr_pages);
 }
 
 static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
@@ -4234,15 +4232,12 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
 	 */
 
 	unlock_page_cgroup(pc);
-	/*
-	 * even after unlock, we have memcg->res.usage here and this memcg
-	 * will never be freed, so it's safe to call css_get().
-	 */
+
 	memcg_check_events(memcg, page);
-	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
-		mem_cgroup_swap_statistics(memcg, true);
-		css_get(&memcg->css);
-	}
+
+	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
+		mem_cgroup_swap_statistics(memcg, 1);
+
 	/*
 	 * Migration does not charge the res_counter for the
 	 * replacement page, so leave it alone when phasing out the
@@ -4351,10 +4346,6 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
 
 	memcg = __mem_cgroup_uncharge_common(page, ctype, false);
 
-	/*
-	 * record memcg information,  if swapout && memcg != NULL,
-	 * css_get() was called in uncharge().
-	 */
 	if (do_swap_account && swapout && memcg)
 		swap_cgroup_record(ent, mem_cgroup_id(memcg));
 }
@@ -4383,8 +4374,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
 		 */
 		if (!mem_cgroup_is_root(memcg))
 			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
-		mem_cgroup_swap_statistics(memcg, false);
-		css_put(&memcg->css);
+		mem_cgroup_swap_statistics(memcg, -1);
 	}
 	rcu_read_unlock();
 }
@@ -4412,20 +4402,8 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
 	new_id = mem_cgroup_id(to);
 
 	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
-		mem_cgroup_swap_statistics(from, false);
-		mem_cgroup_swap_statistics(to, true);
-		/*
-		 * This function is only called from task migration context now.
-		 * It postpones res_counter and refcount handling till the end
-		 * of task migration(mem_cgroup_clear_mc()) for performance
-		 * improvement. But we cannot postpone css_get(to)  because if
-		 * the process that has been moved to @to does swap-in, the
-		 * refcount of @to might be decreased to 0.
-		 *
-		 * We are in attach() phase, so the cgroup is guaranteed to be
-		 * alive, so we can just call css_get().
-		 */
-		css_get(&to->css);
+		mem_cgroup_swap_statistics(from, -1);
+		mem_cgroup_swap_statistics(to, 1);
 		return 0;
 	}
 	return -EINVAL;
@@ -6611,7 +6589,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	kmem_cgroup_css_offline(memcg);
 
 	mem_cgroup_invalidate_reclaim_iterators(memcg);
-	mem_cgroup_reparent_charges(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
 	vmpressure_cleanup(&memcg->vmpressure);
 }
@@ -6619,41 +6596,26 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+	unsigned long swaps_moved;
+	struct mem_cgroup *parent;
+
 	/*
-	 * XXX: css_offline() would be where we should reparent all
-	 * memory to prepare the cgroup for destruction.  However,
-	 * memcg does not do css_tryget() and res_counter charging
-	 * under the same RCU lock region, which means that charging
-	 * could race with offlining.  Offlining only happens to
-	 * cgroups with no tasks in them but charges can show up
-	 * without any tasks from the swapin path when the target
-	 * memcg is looked up from the swapout record and not from the
-	 * current task as it usually is.  A race like this can leak
-	 * charges and put pages with stale cgroup pointers into
-	 * circulation:
-	 *
-	 * #0                        #1
-	 *                           lookup_swap_cgroup_id()
-	 *                           rcu_read_lock()
-	 *                           mem_cgroup_lookup()
-	 *                           css_tryget()
-	 *                           rcu_read_unlock()
-	 * disable css_tryget()
-	 * call_rcu()
-	 *   offline_css()
-	 *     reparent_charges()
-	 *                           res_counter_charge()
-	 *                           css_put()
-	 *                             css_free()
-	 *                           pc->mem_cgroup = dead memcg
-	 *                           add page to lru
-	 *
-	 * The bulk of the charges are still moved in offline_css() to
-	 * avoid pinning a lot of pages in case a long-term reference
-	 * like a swapout record is deferring the css_free() to long
-	 * after offlining.  But this makes sure we catch any charges
-	 * made after offlining:
+	 * Migrate all swap entries to the parent.  There are no more
+	 * references to the css, so no new swap out records can show
+	 * up.  Any concurrently faulting pages will either get this
+	 * offline cgroup and charge the faulting task, or get the
+	 * migrated parent id and charge the parent for the in-memory
+	 * page.  uncharge_swap() will balance the res_counter in the
+	 * parent either way, whether it still fixes this group's
+	 * res_counter is irrelevant at this point.
 	 */
+	parent = parent_mem_cgroup(memcg);
+	if (!parent)
+		parent = root_mem_cgroup;
+	swaps_moved = swap_cgroup_migrate(mem_cgroup_id(memcg),
+					  mem_cgroup_id(parent));
+	mem_cgroup_swap_statistics(parent, swaps_moved);
+
 	mem_cgroup_reparent_charges(memcg);
 
 	memcg_destroy_kmem(memcg);
@@ -6966,7 +6928,6 @@ static void __mem_cgroup_clear_mc(void)
 {
 	struct mem_cgroup *from = mc.from;
 	struct mem_cgroup *to = mc.to;
-	int i;
 
 	/* we must uncharge all the leftover precharges from mc.to */
 	if (mc.precharge) {
@@ -6981,16 +6942,13 @@ static void __mem_cgroup_clear_mc(void)
 		__mem_cgroup_cancel_charge(mc.from, mc.moved_charge);
 		mc.moved_charge = 0;
 	}
-	/* we must fixup refcnts and charges */
+	/* we must fixup charges */
 	if (mc.moved_swap) {
 		/* uncharge swap account from the old cgroup */
 		if (!mem_cgroup_is_root(mc.from))
 			res_counter_uncharge(&mc.from->memsw,
 						PAGE_SIZE * mc.moved_swap);
 
-		for (i = 0; i < mc.moved_swap; i++)
-			css_put(&mc.from->css);
-
 		if (!mem_cgroup_is_root(mc.to)) {
 			/*
 			 * we charged both to->res and to->memsw, so we should
@@ -6999,7 +6957,6 @@ static void __mem_cgroup_clear_mc(void)
 			res_counter_uncharge(&mc.to->res,
 						PAGE_SIZE * mc.moved_swap);
 		}
-		/* we've already done css_get(mc.to) */
 		mc.moved_swap = 0;
 	}
 	memcg_oom_recover(from);
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index cfd162882c00..ca04feaae7fe 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -459,6 +459,47 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
 	return lookup_swap_cgroup(ent, NULL)->id;
 }
 
+/**
+ * swap_cgroup_migrate - migrate all entries of one id to another
+ * @old: old id
+ * @new: new id
+ *
+ * Caller has to be able to deal with swapon/swapoff racing.
+ *
+ * Returns number of migrated entries.
+ */
+unsigned long swap_cgroup_migrate(unsigned short old, unsigned short new)
+{
+	unsigned long migrated = 0;
+	unsigned int type;
+
+	for (type = 0; type < MAX_SWAPFILES; type++) {
+		struct swap_cgroup_ctrl *ctrl;
+		unsigned long flags;
+		unsigned int page;
+
+		ctrl = &swap_cgroup_ctrl[type];
+		spin_lock_irqsave(&ctrl->lock, flags);
+		for (page = 0; page < ctrl->length; page++) {
+			struct swap_cgroup *base;
+			pgoff_t offset;
+
+			base = page_address(ctrl->map[page]);
+			for (offset = 0; offset < SC_PER_PAGE; offset++) {
+				struct swap_cgroup *sc;
+
+				sc = base + offset;
+				if (sc->id == old) {
+					sc->id = new;
+					migrated++;
+				}
+			}
+		}
+		spin_unlock_irqrestore(&ctrl->lock, flags);
+	}
+	return migrated;
+}
+
 int swap_cgroup_swapon(int type, unsigned long max_pages)
 {
 	void *array;
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
@ 2014-02-07 16:43   ` Johannes Weiner
  0 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2014-02-07 16:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tejun Heo, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Michal Hocko, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

On Thu, Feb 06, 2014 at 03:56:01PM -0800, Hugh Dickins wrote:
> Sometimes the cleanup after memcg hierarchy testing gets stuck in
> mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
> 
> There may turn out to be several causes, but a major cause is this: the
> workitem to offline parent can get run before workitem to offline child;
> parent's mem_cgroup_reparent_charges() circles around waiting for the
> child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> which prevents the child from reaching its mem_cgroup_reparent_charges().
> 
> Just use an ordered workqueue for cgroup_destroy_wq.
> 
> Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
> Suggested-by: Filipe Brandenburger <filbranden@google.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: stable@vger.kernel.org # 3.10+

I think this is a good idea for now and -stable:
Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Long-term, I think we may want to get rid of the reparenting in
css_offline() entirely and only do it in the naturally ordered
css_free() callback.  We only reparent in css_offline() because
swapout records pin the css and we don't want to hang on to
potentially gigabytes of unreclaimable (css_tryget() disabled) cache
indefinitely until the last swapout records disappear.

So I'm currently mucking around with the following patch, which drops
the css pin from swapout records and reparents them in css_free().
It's lightly tested and there might well be dragons but I don't see
any fundamental problems with it.

What do you think?

---
 include/linux/page_cgroup.h |   8 ++++
 mm/memcontrol.c             | 101 +++++++++++++-------------------------------
 mm/page_cgroup.c            |  41 ++++++++++++++++++
 3 files changed, 78 insertions(+), 72 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 777a524716db..3ededda8934f 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -111,6 +111,8 @@ extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
 					unsigned short old, unsigned short new);
 extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
 extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
+extern unsigned long swap_cgroup_migrate(unsigned short old,
+					 unsigned short new);
 extern int swap_cgroup_swapon(int type, unsigned long max_pages);
 extern void swap_cgroup_swapoff(int type);
 #else
@@ -127,6 +129,12 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
 	return 0;
 }
 
+static inline unsigned long swap_cgroup_migrate(unsigned short old,
+						unsigned short new)
+{
+	return 0;
+}
+
 static inline int
 swap_cgroup_swapon(int type, unsigned long max_pages)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 53385cd4e6f0..e2a0d3986c74 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -892,11 +892,9 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
 	return val;
 }
 
-static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
-					 bool charge)
+static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg, long nr_pages)
 {
-	int val = (charge) ? 1 : -1;
-	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
+	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], nr_pages);
 }
 
 static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
@@ -4234,15 +4232,12 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
 	 */
 
 	unlock_page_cgroup(pc);
-	/*
-	 * even after unlock, we have memcg->res.usage here and this memcg
-	 * will never be freed, so it's safe to call css_get().
-	 */
+
 	memcg_check_events(memcg, page);
-	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
-		mem_cgroup_swap_statistics(memcg, true);
-		css_get(&memcg->css);
-	}
+
+	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
+		mem_cgroup_swap_statistics(memcg, 1);
+
 	/*
 	 * Migration does not charge the res_counter for the
 	 * replacement page, so leave it alone when phasing out the
@@ -4351,10 +4346,6 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
 
 	memcg = __mem_cgroup_uncharge_common(page, ctype, false);
 
-	/*
-	 * record memcg information,  if swapout && memcg != NULL,
-	 * css_get() was called in uncharge().
-	 */
 	if (do_swap_account && swapout && memcg)
 		swap_cgroup_record(ent, mem_cgroup_id(memcg));
 }
@@ -4383,8 +4374,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
 		 */
 		if (!mem_cgroup_is_root(memcg))
 			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
-		mem_cgroup_swap_statistics(memcg, false);
-		css_put(&memcg->css);
+		mem_cgroup_swap_statistics(memcg, -1);
 	}
 	rcu_read_unlock();
 }
@@ -4412,20 +4402,8 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
 	new_id = mem_cgroup_id(to);
 
 	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
-		mem_cgroup_swap_statistics(from, false);
-		mem_cgroup_swap_statistics(to, true);
-		/*
-		 * This function is only called from task migration context now.
-		 * It postpones res_counter and refcount handling till the end
-		 * of task migration(mem_cgroup_clear_mc()) for performance
-		 * improvement. But we cannot postpone css_get(to)  because if
-		 * the process that has been moved to @to does swap-in, the
-		 * refcount of @to might be decreased to 0.
-		 *
-		 * We are in attach() phase, so the cgroup is guaranteed to be
-		 * alive, so we can just call css_get().
-		 */
-		css_get(&to->css);
+		mem_cgroup_swap_statistics(from, -1);
+		mem_cgroup_swap_statistics(to, 1);
 		return 0;
 	}
 	return -EINVAL;
@@ -6611,7 +6589,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	kmem_cgroup_css_offline(memcg);
 
 	mem_cgroup_invalidate_reclaim_iterators(memcg);
-	mem_cgroup_reparent_charges(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
 	vmpressure_cleanup(&memcg->vmpressure);
 }
@@ -6619,41 +6596,26 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+	unsigned long swaps_moved;
+	struct mem_cgroup *parent;
+
 	/*
-	 * XXX: css_offline() would be where we should reparent all
-	 * memory to prepare the cgroup for destruction.  However,
-	 * memcg does not do css_tryget() and res_counter charging
-	 * under the same RCU lock region, which means that charging
-	 * could race with offlining.  Offlining only happens to
-	 * cgroups with no tasks in them but charges can show up
-	 * without any tasks from the swapin path when the target
-	 * memcg is looked up from the swapout record and not from the
-	 * current task as it usually is.  A race like this can leak
-	 * charges and put pages with stale cgroup pointers into
-	 * circulation:
-	 *
-	 * #0                        #1
-	 *                           lookup_swap_cgroup_id()
-	 *                           rcu_read_lock()
-	 *                           mem_cgroup_lookup()
-	 *                           css_tryget()
-	 *                           rcu_read_unlock()
-	 * disable css_tryget()
-	 * call_rcu()
-	 *   offline_css()
-	 *     reparent_charges()
-	 *                           res_counter_charge()
-	 *                           css_put()
-	 *                             css_free()
-	 *                           pc->mem_cgroup = dead memcg
-	 *                           add page to lru
-	 *
-	 * The bulk of the charges are still moved in offline_css() to
-	 * avoid pinning a lot of pages in case a long-term reference
-	 * like a swapout record is deferring the css_free() to long
-	 * after offlining.  But this makes sure we catch any charges
-	 * made after offlining:
+	 * Migrate all swap entries to the parent.  There are no more
+	 * references to the css, so no new swap out records can show
+	 * up.  Any concurrently faulting pages will either get this
+	 * offline cgroup and charge the faulting task, or get the
+	 * migrated parent id and charge the parent for the in-memory
+	 * page.  uncharge_swap() will balance the res_counter in the
+	 * parent either way, whether it still fixes this group's
+	 * res_counter is irrelevant at this point.
 	 */
+	parent = parent_mem_cgroup(memcg);
+	if (!parent)
+		parent = root_mem_cgroup;
+	swaps_moved = swap_cgroup_migrate(mem_cgroup_id(memcg),
+					  mem_cgroup_id(parent));
+	mem_cgroup_swap_statistics(parent, swaps_moved);
+
 	mem_cgroup_reparent_charges(memcg);
 
 	memcg_destroy_kmem(memcg);
@@ -6966,7 +6928,6 @@ static void __mem_cgroup_clear_mc(void)
 {
 	struct mem_cgroup *from = mc.from;
 	struct mem_cgroup *to = mc.to;
-	int i;
 
 	/* we must uncharge all the leftover precharges from mc.to */
 	if (mc.precharge) {
@@ -6981,16 +6942,13 @@ static void __mem_cgroup_clear_mc(void)
 		__mem_cgroup_cancel_charge(mc.from, mc.moved_charge);
 		mc.moved_charge = 0;
 	}
-	/* we must fixup refcnts and charges */
+	/* we must fixup charges */
 	if (mc.moved_swap) {
 		/* uncharge swap account from the old cgroup */
 		if (!mem_cgroup_is_root(mc.from))
 			res_counter_uncharge(&mc.from->memsw,
 						PAGE_SIZE * mc.moved_swap);
 
-		for (i = 0; i < mc.moved_swap; i++)
-			css_put(&mc.from->css);
-
 		if (!mem_cgroup_is_root(mc.to)) {
 			/*
 			 * we charged both to->res and to->memsw, so we should
@@ -6999,7 +6957,6 @@ static void __mem_cgroup_clear_mc(void)
 			res_counter_uncharge(&mc.to->res,
 						PAGE_SIZE * mc.moved_swap);
 		}
-		/* we've already done css_get(mc.to) */
 		mc.moved_swap = 0;
 	}
 	memcg_oom_recover(from);
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index cfd162882c00..ca04feaae7fe 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -459,6 +459,47 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
 	return lookup_swap_cgroup(ent, NULL)->id;
 }
 
+/**
+ * swap_cgroup_migrate - migrate all entries of one id to another
+ * @old: old id
+ * @new: new id
+ *
+ * Caller has to be able to deal with swapon/swapoff racing.
+ *
+ * Returns number of migrated entries.
+ */
+unsigned long swap_cgroup_migrate(unsigned short old, unsigned short new)
+{
+	unsigned long migrated = 0;
+	unsigned int type;
+
+	for (type = 0; type < MAX_SWAPFILES; type++) {
+		struct swap_cgroup_ctrl *ctrl;
+		unsigned long flags;
+		unsigned int page;
+
+		ctrl = &swap_cgroup_ctrl[type];
+		spin_lock_irqsave(&ctrl->lock, flags);
+		for (page = 0; page < ctrl->length; page++) {
+			struct swap_cgroup *base;
+			pgoff_t offset;
+
+			base = page_address(ctrl->map[page]);
+			for (offset = 0; offset < SC_PER_PAGE; offset++) {
+				struct swap_cgroup *sc;
+
+				sc = base + offset;
+				if (sc->id == old) {
+					sc->id = new;
+					migrated++;
+				}
+			}
+		}
+		spin_unlock_irqrestore(&ctrl->lock, flags);
+	}
+	return migrated;
+}
+
 int swap_cgroup_swapon(int type, unsigned long max_pages)
 {
 	void *array;
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
@ 2014-02-07 16:43   ` Johannes Weiner
  0 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2014-02-07 16:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tejun Heo, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Michal Hocko, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Thu, Feb 06, 2014 at 03:56:01PM -0800, Hugh Dickins wrote:
> Sometimes the cleanup after memcg hierarchy testing gets stuck in
> mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
> 
> There may turn out to be several causes, but a major cause is this: the
> workitem to offline parent can get run before workitem to offline child;
> parent's mem_cgroup_reparent_charges() circles around waiting for the
> child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> which prevents the child from reaching its mem_cgroup_reparent_charges().
> 
> Just use an ordered workqueue for cgroup_destroy_wq.
> 
> Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
> Suggested-by: Filipe Brandenburger <filbranden-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org # 3.10+

I think this is a good idea for now and -stable:
Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

Long-term, I think we may want to get rid of the reparenting in
css_offline() entirely and only do it in the naturally ordered
css_free() callback.  We only reparent in css_offline() because
swapout records pin the css and we don't want to hang on to
potentially gigabytes of unreclaimable (css_tryget() disabled) cache
indefinitely until the last swapout records disappear.

So I'm currently mucking around with the following patch, which drops
the css pin from swapout records and reparents them in css_free().
It's lightly tested and there might well be dragons but I don't see
any fundamental problems with it.

What do you think?

---
 include/linux/page_cgroup.h |   8 ++++
 mm/memcontrol.c             | 101 +++++++++++++-------------------------------
 mm/page_cgroup.c            |  41 ++++++++++++++++++
 3 files changed, 78 insertions(+), 72 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 777a524716db..3ededda8934f 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -111,6 +111,8 @@ extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
 					unsigned short old, unsigned short new);
 extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
 extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
+extern unsigned long swap_cgroup_migrate(unsigned short old,
+					 unsigned short new);
 extern int swap_cgroup_swapon(int type, unsigned long max_pages);
 extern void swap_cgroup_swapoff(int type);
 #else
@@ -127,6 +129,12 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
 	return 0;
 }
 
+static inline unsigned long swap_cgroup_migrate(unsigned short old,
+						unsigned short new)
+{
+	return 0;
+}
+
 static inline int
 swap_cgroup_swapon(int type, unsigned long max_pages)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 53385cd4e6f0..e2a0d3986c74 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -892,11 +892,9 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
 	return val;
 }
 
-static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
-					 bool charge)
+static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg, long nr_pages)
 {
-	int val = (charge) ? 1 : -1;
-	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
+	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], nr_pages);
 }
 
 static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
@@ -4234,15 +4232,12 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
 	 */
 
 	unlock_page_cgroup(pc);
-	/*
-	 * even after unlock, we have memcg->res.usage here and this memcg
-	 * will never be freed, so it's safe to call css_get().
-	 */
+
 	memcg_check_events(memcg, page);
-	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
-		mem_cgroup_swap_statistics(memcg, true);
-		css_get(&memcg->css);
-	}
+
+	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
+		mem_cgroup_swap_statistics(memcg, 1);
+
 	/*
 	 * Migration does not charge the res_counter for the
 	 * replacement page, so leave it alone when phasing out the
@@ -4351,10 +4346,6 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
 
 	memcg = __mem_cgroup_uncharge_common(page, ctype, false);
 
-	/*
-	 * record memcg information,  if swapout && memcg != NULL,
-	 * css_get() was called in uncharge().
-	 */
 	if (do_swap_account && swapout && memcg)
 		swap_cgroup_record(ent, mem_cgroup_id(memcg));
 }
@@ -4383,8 +4374,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
 		 */
 		if (!mem_cgroup_is_root(memcg))
 			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
-		mem_cgroup_swap_statistics(memcg, false);
-		css_put(&memcg->css);
+		mem_cgroup_swap_statistics(memcg, -1);
 	}
 	rcu_read_unlock();
 }
@@ -4412,20 +4402,8 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
 	new_id = mem_cgroup_id(to);
 
 	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
-		mem_cgroup_swap_statistics(from, false);
-		mem_cgroup_swap_statistics(to, true);
-		/*
-		 * This function is only called from task migration context now.
-		 * It postpones res_counter and refcount handling till the end
-		 * of task migration(mem_cgroup_clear_mc()) for performance
-		 * improvement. But we cannot postpone css_get(to)  because if
-		 * the process that has been moved to @to does swap-in, the
-		 * refcount of @to might be decreased to 0.
-		 *
-		 * We are in attach() phase, so the cgroup is guaranteed to be
-		 * alive, so we can just call css_get().
-		 */
-		css_get(&to->css);
+		mem_cgroup_swap_statistics(from, -1);
+		mem_cgroup_swap_statistics(to, 1);
 		return 0;
 	}
 	return -EINVAL;
@@ -6611,7 +6589,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	kmem_cgroup_css_offline(memcg);
 
 	mem_cgroup_invalidate_reclaim_iterators(memcg);
-	mem_cgroup_reparent_charges(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
 	vmpressure_cleanup(&memcg->vmpressure);
 }
@@ -6619,41 +6596,26 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+	unsigned long swaps_moved;
+	struct mem_cgroup *parent;
+
 	/*
-	 * XXX: css_offline() would be where we should reparent all
-	 * memory to prepare the cgroup for destruction.  However,
-	 * memcg does not do css_tryget() and res_counter charging
-	 * under the same RCU lock region, which means that charging
-	 * could race with offlining.  Offlining only happens to
-	 * cgroups with no tasks in them but charges can show up
-	 * without any tasks from the swapin path when the target
-	 * memcg is looked up from the swapout record and not from the
-	 * current task as it usually is.  A race like this can leak
-	 * charges and put pages with stale cgroup pointers into
-	 * circulation:
-	 *
-	 * #0                        #1
-	 *                           lookup_swap_cgroup_id()
-	 *                           rcu_read_lock()
-	 *                           mem_cgroup_lookup()
-	 *                           css_tryget()
-	 *                           rcu_read_unlock()
-	 * disable css_tryget()
-	 * call_rcu()
-	 *   offline_css()
-	 *     reparent_charges()
-	 *                           res_counter_charge()
-	 *                           css_put()
-	 *                             css_free()
-	 *                           pc->mem_cgroup = dead memcg
-	 *                           add page to lru
-	 *
-	 * The bulk of the charges are still moved in offline_css() to
-	 * avoid pinning a lot of pages in case a long-term reference
-	 * like a swapout record is deferring the css_free() to long
-	 * after offlining.  But this makes sure we catch any charges
-	 * made after offlining:
+	 * Migrate all swap entries to the parent.  There are no more
+	 * references to the css, so no new swap out records can show
+	 * up.  Any concurrently faulting pages will either get this
+	 * offline cgroup and charge the faulting task, or get the
+	 * migrated parent id and charge the parent for the in-memory
+	 * page.  uncharge_swap() will balance the res_counter in the
+	 * parent either way, whether it still fixes this group's
+	 * res_counter is irrelevant at this point.
 	 */
+	parent = parent_mem_cgroup(memcg);
+	if (!parent)
+		parent = root_mem_cgroup;
+	swaps_moved = swap_cgroup_migrate(mem_cgroup_id(memcg),
+					  mem_cgroup_id(parent));
+	mem_cgroup_swap_statistics(parent, swaps_moved);
+
 	mem_cgroup_reparent_charges(memcg);
 
 	memcg_destroy_kmem(memcg);
@@ -6966,7 +6928,6 @@ static void __mem_cgroup_clear_mc(void)
 {
 	struct mem_cgroup *from = mc.from;
 	struct mem_cgroup *to = mc.to;
-	int i;
 
 	/* we must uncharge all the leftover precharges from mc.to */
 	if (mc.precharge) {
@@ -6981,16 +6942,13 @@ static void __mem_cgroup_clear_mc(void)
 		__mem_cgroup_cancel_charge(mc.from, mc.moved_charge);
 		mc.moved_charge = 0;
 	}
-	/* we must fixup refcnts and charges */
+	/* we must fixup charges */
 	if (mc.moved_swap) {
 		/* uncharge swap account from the old cgroup */
 		if (!mem_cgroup_is_root(mc.from))
 			res_counter_uncharge(&mc.from->memsw,
 						PAGE_SIZE * mc.moved_swap);
 
-		for (i = 0; i < mc.moved_swap; i++)
-			css_put(&mc.from->css);
-
 		if (!mem_cgroup_is_root(mc.to)) {
 			/*
 			 * we charged both to->res and to->memsw, so we should
@@ -6999,7 +6957,6 @@ static void __mem_cgroup_clear_mc(void)
 			res_counter_uncharge(&mc.to->res,
 						PAGE_SIZE * mc.moved_swap);
 		}
-		/* we've already done css_get(mc.to) */
 		mc.moved_swap = 0;
 	}
 	memcg_oom_recover(from);
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index cfd162882c00..ca04feaae7fe 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -459,6 +459,47 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
 	return lookup_swap_cgroup(ent, NULL)->id;
 }
 
+/**
+ * swap_cgroup_migrate - migrate all entries of one id to another
+ * @old: old id
+ * @new: new id
+ *
+ * Caller has to be able to deal with swapon/swapoff racing.
+ *
+ * Returns number of migrated entries.
+ */
+unsigned long swap_cgroup_migrate(unsigned short old, unsigned short new)
+{
+	unsigned long migrated = 0;
+	unsigned int type;
+
+	for (type = 0; type < MAX_SWAPFILES; type++) {
+		struct swap_cgroup_ctrl *ctrl;
+		unsigned long flags;
+		unsigned int page;
+
+		ctrl = &swap_cgroup_ctrl[type];
+		spin_lock_irqsave(&ctrl->lock, flags);
+		for (page = 0; page < ctrl->length; page++) {
+			struct swap_cgroup *base;
+			pgoff_t offset;
+
+			base = page_address(ctrl->map[page]);
+			for (offset = 0; offset < SC_PER_PAGE; offset++) {
+				struct swap_cgroup *sc;
+
+				sc = base + offset;
+				if (sc->id == old) {
+					sc->id = new;
+					migrated++;
+				}
+			}
+		}
+		spin_unlock_irqrestore(&ctrl->lock, flags);
+	}
+	return migrated;
+}
+
 int swap_cgroup_swapon(int type, unsigned long max_pages)
 {
 	void *array;
-- 
1.8.5.3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
  2014-02-07 14:04   ` Tejun Heo
  (?)
@ 2014-02-07 20:20     ` Hugh Dickins
  -1 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2014-02-07 20:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Hugh Dickins, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Michal Hocko, Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

Hi Tejun,

On Fri, 7 Feb 2014, Tejun Heo wrote:
> On Thu, Feb 06, 2014 at 03:56:01PM -0800, Hugh Dickins wrote:
> > Sometimes the cleanup after memcg hierarchy testing gets stuck in
> > mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
> > 
> > There may turn out to be several causes, but a major cause is this: the
> > workitem to offline parent can get run before workitem to offline child;
> > parent's mem_cgroup_reparent_charges() circles around waiting for the
> > child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> > which prevents the child from reaching its mem_cgroup_reparent_charges().
> > 
> > Just use an ordered workqueue for cgroup_destroy_wq.
> 
> Hmmm... I'm not really comfortable with this.  This would seal shut
> any possiblity of increasing concurrency in that path, which is okay
> now but I find the combination of such long term commitment and the
> non-obviousness (it's not apparent from looking at memcg code why it
> wouldn't deadlock) very unappealing.  Besides, the only reason
> offline() is currently called under cgroup_mutex is history.  We can
> move it out of cgroup_mutex right now.

Thanks for taking the patch into your tree for now,
and thanks to Michal and Hannes for supporting it.

Yes, we're not sealing a door shut with this one-liner.  My first
reaction to the deadlock was indeed, what's the cgroup_mutex for here?
and I've seen enough deadlocks on cgroup_mutex (though most from this
issue, I now believe) to welcome the idea of reducing its blanket use.

But I think there are likely to be bumps along that road (just as
there have been along the workqueue-ification road), so this ordered
workqueue appears much the safer option for now.  Please rip it out
again when the cgroup_mutex is safely removed from this path.

(I've certainly written memcg code myself that "knows" it's already
serialized by cgroup_mutex at the outer level: I think code that
never reached anyone else's tree, but I'm not certain of that.)

> 
> But even with offline being called outside cgroup_mutex, IIRC, the
> described problem would still be able to deadlock as long as the tree
> depth is deeper than max concurrency level of the destruction
> workqueue.  Sure, we can give it large enough number but it's
> generally nasty.

You worry me there: I certainly don't want to be introducing new
deadlocks.  You understand workqueues much better than most of us: I'm
not sure what "max concurrency level of the destruction workqueue" is,
but it sounds uncomfortably like an ordered workqueue's max_active 1.

You don't return to this concern in the following mails of the thread:
did you later decide that it actually won't be a problem?  I'll assume
so for the moment, since you took the patch, but please reassure me.

> 
> One thing I don't get is why memcg has such reverse dependency at all.
> Why does the parent wait for its descendants to do something during
> offline?  Shouldn't it be able to just bail and let whatever
> descendant which is stil busy propagate things upwards?  That's a
> usual pattern we use to tree shutdowns anyway.  Would that be nasty to
> implement in memcg?

I've no idea how nasty it would be to change memcg around, but Michal
and Hannes appear very open to doing so.  I do think that memcg's current
expectation is very reasonable: it's perfectly normal that a rmdir cannot
succeed until the directory is empty, and to depend upon that fact; but
the use of workqueue made some things asynchronous which were not before,
which has led to some surprises.

Hugh

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
@ 2014-02-07 20:20     ` Hugh Dickins
  0 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2014-02-07 20:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Hugh Dickins, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Michal Hocko, Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

Hi Tejun,

On Fri, 7 Feb 2014, Tejun Heo wrote:
> On Thu, Feb 06, 2014 at 03:56:01PM -0800, Hugh Dickins wrote:
> > Sometimes the cleanup after memcg hierarchy testing gets stuck in
> > mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
> > 
> > There may turn out to be several causes, but a major cause is this: the
> > workitem to offline parent can get run before workitem to offline child;
> > parent's mem_cgroup_reparent_charges() circles around waiting for the
> > child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> > which prevents the child from reaching its mem_cgroup_reparent_charges().
> > 
> > Just use an ordered workqueue for cgroup_destroy_wq.
> 
> Hmmm... I'm not really comfortable with this.  This would seal shut
> any possiblity of increasing concurrency in that path, which is okay
> now but I find the combination of such long term commitment and the
> non-obviousness (it's not apparent from looking at memcg code why it
> wouldn't deadlock) very unappealing.  Besides, the only reason
> offline() is currently called under cgroup_mutex is history.  We can
> move it out of cgroup_mutex right now.

Thanks for taking the patch into your tree for now,
and thanks to Michal and Hannes for supporting it.

Yes, we're not sealing a door shut with this one-liner.  My first
reaction to the deadlock was indeed, what's the cgroup_mutex for here?
and I've seen enough deadlocks on cgroup_mutex (though most from this
issue, I now believe) to welcome the idea of reducing its blanket use.

But I think there are likely to be bumps along that road (just as
there have been along the workqueue-ification road), so this ordered
workqueue appears much the safer option for now.  Please rip it out
again when the cgroup_mutex is safely removed from this path.

(I've certainly written memcg code myself that "knows" it's already
serialized by cgroup_mutex at the outer level: I think code that
never reached anyone else's tree, but I'm not certain of that.)

> 
> But even with offline being called outside cgroup_mutex, IIRC, the
> described problem would still be able to deadlock as long as the tree
> depth is deeper than max concurrency level of the destruction
> workqueue.  Sure, we can give it large enough number but it's
> generally nasty.

You worry me there: I certainly don't want to be introducing new
deadlocks.  You understand workqueues much better than most of us: I'm
not sure what "max concurrency level of the destruction workqueue" is,
but it sounds uncomfortably like an ordered workqueue's max_active 1.

You don't return to this concern in the following mails of the thread:
did you later decide that it actually won't be a problem?  I'll assume
so for the moment, since you took the patch, but please reassure me.

> 
> One thing I don't get is why memcg has such reverse dependency at all.
> Why does the parent wait for its descendants to do something during
> offline?  Shouldn't it be able to just bail and let whatever
> descendant which is stil busy propagate things upwards?  That's a
> usual pattern we use to tree shutdowns anyway.  Would that be nasty to
> implement in memcg?

I've no idea how nasty it would be to change memcg around, but Michal
and Hannes appear very open to doing so.  I do think that memcg's current
expectation is very reasonable: it's perfectly normal that a rmdir cannot
succeed until the directory is empty, and to depend upon that fact; but
the use of workqueue made some things asynchronous which were not before,
which has led to some surprises.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
@ 2014-02-07 20:20     ` Hugh Dickins
  0 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2014-02-07 20:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Hugh Dickins, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Michal Hocko, Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

Hi Tejun,

On Fri, 7 Feb 2014, Tejun Heo wrote:
> On Thu, Feb 06, 2014 at 03:56:01PM -0800, Hugh Dickins wrote:
> > Sometimes the cleanup after memcg hierarchy testing gets stuck in
> > mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
> > 
> > There may turn out to be several causes, but a major cause is this: the
> > workitem to offline parent can get run before workitem to offline child;
> > parent's mem_cgroup_reparent_charges() circles around waiting for the
> > child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> > which prevents the child from reaching its mem_cgroup_reparent_charges().
> > 
> > Just use an ordered workqueue for cgroup_destroy_wq.
> 
> Hmmm... I'm not really comfortable with this.  This would seal shut
> any possiblity of increasing concurrency in that path, which is okay
> now but I find the combination of such long term commitment and the
> non-obviousness (it's not apparent from looking at memcg code why it
> wouldn't deadlock) very unappealing.  Besides, the only reason
> offline() is currently called under cgroup_mutex is history.  We can
> move it out of cgroup_mutex right now.

Thanks for taking the patch into your tree for now,
and thanks to Michal and Hannes for supporting it.

Yes, we're not sealing a door shut with this one-liner.  My first
reaction to the deadlock was indeed, what's the cgroup_mutex for here?
and I've seen enough deadlocks on cgroup_mutex (though most from this
issue, I now believe) to welcome the idea of reducing its blanket use.

But I think there are likely to be bumps along that road (just as
there have been along the workqueue-ification road), so this ordered
workqueue appears much the safer option for now.  Please rip it out
again when the cgroup_mutex is safely removed from this path.

(I've certainly written memcg code myself that "knows" it's already
serialized by cgroup_mutex at the outer level: I think code that
never reached anyone else's tree, but I'm not certain of that.)

> 
> But even with offline being called outside cgroup_mutex, IIRC, the
> described problem would still be able to deadlock as long as the tree
> depth is deeper than max concurrency level of the destruction
> workqueue.  Sure, we can give it large enough number but it's
> generally nasty.

You worry me there: I certainly don't want to be introducing new
deadlocks.  You understand workqueues much better than most of us: I'm
not sure what "max concurrency level of the destruction workqueue" is,
but it sounds uncomfortably like an ordered workqueue's max_active 1.

You don't return to this concern in the following mails of the thread:
did you later decide that it actually won't be a problem?  I'll assume
so for the moment, since you took the patch, but please reassure me.

> 
> One thing I don't get is why memcg has such reverse dependency at all.
> Why does the parent wait for its descendants to do something during
> offline?  Shouldn't it be able to just bail and let whatever
> descendant which is stil busy propagate things upwards?  That's a
> usual pattern we use to tree shutdowns anyway.  Would that be nasty to
> implement in memcg?

I've no idea how nasty it would be to change memcg around, but Michal
and Hannes appear very open to doing so.  I do think that memcg's current
expectation is very reasonable: it's perfectly normal that a rmdir cannot
succeed until the directory is empty, and to depend upon that fact; but
the use of workqueue made some things asynchronous which were not before,
which has led to some surprises.

Hugh

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
  2014-02-07 20:20     ` Hugh Dickins
@ 2014-02-07 20:35       ` Tejun Heo
  -1 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2014-02-07 20:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Filipe Brandenburger, Li Zefan, Andrew Morton, Michal Hocko,
	Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

Hello, Hugh.

On Fri, Feb 07, 2014 at 12:20:44PM -0800, Hugh Dickins wrote:
> > But even with offline being called outside cgroup_mutex, IIRC, the
> > described problem would still be able to deadlock as long as the tree
> > depth is deeper than max concurrency level of the destruction
> > workqueue.  Sure, we can give it large enough number but it's
> > generally nasty.
> 
> You worry me there: I certainly don't want to be introducing new
> deadlocks.  You understand workqueues much better than most of us: I'm
> not sure what "max concurrency level of the destruction workqueue" is,
> but it sounds uncomfortably like an ordered workqueue's max_active 1.

Ooh, max_active is always a finite number.  The only reason we usually
don't worry about it is because they are large enough for the existing
dependency chains to cause deadlocks.  The theoretical problem with
cgroup is that the dependency chain can grow arbitrarily long and
multiple removals along different subhierarchies can overlap which
means that there can be multiple long dependency chains among work
items.  The probability would be extremely low but deadlock might be
possible even with relatively high max_active.

Besides, the reason we reduced max_active in the first place was
because destruction work items tend to just stack up without any
actual concurrency benefits, so increasing concurrncy level seems a
bit nasty to me (but probably a lot of those traffic jam was from
cgroup_mutex and once we take that out of the picture, it could become
fine).

> You don't return to this concern in the following mails of the thread:
> did you later decide that it actually won't be a problem?  I'll assume
> so for the moment, since you took the patch, but please reassure me.

I was just worrying about a different solution where we take
css_offline invocation outside of cgroup_mutex and bumping up
max_active.  There's nothing to worry about your patch.  Sorry about
not being clear.  :)

> > One thing I don't get is why memcg has such reverse dependency at all.
> > Why does the parent wait for its descendants to do something during
> > offline?  Shouldn't it be able to just bail and let whatever
> > descendant which is stil busy propagate things upwards?  That's a
> > usual pattern we use to tree shutdowns anyway.  Would that be nasty to
> > implement in memcg?
> 
> I've no idea how nasty it would be to change memcg around, but Michal
> and Hannes appear very open to doing so.  I do think that memcg's current
> expectation is very reasonable: it's perfectly normal that a rmdir cannot
> succeed until the directory is empty, and to depend upon that fact; but
> the use of workqueue made some things asynchronous which were not before,
> which has led to some surprises.

Maybe.  The thing is that ->css_offline() isn't really comparable to
rmdir.  ->css_free() is and is fully ordered through refcnts as one
would expect.  Whether ->css_offline() should be ordered similarly so
that the parent's offline is called iff all its children finished
offlining, I'm not sure.  Maybe it'd be something nice to have but I
kinda wanna keep the offline hook and its usages simple and limited.
It's not where the actual destruction should happen.  It's just a
notification to get ready.

Looks like Johannes's patch is headed towards that direction - moving
destruction from ->css_offline to ->css_free(), so if that works out,
I think we should be good for the time being.

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
@ 2014-02-07 20:35       ` Tejun Heo
  0 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2014-02-07 20:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Filipe Brandenburger, Li Zefan, Andrew Morton, Michal Hocko,
	Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

Hello, Hugh.

On Fri, Feb 07, 2014 at 12:20:44PM -0800, Hugh Dickins wrote:
> > But even with offline being called outside cgroup_mutex, IIRC, the
> > described problem would still be able to deadlock as long as the tree
> > depth is deeper than max concurrency level of the destruction
> > workqueue.  Sure, we can give it large enough number but it's
> > generally nasty.
> 
> You worry me there: I certainly don't want to be introducing new
> deadlocks.  You understand workqueues much better than most of us: I'm
> not sure what "max concurrency level of the destruction workqueue" is,
> but it sounds uncomfortably like an ordered workqueue's max_active 1.

Ooh, max_active is always a finite number.  The only reason we usually
don't worry about it is because they are large enough for the existing
dependency chains to cause deadlocks.  The theoretical problem with
cgroup is that the dependency chain can grow arbitrarily long and
multiple removals along different subhierarchies can overlap which
means that there can be multiple long dependency chains among work
items.  The probability would be extremely low but deadlock might be
possible even with relatively high max_active.

Besides, the reason we reduced max_active in the first place was
because destruction work items tend to just stack up without any
actual concurrency benefits, so increasing concurrncy level seems a
bit nasty to me (but probably a lot of those traffic jam was from
cgroup_mutex and once we take that out of the picture, it could become
fine).

> You don't return to this concern in the following mails of the thread:
> did you later decide that it actually won't be a problem?  I'll assume
> so for the moment, since you took the patch, but please reassure me.

I was just worrying about a different solution where we take
css_offline invocation outside of cgroup_mutex and bumping up
max_active.  There's nothing to worry about your patch.  Sorry about
not being clear.  :)

> > One thing I don't get is why memcg has such reverse dependency at all.
> > Why does the parent wait for its descendants to do something during
> > offline?  Shouldn't it be able to just bail and let whatever
> > descendant which is stil busy propagate things upwards?  That's a
> > usual pattern we use to tree shutdowns anyway.  Would that be nasty to
> > implement in memcg?
> 
> I've no idea how nasty it would be to change memcg around, but Michal
> and Hannes appear very open to doing so.  I do think that memcg's current
> expectation is very reasonable: it's perfectly normal that a rmdir cannot
> succeed until the directory is empty, and to depend upon that fact; but
> the use of workqueue made some things asynchronous which were not before,
> which has led to some surprises.

Maybe.  The thing is that ->css_offline() isn't really comparable to
rmdir.  ->css_free() is and is fully ordered through refcnts as one
would expect.  Whether ->css_offline() should be ordered similarly so
that the parent's offline is called iff all its children finished
offlining, I'm not sure.  Maybe it'd be something nice to have but I
kinda wanna keep the offline hook and its usages simple and limited.
It's not where the actual destruction should happen.  It's just a
notification to get ready.

Looks like Johannes's patch is headed towards that direction - moving
destruction from ->css_offline to ->css_free(), so if that works out,
I think we should be good for the time being.

Thanks!

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
  2014-02-07 20:35       ` Tejun Heo
@ 2014-02-07 21:06         ` Hugh Dickins
  -1 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2014-02-07 21:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Hugh Dickins, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Michal Hocko, Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

On Fri, 7 Feb 2014, Tejun Heo wrote:
> On Fri, Feb 07, 2014 at 12:20:44PM -0800, Hugh Dickins wrote:
> 
> > You don't return to this concern in the following mails of the thread:
> > did you later decide that it actually won't be a problem?  I'll assume
> > so for the moment, since you took the patch, but please reassure me.
> 
> I was just worrying about a different solution where we take
> css_offline invocation outside of cgroup_mutex and bumping up
> max_active.  There's nothing to worry about your patch.  Sorry about
> not being clear.  :)

Thanks a lot for your detailed and admirably responsive explanations.
You were looking ahead, I see that now, and I'm gratefully reassured :)

Hugh

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
@ 2014-02-07 21:06         ` Hugh Dickins
  0 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2014-02-07 21:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Hugh Dickins, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Michal Hocko, Johannes Weiner, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

On Fri, 7 Feb 2014, Tejun Heo wrote:
> On Fri, Feb 07, 2014 at 12:20:44PM -0800, Hugh Dickins wrote:
> 
> > You don't return to this concern in the following mails of the thread:
> > did you later decide that it actually won't be a problem?  I'll assume
> > so for the moment, since you took the patch, but please reassure me.
> 
> I was just worrying about a different solution where we take
> css_offline invocation outside of cgroup_mutex and bumping up
> max_active.  There's nothing to worry about your patch.  Sorry about
> not being clear.  :)

Thanks a lot for your detailed and admirably responsive explanations.
You were looking ahead, I see that now, and I'm gratefully reassured :)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
  2014-02-07 16:43   ` Johannes Weiner
  (?)
@ 2014-02-10 15:46     ` Michal Hocko
  -1 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2014-02-10 15:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Tejun Heo, Filipe Brandenburger, Li Zefan,
	Andrew Morton, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

On Fri 07-02-14 11:43:21, Johannes Weiner wrote:
[...]
> Long-term, I think we may want to get rid of the reparenting in
> css_offline() entirely and only do it in the naturally ordered
> css_free() callback.  We only reparent in css_offline() because
> swapout records pin the css and we don't want to hang on to
> potentially gigabytes of unreclaimable (css_tryget() disabled) cache
> indefinitely until the last swapout records disappear.
> 
> So I'm currently mucking around with the following patch, which drops
> the css pin from swapout records and reparents them in css_free().
> It's lightly tested and there might well be dragons but I don't see
> any fundamental problems with it.
> 
> What do you think?

Hugh has posted something like this back in December
(http://marc.info/?l=linux-mm&m=138718299304670&w=2).
I am not entirely happy about scanning potentially huge amount of 
swap records.

A trivial optimization would check the memsw counter and break out early
but it would still leave a potentially full scan possible. I guess we
shouldn't care much about when css_free is called. I think we should
split the reparenting into two parts. LRU reparent without any hard
requirements and charges reparent which guarantees that nothing is left
behind.
The first one called css_offline and the second one from css_free.

kmem accounting pins memcg as well btw so taking care of swap is not
sufficient to guarantee an early css_free.

> ---
>  include/linux/page_cgroup.h |   8 ++++
>  mm/memcontrol.c             | 101 +++++++++++++-------------------------------
>  mm/page_cgroup.c            |  41 ++++++++++++++++++
>  3 files changed, 78 insertions(+), 72 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 777a524716db..3ededda8934f 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -111,6 +111,8 @@ extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
>  					unsigned short old, unsigned short new);
>  extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
>  extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
> +extern unsigned long swap_cgroup_migrate(unsigned short old,
> +					 unsigned short new);
>  extern int swap_cgroup_swapon(int type, unsigned long max_pages);
>  extern void swap_cgroup_swapoff(int type);
>  #else
> @@ -127,6 +129,12 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
>  	return 0;
>  }
>  
> +static inline unsigned long swap_cgroup_migrate(unsigned short old,
> +						unsigned short new)
> +{
> +	return 0;
> +}
> +
>  static inline int
>  swap_cgroup_swapon(int type, unsigned long max_pages)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 53385cd4e6f0..e2a0d3986c74 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -892,11 +892,9 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
>  	return val;
>  }
>  
> -static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
> -					 bool charge)
> +static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg, long nr_pages)
>  {
> -	int val = (charge) ? 1 : -1;
> -	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
> +	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], nr_pages);
>  }
>  
>  static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
> @@ -4234,15 +4232,12 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
>  	 */
>  
>  	unlock_page_cgroup(pc);
> -	/*
> -	 * even after unlock, we have memcg->res.usage here and this memcg
> -	 * will never be freed, so it's safe to call css_get().
> -	 */
> +
>  	memcg_check_events(memcg, page);
> -	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
> -		mem_cgroup_swap_statistics(memcg, true);
> -		css_get(&memcg->css);
> -	}
> +
> +	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> +		mem_cgroup_swap_statistics(memcg, 1);
> +
>  	/*
>  	 * Migration does not charge the res_counter for the
>  	 * replacement page, so leave it alone when phasing out the
> @@ -4351,10 +4346,6 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
>  
>  	memcg = __mem_cgroup_uncharge_common(page, ctype, false);
>  
> -	/*
> -	 * record memcg information,  if swapout && memcg != NULL,
> -	 * css_get() was called in uncharge().
> -	 */
>  	if (do_swap_account && swapout && memcg)
>  		swap_cgroup_record(ent, mem_cgroup_id(memcg));
>  }
> @@ -4383,8 +4374,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
>  		 */
>  		if (!mem_cgroup_is_root(memcg))
>  			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> -		mem_cgroup_swap_statistics(memcg, false);
> -		css_put(&memcg->css);
> +		mem_cgroup_swap_statistics(memcg, -1);
>  	}
>  	rcu_read_unlock();
>  }
> @@ -4412,20 +4402,8 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
>  	new_id = mem_cgroup_id(to);
>  
>  	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
> -		mem_cgroup_swap_statistics(from, false);
> -		mem_cgroup_swap_statistics(to, true);
> -		/*
> -		 * This function is only called from task migration context now.
> -		 * It postpones res_counter and refcount handling till the end
> -		 * of task migration(mem_cgroup_clear_mc()) for performance
> -		 * improvement. But we cannot postpone css_get(to)  because if
> -		 * the process that has been moved to @to does swap-in, the
> -		 * refcount of @to might be decreased to 0.
> -		 *
> -		 * We are in attach() phase, so the cgroup is guaranteed to be
> -		 * alive, so we can just call css_get().
> -		 */
> -		css_get(&to->css);
> +		mem_cgroup_swap_statistics(from, -1);
> +		mem_cgroup_swap_statistics(to, 1);
>  		return 0;
>  	}
>  	return -EINVAL;
> @@ -6611,7 +6589,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>  	kmem_cgroup_css_offline(memcg);
>  
>  	mem_cgroup_invalidate_reclaim_iterators(memcg);
> -	mem_cgroup_reparent_charges(memcg);
>  	mem_cgroup_destroy_all_caches(memcg);
>  	vmpressure_cleanup(&memcg->vmpressure);
>  }
> @@ -6619,41 +6596,26 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>  static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +	unsigned long swaps_moved;
> +	struct mem_cgroup *parent;
> +
>  	/*
> -	 * XXX: css_offline() would be where we should reparent all
> -	 * memory to prepare the cgroup for destruction.  However,
> -	 * memcg does not do css_tryget() and res_counter charging
> -	 * under the same RCU lock region, which means that charging
> -	 * could race with offlining.  Offlining only happens to
> -	 * cgroups with no tasks in them but charges can show up
> -	 * without any tasks from the swapin path when the target
> -	 * memcg is looked up from the swapout record and not from the
> -	 * current task as it usually is.  A race like this can leak
> -	 * charges and put pages with stale cgroup pointers into
> -	 * circulation:
> -	 *
> -	 * #0                        #1
> -	 *                           lookup_swap_cgroup_id()
> -	 *                           rcu_read_lock()
> -	 *                           mem_cgroup_lookup()
> -	 *                           css_tryget()
> -	 *                           rcu_read_unlock()
> -	 * disable css_tryget()
> -	 * call_rcu()
> -	 *   offline_css()
> -	 *     reparent_charges()
> -	 *                           res_counter_charge()
> -	 *                           css_put()
> -	 *                             css_free()
> -	 *                           pc->mem_cgroup = dead memcg
> -	 *                           add page to lru
> -	 *
> -	 * The bulk of the charges are still moved in offline_css() to
> -	 * avoid pinning a lot of pages in case a long-term reference
> -	 * like a swapout record is deferring the css_free() to long
> -	 * after offlining.  But this makes sure we catch any charges
> -	 * made after offlining:
> +	 * Migrate all swap entries to the parent.  There are no more
> +	 * references to the css, so no new swap out records can show
> +	 * up.  Any concurrently faulting pages will either get this
> +	 * offline cgroup and charge the faulting task, or get the
> +	 * migrated parent id and charge the parent for the in-memory
> +	 * page.  uncharge_swap() will balance the res_counter in the
> +	 * parent either way, whether it still fixes this group's
> +	 * res_counter is irrelevant at this point.
>  	 */
> +	parent = parent_mem_cgroup(memcg);
> +	if (!parent)
> +		parent = root_mem_cgroup;
> +	swaps_moved = swap_cgroup_migrate(mem_cgroup_id(memcg),
> +					  mem_cgroup_id(parent));
> +	mem_cgroup_swap_statistics(parent, swaps_moved);
> +
>  	mem_cgroup_reparent_charges(memcg);
>  
>  	memcg_destroy_kmem(memcg);
> @@ -6966,7 +6928,6 @@ static void __mem_cgroup_clear_mc(void)
>  {
>  	struct mem_cgroup *from = mc.from;
>  	struct mem_cgroup *to = mc.to;
> -	int i;
>  
>  	/* we must uncharge all the leftover precharges from mc.to */
>  	if (mc.precharge) {
> @@ -6981,16 +6942,13 @@ static void __mem_cgroup_clear_mc(void)
>  		__mem_cgroup_cancel_charge(mc.from, mc.moved_charge);
>  		mc.moved_charge = 0;
>  	}
> -	/* we must fixup refcnts and charges */
> +	/* we must fixup charges */
>  	if (mc.moved_swap) {
>  		/* uncharge swap account from the old cgroup */
>  		if (!mem_cgroup_is_root(mc.from))
>  			res_counter_uncharge(&mc.from->memsw,
>  						PAGE_SIZE * mc.moved_swap);
>  
> -		for (i = 0; i < mc.moved_swap; i++)
> -			css_put(&mc.from->css);
> -
>  		if (!mem_cgroup_is_root(mc.to)) {
>  			/*
>  			 * we charged both to->res and to->memsw, so we should
> @@ -6999,7 +6957,6 @@ static void __mem_cgroup_clear_mc(void)
>  			res_counter_uncharge(&mc.to->res,
>  						PAGE_SIZE * mc.moved_swap);
>  		}
> -		/* we've already done css_get(mc.to) */
>  		mc.moved_swap = 0;
>  	}
>  	memcg_oom_recover(from);
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index cfd162882c00..ca04feaae7fe 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -459,6 +459,47 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
>  	return lookup_swap_cgroup(ent, NULL)->id;
>  }
>  
> +/**
> + * swap_cgroup_migrate - migrate all entries of one id to another
> + * @old: old id
> + * @new: new id
> + *
> + * Caller has to be able to deal with swapon/swapoff racing.
> + *
> + * Returns number of migrated entries.
> + */
> +unsigned long swap_cgroup_migrate(unsigned short old, unsigned short new)
> +{
> +	unsigned long migrated = 0;
> +	unsigned int type;
> +
> +	for (type = 0; type < MAX_SWAPFILES; type++) {
> +		struct swap_cgroup_ctrl *ctrl;
> +		unsigned long flags;
> +		unsigned int page;
> +
> +		ctrl = &swap_cgroup_ctrl[type];
> +		spin_lock_irqsave(&ctrl->lock, flags);
> +		for (page = 0; page < ctrl->length; page++) {
> +			struct swap_cgroup *base;
> +			pgoff_t offset;
> +
> +			base = page_address(ctrl->map[page]);
> +			for (offset = 0; offset < SC_PER_PAGE; offset++) {
> +				struct swap_cgroup *sc;
> +
> +				sc = base + offset;
> +				if (sc->id == old) {
> +					sc->id = new;
> +					migrated++;
> +				}
> +			}
> +		}
> +		spin_unlock_irqrestore(&ctrl->lock, flags);
> +	}
> +	return migrated;
> +}
> +
>  int swap_cgroup_swapon(int type, unsigned long max_pages)
>  {
>  	void *array;
> -- 
> 1.8.5.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
@ 2014-02-10 15:46     ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2014-02-10 15:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Tejun Heo, Filipe Brandenburger, Li Zefan,
	Andrew Morton, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

On Fri 07-02-14 11:43:21, Johannes Weiner wrote:
[...]
> Long-term, I think we may want to get rid of the reparenting in
> css_offline() entirely and only do it in the naturally ordered
> css_free() callback.  We only reparent in css_offline() because
> swapout records pin the css and we don't want to hang on to
> potentially gigabytes of unreclaimable (css_tryget() disabled) cache
> indefinitely until the last swapout records disappear.
> 
> So I'm currently mucking around with the following patch, which drops
> the css pin from swapout records and reparents them in css_free().
> It's lightly tested and there might well be dragons but I don't see
> any fundamental problems with it.
> 
> What do you think?

Hugh has posted something like this back in December
(http://marc.info/?l=linux-mm&m=138718299304670&w=2).
I am not entirely happy about scanning potentially huge amount of 
swap records.

A trivial optimization would check the memsw counter and break out early
but it would still leave a potentially full scan possible. I guess we
shouldn't care much about when css_free is called. I think we should
split the reparenting into two parts. LRU reparent without any hard
requirements and charges reparent which guarantees that nothing is left
behind.
The first one called css_offline and the second one from css_free.

kmem accounting pins memcg as well btw so taking care of swap is not
sufficient to guarantee an early css_free.

> ---
>  include/linux/page_cgroup.h |   8 ++++
>  mm/memcontrol.c             | 101 +++++++++++++-------------------------------
>  mm/page_cgroup.c            |  41 ++++++++++++++++++
>  3 files changed, 78 insertions(+), 72 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 777a524716db..3ededda8934f 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -111,6 +111,8 @@ extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
>  					unsigned short old, unsigned short new);
>  extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
>  extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
> +extern unsigned long swap_cgroup_migrate(unsigned short old,
> +					 unsigned short new);
>  extern int swap_cgroup_swapon(int type, unsigned long max_pages);
>  extern void swap_cgroup_swapoff(int type);
>  #else
> @@ -127,6 +129,12 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
>  	return 0;
>  }
>  
> +static inline unsigned long swap_cgroup_migrate(unsigned short old,
> +						unsigned short new)
> +{
> +	return 0;
> +}
> +
>  static inline int
>  swap_cgroup_swapon(int type, unsigned long max_pages)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 53385cd4e6f0..e2a0d3986c74 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -892,11 +892,9 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
>  	return val;
>  }
>  
> -static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
> -					 bool charge)
> +static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg, long nr_pages)
>  {
> -	int val = (charge) ? 1 : -1;
> -	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
> +	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], nr_pages);
>  }
>  
>  static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
> @@ -4234,15 +4232,12 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
>  	 */
>  
>  	unlock_page_cgroup(pc);
> -	/*
> -	 * even after unlock, we have memcg->res.usage here and this memcg
> -	 * will never be freed, so it's safe to call css_get().
> -	 */
> +
>  	memcg_check_events(memcg, page);
> -	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
> -		mem_cgroup_swap_statistics(memcg, true);
> -		css_get(&memcg->css);
> -	}
> +
> +	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> +		mem_cgroup_swap_statistics(memcg, 1);
> +
>  	/*
>  	 * Migration does not charge the res_counter for the
>  	 * replacement page, so leave it alone when phasing out the
> @@ -4351,10 +4346,6 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
>  
>  	memcg = __mem_cgroup_uncharge_common(page, ctype, false);
>  
> -	/*
> -	 * record memcg information,  if swapout && memcg != NULL,
> -	 * css_get() was called in uncharge().
> -	 */
>  	if (do_swap_account && swapout && memcg)
>  		swap_cgroup_record(ent, mem_cgroup_id(memcg));
>  }
> @@ -4383,8 +4374,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
>  		 */
>  		if (!mem_cgroup_is_root(memcg))
>  			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> -		mem_cgroup_swap_statistics(memcg, false);
> -		css_put(&memcg->css);
> +		mem_cgroup_swap_statistics(memcg, -1);
>  	}
>  	rcu_read_unlock();
>  }
> @@ -4412,20 +4402,8 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
>  	new_id = mem_cgroup_id(to);
>  
>  	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
> -		mem_cgroup_swap_statistics(from, false);
> -		mem_cgroup_swap_statistics(to, true);
> -		/*
> -		 * This function is only called from task migration context now.
> -		 * It postpones res_counter and refcount handling till the end
> -		 * of task migration(mem_cgroup_clear_mc()) for performance
> -		 * improvement. But we cannot postpone css_get(to)  because if
> -		 * the process that has been moved to @to does swap-in, the
> -		 * refcount of @to might be decreased to 0.
> -		 *
> -		 * We are in attach() phase, so the cgroup is guaranteed to be
> -		 * alive, so we can just call css_get().
> -		 */
> -		css_get(&to->css);
> +		mem_cgroup_swap_statistics(from, -1);
> +		mem_cgroup_swap_statistics(to, 1);
>  		return 0;
>  	}
>  	return -EINVAL;
> @@ -6611,7 +6589,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>  	kmem_cgroup_css_offline(memcg);
>  
>  	mem_cgroup_invalidate_reclaim_iterators(memcg);
> -	mem_cgroup_reparent_charges(memcg);
>  	mem_cgroup_destroy_all_caches(memcg);
>  	vmpressure_cleanup(&memcg->vmpressure);
>  }
> @@ -6619,41 +6596,26 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>  static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +	unsigned long swaps_moved;
> +	struct mem_cgroup *parent;
> +
>  	/*
> -	 * XXX: css_offline() would be where we should reparent all
> -	 * memory to prepare the cgroup for destruction.  However,
> -	 * memcg does not do css_tryget() and res_counter charging
> -	 * under the same RCU lock region, which means that charging
> -	 * could race with offlining.  Offlining only happens to
> -	 * cgroups with no tasks in them but charges can show up
> -	 * without any tasks from the swapin path when the target
> -	 * memcg is looked up from the swapout record and not from the
> -	 * current task as it usually is.  A race like this can leak
> -	 * charges and put pages with stale cgroup pointers into
> -	 * circulation:
> -	 *
> -	 * #0                        #1
> -	 *                           lookup_swap_cgroup_id()
> -	 *                           rcu_read_lock()
> -	 *                           mem_cgroup_lookup()
> -	 *                           css_tryget()
> -	 *                           rcu_read_unlock()
> -	 * disable css_tryget()
> -	 * call_rcu()
> -	 *   offline_css()
> -	 *     reparent_charges()
> -	 *                           res_counter_charge()
> -	 *                           css_put()
> -	 *                             css_free()
> -	 *                           pc->mem_cgroup = dead memcg
> -	 *                           add page to lru
> -	 *
> -	 * The bulk of the charges are still moved in offline_css() to
> -	 * avoid pinning a lot of pages in case a long-term reference
> -	 * like a swapout record is deferring the css_free() to long
> -	 * after offlining.  But this makes sure we catch any charges
> -	 * made after offlining:
> +	 * Migrate all swap entries to the parent.  There are no more
> +	 * references to the css, so no new swap out records can show
> +	 * up.  Any concurrently faulting pages will either get this
> +	 * offline cgroup and charge the faulting task, or get the
> +	 * migrated parent id and charge the parent for the in-memory
> +	 * page.  uncharge_swap() will balance the res_counter in the
> +	 * parent either way, whether it still fixes this group's
> +	 * res_counter is irrelevant at this point.
>  	 */
> +	parent = parent_mem_cgroup(memcg);
> +	if (!parent)
> +		parent = root_mem_cgroup;
> +	swaps_moved = swap_cgroup_migrate(mem_cgroup_id(memcg),
> +					  mem_cgroup_id(parent));
> +	mem_cgroup_swap_statistics(parent, swaps_moved);
> +
>  	mem_cgroup_reparent_charges(memcg);
>  
>  	memcg_destroy_kmem(memcg);
> @@ -6966,7 +6928,6 @@ static void __mem_cgroup_clear_mc(void)
>  {
>  	struct mem_cgroup *from = mc.from;
>  	struct mem_cgroup *to = mc.to;
> -	int i;
>  
>  	/* we must uncharge all the leftover precharges from mc.to */
>  	if (mc.precharge) {
> @@ -6981,16 +6942,13 @@ static void __mem_cgroup_clear_mc(void)
>  		__mem_cgroup_cancel_charge(mc.from, mc.moved_charge);
>  		mc.moved_charge = 0;
>  	}
> -	/* we must fixup refcnts and charges */
> +	/* we must fixup charges */
>  	if (mc.moved_swap) {
>  		/* uncharge swap account from the old cgroup */
>  		if (!mem_cgroup_is_root(mc.from))
>  			res_counter_uncharge(&mc.from->memsw,
>  						PAGE_SIZE * mc.moved_swap);
>  
> -		for (i = 0; i < mc.moved_swap; i++)
> -			css_put(&mc.from->css);
> -
>  		if (!mem_cgroup_is_root(mc.to)) {
>  			/*
>  			 * we charged both to->res and to->memsw, so we should
> @@ -6999,7 +6957,6 @@ static void __mem_cgroup_clear_mc(void)
>  			res_counter_uncharge(&mc.to->res,
>  						PAGE_SIZE * mc.moved_swap);
>  		}
> -		/* we've already done css_get(mc.to) */
>  		mc.moved_swap = 0;
>  	}
>  	memcg_oom_recover(from);
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index cfd162882c00..ca04feaae7fe 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -459,6 +459,47 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
>  	return lookup_swap_cgroup(ent, NULL)->id;
>  }
>  
> +/**
> + * swap_cgroup_migrate - migrate all entries of one id to another
> + * @old: old id
> + * @new: new id
> + *
> + * Caller has to be able to deal with swapon/swapoff racing.
> + *
> + * Returns number of migrated entries.
> + */
> +unsigned long swap_cgroup_migrate(unsigned short old, unsigned short new)
> +{
> +	unsigned long migrated = 0;
> +	unsigned int type;
> +
> +	for (type = 0; type < MAX_SWAPFILES; type++) {
> +		struct swap_cgroup_ctrl *ctrl;
> +		unsigned long flags;
> +		unsigned int page;
> +
> +		ctrl = &swap_cgroup_ctrl[type];
> +		spin_lock_irqsave(&ctrl->lock, flags);
> +		for (page = 0; page < ctrl->length; page++) {
> +			struct swap_cgroup *base;
> +			pgoff_t offset;
> +
> +			base = page_address(ctrl->map[page]);
> +			for (offset = 0; offset < SC_PER_PAGE; offset++) {
> +				struct swap_cgroup *sc;
> +
> +				sc = base + offset;
> +				if (sc->id == old) {
> +					sc->id = new;
> +					migrated++;
> +				}
> +			}
> +		}
> +		spin_unlock_irqrestore(&ctrl->lock, flags);
> +	}
> +	return migrated;
> +}
> +
>  int swap_cgroup_swapon(int type, unsigned long max_pages)
>  {
>  	void *array;
> -- 
> 1.8.5.3
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
@ 2014-02-10 15:46     ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2014-02-10 15:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Tejun Heo, Filipe Brandenburger, Li Zefan,
	Andrew Morton, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Fri 07-02-14 11:43:21, Johannes Weiner wrote:
[...]
> Long-term, I think we may want to get rid of the reparenting in
> css_offline() entirely and only do it in the naturally ordered
> css_free() callback.  We only reparent in css_offline() because
> swapout records pin the css and we don't want to hang on to
> potentially gigabytes of unreclaimable (css_tryget() disabled) cache
> indefinitely until the last swapout records disappear.
> 
> So I'm currently mucking around with the following patch, which drops
> the css pin from swapout records and reparents them in css_free().
> It's lightly tested and there might well be dragons but I don't see
> any fundamental problems with it.
> 
> What do you think?

Hugh has posted something like this back in December
(http://marc.info/?l=linux-mm&m=138718299304670&w=2).
I am not entirely happy about scanning potentially huge amount of 
swap records.

A trivial optimization would check the memsw counter and break out early
but it would still leave a potentially full scan possible. I guess we
shouldn't care much about when css_free is called. I think we should
split the reparenting into two parts. LRU reparent without any hard
requirements and charges reparent which guarantees that nothing is left
behind.
The first one called css_offline and the second one from css_free.

kmem accounting pins memcg as well btw so taking care of swap is not
sufficient to guarantee an early css_free.

> ---
>  include/linux/page_cgroup.h |   8 ++++
>  mm/memcontrol.c             | 101 +++++++++++++-------------------------------
>  mm/page_cgroup.c            |  41 ++++++++++++++++++
>  3 files changed, 78 insertions(+), 72 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 777a524716db..3ededda8934f 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -111,6 +111,8 @@ extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
>  					unsigned short old, unsigned short new);
>  extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
>  extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
> +extern unsigned long swap_cgroup_migrate(unsigned short old,
> +					 unsigned short new);
>  extern int swap_cgroup_swapon(int type, unsigned long max_pages);
>  extern void swap_cgroup_swapoff(int type);
>  #else
> @@ -127,6 +129,12 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
>  	return 0;
>  }
>  
> +static inline unsigned long swap_cgroup_migrate(unsigned short old,
> +						unsigned short new)
> +{
> +	return 0;
> +}
> +
>  static inline int
>  swap_cgroup_swapon(int type, unsigned long max_pages)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 53385cd4e6f0..e2a0d3986c74 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -892,11 +892,9 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
>  	return val;
>  }
>  
> -static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
> -					 bool charge)
> +static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg, long nr_pages)
>  {
> -	int val = (charge) ? 1 : -1;
> -	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
> +	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], nr_pages);
>  }
>  
>  static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
> @@ -4234,15 +4232,12 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
>  	 */
>  
>  	unlock_page_cgroup(pc);
> -	/*
> -	 * even after unlock, we have memcg->res.usage here and this memcg
> -	 * will never be freed, so it's safe to call css_get().
> -	 */
> +
>  	memcg_check_events(memcg, page);
> -	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
> -		mem_cgroup_swap_statistics(memcg, true);
> -		css_get(&memcg->css);
> -	}
> +
> +	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> +		mem_cgroup_swap_statistics(memcg, 1);
> +
>  	/*
>  	 * Migration does not charge the res_counter for the
>  	 * replacement page, so leave it alone when phasing out the
> @@ -4351,10 +4346,6 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
>  
>  	memcg = __mem_cgroup_uncharge_common(page, ctype, false);
>  
> -	/*
> -	 * record memcg information,  if swapout && memcg != NULL,
> -	 * css_get() was called in uncharge().
> -	 */
>  	if (do_swap_account && swapout && memcg)
>  		swap_cgroup_record(ent, mem_cgroup_id(memcg));
>  }
> @@ -4383,8 +4374,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
>  		 */
>  		if (!mem_cgroup_is_root(memcg))
>  			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> -		mem_cgroup_swap_statistics(memcg, false);
> -		css_put(&memcg->css);
> +		mem_cgroup_swap_statistics(memcg, -1);
>  	}
>  	rcu_read_unlock();
>  }
> @@ -4412,20 +4402,8 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
>  	new_id = mem_cgroup_id(to);
>  
>  	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
> -		mem_cgroup_swap_statistics(from, false);
> -		mem_cgroup_swap_statistics(to, true);
> -		/*
> -		 * This function is only called from task migration context now.
> -		 * It postpones res_counter and refcount handling till the end
> -		 * of task migration(mem_cgroup_clear_mc()) for performance
> -		 * improvement. But we cannot postpone css_get(to)  because if
> -		 * the process that has been moved to @to does swap-in, the
> -		 * refcount of @to might be decreased to 0.
> -		 *
> -		 * We are in attach() phase, so the cgroup is guaranteed to be
> -		 * alive, so we can just call css_get().
> -		 */
> -		css_get(&to->css);
> +		mem_cgroup_swap_statistics(from, -1);
> +		mem_cgroup_swap_statistics(to, 1);
>  		return 0;
>  	}
>  	return -EINVAL;
> @@ -6611,7 +6589,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>  	kmem_cgroup_css_offline(memcg);
>  
>  	mem_cgroup_invalidate_reclaim_iterators(memcg);
> -	mem_cgroup_reparent_charges(memcg);
>  	mem_cgroup_destroy_all_caches(memcg);
>  	vmpressure_cleanup(&memcg->vmpressure);
>  }
> @@ -6619,41 +6596,26 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>  static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +	unsigned long swaps_moved;
> +	struct mem_cgroup *parent;
> +
>  	/*
> -	 * XXX: css_offline() would be where we should reparent all
> -	 * memory to prepare the cgroup for destruction.  However,
> -	 * memcg does not do css_tryget() and res_counter charging
> -	 * under the same RCU lock region, which means that charging
> -	 * could race with offlining.  Offlining only happens to
> -	 * cgroups with no tasks in them but charges can show up
> -	 * without any tasks from the swapin path when the target
> -	 * memcg is looked up from the swapout record and not from the
> -	 * current task as it usually is.  A race like this can leak
> -	 * charges and put pages with stale cgroup pointers into
> -	 * circulation:
> -	 *
> -	 * #0                        #1
> -	 *                           lookup_swap_cgroup_id()
> -	 *                           rcu_read_lock()
> -	 *                           mem_cgroup_lookup()
> -	 *                           css_tryget()
> -	 *                           rcu_read_unlock()
> -	 * disable css_tryget()
> -	 * call_rcu()
> -	 *   offline_css()
> -	 *     reparent_charges()
> -	 *                           res_counter_charge()
> -	 *                           css_put()
> -	 *                             css_free()
> -	 *                           pc->mem_cgroup = dead memcg
> -	 *                           add page to lru
> -	 *
> -	 * The bulk of the charges are still moved in offline_css() to
> -	 * avoid pinning a lot of pages in case a long-term reference
> -	 * like a swapout record is deferring the css_free() to long
> -	 * after offlining.  But this makes sure we catch any charges
> -	 * made after offlining:
> +	 * Migrate all swap entries to the parent.  There are no more
> +	 * references to the css, so no new swap out records can show
> +	 * up.  Any concurrently faulting pages will either get this
> +	 * offline cgroup and charge the faulting task, or get the
> +	 * migrated parent id and charge the parent for the in-memory
> +	 * page.  uncharge_swap() will balance the res_counter in the
> +	 * parent either way, whether it still fixes this group's
> +	 * res_counter is irrelevant at this point.
>  	 */
> +	parent = parent_mem_cgroup(memcg);
> +	if (!parent)
> +		parent = root_mem_cgroup;
> +	swaps_moved = swap_cgroup_migrate(mem_cgroup_id(memcg),
> +					  mem_cgroup_id(parent));
> +	mem_cgroup_swap_statistics(parent, swaps_moved);
> +
>  	mem_cgroup_reparent_charges(memcg);
>  
>  	memcg_destroy_kmem(memcg);
> @@ -6966,7 +6928,6 @@ static void __mem_cgroup_clear_mc(void)
>  {
>  	struct mem_cgroup *from = mc.from;
>  	struct mem_cgroup *to = mc.to;
> -	int i;
>  
>  	/* we must uncharge all the leftover precharges from mc.to */
>  	if (mc.precharge) {
> @@ -6981,16 +6942,13 @@ static void __mem_cgroup_clear_mc(void)
>  		__mem_cgroup_cancel_charge(mc.from, mc.moved_charge);
>  		mc.moved_charge = 0;
>  	}
> -	/* we must fixup refcnts and charges */
> +	/* we must fixup charges */
>  	if (mc.moved_swap) {
>  		/* uncharge swap account from the old cgroup */
>  		if (!mem_cgroup_is_root(mc.from))
>  			res_counter_uncharge(&mc.from->memsw,
>  						PAGE_SIZE * mc.moved_swap);
>  
> -		for (i = 0; i < mc.moved_swap; i++)
> -			css_put(&mc.from->css);
> -
>  		if (!mem_cgroup_is_root(mc.to)) {
>  			/*
>  			 * we charged both to->res and to->memsw, so we should
> @@ -6999,7 +6957,6 @@ static void __mem_cgroup_clear_mc(void)
>  			res_counter_uncharge(&mc.to->res,
>  						PAGE_SIZE * mc.moved_swap);
>  		}
> -		/* we've already done css_get(mc.to) */
>  		mc.moved_swap = 0;
>  	}
>  	memcg_oom_recover(from);
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index cfd162882c00..ca04feaae7fe 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -459,6 +459,47 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
>  	return lookup_swap_cgroup(ent, NULL)->id;
>  }
>  
> +/**
> + * swap_cgroup_migrate - migrate all entries of one id to another
> + * @old: old id
> + * @new: new id
> + *
> + * Caller has to be able to deal with swapon/swapoff racing.
> + *
> + * Returns number of migrated entries.
> + */
> +unsigned long swap_cgroup_migrate(unsigned short old, unsigned short new)
> +{
> +	unsigned long migrated = 0;
> +	unsigned int type;
> +
> +	for (type = 0; type < MAX_SWAPFILES; type++) {
> +		struct swap_cgroup_ctrl *ctrl;
> +		unsigned long flags;
> +		unsigned int page;
> +
> +		ctrl = &swap_cgroup_ctrl[type];
> +		spin_lock_irqsave(&ctrl->lock, flags);
> +		for (page = 0; page < ctrl->length; page++) {
> +			struct swap_cgroup *base;
> +			pgoff_t offset;
> +
> +			base = page_address(ctrl->map[page]);
> +			for (offset = 0; offset < SC_PER_PAGE; offset++) {
> +				struct swap_cgroup *sc;
> +
> +				sc = base + offset;
> +				if (sc->id == old) {
> +					sc->id = new;
> +					migrated++;
> +				}
> +			}
> +		}
> +		spin_unlock_irqrestore(&ctrl->lock, flags);
> +	}
> +	return migrated;
> +}
> +
>  int swap_cgroup_swapon(int type, unsigned long max_pages)
>  {
>  	void *array;
> -- 
> 1.8.5.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
  2014-02-07 16:43   ` Johannes Weiner
  (?)
@ 2014-02-12 22:59     ` Hugh Dickins
  -1 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2014-02-12 22:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Tejun Heo, Filipe Brandenburger, Li Zefan,
	Andrew Morton, Michal Hocko, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

On Fri, 7 Feb 2014, Johannes Weiner wrote:
> On Thu, Feb 06, 2014 at 03:56:01PM -0800, Hugh Dickins wrote:
> > Sometimes the cleanup after memcg hierarchy testing gets stuck in
> > mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
> > 
> > There may turn out to be several causes, but a major cause is this: the
> > workitem to offline parent can get run before workitem to offline child;
> > parent's mem_cgroup_reparent_charges() circles around waiting for the
> > child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> > which prevents the child from reaching its mem_cgroup_reparent_charges().
> > 
> > Just use an ordered workqueue for cgroup_destroy_wq.
> > 
> > Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
> > Suggested-by: Filipe Brandenburger <filbranden@google.com>
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > Cc: stable@vger.kernel.org # 3.10+
> 
> I think this is a good idea for now and -stable:
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

You might be wondering why this patch didn't reach Linus yet.

It's because more thorough testing, by others here, found that it
wasn't always solving the problem: so I asked Tejun privately to
hold off from sending it in, until we'd worked out why not.

Most of our testing being on a v3,11-based kernel, it was perfectly
possible that the problem was merely our own e.g. missing Tejun's
8a2b75384444 ("workqueue: fix ordered workqueues in NUMA setups").

But that turned out not to be enough to fix it either. Then Filipe
pointed out how percpu_ref_kill_and_confirm() uses call_rcu_sched()
before we ever get to put the offline on to the workqueue: by the
time we get to the workqueue, the ordering has already been lost.

So, thanks for the Acks, but I'm afraid that this ordered workqueue
solution is just not good enough: we should simply forget that patch
and provide a different answer.

So I'm now posting a couple of alternative solutions: 1/2 from Filipe
at the memcg end, and 2/2 from me at the cgroup end.  Each of these
has stood up to better testing, so you can choose between them,
or work out a better answer.

(By the way, I have another little pair of memcg/cgroup fixes to post
shortly, nothing to do with these two: it would be less confusing if
I had some third fix to add in there, but sadly not.)

Hugh

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
@ 2014-02-12 22:59     ` Hugh Dickins
  0 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2014-02-12 22:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Tejun Heo, Filipe Brandenburger, Li Zefan,
	Andrew Morton, Michal Hocko, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

On Fri, 7 Feb 2014, Johannes Weiner wrote:
> On Thu, Feb 06, 2014 at 03:56:01PM -0800, Hugh Dickins wrote:
> > Sometimes the cleanup after memcg hierarchy testing gets stuck in
> > mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
> > 
> > There may turn out to be several causes, but a major cause is this: the
> > workitem to offline parent can get run before workitem to offline child;
> > parent's mem_cgroup_reparent_charges() circles around waiting for the
> > child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> > which prevents the child from reaching its mem_cgroup_reparent_charges().
> > 
> > Just use an ordered workqueue for cgroup_destroy_wq.
> > 
> > Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
> > Suggested-by: Filipe Brandenburger <filbranden@google.com>
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > Cc: stable@vger.kernel.org # 3.10+
> 
> I think this is a good idea for now and -stable:
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

You might be wondering why this patch didn't reach Linus yet.

It's because more thorough testing, by others here, found that it
wasn't always solving the problem: so I asked Tejun privately to
hold off from sending it in, until we'd worked out why not.

Most of our testing being on a v3,11-based kernel, it was perfectly
possible that the problem was merely our own e.g. missing Tejun's
8a2b75384444 ("workqueue: fix ordered workqueues in NUMA setups").

But that turned out not to be enough to fix it either. Then Filipe
pointed out how percpu_ref_kill_and_confirm() uses call_rcu_sched()
before we ever get to put the offline on to the workqueue: by the
time we get to the workqueue, the ordering has already been lost.

So, thanks for the Acks, but I'm afraid that this ordered workqueue
solution is just not good enough: we should simply forget that patch
and provide a different answer.

So I'm now posting a couple of alternative solutions: 1/2 from Filipe
at the memcg end, and 2/2 from me at the cgroup end.  Each of these
has stood up to better testing, so you can choose between them,
or work out a better answer.

(By the way, I have another little pair of memcg/cgroup fixes to post
shortly, nothing to do with these two: it would be less confusing if
I had some third fix to add in there, but sadly not.)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction
@ 2014-02-12 22:59     ` Hugh Dickins
  0 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2014-02-12 22:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Tejun Heo, Filipe Brandenburger, Li Zefan,
	Andrew Morton, Michal Hocko, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Fri, 7 Feb 2014, Johannes Weiner wrote:
> On Thu, Feb 06, 2014 at 03:56:01PM -0800, Hugh Dickins wrote:
> > Sometimes the cleanup after memcg hierarchy testing gets stuck in
> > mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
> > 
> > There may turn out to be several causes, but a major cause is this: the
> > workitem to offline parent can get run before workitem to offline child;
> > parent's mem_cgroup_reparent_charges() circles around waiting for the
> > child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> > which prevents the child from reaching its mem_cgroup_reparent_charges().
> > 
> > Just use an ordered workqueue for cgroup_destroy_wq.
> > 
> > Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
> > Suggested-by: Filipe Brandenburger <filbranden-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > Signed-off-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org # 3.10+
> 
> I think this is a good idea for now and -stable:
> Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

You might be wondering why this patch didn't reach Linus yet.

It's because more thorough testing, by others here, found that it
wasn't always solving the problem: so I asked Tejun privately to
hold off from sending it in, until we'd worked out why not.

Most of our testing being on a v3,11-based kernel, it was perfectly
possible that the problem was merely our own e.g. missing Tejun's
8a2b75384444 ("workqueue: fix ordered workqueues in NUMA setups").

But that turned out not to be enough to fix it either. Then Filipe
pointed out how percpu_ref_kill_and_confirm() uses call_rcu_sched()
before we ever get to put the offline on to the workqueue: by the
time we get to the workqueue, the ordering has already been lost.

So, thanks for the Acks, but I'm afraid that this ordered workqueue
solution is just not good enough: we should simply forget that patch
and provide a different answer.

So I'm now posting a couple of alternative solutions: 1/2 from Filipe
at the memcg end, and 2/2 from me at the cgroup end.  Each of these
has stood up to better testing, so you can choose between them,
or work out a better answer.

(By the way, I have another little pair of memcg/cgroup fixes to post
shortly, nothing to do with these two: it would be less confusing if
I had some third fix to add in there, but sadly not.)

Hugh

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 2/2] cgroup: bring back kill_cnt to order css destruction
  2014-02-12 22:59     ` Hugh Dickins
  (?)
@ 2014-02-12 23:06       ` Hugh Dickins
  -1 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2014-02-12 23:06 UTC (permalink / raw)
  To: Tejun Heo, Michal Hocko
  Cc: Johannes Weiner, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Greg Thelen, Michel Lespinasse, Markus Blank-Burian,
	Shawn Bohrer, cgroups, linux-kernel, linux-mm

Sometimes the cleanup after memcg hierarchy testing gets stuck in
mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.

There may turn out to be several causes, but a major cause is this: the
workitem to offline parent can get run before workitem to offline child;
parent's mem_cgroup_reparent_charges() circles around waiting for the
child's pages to be reparented to its lrus, but it's holding cgroup_mutex
which prevents the child from reaching its mem_cgroup_reparent_charges().

Further testing showed that an ordered workqueue for cgroup_destroy_wq
is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
stage on the way can mess up the order before reaching the workqueue.

Instead bring back v3.11's css kill_cnt, repurposing it to make sure
that offline_css() is not called for parent before it has been called
for all children.

Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Filipe Brandenburger <filbranden@google.com>
Cc: stable@vger.kernel.org # v3.10+ (but will need extra care)
---
This is an alternative to Filipe's 1/2: there's no need for both,
but each has its merits.  I prefer Filipe's, which is much easier to
understand: this one made more sense in v3.11, when it was just a matter
of extending the use of css_kill_cnt; but might be preferred if offlining
children before parent is thought to be a good idea generally.

 include/linux/cgroup.h |    3 +++
 kernel/cgroup.c        |   21 +++++++++++++++++++++
 2 files changed, 24 insertions(+)

--- 3.14-rc2/include/linux/cgroup.h	2014-02-02 18:49:07.033302094 -0800
+++ linux/include/linux/cgroup.h	2014-02-11 15:59:22.720393186 -0800
@@ -79,6 +79,9 @@ struct cgroup_subsys_state {
 
 	unsigned long flags;
 
+	/* ensure children are offlined before parent */
+	atomic_t kill_cnt;
+
 	/* percpu_ref killing and RCU release */
 	struct rcu_head rcu_head;
 	struct work_struct destroy_work;
--- 3.14-rc2/kernel/cgroup.c	2014-02-02 18:49:07.737302111 -0800
+++ linux/kernel/cgroup.c	2014-02-11 15:57:56.000391125 -0800
@@ -175,6 +175,7 @@ static int need_forkexit_callback __read
 
 static struct cftype cgroup_base_files[];
 
+static void css_killed_ref_fn(struct percpu_ref *ref);
 static void cgroup_destroy_css_killed(struct cgroup *cgrp);
 static int cgroup_destroy_locked(struct cgroup *cgrp);
 static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
@@ -4043,6 +4044,7 @@ static void init_css(struct cgroup_subsy
 	css->cgroup = cgrp;
 	css->ss = ss;
 	css->flags = 0;
+	atomic_set(&css->kill_cnt, 1);
 
 	if (cgrp->parent)
 		css->parent = cgroup_css(cgrp->parent, ss);
@@ -4292,6 +4294,7 @@ static void css_killed_work_fn(struct wo
 {
 	struct cgroup_subsys_state *css =
 		container_of(work, struct cgroup_subsys_state, destroy_work);
+	struct cgroup_subsys_state *parent = css->parent;
 	struct cgroup *cgrp = css->cgroup;
 
 	mutex_lock(&cgroup_mutex);
@@ -4320,6 +4323,12 @@ static void css_killed_work_fn(struct wo
 	 * destruction happens only after all css's are released.
 	 */
 	css_put(css);
+
+	/*
+	 * Put the parent's kill_cnt reference from kill_css(), and
+	 * schedule its ->css_offline() if all children are now offline.
+	 */
+	css_killed_ref_fn(&parent->refcnt);
 }
 
 /* css kill confirmation processing requires process context, bounce */
@@ -4328,6 +4337,9 @@ static void css_killed_ref_fn(struct per
 	struct cgroup_subsys_state *css =
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
+	if (!atomic_dec_and_test(&css->kill_cnt))
+		return;
+
 	INIT_WORK(&css->destroy_work, css_killed_work_fn);
 	queue_work(cgroup_destroy_wq, &css->destroy_work);
 }
@@ -4362,6 +4374,15 @@ static void kill_css(struct cgroup_subsy
 	 * css is confirmed to be seen as killed on all CPUs.
 	 */
 	percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn);
+
+	/*
+	 * Make sure that ->css_offline() will not be called for parent
+	 * before it has been called for all children: this ordering
+	 * requirement is important for memcg, where parent's offline
+	 * might wait for a child's, leading to deadlock.
+	 */
+	atomic_inc(&css->parent->kill_cnt);
+	css_killed_ref_fn(&css->refcnt);
 }
 
 /**

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 2/2] cgroup: bring back kill_cnt to order css destruction
@ 2014-02-12 23:06       ` Hugh Dickins
  0 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2014-02-12 23:06 UTC (permalink / raw)
  To: Tejun Heo, Michal Hocko
  Cc: Johannes Weiner, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Greg Thelen, Michel Lespinasse, Markus Blank-Burian,
	Shawn Bohrer, cgroups, linux-kernel, linux-mm

Sometimes the cleanup after memcg hierarchy testing gets stuck in
mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.

There may turn out to be several causes, but a major cause is this: the
workitem to offline parent can get run before workitem to offline child;
parent's mem_cgroup_reparent_charges() circles around waiting for the
child's pages to be reparented to its lrus, but it's holding cgroup_mutex
which prevents the child from reaching its mem_cgroup_reparent_charges().

Further testing showed that an ordered workqueue for cgroup_destroy_wq
is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
stage on the way can mess up the order before reaching the workqueue.

Instead bring back v3.11's css kill_cnt, repurposing it to make sure
that offline_css() is not called for parent before it has been called
for all children.

Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Filipe Brandenburger <filbranden@google.com>
Cc: stable@vger.kernel.org # v3.10+ (but will need extra care)
---
This is an alternative to Filipe's 1/2: there's no need for both,
but each has its merits.  I prefer Filipe's, which is much easier to
understand: this one made more sense in v3.11, when it was just a matter
of extending the use of css_kill_cnt; but might be preferred if offlining
children before parent is thought to be a good idea generally.

 include/linux/cgroup.h |    3 +++
 kernel/cgroup.c        |   21 +++++++++++++++++++++
 2 files changed, 24 insertions(+)

--- 3.14-rc2/include/linux/cgroup.h	2014-02-02 18:49:07.033302094 -0800
+++ linux/include/linux/cgroup.h	2014-02-11 15:59:22.720393186 -0800
@@ -79,6 +79,9 @@ struct cgroup_subsys_state {
 
 	unsigned long flags;
 
+	/* ensure children are offlined before parent */
+	atomic_t kill_cnt;
+
 	/* percpu_ref killing and RCU release */
 	struct rcu_head rcu_head;
 	struct work_struct destroy_work;
--- 3.14-rc2/kernel/cgroup.c	2014-02-02 18:49:07.737302111 -0800
+++ linux/kernel/cgroup.c	2014-02-11 15:57:56.000391125 -0800
@@ -175,6 +175,7 @@ static int need_forkexit_callback __read
 
 static struct cftype cgroup_base_files[];
 
+static void css_killed_ref_fn(struct percpu_ref *ref);
 static void cgroup_destroy_css_killed(struct cgroup *cgrp);
 static int cgroup_destroy_locked(struct cgroup *cgrp);
 static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
@@ -4043,6 +4044,7 @@ static void init_css(struct cgroup_subsy
 	css->cgroup = cgrp;
 	css->ss = ss;
 	css->flags = 0;
+	atomic_set(&css->kill_cnt, 1);
 
 	if (cgrp->parent)
 		css->parent = cgroup_css(cgrp->parent, ss);
@@ -4292,6 +4294,7 @@ static void css_killed_work_fn(struct wo
 {
 	struct cgroup_subsys_state *css =
 		container_of(work, struct cgroup_subsys_state, destroy_work);
+	struct cgroup_subsys_state *parent = css->parent;
 	struct cgroup *cgrp = css->cgroup;
 
 	mutex_lock(&cgroup_mutex);
@@ -4320,6 +4323,12 @@ static void css_killed_work_fn(struct wo
 	 * destruction happens only after all css's are released.
 	 */
 	css_put(css);
+
+	/*
+	 * Put the parent's kill_cnt reference from kill_css(), and
+	 * schedule its ->css_offline() if all children are now offline.
+	 */
+	css_killed_ref_fn(&parent->refcnt);
 }
 
 /* css kill confirmation processing requires process context, bounce */
@@ -4328,6 +4337,9 @@ static void css_killed_ref_fn(struct per
 	struct cgroup_subsys_state *css =
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
+	if (!atomic_dec_and_test(&css->kill_cnt))
+		return;
+
 	INIT_WORK(&css->destroy_work, css_killed_work_fn);
 	queue_work(cgroup_destroy_wq, &css->destroy_work);
 }
@@ -4362,6 +4374,15 @@ static void kill_css(struct cgroup_subsy
 	 * css is confirmed to be seen as killed on all CPUs.
 	 */
 	percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn);
+
+	/*
+	 * Make sure that ->css_offline() will not be called for parent
+	 * before it has been called for all children: this ordering
+	 * requirement is important for memcg, where parent's offline
+	 * might wait for a child's, leading to deadlock.
+	 */
+	atomic_inc(&css->parent->kill_cnt);
+	css_killed_ref_fn(&css->refcnt);
 }
 
 /**

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 2/2] cgroup: bring back kill_cnt to order css destruction
@ 2014-02-12 23:06       ` Hugh Dickins
  0 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2014-02-12 23:06 UTC (permalink / raw)
  To: Tejun Heo, Michal Hocko
  Cc: Johannes Weiner, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Greg Thelen, Michel Lespinasse, Markus Blank-Burian,
	Shawn Bohrer, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

Sometimes the cleanup after memcg hierarchy testing gets stuck in
mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.

There may turn out to be several causes, but a major cause is this: the
workitem to offline parent can get run before workitem to offline child;
parent's mem_cgroup_reparent_charges() circles around waiting for the
child's pages to be reparented to its lrus, but it's holding cgroup_mutex
which prevents the child from reaching its mem_cgroup_reparent_charges().

Further testing showed that an ordered workqueue for cgroup_destroy_wq
is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
stage on the way can mess up the order before reaching the workqueue.

Instead bring back v3.11's css kill_cnt, repurposing it to make sure
that offline_css() is not called for parent before it has been called
for all children.

Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
Signed-off-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Filipe Brandenburger <filbranden-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org # v3.10+ (but will need extra care)
---
This is an alternative to Filipe's 1/2: there's no need for both,
but each has its merits.  I prefer Filipe's, which is much easier to
understand: this one made more sense in v3.11, when it was just a matter
of extending the use of css_kill_cnt; but might be preferred if offlining
children before parent is thought to be a good idea generally.

 include/linux/cgroup.h |    3 +++
 kernel/cgroup.c        |   21 +++++++++++++++++++++
 2 files changed, 24 insertions(+)

--- 3.14-rc2/include/linux/cgroup.h	2014-02-02 18:49:07.033302094 -0800
+++ linux/include/linux/cgroup.h	2014-02-11 15:59:22.720393186 -0800
@@ -79,6 +79,9 @@ struct cgroup_subsys_state {
 
 	unsigned long flags;
 
+	/* ensure children are offlined before parent */
+	atomic_t kill_cnt;
+
 	/* percpu_ref killing and RCU release */
 	struct rcu_head rcu_head;
 	struct work_struct destroy_work;
--- 3.14-rc2/kernel/cgroup.c	2014-02-02 18:49:07.737302111 -0800
+++ linux/kernel/cgroup.c	2014-02-11 15:57:56.000391125 -0800
@@ -175,6 +175,7 @@ static int need_forkexit_callback __read
 
 static struct cftype cgroup_base_files[];
 
+static void css_killed_ref_fn(struct percpu_ref *ref);
 static void cgroup_destroy_css_killed(struct cgroup *cgrp);
 static int cgroup_destroy_locked(struct cgroup *cgrp);
 static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
@@ -4043,6 +4044,7 @@ static void init_css(struct cgroup_subsy
 	css->cgroup = cgrp;
 	css->ss = ss;
 	css->flags = 0;
+	atomic_set(&css->kill_cnt, 1);
 
 	if (cgrp->parent)
 		css->parent = cgroup_css(cgrp->parent, ss);
@@ -4292,6 +4294,7 @@ static void css_killed_work_fn(struct wo
 {
 	struct cgroup_subsys_state *css =
 		container_of(work, struct cgroup_subsys_state, destroy_work);
+	struct cgroup_subsys_state *parent = css->parent;
 	struct cgroup *cgrp = css->cgroup;
 
 	mutex_lock(&cgroup_mutex);
@@ -4320,6 +4323,12 @@ static void css_killed_work_fn(struct wo
 	 * destruction happens only after all css's are released.
 	 */
 	css_put(css);
+
+	/*
+	 * Put the parent's kill_cnt reference from kill_css(), and
+	 * schedule its ->css_offline() if all children are now offline.
+	 */
+	css_killed_ref_fn(&parent->refcnt);
 }
 
 /* css kill confirmation processing requires process context, bounce */
@@ -4328,6 +4337,9 @@ static void css_killed_ref_fn(struct per
 	struct cgroup_subsys_state *css =
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
+	if (!atomic_dec_and_test(&css->kill_cnt))
+		return;
+
 	INIT_WORK(&css->destroy_work, css_killed_work_fn);
 	queue_work(cgroup_destroy_wq, &css->destroy_work);
 }
@@ -4362,6 +4374,15 @@ static void kill_css(struct cgroup_subsy
 	 * css is confirmed to be seen as killed on all CPUs.
 	 */
 	percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn);
+
+	/*
+	 * Make sure that ->css_offline() will not be called for parent
+	 * before it has been called for all children: this ordering
+	 * requirement is important for memcg, where parent's offline
+	 * might wait for a child's, leading to deadlock.
+	 */
+	atomic_inc(&css->parent->kill_cnt);
+	css_killed_ref_fn(&css->refcnt);
 }
 
 /**

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH] Revert "cgroup: use an ordered workqueue for cgroup destruction"
  2014-02-12 22:59     ` Hugh Dickins
  (?)
@ 2014-02-13  0:09       ` Tejun Heo
  -1 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2014-02-13  0:09 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Johannes Weiner, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Michal Hocko, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

>From 1a11533fbd71792e8c5d36f6763fbce8df0d231d Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 12 Feb 2014 19:06:19 -0500

This reverts commit ab3f5faa6255a0eb4f832675507d9e295ca7e9ba.
Explanation from Hugh:

  It's because more thorough testing, by others here, found that it
  wasn't always solving the problem: so I asked Tejun privately to
  hold off from sending it in, until we'd worked out why not.

  Most of our testing being on a v3,11-based kernel, it was perfectly
  possible that the problem was merely our own e.g. missing Tejun's
  8a2b75384444 ("workqueue: fix ordered workqueues in NUMA setups").

  But that turned out not to be enough to fix it either. Then Filipe
  pointed out how percpu_ref_kill_and_confirm() uses call_rcu_sched()
  before we ever get to put the offline on to the workqueue: by the
  time we get to the workqueue, the ordering has already been lost.

  So, thanks for the Acks, but I'm afraid that this ordered workqueue
  solution is just not good enough: we should simply forget that patch
  and provide a different answer."

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
---
Queued to cgroup/for-3.14-fixes.

Thanks.

 kernel/cgroup.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 52719ce..68d8710 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4844,16 +4844,12 @@ static int __init cgroup_wq_init(void)
 	/*
 	 * There isn't much point in executing destruction path in
 	 * parallel.  Good chunk is serialized with cgroup_mutex anyway.
-	 *
-	 * XXX: Must be ordered to make sure parent is offlined after
-	 * children.  The ordering requirement is for memcg where a
-	 * parent's offline may wait for a child's leading to deadlock.  In
-	 * the long term, this should be fixed from memcg side.
+	 * Use 1 for @max_active.
 	 *
 	 * We would prefer to do this in cgroup_init() above, but that
 	 * is called before init_workqueues(): so leave this until after.
 	 */
-	cgroup_destroy_wq = alloc_ordered_workqueue("cgroup_destroy", 0);
+	cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
 	BUG_ON(!cgroup_destroy_wq);
 
 	/*
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH] Revert "cgroup: use an ordered workqueue for cgroup destruction"
@ 2014-02-13  0:09       ` Tejun Heo
  0 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2014-02-13  0:09 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Johannes Weiner, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Michal Hocko, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH] Revert "cgroup: use an ordered workqueue for cgroup destruction"
@ 2014-02-13  0:09       ` Tejun Heo
  0 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2014-02-13  0:09 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Johannes Weiner, Filipe Brandenburger, Li Zefan, Andrew Morton,
	Michal Hocko, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

From 1a11533fbd71792e8c5d36f6763fbce8df0d231d Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 12 Feb 2014 19:06:19 -0500

This reverts commit ab3f5faa6255a0eb4f832675507d9e295ca7e9ba.
Explanation from Hugh:

  It's because more thorough testing, by others here, found that it
  wasn't always solving the problem: so I asked Tejun privately to
  hold off from sending it in, until we'd worked out why not.

  Most of our testing being on a v3,11-based kernel, it was perfectly
  possible that the problem was merely our own e.g. missing Tejun's
  8a2b75384444 ("workqueue: fix ordered workqueues in NUMA setups").

  But that turned out not to be enough to fix it either. Then Filipe
  pointed out how percpu_ref_kill_and_confirm() uses call_rcu_sched()
  before we ever get to put the offline on to the workqueue: by the
  time we get to the workqueue, the ordering has already been lost.

  So, thanks for the Acks, but I'm afraid that this ordered workqueue
  solution is just not good enough: we should simply forget that patch
  and provide a different answer."

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
---
Queued to cgroup/for-3.14-fixes.

Thanks.

 kernel/cgroup.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 52719ce..68d8710 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4844,16 +4844,12 @@ static int __init cgroup_wq_init(void)
 	/*
 	 * There isn't much point in executing destruction path in
 	 * parallel.  Good chunk is serialized with cgroup_mutex anyway.
-	 *
-	 * XXX: Must be ordered to make sure parent is offlined after
-	 * children.  The ordering requirement is for memcg where a
-	 * parent's offline may wait for a child's leading to deadlock.  In
-	 * the long term, this should be fixed from memcg side.
+	 * Use 1 for @max_active.
 	 *
 	 * We would prefer to do this in cgroup_init() above, but that
 	 * is called before init_workqueues(): so leave this until after.
 	 */
-	cgroup_destroy_wq = alloc_ordered_workqueue("cgroup_destroy", 0);
+	cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
 	BUG_ON(!cgroup_destroy_wq);
 
 	/*
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] cgroup: bring back kill_cnt to order css destruction
  2014-02-12 23:06       ` Hugh Dickins
@ 2014-02-13  0:28         ` Tejun Heo
  -1 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2014-02-13  0:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Johannes Weiner, Filipe Brandenburger, Li Zefan,
	Andrew Morton, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

Hello, Hugh.

On Wed, Feb 12, 2014 at 03:06:26PM -0800, Hugh Dickins wrote:
> Sometimes the cleanup after memcg hierarchy testing gets stuck in
> mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
> 
> There may turn out to be several causes, but a major cause is this: the
> workitem to offline parent can get run before workitem to offline child;
> parent's mem_cgroup_reparent_charges() circles around waiting for the
> child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> which prevents the child from reaching its mem_cgroup_reparent_charges().
> 
> Further testing showed that an ordered workqueue for cgroup_destroy_wq
> is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
> stage on the way can mess up the order before reaching the workqueue.
> 
> Instead bring back v3.11's css kill_cnt, repurposing it to make sure
> that offline_css() is not called for parent before it has been called
> for all children.
> 
> Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Reviewed-by: Filipe Brandenburger <filbranden@google.com>
> Cc: stable@vger.kernel.org # v3.10+ (but will need extra care)
> ---
> This is an alternative to Filipe's 1/2: there's no need for both,
> but each has its merits.  I prefer Filipe's, which is much easier to
> understand: this one made more sense in v3.11, when it was just a matter
> of extending the use of css_kill_cnt; but might be preferred if offlining
> children before parent is thought to be a good idea generally.

Not that your implementation is bad or anything but the patch itself
somehow makes me cringe a bit.  It's probably just because it has to
add to the already overly complicated offline path.  Guaranteeing
strict offline ordering might be a good idea but at least for the
immediate bug fix, I agree that the memcg specific fix seems better
suited.  Let's apply that one and reconsider this one if it turns out
we do need strict offline reordering.

Thanks a lot!

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] cgroup: bring back kill_cnt to order css destruction
@ 2014-02-13  0:28         ` Tejun Heo
  0 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2014-02-13  0:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Johannes Weiner, Filipe Brandenburger, Li Zefan,
	Andrew Morton, Greg Thelen, Michel Lespinasse,
	Markus Blank-Burian, Shawn Bohrer, cgroups, linux-kernel,
	linux-mm

Hello, Hugh.

On Wed, Feb 12, 2014 at 03:06:26PM -0800, Hugh Dickins wrote:
> Sometimes the cleanup after memcg hierarchy testing gets stuck in
> mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
> 
> There may turn out to be several causes, but a major cause is this: the
> workitem to offline parent can get run before workitem to offline child;
> parent's mem_cgroup_reparent_charges() circles around waiting for the
> child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> which prevents the child from reaching its mem_cgroup_reparent_charges().
> 
> Further testing showed that an ordered workqueue for cgroup_destroy_wq
> is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
> stage on the way can mess up the order before reaching the workqueue.
> 
> Instead bring back v3.11's css kill_cnt, repurposing it to make sure
> that offline_css() is not called for parent before it has been called
> for all children.
> 
> Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Reviewed-by: Filipe Brandenburger <filbranden@google.com>
> Cc: stable@vger.kernel.org # v3.10+ (but will need extra care)
> ---
> This is an alternative to Filipe's 1/2: there's no need for both,
> but each has its merits.  I prefer Filipe's, which is much easier to
> understand: this one made more sense in v3.11, when it was just a matter
> of extending the use of css_kill_cnt; but might be preferred if offlining
> children before parent is thought to be a good idea generally.

Not that your implementation is bad or anything but the patch itself
somehow makes me cringe a bit.  It's probably just because it has to
add to the already overly complicated offline path.  Guaranteeing
strict offline ordering might be a good idea but at least for the
immediate bug fix, I agree that the memcg specific fix seems better
suited.  Let's apply that one and reconsider this one if it turns out
we do need strict offline reordering.

Thanks a lot!

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] cgroup: bring back kill_cnt to order css destruction
  2014-02-13  0:28         ` Tejun Heo
@ 2014-02-13  0:38           ` Hugh Dickins
  -1 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2014-02-13  0:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Hugh Dickins, Michal Hocko, Johannes Weiner,
	Filipe Brandenburger, Li Zefan, Andrew Morton, Greg Thelen,
	Michel Lespinasse, Markus Blank-Burian, Shawn Bohrer, cgroups,
	linux-kernel, linux-mm

On Wed, 12 Feb 2014, Tejun Heo wrote:
> 
> Not that your implementation is bad or anything but the patch itself
> somehow makes me cringe a bit.  It's probably just because it has to
> add to the already overly complicated offline path.  Guaranteeing
> strict offline ordering might be a good idea but at least for the
> immediate bug fix, I agree that the memcg specific fix seems better
> suited.  Let's apply that one and reconsider this one if it turns out
> we do need strict offline reordering.

Yes, I agree completely - thanks.

Hugh

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] cgroup: bring back kill_cnt to order css destruction
@ 2014-02-13  0:38           ` Hugh Dickins
  0 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2014-02-13  0:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Hugh Dickins, Michal Hocko, Johannes Weiner,
	Filipe Brandenburger, Li Zefan, Andrew Morton, Greg Thelen,
	Michel Lespinasse, Markus Blank-Burian, Shawn Bohrer, cgroups,
	linux-kernel, linux-mm

On Wed, 12 Feb 2014, Tejun Heo wrote:
> 
> Not that your implementation is bad or anything but the patch itself
> somehow makes me cringe a bit.  It's probably just because it has to
> add to the already overly complicated offline path.  Guaranteeing
> strict offline ordering might be a good idea but at least for the
> immediate bug fix, I agree that the memcg specific fix seems better
> suited.  Let's apply that one and reconsider this one if it turns out
> we do need strict offline reordering.

Yes, I agree completely - thanks.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2014-02-13  0:39 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-06 23:56 [PATCH] cgroup: use an ordered workqueue for cgroup destruction Hugh Dickins
2014-02-06 23:56 ` Hugh Dickins
2014-02-06 23:56 ` Hugh Dickins
2014-02-07 13:45 ` Michal Hocko
2014-02-07 13:45   ` Michal Hocko
2014-02-07 13:45   ` Michal Hocko
2014-02-07 14:04 ` Tejun Heo
2014-02-07 14:04   ` Tejun Heo
2014-02-07 14:37   ` Michal Hocko
2014-02-07 14:37     ` Michal Hocko
2014-02-07 15:13     ` Tejun Heo
2014-02-07 15:13       ` Tejun Heo
2014-02-07 15:13       ` Tejun Heo
2014-02-07 15:28       ` Michal Hocko
2014-02-07 15:28         ` Michal Hocko
2014-02-07 20:20   ` Hugh Dickins
2014-02-07 20:20     ` Hugh Dickins
2014-02-07 20:20     ` Hugh Dickins
2014-02-07 20:35     ` Tejun Heo
2014-02-07 20:35       ` Tejun Heo
2014-02-07 21:06       ` Hugh Dickins
2014-02-07 21:06         ` Hugh Dickins
2014-02-07 15:21 ` Tejun Heo
2014-02-07 15:21   ` Tejun Heo
2014-02-07 16:43 ` Johannes Weiner
2014-02-07 16:43   ` Johannes Weiner
2014-02-07 16:43   ` Johannes Weiner
2014-02-10 15:46   ` Michal Hocko
2014-02-10 15:46     ` Michal Hocko
2014-02-10 15:46     ` Michal Hocko
2014-02-12 22:59   ` Hugh Dickins
2014-02-12 22:59     ` Hugh Dickins
2014-02-12 22:59     ` Hugh Dickins
2014-02-12 23:06     ` [PATCH 2/2] cgroup: bring back kill_cnt to order css destruction Hugh Dickins
2014-02-12 23:06       ` Hugh Dickins
2014-02-12 23:06       ` Hugh Dickins
2014-02-13  0:28       ` Tejun Heo
2014-02-13  0:28         ` Tejun Heo
2014-02-13  0:38         ` Hugh Dickins
2014-02-13  0:38           ` Hugh Dickins
2014-02-13  0:09     ` [PATCH] Revert "cgroup: use an ordered workqueue for cgroup destruction" Tejun Heo
2014-02-13  0:09       ` Tejun Heo
2014-02-13  0:09       ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.