All of lore.kernel.org
 help / color / mirror / Atom feed
* [Regression] 3.19-rc3 : memcg: Hang in mount memcg
@ 2015-01-09 17:43 ` Suzuki K. Poulose
  0 siblings, 0 replies; 40+ messages in thread
From: Suzuki K. Poulose @ 2015-01-09 17:43 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner; +Cc: linux-mm, linux-kernel, Will Deacon

Hi

We have hit a hang on ARM64 defconfig, while running LTP tests on 
3.19-rc3. We are
in the process of a git bisect and will update the results as and
when we find the commit.

During the ksm ltp run, the test hangs trying to mount memcg with the 
following strace
output:

mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? ERESTARTNOINTR 
(To be restarted)
mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? ERESTARTNOINTR 
(To be restarted)
[ ... repeated forever ... ]

At this point, one can try mounting the memcg to verify the problem.
# mount -t cgroup -o memory memcg memcg_dir
--hangs--

Strangely, if we run the mount command from a cold boot (i.e. without 
running LTP first),
then it succeeds.

Upon a quick look we are hitting the following code :
kernel/cgroup.c: cgroup_mount() :

1779         for_each_subsys(ss, i) {
1780                 if (!(opts.subsys_mask & (1 << i)) ||
1781                     ss->root == &cgrp_dfl_root)
1782                         continue;
1783
1784                 if 
(!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
1785                         mutex_unlock(&cgroup_mutex);
1786                         msleep(10);
1787                         ret = restart_syscall(); <=====
1788                         goto out_free;
1789                 }
1790                 cgroup_put(&ss->root->cgrp);
1791         }

with ss->root->cgrp.self.refct.percpu_count_ptr == __PERCPU_REF_ATOMIC_DEAD

Any ideas?

Thanks
Suzuki


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Regression] 3.19-rc3 : memcg: Hang in mount memcg
@ 2015-01-09 17:43 ` Suzuki K. Poulose
  0 siblings, 0 replies; 40+ messages in thread
From: Suzuki K. Poulose @ 2015-01-09 17:43 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner; +Cc: linux-mm, linux-kernel, Will Deacon

Hi

We have hit a hang on ARM64 defconfig, while running LTP tests on 
3.19-rc3. We are
in the process of a git bisect and will update the results as and
when we find the commit.

During the ksm ltp run, the test hangs trying to mount memcg with the 
following strace
output:

mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? ERESTARTNOINTR 
(To be restarted)
mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? ERESTARTNOINTR 
(To be restarted)
[ ... repeated forever ... ]

At this point, one can try mounting the memcg to verify the problem.
# mount -t cgroup -o memory memcg memcg_dir
--hangs--

Strangely, if we run the mount command from a cold boot (i.e. without 
running LTP first),
then it succeeds.

Upon a quick look we are hitting the following code :
kernel/cgroup.c: cgroup_mount() :

1779         for_each_subsys(ss, i) {
1780                 if (!(opts.subsys_mask & (1 << i)) ||
1781                     ss->root == &cgrp_dfl_root)
1782                         continue;
1783
1784                 if 
(!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
1785                         mutex_unlock(&cgroup_mutex);
1786                         msleep(10);
1787                         ret = restart_syscall(); <=====
1788                         goto out_free;
1789                 }
1790                 cgroup_put(&ss->root->cgrp);
1791         }

with ss->root->cgrp.self.refct.percpu_count_ptr == __PERCPU_REF_ATOMIC_DEAD

Any ideas?

Thanks
Suzuki

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg
  2015-01-09 17:43 ` Suzuki K. Poulose
@ 2015-01-09 21:46   ` Tejun Heo
  -1 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2015-01-09 21:46 UTC (permalink / raw)
  To: Suzuki K. Poulose; +Cc: Johannes Weiner, linux-mm, linux-kernel, Will Deacon

On Fri, Jan 09, 2015 at 05:43:17PM +0000, Suzuki K. Poulose wrote:
> We have hit a hang on ARM64 defconfig, while running LTP tests on 3.19-rc3.
> We are
> in the process of a git bisect and will update the results as and
> when we find the commit.
> 
> During the ksm ltp run, the test hangs trying to mount memcg with the
> following strace
> output:
> 
> mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? ERESTARTNOINTR (To
> be restarted)
> mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? ERESTARTNOINTR (To
> be restarted)
> [ ... repeated forever ... ]
> 
> At this point, one can try mounting the memcg to verify the problem.
> # mount -t cgroup -o memory memcg memcg_dir
> --hangs--
> 
> Strangely, if we run the mount command from a cold boot (i.e. without
> running LTP first),
> then it succeeds.

I don't know what LTP is doing and this could actually be hitting on
an actual bug but if it's trying to move memcg back from unified
hierarchy to an old one, that might hang - it should prolly made to
just fail at that point.  Anyways, any chance you can find out what
happened, in terms of cgroup mounting, to memcg upto that point?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg
@ 2015-01-09 21:46   ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2015-01-09 21:46 UTC (permalink / raw)
  To: Suzuki K. Poulose; +Cc: Johannes Weiner, linux-mm, linux-kernel, Will Deacon

On Fri, Jan 09, 2015 at 05:43:17PM +0000, Suzuki K. Poulose wrote:
> We have hit a hang on ARM64 defconfig, while running LTP tests on 3.19-rc3.
> We are
> in the process of a git bisect and will update the results as and
> when we find the commit.
> 
> During the ksm ltp run, the test hangs trying to mount memcg with the
> following strace
> output:
> 
> mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? ERESTARTNOINTR (To
> be restarted)
> mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? ERESTARTNOINTR (To
> be restarted)
> [ ... repeated forever ... ]
> 
> At this point, one can try mounting the memcg to verify the problem.
> # mount -t cgroup -o memory memcg memcg_dir
> --hangs--
> 
> Strangely, if we run the mount command from a cold boot (i.e. without
> running LTP first),
> then it succeeds.

I don't know what LTP is doing and this could actually be hitting on
an actual bug but if it's trying to move memcg back from unified
hierarchy to an old one, that might hang - it should prolly made to
just fail at that point.  Anyways, any chance you can find out what
happened, in terms of cgroup mounting, to memcg upto that point?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg
  2015-01-09 17:43 ` Suzuki K. Poulose
@ 2015-01-10  8:55   ` Vladimir Davydov
  -1 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2015-01-10  8:55 UTC (permalink / raw)
  To: Suzuki K. Poulose
  Cc: Tejun Heo, Johannes Weiner, linux-mm, linux-kernel, Will Deacon

On Fri, Jan 09, 2015 at 05:43:17PM +0000, Suzuki K. Poulose wrote:
> Hi
> 
> We have hit a hang on ARM64 defconfig, while running LTP tests on
> 3.19-rc3. We are
> in the process of a git bisect and will update the results as and
> when we find the commit.
> 
> During the ksm ltp run, the test hangs trying to mount memcg with
> the following strace
> output:
> 
> mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ?
> ERESTARTNOINTR (To be restarted)
> mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ?
> ERESTARTNOINTR (To be restarted)
> [ ... repeated forever ... ]
> 
> At this point, one can try mounting the memcg to verify the problem.
> # mount -t cgroup -o memory memcg memcg_dir
> --hangs--
> 
> Strangely, if we run the mount command from a cold boot (i.e.
> without running LTP first),
> then it succeeds.
> 
> Upon a quick look we are hitting the following code :
> kernel/cgroup.c: cgroup_mount() :
> 
> 1779         for_each_subsys(ss, i) {
> 1780                 if (!(opts.subsys_mask & (1 << i)) ||
> 1781                     ss->root == &cgrp_dfl_root)
> 1782                         continue;
> 1783
> 1784                 if
> (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
> 1785                         mutex_unlock(&cgroup_mutex);
> 1786                         msleep(10);
> 1787                         ret = restart_syscall(); <=====
> 1788                         goto out_free;
> 1789                 }
> 1790                 cgroup_put(&ss->root->cgrp);
> 1791         }
> 
> with ss->root->cgrp.self.refct.percpu_count_ptr == __PERCPU_REF_ATOMIC_DEAD
> 
> Any ideas?

The problem is that the memory cgroup controller takes a css reference
per each charged page and does not reparent charged pages on css
offline, while cgroup_mount/cgroup_kill_sb expect all css references to
offline cgroups to be gone soon, restarting the syscall if the ref count
!= 0. As a result, if you create a memory cgroup, charge some page cache
to it, and then remove it, unmount/mount will hang forever.

May be, we should kill the ref counter to the memory controller root in
cgroup_kill_sb only if there is no children at all, neither online nor
offline.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg
@ 2015-01-10  8:55   ` Vladimir Davydov
  0 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2015-01-10  8:55 UTC (permalink / raw)
  To: Suzuki K. Poulose
  Cc: Tejun Heo, Johannes Weiner, linux-mm, linux-kernel, Will Deacon

On Fri, Jan 09, 2015 at 05:43:17PM +0000, Suzuki K. Poulose wrote:
> Hi
> 
> We have hit a hang on ARM64 defconfig, while running LTP tests on
> 3.19-rc3. We are
> in the process of a git bisect and will update the results as and
> when we find the commit.
> 
> During the ksm ltp run, the test hangs trying to mount memcg with
> the following strace
> output:
> 
> mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ?
> ERESTARTNOINTR (To be restarted)
> mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ?
> ERESTARTNOINTR (To be restarted)
> [ ... repeated forever ... ]
> 
> At this point, one can try mounting the memcg to verify the problem.
> # mount -t cgroup -o memory memcg memcg_dir
> --hangs--
> 
> Strangely, if we run the mount command from a cold boot (i.e.
> without running LTP first),
> then it succeeds.
> 
> Upon a quick look we are hitting the following code :
> kernel/cgroup.c: cgroup_mount() :
> 
> 1779         for_each_subsys(ss, i) {
> 1780                 if (!(opts.subsys_mask & (1 << i)) ||
> 1781                     ss->root == &cgrp_dfl_root)
> 1782                         continue;
> 1783
> 1784                 if
> (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
> 1785                         mutex_unlock(&cgroup_mutex);
> 1786                         msleep(10);
> 1787                         ret = restart_syscall(); <=====
> 1788                         goto out_free;
> 1789                 }
> 1790                 cgroup_put(&ss->root->cgrp);
> 1791         }
> 
> with ss->root->cgrp.self.refct.percpu_count_ptr == __PERCPU_REF_ATOMIC_DEAD
> 
> Any ideas?

The problem is that the memory cgroup controller takes a css reference
per each charged page and does not reparent charged pages on css
offline, while cgroup_mount/cgroup_kill_sb expect all css references to
offline cgroups to be gone soon, restarting the syscall if the ref count
!= 0. As a result, if you create a memory cgroup, charge some page cache
to it, and then remove it, unmount/mount will hang forever.

May be, we should kill the ref counter to the memory controller root in
cgroup_kill_sb only if there is no children at all, neither online nor
offline.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
  2015-01-10  8:55   ` Vladimir Davydov
@ 2015-01-10 21:43     ` Tejun Heo
  -1 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2015-01-10 21:43 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Suzuki K. Poulose, Johannes Weiner, linux-mm, linux-kernel, Will Deacon

Currently, if a hierarchy doesn't have any live children when it's
unmounted, the hierarchy starts dying by killing its refcnt.  The
expectation is that even if there are lingering dead children which
are lingering due to remaining references, they'll be put in a finite
amount of time.  When the children are finally released, the hierarchy
is destroyed and all controllers bound to it also are released.

However, for memcg, the premise that the lingering refs will be put in
a finite amount time is not true.  In the absense of memory pressure,
dead memcg's may hang around indefinitely pinned by its pages.  This
unfortunately may lead to indefinite hang on the next mount attempt
involving memcg as the mount logic waits for it to get released.

While we can change hierarchy destruction logic such that a hierarchy
is only destroyed when it's not mounted anywhere and all its children,
live or dead, are gone, this makes whether the hierarchy gets
destroyed or not to be determined by factors opaque to userland.
Userland may or may not get a new hierarchy on the next mount attempt.
Worse, if it explicitly wants to create a new hierarchy with different
options or controller compositions involving memcg, it will fail in an
essentially arbitrary manner.

We want to guarantee that a hierarchy is destroyed once the
conditions, unmounted and no visible children, are met.  To aid it,
this patch introduces a new callback cgroup_subsys->unbind() which is
invoked right before the hierarchy a subsystem is bound to starts
dying.  memcg can implement this callback and initiate draining of
remaining refs so that the hierarchy can eventually be released in a
finite amount of time.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
---
Hello,

> May be, we should kill the ref counter to the memory controller root in
> cgroup_kill_sb only if there is no children at all, neither online nor
> offline.

Ah, thanks for the analysis, but I really wanna avoid making hierarchy
destruction conditions opaque to userland.  This is userland visible
behavior.  It shouldn't be determined by kernel internals invisible
outside.  This patch adds ss->unbind() which memcg can hook into to
kick off draining of residual refs.  If this would work, I'll add this
patch to cgroup/for-3.19-fixes, possibly with stable cc'd.

Thanks.

 Documentation/cgroups/cgroups.txt |   12 +++++++++++-
 include/linux/cgroup.h            |    1 +
 kernel/cgroup.c                   |   14 ++++++++++++--
 3 files changed, 24 insertions(+), 3 deletions(-)

--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -637,7 +637,7 @@ void exit(struct task_struct *task)
 
 Called during task exit.
 
-void bind(struct cgroup *root)
+void bind(struct cgroup_subsys_state *root_css)
 (cgroup_mutex held by caller)
 
 Called when a cgroup subsystem is rebound to a different hierarchy
@@ -645,6 +645,16 @@ and root cgroup. Currently this will onl
 the default hierarchy (which never has sub-cgroups) and a hierarchy
 that is being created/destroyed (and hence has no sub-cgroups).
 
+void unbind(struct cgroup_subsys_state *root_css)
+(cgroup_mutex held by caller)
+
+Called when a cgroup subsys is unbound from its current hierarchy. The
+subsystem must guarantee that all offline cgroups are going to be
+released in a finite amount of time after this function is called.
+Such draining can be asynchronous. The following binding of the
+subsystem to a hierarchy will be delayed till the draining is
+complete.
+
 4. Extended attribute usage
 ===========================
 
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -654,6 +654,7 @@ struct cgroup_subsys {
 		     struct cgroup_subsys_state *old_css,
 		     struct task_struct *task);
 	void (*bind)(struct cgroup_subsys_state *root_css);
+	void (*unbind)(struct cgroup_subsys_state *root_css);
 
 	int disabled;
 	int early_init;
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1910,10 +1910,20 @@ static void cgroup_kill_sb(struct super_
 	 * And don't kill the default root.
 	 */
 	if (css_has_online_children(&root->cgrp.self) ||
-	    root == &cgrp_dfl_root)
+	    root == &cgrp_dfl_root) {
 		cgroup_put(&root->cgrp);
-	else
+	} else {
+		struct cgroup_subsys *ss;
+		int i;
+
+		mutex_lock(&cgroup_mutex);
+		for_each_subsys(ss, i)
+			if ((root->subsys_mask & (1UL << i)) && ss->unbind)
+				ss->unbind(cgroup_css(&root->cgrp, ss));
+		mutex_unlock(&cgroup_mutex);
+
 		percpu_ref_kill(&root->cgrp.self.refcnt);
+	}
 
 	kernfs_kill_sb(sb);
 }

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
@ 2015-01-10 21:43     ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2015-01-10 21:43 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Suzuki K. Poulose, Johannes Weiner, linux-mm, linux-kernel, Will Deacon

Currently, if a hierarchy doesn't have any live children when it's
unmounted, the hierarchy starts dying by killing its refcnt.  The
expectation is that even if there are lingering dead children which
are lingering due to remaining references, they'll be put in a finite
amount of time.  When the children are finally released, the hierarchy
is destroyed and all controllers bound to it also are released.

However, for memcg, the premise that the lingering refs will be put in
a finite amount time is not true.  In the absense of memory pressure,
dead memcg's may hang around indefinitely pinned by its pages.  This
unfortunately may lead to indefinite hang on the next mount attempt
involving memcg as the mount logic waits for it to get released.

While we can change hierarchy destruction logic such that a hierarchy
is only destroyed when it's not mounted anywhere and all its children,
live or dead, are gone, this makes whether the hierarchy gets
destroyed or not to be determined by factors opaque to userland.
Userland may or may not get a new hierarchy on the next mount attempt.
Worse, if it explicitly wants to create a new hierarchy with different
options or controller compositions involving memcg, it will fail in an
essentially arbitrary manner.

We want to guarantee that a hierarchy is destroyed once the
conditions, unmounted and no visible children, are met.  To aid it,
this patch introduces a new callback cgroup_subsys->unbind() which is
invoked right before the hierarchy a subsystem is bound to starts
dying.  memcg can implement this callback and initiate draining of
remaining refs so that the hierarchy can eventually be released in a
finite amount of time.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
---
Hello,

> May be, we should kill the ref counter to the memory controller root in
> cgroup_kill_sb only if there is no children at all, neither online nor
> offline.

Ah, thanks for the analysis, but I really wanna avoid making hierarchy
destruction conditions opaque to userland.  This is userland visible
behavior.  It shouldn't be determined by kernel internals invisible
outside.  This patch adds ss->unbind() which memcg can hook into to
kick off draining of residual refs.  If this would work, I'll add this
patch to cgroup/for-3.19-fixes, possibly with stable cc'd.

Thanks.

 Documentation/cgroups/cgroups.txt |   12 +++++++++++-
 include/linux/cgroup.h            |    1 +
 kernel/cgroup.c                   |   14 ++++++++++++--
 3 files changed, 24 insertions(+), 3 deletions(-)

--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -637,7 +637,7 @@ void exit(struct task_struct *task)
 
 Called during task exit.
 
-void bind(struct cgroup *root)
+void bind(struct cgroup_subsys_state *root_css)
 (cgroup_mutex held by caller)
 
 Called when a cgroup subsystem is rebound to a different hierarchy
@@ -645,6 +645,16 @@ and root cgroup. Currently this will onl
 the default hierarchy (which never has sub-cgroups) and a hierarchy
 that is being created/destroyed (and hence has no sub-cgroups).
 
+void unbind(struct cgroup_subsys_state *root_css)
+(cgroup_mutex held by caller)
+
+Called when a cgroup subsys is unbound from its current hierarchy. The
+subsystem must guarantee that all offline cgroups are going to be
+released in a finite amount of time after this function is called.
+Such draining can be asynchronous. The following binding of the
+subsystem to a hierarchy will be delayed till the draining is
+complete.
+
 4. Extended attribute usage
 ===========================
 
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -654,6 +654,7 @@ struct cgroup_subsys {
 		     struct cgroup_subsys_state *old_css,
 		     struct task_struct *task);
 	void (*bind)(struct cgroup_subsys_state *root_css);
+	void (*unbind)(struct cgroup_subsys_state *root_css);
 
 	int disabled;
 	int early_init;
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1910,10 +1910,20 @@ static void cgroup_kill_sb(struct super_
 	 * And don't kill the default root.
 	 */
 	if (css_has_online_children(&root->cgrp.self) ||
-	    root == &cgrp_dfl_root)
+	    root == &cgrp_dfl_root) {
 		cgroup_put(&root->cgrp);
-	else
+	} else {
+		struct cgroup_subsys *ss;
+		int i;
+
+		mutex_lock(&cgroup_mutex);
+		for_each_subsys(ss, i)
+			if ((root->subsys_mask & (1UL << i)) && ss->unbind)
+				ss->unbind(cgroup_css(&root->cgrp, ss));
+		mutex_unlock(&cgroup_mutex);
+
 		percpu_ref_kill(&root->cgrp.self.refcnt);
+	}
 
 	kernfs_kill_sb(sb);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
  2015-01-10 21:43     ` Tejun Heo
@ 2015-01-11 20:55       ` Johannes Weiner
  -1 siblings, 0 replies; 40+ messages in thread
From: Johannes Weiner @ 2015-01-11 20:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vladimir Davydov, Suzuki K. Poulose, linux-mm, linux-kernel, Will Deacon

On Sat, Jan 10, 2015 at 04:43:16PM -0500, Tejun Heo wrote:
> Currently, if a hierarchy doesn't have any live children when it's
> unmounted, the hierarchy starts dying by killing its refcnt.  The
> expectation is that even if there are lingering dead children which
> are lingering due to remaining references, they'll be put in a finite
> amount of time.  When the children are finally released, the hierarchy
> is destroyed and all controllers bound to it also are released.
> 
> However, for memcg, the premise that the lingering refs will be put in
> a finite amount time is not true.  In the absense of memory pressure,
> dead memcg's may hang around indefinitely pinned by its pages.  This
> unfortunately may lead to indefinite hang on the next mount attempt
> involving memcg as the mount logic waits for it to get released.
> 
> While we can change hierarchy destruction logic such that a hierarchy
> is only destroyed when it's not mounted anywhere and all its children,
> live or dead, are gone, this makes whether the hierarchy gets
> destroyed or not to be determined by factors opaque to userland.
> Userland may or may not get a new hierarchy on the next mount attempt.
> Worse, if it explicitly wants to create a new hierarchy with different
> options or controller compositions involving memcg, it will fail in an
> essentially arbitrary manner.
> 
> We want to guarantee that a hierarchy is destroyed once the
> conditions, unmounted and no visible children, are met.  To aid it,
> this patch introduces a new callback cgroup_subsys->unbind() which is
> invoked right before the hierarchy a subsystem is bound to starts
> dying.  memcg can implement this callback and initiate draining of
> remaining refs so that the hierarchy can eventually be released in a
> finite amount of time.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Li Zefan <lizefan@huawei.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Vladimir Davydov <vdavydov@parallels.com>
> ---
> Hello,
> 
> > May be, we should kill the ref counter to the memory controller root in
> > cgroup_kill_sb only if there is no children at all, neither online nor
> > offline.
> 
> Ah, thanks for the analysis, but I really wanna avoid making hierarchy
> destruction conditions opaque to userland.  This is userland visible
> behavior.  It shouldn't be determined by kernel internals invisible
> outside.  This patch adds ss->unbind() which memcg can hook into to
> kick off draining of residual refs.  If this would work, I'll add this
> patch to cgroup/for-3.19-fixes, possibly with stable cc'd.

How about this ->unbind() for memcg?

>From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Sun, 11 Jan 2015 10:29:05 -0500
Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during
 unbind

Cgroup core assumes that any outstanding css references after
offlining are temporary in nature, and e.g. mount waits for them to
disappear and release the root cgroup.  But leftover page cache and
swapout records in an offlined memcg are only dropped when the pages
get reclaimed under pressure or the swapped out pages get faulted in
from other cgroups, and so those cgroup operations can hang forever.

Implement the ->unbind() callback to actively get rid of outstanding
references when cgroup core wants them gone.  Swap out records are
deleted, such that the swap-in path will charge those pages to the
faulting task.  Page cache pages are moved to the root memory cgroup.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/swap_cgroup.h |   6 +++
 mm/memcontrol.c             | 126 ++++++++++++++++++++++++++++++++++++++++++++
 mm/swap_cgroup.c            |  38 +++++++++++++
 3 files changed, 170 insertions(+)

diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h
index 145306bdc92f..ffe0866d2997 100644
--- a/include/linux/swap_cgroup.h
+++ b/include/linux/swap_cgroup.h
@@ -9,6 +9,7 @@ extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
 					unsigned short old, unsigned short new);
 extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
 extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
+extern unsigned long swap_cgroup_zap_records(unsigned short id);
 extern int swap_cgroup_swapon(int type, unsigned long max_pages);
 extern void swap_cgroup_swapoff(int type);
 
@@ -26,6 +27,11 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
 	return 0;
 }
 
+static inline unsigned long swap_cgroup_zap_records(unsigned short id)
+{
+	return 0;
+}
+
 static inline int
 swap_cgroup_swapon(int type, unsigned long max_pages)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 692e96407627..40c426add613 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5197,6 +5197,131 @@ static void mem_cgroup_bind(struct cgroup_subsys_state *root_css)
 		mem_cgroup_from_css(root_css)->use_hierarchy = true;
 }
 
+static void unbind_lru_list(struct mem_cgroup *memcg,
+			    struct zone *zone, enum lru_list lru)
+{
+	struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+	struct list_head *list = &lruvec->lists[lru];
+
+	while (!list_empty(list)) {
+		unsigned int nr_pages;
+		unsigned long flags;
+		struct page *page;
+
+		spin_lock_irqsave(&zone->lru_lock, flags);
+		if (list_empty(list)) {
+			spin_unlock_irqrestore(&zone->lru_lock, flags);
+			break;
+		}
+		page = list_last_entry(list, struct page, lru);
+		if (!get_page_unless_zero(page)) {
+			list_move(&page->lru, list);
+			spin_unlock_irqrestore(&zone->lru_lock, flags);
+			continue;
+		}
+		BUG_ON(!PageLRU(page));
+		ClearPageLRU(page);
+		del_page_from_lru_list(page, lruvec, lru);
+		spin_unlock_irqrestore(&zone->lru_lock, flags);
+
+		compound_lock(page);
+		nr_pages = hpage_nr_pages(page);
+
+		if (!mem_cgroup_move_account(page, nr_pages,
+					     memcg, root_mem_cgroup)) {
+			/*
+			 * root_mem_cgroup page counters are not used,
+			 * otherwise we'd have to charge them here.
+			 */
+			page_counter_uncharge(&memcg->memory, nr_pages);
+			if (do_swap_account)
+				page_counter_uncharge(&memcg->memsw, nr_pages);
+			css_put_many(&memcg->css, nr_pages);
+		}
+
+		compound_unlock(page);
+
+		putback_lru_page(page);
+	}
+}
+
+static void unbind_work_fn(struct work_struct *work)
+{
+	struct cgroup_subsys_state *css;
+retry:
+	drain_all_stock(root_mem_cgroup);
+
+	rcu_read_lock();
+	css_for_each_child(css, &root_mem_cgroup->css) {
+		struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+		/* Drop references from swap-out records */
+		if (do_swap_account) {
+			long zapped;
+
+			zapped = swap_cgroup_zap_records(memcg->css.id);
+			page_counter_uncharge(&memcg->memsw, zapped);
+			css_put_many(&memcg->css, zapped);
+		}
+
+		/* Drop references from leftover LRU pages */
+		css_get(css);
+		rcu_read_unlock();
+		atomic_inc(&memcg->moving_account);
+		synchronize_rcu();
+		while (page_counter_read(&memcg->memory) -
+		       page_counter_read(&memcg->kmem) > 0) {
+			struct zone *zone;
+			enum lru_list lru;
+
+			lru_add_drain_all();
+
+			for_each_zone(zone)
+				for_each_lru(lru)
+					unbind_lru_list(memcg, zone, lru);
+
+			cond_resched();
+		}
+		atomic_dec(&memcg->moving_account);
+		rcu_read_lock();
+		css_put(css);
+	}
+	rcu_read_unlock();
+	/*
+	 * Swap-in is racy:
+	 *
+	 * #0                        #1
+	 *                           lookup_swap_cgroup_id()
+	 *                           rcu_read_lock()
+	 *                           mem_cgroup_lookup()
+	 *                           css_tryget_online()
+	 *                           rcu_read_unlock()
+	 * cgroup_kill_sb()
+	 *   !css_has_online_children()
+	 *     ->unbind()
+	 *                           page_counter_try_charge()
+	 *                           css_put()
+	 *                             css_free()
+	 *                           pc->mem_cgroup = dead memcg
+	 *                           add page to lru
+	 *
+	 * Loop until until all references established from previously
+	 * existing swap-out records have been transferred to pages on
+	 * the LRU and then uncharged from there.
+	 */
+	if (!list_empty(&root_mem_cgroup->css.children)) {
+		msleep(10);
+		goto retry;
+	}
+}
+
+static DECLARE_WORK(unbind_work, unbind_work_fn);
+
+static void mem_cgroup_unbind(struct cgroup_subsys_state *root_css)
+{
+	schedule_work(&unbind_work);
+}
+
 static u64 memory_current_read(struct cgroup_subsys_state *css,
 			       struct cftype *cft)
 {
@@ -5360,6 +5485,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.cancel_attach = mem_cgroup_cancel_attach,
 	.attach = mem_cgroup_move_task,
 	.bind = mem_cgroup_bind,
+	.unbind = mem_cgroup_unbind,
 	.dfl_cftypes = memory_files,
 	.legacy_cftypes = mem_cgroup_legacy_files,
 	.early_init = 0,
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index b5f7f24b8dd1..665923a558c4 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -140,6 +140,44 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
 	return lookup_swap_cgroup(ent, NULL)->id;
 }
 
+/**
+ * swap_cgroup_zap_records - delete all swapout records of one cgroup
+ * @id: memcg id
+ *
+ * Returns the number of deleted records.
+ */
+unsigned long swap_cgroup_zap_records(unsigned short id)
+{
+	unsigned long zapped = 0;
+	unsigned int type;
+
+	for (type = 0; type < MAX_SWAPFILES; type++) {
+		struct swap_cgroup_ctrl *ctrl;
+		unsigned long flags;
+		unsigned int page;
+
+		ctrl = &swap_cgroup_ctrl[type];
+		spin_lock_irqsave(&ctrl->lock, flags);
+		for (page = 0; page < ctrl->length; page++) {
+			struct swap_cgroup *base;
+			pgoff_t offset;
+
+			base = page_address(ctrl->map[page]);
+			for (offset = 0; offset < SC_PER_PAGE; offset++) {
+				struct swap_cgroup *sc;
+
+				sc = base + offset;
+				if (sc->id == id) {
+					sc->id = 0;
+					zapped++;
+				}
+			}
+		}
+		spin_unlock_irqrestore(&ctrl->lock, flags);
+	}
+	return zapped;
+}
+
 int swap_cgroup_swapon(int type, unsigned long max_pages)
 {
 	void *array;
-- 
2.2.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
@ 2015-01-11 20:55       ` Johannes Weiner
  0 siblings, 0 replies; 40+ messages in thread
From: Johannes Weiner @ 2015-01-11 20:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vladimir Davydov, Suzuki K. Poulose, linux-mm, linux-kernel, Will Deacon

On Sat, Jan 10, 2015 at 04:43:16PM -0500, Tejun Heo wrote:
> Currently, if a hierarchy doesn't have any live children when it's
> unmounted, the hierarchy starts dying by killing its refcnt.  The
> expectation is that even if there are lingering dead children which
> are lingering due to remaining references, they'll be put in a finite
> amount of time.  When the children are finally released, the hierarchy
> is destroyed and all controllers bound to it also are released.
> 
> However, for memcg, the premise that the lingering refs will be put in
> a finite amount time is not true.  In the absense of memory pressure,
> dead memcg's may hang around indefinitely pinned by its pages.  This
> unfortunately may lead to indefinite hang on the next mount attempt
> involving memcg as the mount logic waits for it to get released.
> 
> While we can change hierarchy destruction logic such that a hierarchy
> is only destroyed when it's not mounted anywhere and all its children,
> live or dead, are gone, this makes whether the hierarchy gets
> destroyed or not to be determined by factors opaque to userland.
> Userland may or may not get a new hierarchy on the next mount attempt.
> Worse, if it explicitly wants to create a new hierarchy with different
> options or controller compositions involving memcg, it will fail in an
> essentially arbitrary manner.
> 
> We want to guarantee that a hierarchy is destroyed once the
> conditions, unmounted and no visible children, are met.  To aid it,
> this patch introduces a new callback cgroup_subsys->unbind() which is
> invoked right before the hierarchy a subsystem is bound to starts
> dying.  memcg can implement this callback and initiate draining of
> remaining refs so that the hierarchy can eventually be released in a
> finite amount of time.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Li Zefan <lizefan@huawei.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Vladimir Davydov <vdavydov@parallels.com>
> ---
> Hello,
> 
> > May be, we should kill the ref counter to the memory controller root in
> > cgroup_kill_sb only if there is no children at all, neither online nor
> > offline.
> 
> Ah, thanks for the analysis, but I really wanna avoid making hierarchy
> destruction conditions opaque to userland.  This is userland visible
> behavior.  It shouldn't be determined by kernel internals invisible
> outside.  This patch adds ss->unbind() which memcg can hook into to
> kick off draining of residual refs.  If this would work, I'll add this
> patch to cgroup/for-3.19-fixes, possibly with stable cc'd.

How about this ->unbind() for memcg?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
  2015-01-11 20:55       ` Johannes Weiner
@ 2015-01-12  8:01         ` Vladimir Davydov
  -1 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2015-01-12  8:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Suzuki K. Poulose, linux-mm, linux-kernel, Will Deacon

On Sun, Jan 11, 2015 at 03:55:43PM -0500, Johannes Weiner wrote:
> On Sat, Jan 10, 2015 at 04:43:16PM -0500, Tejun Heo wrote:
> > > May be, we should kill the ref counter to the memory controller root in
> > > cgroup_kill_sb only if there is no children at all, neither online nor
> > > offline.
> > 
> > Ah, thanks for the analysis, but I really wanna avoid making hierarchy
> > destruction conditions opaque to userland.  This is userland visible
> > behavior.  It shouldn't be determined by kernel internals invisible
> > outside.  This patch adds ss->unbind() which memcg can hook into to
> > kick off draining of residual refs.  If this would work, I'll add this
> > patch to cgroup/for-3.19-fixes, possibly with stable cc'd.
> 
> How about this ->unbind() for memcg?
> 
> From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Sun, 11 Jan 2015 10:29:05 -0500
> Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during
>  unbind
> 
> Cgroup core assumes that any outstanding css references after
> offlining are temporary in nature, and e.g. mount waits for them to
> disappear and release the root cgroup.  But leftover page cache and
> swapout records in an offlined memcg are only dropped when the pages
> get reclaimed under pressure or the swapped out pages get faulted in
> from other cgroups, and so those cgroup operations can hang forever.
> 
> Implement the ->unbind() callback to actively get rid of outstanding
> references when cgroup core wants them gone.  Swap out records are
> deleted, such that the swap-in path will charge those pages to the
> faulting task.  Page cache pages are moved to the root memory cgroup.

... and kmem pages are ignored. I reckon we could reparent them (I
submitted the patch set some time ago), but that's going to be tricky
and will complicate regular kmem charge/uncharge paths, as well as
list_lru_add/del. I don't think we can put up with it, provided we only
want reparenting on unmount, do we not?

Come to think of it, I wonder how many users actually want to mount
different controllers subset after unmount. Because we could allow
mounting the same subset perfectly well, even if it includes memcg. BTW,
AFAIU in the unified hierarchy we won't have this problem at all,
because by definition it mounts all controllers IIRC, so do we need to
bother fixing this in such a complicated manner at all for the setup
that's going to be deprecated anyway?

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
@ 2015-01-12  8:01         ` Vladimir Davydov
  0 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2015-01-12  8:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Suzuki K. Poulose, linux-mm, linux-kernel, Will Deacon

On Sun, Jan 11, 2015 at 03:55:43PM -0500, Johannes Weiner wrote:
> On Sat, Jan 10, 2015 at 04:43:16PM -0500, Tejun Heo wrote:
> > > May be, we should kill the ref counter to the memory controller root in
> > > cgroup_kill_sb only if there is no children at all, neither online nor
> > > offline.
> > 
> > Ah, thanks for the analysis, but I really wanna avoid making hierarchy
> > destruction conditions opaque to userland.  This is userland visible
> > behavior.  It shouldn't be determined by kernel internals invisible
> > outside.  This patch adds ss->unbind() which memcg can hook into to
> > kick off draining of residual refs.  If this would work, I'll add this
> > patch to cgroup/for-3.19-fixes, possibly with stable cc'd.
> 
> How about this ->unbind() for memcg?
> 
> From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Sun, 11 Jan 2015 10:29:05 -0500
> Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during
>  unbind
> 
> Cgroup core assumes that any outstanding css references after
> offlining are temporary in nature, and e.g. mount waits for them to
> disappear and release the root cgroup.  But leftover page cache and
> swapout records in an offlined memcg are only dropped when the pages
> get reclaimed under pressure or the swapped out pages get faulted in
> from other cgroups, and so those cgroup operations can hang forever.
> 
> Implement the ->unbind() callback to actively get rid of outstanding
> references when cgroup core wants them gone.  Swap out records are
> deleted, such that the swap-in path will charge those pages to the
> faulting task.  Page cache pages are moved to the root memory cgroup.

... and kmem pages are ignored. I reckon we could reparent them (I
submitted the patch set some time ago), but that's going to be tricky
and will complicate regular kmem charge/uncharge paths, as well as
list_lru_add/del. I don't think we can put up with it, provided we only
want reparenting on unmount, do we not?

Come to think of it, I wonder how many users actually want to mount
different controllers subset after unmount. Because we could allow
mounting the same subset perfectly well, even if it includes memcg. BTW,
AFAIU in the unified hierarchy we won't have this problem at all,
because by definition it mounts all controllers IIRC, so do we need to
bother fixing this in such a complicated manner at all for the setup
that's going to be deprecated anyway?

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
  2015-01-12  8:01         ` Vladimir Davydov
@ 2015-01-12 11:28           ` Tejun Heo
  -1 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2015-01-12 11:28 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, Suzuki K. Poulose, linux-mm, linux-kernel, Will Deacon

Hello, Vladimir.

On Mon, Jan 12, 2015 at 11:01:14AM +0300, Vladimir Davydov wrote:
> Come to think of it, I wonder how many users actually want to mount
> different controllers subset after unmount. Because we could allow

It wouldn't be a common use case but, on the face of it, we still
support it.  If we collecctively decide that once a sub cgroup is
created for any controller no further hierarchy configuration for that
controller is allowed, that'd work too, but one way or the other, the
behavior, I believe, should be well-defined.  As it currently stands,
the conditions and failure mode are opaque to userland, which is never
a good thing.

> mounting the same subset perfectly well, even if it includes memcg. BTW,
> AFAIU in the unified hierarchy we won't have this problem at all,
> because by definition it mounts all controllers IIRC, so do we need to
> bother fixing this in such a complicated manner at all for the setup
> that's going to be deprecated anyway?

There will likely be a quite long transition period and if and when
the old things can be removed, this added cleanup logic can go away
with it.  It depends on how complex the implementation would get but
as long as it isn't too much and stays mostly isolated from the saner
paths, I think it's probably the right thing to do.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
@ 2015-01-12 11:28           ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2015-01-12 11:28 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, Suzuki K. Poulose, linux-mm, linux-kernel, Will Deacon

Hello, Vladimir.

On Mon, Jan 12, 2015 at 11:01:14AM +0300, Vladimir Davydov wrote:
> Come to think of it, I wonder how many users actually want to mount
> different controllers subset after unmount. Because we could allow

It wouldn't be a common use case but, on the face of it, we still
support it.  If we collecctively decide that once a sub cgroup is
created for any controller no further hierarchy configuration for that
controller is allowed, that'd work too, but one way or the other, the
behavior, I believe, should be well-defined.  As it currently stands,
the conditions and failure mode are opaque to userland, which is never
a good thing.

> mounting the same subset perfectly well, even if it includes memcg. BTW,
> AFAIU in the unified hierarchy we won't have this problem at all,
> because by definition it mounts all controllers IIRC, so do we need to
> bother fixing this in such a complicated manner at all for the setup
> that's going to be deprecated anyway?

There will likely be a quite long transition period and if and when
the old things can be removed, this added cleanup logic can go away
with it.  It depends on how complex the implementation would get but
as long as it isn't too much and stays mostly isolated from the saner
paths, I think it's probably the right thing to do.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
  2015-01-12 11:28           ` Tejun Heo
@ 2015-01-12 12:59             ` Vladimir Davydov
  -1 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2015-01-12 12:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Suzuki K. Poulose, linux-mm, linux-kernel, Will Deacon

On Mon, Jan 12, 2015 at 06:28:45AM -0500, Tejun Heo wrote:
> On Mon, Jan 12, 2015 at 11:01:14AM +0300, Vladimir Davydov wrote:
> > Come to think of it, I wonder how many users actually want to mount
> > different controllers subset after unmount. Because we could allow
> 
> It wouldn't be a common use case but, on the face of it, we still
> support it.  If we collecctively decide that once a sub cgroup is
> created for any controller no further hierarchy configuration for that
> controller is allowed, that'd work too, but one way or the other, the
> behavior, I believe, should be well-defined.  As it currently stands,
> the conditions and failure mode are opaque to userland, which is never
> a good thing.
> 
> > mounting the same subset perfectly well, even if it includes memcg. BTW,
> > AFAIU in the unified hierarchy we won't have this problem at all,
> > because by definition it mounts all controllers IIRC, so do we need to
> > bother fixing this in such a complicated manner at all for the setup
> > that's going to be deprecated anyway?
> 
> There will likely be a quite long transition period and if and when
> the old things can be removed, this added cleanup logic can go away
> with it.  It depends on how complex the implementation would get but
> as long as it isn't too much and stays mostly isolated from the saner
> paths, I think it's probably the right thing to do.

We can't just move kmem objects from a per-memcg kmem_cache to the
global one fixing page counters, because in contrast to page cache and
swap we don't even track all kmem allocations. So we have to keep all
per-memcg kmem_cache's somewhere after unmount until they can finally be
destroyed, but the whole logic behind per-memcg kmem_cache's destruction
is currently tightly interwoven with that of css's (we destroy
kmem_cache's from css_free), and there won't be any css's after unmount.

That said, it isn't possible to add a couple of isolated functions,
which will live their own lives and can be easily removed once we've
switched to the unified hierarchy. Quite the contrary, implementing of
kmem reparenting would make me rethink and complicate kmemcg code all
over the place. That's why I'm rather reluctant to do it.

I haven't dug deep into the cgroup core, but may be we could detach the
old root in cgroup_kill_sb() and leave it dangling until the last
reference to it has gone?

BTW, IIRC the problem always existed for kmem-active memory cgroups,
because we never had kmem reparenting. May be, we could therefore just
document somewhere that kmem accounting is highly discouraged to be used
in the legacy hierarchy and merge these two patches as is to handle page
cache and swap charges? We won't break anything, because it was always
broken :-)

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
@ 2015-01-12 12:59             ` Vladimir Davydov
  0 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2015-01-12 12:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Suzuki K. Poulose, linux-mm, linux-kernel, Will Deacon

On Mon, Jan 12, 2015 at 06:28:45AM -0500, Tejun Heo wrote:
> On Mon, Jan 12, 2015 at 11:01:14AM +0300, Vladimir Davydov wrote:
> > Come to think of it, I wonder how many users actually want to mount
> > different controllers subset after unmount. Because we could allow
> 
> It wouldn't be a common use case but, on the face of it, we still
> support it.  If we collecctively decide that once a sub cgroup is
> created for any controller no further hierarchy configuration for that
> controller is allowed, that'd work too, but one way or the other, the
> behavior, I believe, should be well-defined.  As it currently stands,
> the conditions and failure mode are opaque to userland, which is never
> a good thing.
> 
> > mounting the same subset perfectly well, even if it includes memcg. BTW,
> > AFAIU in the unified hierarchy we won't have this problem at all,
> > because by definition it mounts all controllers IIRC, so do we need to
> > bother fixing this in such a complicated manner at all for the setup
> > that's going to be deprecated anyway?
> 
> There will likely be a quite long transition period and if and when
> the old things can be removed, this added cleanup logic can go away
> with it.  It depends on how complex the implementation would get but
> as long as it isn't too much and stays mostly isolated from the saner
> paths, I think it's probably the right thing to do.

We can't just move kmem objects from a per-memcg kmem_cache to the
global one fixing page counters, because in contrast to page cache and
swap we don't even track all kmem allocations. So we have to keep all
per-memcg kmem_cache's somewhere after unmount until they can finally be
destroyed, but the whole logic behind per-memcg kmem_cache's destruction
is currently tightly interwoven with that of css's (we destroy
kmem_cache's from css_free), and there won't be any css's after unmount.

That said, it isn't possible to add a couple of isolated functions,
which will live their own lives and can be easily removed once we've
switched to the unified hierarchy. Quite the contrary, implementing of
kmem reparenting would make me rethink and complicate kmemcg code all
over the place. That's why I'm rather reluctant to do it.

I haven't dug deep into the cgroup core, but may be we could detach the
old root in cgroup_kill_sb() and leave it dangling until the last
reference to it has gone?

BTW, IIRC the problem always existed for kmem-active memory cgroups,
because we never had kmem reparenting. May be, we could therefore just
document somewhere that kmem accounting is highly discouraged to be used
in the legacy hierarchy and merge these two patches as is to handle page
cache and swap charges? We won't break anything, because it was always
broken :-)

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
  2015-01-12 12:59             ` Vladimir Davydov
@ 2015-01-12 13:05               ` Tejun Heo
  -1 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2015-01-12 13:05 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, Suzuki K. Poulose, linux-mm, linux-kernel, Will Deacon

On Mon, Jan 12, 2015 at 03:59:56PM +0300, Vladimir Davydov wrote:
> I haven't dug deep into the cgroup core, but may be we could detach the
> old root in cgroup_kill_sb() and leave it dangling until the last
> reference to it has gone?

The root isn't the problem here.  Individual controllers are as
there's only one copy of each and we most likely don't want to carry
over child csses from one hierarchy to the next as the controller may
operate under a different set of rules.

> BTW, IIRC the problem always existed for kmem-active memory cgroups,
> because we never had kmem reparenting. May be, we could therefore just
> document somewhere that kmem accounting is highly discouraged to be used
> in the legacy hierarchy and merge these two patches as is to handle page
> cache and swap charges? We won't break anything, because it was always
> broken :-)

If we're going that route, I think it'd be better to declare hierarchy
lifetime rules as essentially opaque to userland and destroy
hierarchies only when all its children, dead or alive, are gone.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
@ 2015-01-12 13:05               ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2015-01-12 13:05 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, Suzuki K. Poulose, linux-mm, linux-kernel, Will Deacon

On Mon, Jan 12, 2015 at 03:59:56PM +0300, Vladimir Davydov wrote:
> I haven't dug deep into the cgroup core, but may be we could detach the
> old root in cgroup_kill_sb() and leave it dangling until the last
> reference to it has gone?

The root isn't the problem here.  Individual controllers are as
there's only one copy of each and we most likely don't want to carry
over child csses from one hierarchy to the next as the controller may
operate under a different set of rules.

> BTW, IIRC the problem always existed for kmem-active memory cgroups,
> because we never had kmem reparenting. May be, we could therefore just
> document somewhere that kmem accounting is highly discouraged to be used
> in the legacy hierarchy and merge these two patches as is to handle page
> cache and swap charges? We won't break anything, because it was always
> broken :-)

If we're going that route, I think it'd be better to declare hierarchy
lifetime rules as essentially opaque to userland and destroy
hierarchies only when all its children, dead or alive, are gone.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg
  2015-01-09 21:46   ` Tejun Heo
@ 2015-01-12 17:02     ` Suzuki K. Poulose
  -1 siblings, 0 replies; 40+ messages in thread
From: Suzuki K. Poulose @ 2015-01-12 17:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, vdavydov, hannes, Will.Deacon, linux-mm, suzuki.poulose

On Fri, Jan 09, 2015 at 09:46:49PM +0000, Tejun Heo wrote:
> On Fri, Jan 09, 2015 at 05:43:17PM +0000, Suzuki K. Poulose wrote:
> > We have hit a hang on ARM64 defconfig, while running LTP tests on 3.19-rc3.
> > We are
> > in the process of a git bisect and will update the results as and
> > when we find the commit.
> >
> > During the ksm ltp run, the test hangs trying to mount memcg with the
> > following strace
> > output:
> >
> > mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? ERESTARTNOINTR (To
> > be restarted)
> > mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? ERESTARTNOINTR (To
> > be restarted)
> > [ ... repeated forever ... ]
> >
> > At this point, one can try mounting the memcg to verify the problem.
> > # mount -t cgroup -o memory memcg memcg_dir
> > --hangs--
> >
> > Strangely, if we run the mount command from a cold boot (i.e. without
> > running LTP first),
> > then it succeeds.
>
> I don't know what LTP is doing and this could actually be hitting on
> an actual bug but if it's trying to move memcg back from unified
> hierarchy to an old one, that might hang - it should prolly made to
> just fail at that point.  Anyways, any chance you can find out what
> happened, in terms of cgroup mounting, to memcg upto that point?
>

This is what the test(ksm03) does, roughly from strace :

faccessat(AT_FDCWD, "/sys/kernel/mm/ksm/", F_OK) = 0
faccessat(AT_FDCWD, "/sys/kernel/mm/ksm/merge_across_nodes", F_OK) = -1 ENOENT (No such file or directory)
mkdirat(AT_FDCWD, "/dev/cgroup", 0777)  = 0
mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = 0

--- set memory limit. Create a new set /dev/cgroups/1 and moves test to that group ---
mkdirat(AT_FDCWD, "/dev/cgroup/1", 0777) = 0
openat(AT_FDCWD, "/dev/cgroup/1/memory.limit_in_bytes", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
fstat(3, {st_dev=makedev(0, 24), st_ino=41, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2015/01/12-15:10:13, st_mtime=2015/01/12-15:10:13, st_ctime=2015/01/12-15:10:13}) = 0
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb2903000
write(3, "1073741824", 10)              = 10
close(3)                                = 0
munmap(0x7fb2903000, 65536)             = 0
getpid()                                = 1324
openat(AT_FDCWD, "/dev/cgroup/1/tasks", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
fstat(3, {st_dev=makedev(0, 24), st_ino=37, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2015/01/12-15:10:13, st_mtime=2015/01/12-15:10:13, st_ctime=2015/01/12-15:10:13}) = 0
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb2903000
write(3, "1324", 4)                     = 4
close(3)                                = 0
munmap(0x7fb2903000, 65536)             = 0

clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fb2a7f0d0) = 1325
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fb2a7f0d0) = 1326
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fb2a7f0d0) = 1327

--- Creates 3 children, perform a lot of memory operations with shared pages
    verify the ksm for activity and wait for children to exit ---

wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WSTOPPED|WCONTINUED, NULL) = 1325
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WSTOPPED|WCONTINUED, NULL) = 1326
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WSTOPPED|WCONTINUED, NULL) = 1327
wait4(-1, 0x7fe5625f3c, WSTOPPED|WCONTINUED, NULL) = -1 ECHILD (No child processes)

--- cleanup: Move tasks under /dev/cgroups/1/ to /dev/cgroups/ and delete subdir, umount cgroup ---

faccessat(AT_FDCWD, "/sys/kernel/mm/ksm/merge_across_nodes", F_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/dev/cgroup/tasks", O_WRONLY) = 205
openat(AT_FDCWD, "/dev/cgroup/1/tasks", O_RDONLY) = 206
fstat(206, {st_dev=makedev(0, 24), st_ino=37, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2015/01/12-15:10:13, st_mtime=2015/01/12-15:10:13, st_ctime=2015/01/12-15:10:13}) = 0
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb1c53000
read(206, "1324\n", 4096)               = 5
write(205, "1324", 4)                   = 4
read(206, "", 4096)                     = 0
close(205)                              = 0
close(206)                              = 0
munmap(0x7fb1c53000, 65536)             = 0
unlinkat(AT_FDCWD, "/dev/cgroup/1", AT_REMOVEDIR) = 0
umount2("/dev/cgroup", 0)               = 0
unlinkat(AT_FDCWD, "/dev/cgroup", AT_REMOVEDIR) = 0
exit_group(0)                           = ?


The next invocation of the same test fails to mount the cgroup memory.

Thanks
Suzuki

> Thanks.
>
> --
> tejun
>

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2548782


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg
@ 2015-01-12 17:02     ` Suzuki K. Poulose
  0 siblings, 0 replies; 40+ messages in thread
From: Suzuki K. Poulose @ 2015-01-12 17:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, vdavydov, hannes, Will.Deacon, linux-mm, suzuki.poulose

On Fri, Jan 09, 2015 at 09:46:49PM +0000, Tejun Heo wrote:
> On Fri, Jan 09, 2015 at 05:43:17PM +0000, Suzuki K. Poulose wrote:
> > We have hit a hang on ARM64 defconfig, while running LTP tests on 3.19-rc3.
> > We are
> > in the process of a git bisect and will update the results as and
> > when we find the commit.
> >
> > During the ksm ltp run, the test hangs trying to mount memcg with the
> > following strace
> > output:
> >
> > mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? ERESTARTNOINTR (To
> > be restarted)
> > mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ? ERESTARTNOINTR (To
> > be restarted)
> > [ ... repeated forever ... ]
> >
> > At this point, one can try mounting the memcg to verify the problem.
> > # mount -t cgroup -o memory memcg memcg_dir
> > --hangs--
> >
> > Strangely, if we run the mount command from a cold boot (i.e. without
> > running LTP first),
> > then it succeeds.
>
> I don't know what LTP is doing and this could actually be hitting on
> an actual bug but if it's trying to move memcg back from unified
> hierarchy to an old one, that might hang - it should prolly made to
> just fail at that point.  Anyways, any chance you can find out what
> happened, in terms of cgroup mounting, to memcg upto that point?
>

This is what the test(ksm03) does, roughly from strace :

faccessat(AT_FDCWD, "/sys/kernel/mm/ksm/", F_OK) = 0
faccessat(AT_FDCWD, "/sys/kernel/mm/ksm/merge_across_nodes", F_OK) = -1 ENOENT (No such file or directory)
mkdirat(AT_FDCWD, "/dev/cgroup", 0777)  = 0
mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = 0

--- set memory limit. Create a new set /dev/cgroups/1 and moves test to that group ---
mkdirat(AT_FDCWD, "/dev/cgroup/1", 0777) = 0
openat(AT_FDCWD, "/dev/cgroup/1/memory.limit_in_bytes", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
fstat(3, {st_dev=makedev(0, 24), st_ino=41, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2015/01/12-15:10:13, st_mtime=2015/01/12-15:10:13, st_ctime=2015/01/12-15:10:13}) = 0
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb2903000
write(3, "1073741824", 10)              = 10
close(3)                                = 0
munmap(0x7fb2903000, 65536)             = 0
getpid()                                = 1324
openat(AT_FDCWD, "/dev/cgroup/1/tasks", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
fstat(3, {st_dev=makedev(0, 24), st_ino=37, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2015/01/12-15:10:13, st_mtime=2015/01/12-15:10:13, st_ctime=2015/01/12-15:10:13}) = 0
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb2903000
write(3, "1324", 4)                     = 4
close(3)                                = 0
munmap(0x7fb2903000, 65536)             = 0

clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fb2a7f0d0) = 1325
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fb2a7f0d0) = 1326
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fb2a7f0d0) = 1327

--- Creates 3 children, perform a lot of memory operations with shared pages
    verify the ksm for activity and wait for children to exit ---

wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WSTOPPED|WCONTINUED, NULL) = 1325
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WSTOPPED|WCONTINUED, NULL) = 1326
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WSTOPPED|WCONTINUED, NULL) = 1327
wait4(-1, 0x7fe5625f3c, WSTOPPED|WCONTINUED, NULL) = -1 ECHILD (No child processes)

--- cleanup: Move tasks under /dev/cgroups/1/ to /dev/cgroups/ and delete subdir, umount cgroup ---

faccessat(AT_FDCWD, "/sys/kernel/mm/ksm/merge_across_nodes", F_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/dev/cgroup/tasks", O_WRONLY) = 205
openat(AT_FDCWD, "/dev/cgroup/1/tasks", O_RDONLY) = 206
fstat(206, {st_dev=makedev(0, 24), st_ino=37, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2015/01/12-15:10:13, st_mtime=2015/01/12-15:10:13, st_ctime=2015/01/12-15:10:13}) = 0
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb1c53000
read(206, "1324\n", 4096)               = 5
write(205, "1324", 4)                   = 4
read(206, "", 4096)                     = 0
close(205)                              = 0
close(206)                              = 0
munmap(0x7fb1c53000, 65536)             = 0
unlinkat(AT_FDCWD, "/dev/cgroup/1", AT_REMOVEDIR) = 0
umount2("/dev/cgroup", 0)               = 0
unlinkat(AT_FDCWD, "/dev/cgroup", AT_REMOVEDIR) = 0
exit_group(0)                           = ?


The next invocation of the same test fails to mount the cgroup memory.

Thanks
Suzuki

> Thanks.
>
> --
> tejun
>

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2548782

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
  2015-01-11 20:55       ` Johannes Weiner
@ 2015-01-14 11:16         ` Suzuki K. Poulose
  -1 siblings, 0 replies; 40+ messages in thread
From: Suzuki K. Poulose @ 2015-01-14 11:16 UTC (permalink / raw)
  To: Johannes Weiner, Tejun Heo
  Cc: Vladimir Davydov, linux-mm, linux-kernel, Will Deacon

On 11/01/15 20:55, Johannes Weiner wrote:
> On Sat, Jan 10, 2015 at 04:43:16PM -0500, Tejun Heo wrote:
>> Currently, if a hierarchy doesn't have any live children when it's
>> unmounted, the hierarchy starts dying by killing its refcnt.  The
>> expectation is that even if there are lingering dead children which
>> are lingering due to remaining references, they'll be put in a finite
>> amount of time.  When the children are finally released, the hierarchy
>> is destroyed and all controllers bound to it also are released.
>>
>> However, for memcg, the premise that the lingering refs will be put in
>> a finite amount time is not true.  In the absense of memory pressure,
>> dead memcg's may hang around indefinitely pinned by its pages.  This
>> unfortunately may lead to indefinite hang on the next mount attempt
>> involving memcg as the mount logic waits for it to get released.
>>
>> While we can change hierarchy destruction logic such that a hierarchy
>> is only destroyed when it's not mounted anywhere and all its children,
>> live or dead, are gone, this makes whether the hierarchy gets
>> destroyed or not to be determined by factors opaque to userland.
>> Userland may or may not get a new hierarchy on the next mount attempt.
>> Worse, if it explicitly wants to create a new hierarchy with different
>> options or controller compositions involving memcg, it will fail in an
>> essentially arbitrary manner.
>>
>> We want to guarantee that a hierarchy is destroyed once the
>> conditions, unmounted and no visible children, are met.  To aid it,
>> this patch introduces a new callback cgroup_subsys->unbind() which is
>> invoked right before the hierarchy a subsystem is bound to starts
>> dying.  memcg can implement this callback and initiate draining of
>> remaining refs so that the hierarchy can eventually be released in a
>> finite amount of time.
>>
>> Signed-off-by: Tejun Heo <tj@kernel.org>
>> Cc: Li Zefan <lizefan@huawei.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Michal Hocko <mhocko@suse.cz>
>> Cc: Vladimir Davydov <vdavydov@parallels.com>
>> ---
>> Hello,
>>
>>> May be, we should kill the ref counter to the memory controller root in
>>> cgroup_kill_sb only if there is no children at all, neither online nor
>>> offline.
>>
>> Ah, thanks for the analysis, but I really wanna avoid making hierarchy
>> destruction conditions opaque to userland.  This is userland visible
>> behavior.  It shouldn't be determined by kernel internals invisible
>> outside.  This patch adds ss->unbind() which memcg can hook into to
>> kick off draining of residual refs.  If this would work, I'll add this
>> patch to cgroup/for-3.19-fixes, possibly with stable cc'd.
>
> How about this ->unbind() for memcg?
>
>  From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Sun, 11 Jan 2015 10:29:05 -0500
> Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during
>   unbind
>
This patch doesn't cleanly apply on 3.19-rc4 for me (hunks in 
mm/memcontrol.c). I have manually applied it.

With these two patches in, I am still getting the failure. Also, the 
kworker thread is taking up 100% time (unbind_work) and continues to do 
so even after 6minutes.

Is there something I missed ?

Thanks
Suzuki




> Cgroup core assumes that any outstanding css references after
> offlining are temporary in nature, and e.g. mount waits for them to
> disappear and release the root cgroup.  But leftover page cache and
> swapout records in an offlined memcg are only dropped when the pages
> get reclaimed under pressure or the swapped out pages get faulted in
> from other cgroups, and so those cgroup operations can hang forever.
>
> Implement the ->unbind() callback to actively get rid of outstanding
> references when cgroup core wants them gone.  Swap out records are
> deleted, such that the swap-in path will charge those pages to the
> faulting task.  Page cache pages are moved to the root memory cgroup.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>   include/linux/swap_cgroup.h |   6 +++
>   mm/memcontrol.c             | 126 ++++++++++++++++++++++++++++++++++++++++++++
>   mm/swap_cgroup.c            |  38 +++++++++++++
>   3 files changed, 170 insertions(+)
>
> diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h
> index 145306bdc92f..ffe0866d2997 100644
> --- a/include/linux/swap_cgroup.h
> +++ b/include/linux/swap_cgroup.h
> @@ -9,6 +9,7 @@ extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
>   					unsigned short old, unsigned short new);
>   extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
>   extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
> +extern unsigned long swap_cgroup_zap_records(unsigned short id);
>   extern int swap_cgroup_swapon(int type, unsigned long max_pages);
>   extern void swap_cgroup_swapoff(int type);
>
> @@ -26,6 +27,11 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
>   	return 0;
>   }
>
> +static inline unsigned long swap_cgroup_zap_records(unsigned short id)
> +{
> +	return 0;
> +}
> +
>   static inline int
>   swap_cgroup_swapon(int type, unsigned long max_pages)
>   {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 692e96407627..40c426add613 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5197,6 +5197,131 @@ static void mem_cgroup_bind(struct cgroup_subsys_state *root_css)
>   		mem_cgroup_from_css(root_css)->use_hierarchy = true;
>   }
>
> +static void unbind_lru_list(struct mem_cgroup *memcg,
> +			    struct zone *zone, enum lru_list lru)
> +{
> +	struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> +	struct list_head *list = &lruvec->lists[lru];
> +
> +	while (!list_empty(list)) {
> +		unsigned int nr_pages;
> +		unsigned long flags;
> +		struct page *page;
> +
> +		spin_lock_irqsave(&zone->lru_lock, flags);
> +		if (list_empty(list)) {
> +			spin_unlock_irqrestore(&zone->lru_lock, flags);
> +			break;
> +		}
> +		page = list_last_entry(list, struct page, lru);
> +		if (!get_page_unless_zero(page)) {
> +			list_move(&page->lru, list);
> +			spin_unlock_irqrestore(&zone->lru_lock, flags);
> +			continue;
> +		}
> +		BUG_ON(!PageLRU(page));
> +		ClearPageLRU(page);
> +		del_page_from_lru_list(page, lruvec, lru);
> +		spin_unlock_irqrestore(&zone->lru_lock, flags);
> +
> +		compound_lock(page);
> +		nr_pages = hpage_nr_pages(page);
> +
> +		if (!mem_cgroup_move_account(page, nr_pages,
> +					     memcg, root_mem_cgroup)) {
> +			/*
> +			 * root_mem_cgroup page counters are not used,
> +			 * otherwise we'd have to charge them here.
> +			 */
> +			page_counter_uncharge(&memcg->memory, nr_pages);
> +			if (do_swap_account)
> +				page_counter_uncharge(&memcg->memsw, nr_pages);
> +			css_put_many(&memcg->css, nr_pages);
> +		}
> +
> +		compound_unlock(page);
> +
> +		putback_lru_page(page);
> +	}
> +}
> +
> +static void unbind_work_fn(struct work_struct *work)
> +{
> +	struct cgroup_subsys_state *css;
> +retry:
> +	drain_all_stock(root_mem_cgroup);
> +
> +	rcu_read_lock();
> +	css_for_each_child(css, &root_mem_cgroup->css) {
> +		struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +
> +		/* Drop references from swap-out records */
> +		if (do_swap_account) {
> +			long zapped;
> +
> +			zapped = swap_cgroup_zap_records(memcg->css.id);
> +			page_counter_uncharge(&memcg->memsw, zapped);
> +			css_put_many(&memcg->css, zapped);
> +		}
> +
> +		/* Drop references from leftover LRU pages */
> +		css_get(css);
> +		rcu_read_unlock();
> +		atomic_inc(&memcg->moving_account);
> +		synchronize_rcu();
> +		while (page_counter_read(&memcg->memory) -
> +		       page_counter_read(&memcg->kmem) > 0) {
> +			struct zone *zone;
> +			enum lru_list lru;
> +
> +			lru_add_drain_all();
> +
> +			for_each_zone(zone)
> +				for_each_lru(lru)
> +					unbind_lru_list(memcg, zone, lru);
> +
> +			cond_resched();
> +		}
> +		atomic_dec(&memcg->moving_account);
> +		rcu_read_lock();
> +		css_put(css);
> +	}
> +	rcu_read_unlock();
> +	/*
> +	 * Swap-in is racy:
> +	 *
> +	 * #0                        #1
> +	 *                           lookup_swap_cgroup_id()
> +	 *                           rcu_read_lock()
> +	 *                           mem_cgroup_lookup()
> +	 *                           css_tryget_online()
> +	 *                           rcu_read_unlock()
> +	 * cgroup_kill_sb()
> +	 *   !css_has_online_children()
> +	 *     ->unbind()
> +	 *                           page_counter_try_charge()
> +	 *                           css_put()
> +	 *                             css_free()
> +	 *                           pc->mem_cgroup = dead memcg
> +	 *                           add page to lru
> +	 *
> +	 * Loop until until all references established from previously
> +	 * existing swap-out records have been transferred to pages on
> +	 * the LRU and then uncharged from there.
> +	 */
> +	if (!list_empty(&root_mem_cgroup->css.children)) {
> +		msleep(10);
> +		goto retry;
> +	}
> +}
> +
> +static DECLARE_WORK(unbind_work, unbind_work_fn);
> +
> +static void mem_cgroup_unbind(struct cgroup_subsys_state *root_css)
> +{
> +	schedule_work(&unbind_work);
> +}
> +
>   static u64 memory_current_read(struct cgroup_subsys_state *css,
>   			       struct cftype *cft)
>   {
> @@ -5360,6 +5485,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
>   	.cancel_attach = mem_cgroup_cancel_attach,
>   	.attach = mem_cgroup_move_task,
>   	.bind = mem_cgroup_bind,
> +	.unbind = mem_cgroup_unbind,
>   	.dfl_cftypes = memory_files,
>   	.legacy_cftypes = mem_cgroup_legacy_files,
>   	.early_init = 0,
> diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
> index b5f7f24b8dd1..665923a558c4 100644
> --- a/mm/swap_cgroup.c
> +++ b/mm/swap_cgroup.c
> @@ -140,6 +140,44 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
>   	return lookup_swap_cgroup(ent, NULL)->id;
>   }
>
> +/**
> + * swap_cgroup_zap_records - delete all swapout records of one cgroup
> + * @id: memcg id
> + *
> + * Returns the number of deleted records.
> + */
> +unsigned long swap_cgroup_zap_records(unsigned short id)
> +{
> +	unsigned long zapped = 0;
> +	unsigned int type;
> +
> +	for (type = 0; type < MAX_SWAPFILES; type++) {
> +		struct swap_cgroup_ctrl *ctrl;
> +		unsigned long flags;
> +		unsigned int page;
> +
> +		ctrl = &swap_cgroup_ctrl[type];
> +		spin_lock_irqsave(&ctrl->lock, flags);
> +		for (page = 0; page < ctrl->length; page++) {
> +			struct swap_cgroup *base;
> +			pgoff_t offset;
> +
> +			base = page_address(ctrl->map[page]);
> +			for (offset = 0; offset < SC_PER_PAGE; offset++) {
> +				struct swap_cgroup *sc;
> +
> +				sc = base + offset;
> +				if (sc->id == id) {
> +					sc->id = 0;
> +					zapped++;
> +				}
> +			}
> +		}
> +		spin_unlock_irqrestore(&ctrl->lock, flags);
> +	}
> +	return zapped;
> +}
> +
>   int swap_cgroup_swapon(int type, unsigned long max_pages)
>   {
>   	void *array;
>



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
@ 2015-01-14 11:16         ` Suzuki K. Poulose
  0 siblings, 0 replies; 40+ messages in thread
From: Suzuki K. Poulose @ 2015-01-14 11:16 UTC (permalink / raw)
  To: Johannes Weiner, Tejun Heo
  Cc: Vladimir Davydov, linux-mm, linux-kernel, Will Deacon

On 11/01/15 20:55, Johannes Weiner wrote:
> On Sat, Jan 10, 2015 at 04:43:16PM -0500, Tejun Heo wrote:
>> Currently, if a hierarchy doesn't have any live children when it's
>> unmounted, the hierarchy starts dying by killing its refcnt.  The
>> expectation is that even if there are lingering dead children which
>> are lingering due to remaining references, they'll be put in a finite
>> amount of time.  When the children are finally released, the hierarchy
>> is destroyed and all controllers bound to it also are released.
>>
>> However, for memcg, the premise that the lingering refs will be put in
>> a finite amount time is not true.  In the absense of memory pressure,
>> dead memcg's may hang around indefinitely pinned by its pages.  This
>> unfortunately may lead to indefinite hang on the next mount attempt
>> involving memcg as the mount logic waits for it to get released.
>>
>> While we can change hierarchy destruction logic such that a hierarchy
>> is only destroyed when it's not mounted anywhere and all its children,
>> live or dead, are gone, this makes whether the hierarchy gets
>> destroyed or not to be determined by factors opaque to userland.
>> Userland may or may not get a new hierarchy on the next mount attempt.
>> Worse, if it explicitly wants to create a new hierarchy with different
>> options or controller compositions involving memcg, it will fail in an
>> essentially arbitrary manner.
>>
>> We want to guarantee that a hierarchy is destroyed once the
>> conditions, unmounted and no visible children, are met.  To aid it,
>> this patch introduces a new callback cgroup_subsys->unbind() which is
>> invoked right before the hierarchy a subsystem is bound to starts
>> dying.  memcg can implement this callback and initiate draining of
>> remaining refs so that the hierarchy can eventually be released in a
>> finite amount of time.
>>
>> Signed-off-by: Tejun Heo <tj@kernel.org>
>> Cc: Li Zefan <lizefan@huawei.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Michal Hocko <mhocko@suse.cz>
>> Cc: Vladimir Davydov <vdavydov@parallels.com>
>> ---
>> Hello,
>>
>>> May be, we should kill the ref counter to the memory controller root in
>>> cgroup_kill_sb only if there is no children at all, neither online nor
>>> offline.
>>
>> Ah, thanks for the analysis, but I really wanna avoid making hierarchy
>> destruction conditions opaque to userland.  This is userland visible
>> behavior.  It shouldn't be determined by kernel internals invisible
>> outside.  This patch adds ss->unbind() which memcg can hook into to
>> kick off draining of residual refs.  If this would work, I'll add this
>> patch to cgroup/for-3.19-fixes, possibly with stable cc'd.
>
> How about this ->unbind() for memcg?
>
>  From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Sun, 11 Jan 2015 10:29:05 -0500
> Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during
>   unbind
>
This patch doesn't cleanly apply on 3.19-rc4 for me (hunks in 
mm/memcontrol.c). I have manually applied it.

With these two patches in, I am still getting the failure. Also, the 
kworker thread is taking up 100% time (unbind_work) and continues to do 
so even after 6minutes.

Is there something I missed ?

Thanks
Suzuki




> Cgroup core assumes that any outstanding css references after
> offlining are temporary in nature, and e.g. mount waits for them to
> disappear and release the root cgroup.  But leftover page cache and
> swapout records in an offlined memcg are only dropped when the pages
> get reclaimed under pressure or the swapped out pages get faulted in
> from other cgroups, and so those cgroup operations can hang forever.
>
> Implement the ->unbind() callback to actively get rid of outstanding
> references when cgroup core wants them gone.  Swap out records are
> deleted, such that the swap-in path will charge those pages to the
> faulting task.  Page cache pages are moved to the root memory cgroup.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>   include/linux/swap_cgroup.h |   6 +++
>   mm/memcontrol.c             | 126 ++++++++++++++++++++++++++++++++++++++++++++
>   mm/swap_cgroup.c            |  38 +++++++++++++
>   3 files changed, 170 insertions(+)
>
> diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h
> index 145306bdc92f..ffe0866d2997 100644
> --- a/include/linux/swap_cgroup.h
> +++ b/include/linux/swap_cgroup.h
> @@ -9,6 +9,7 @@ extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
>   					unsigned short old, unsigned short new);
>   extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
>   extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
> +extern unsigned long swap_cgroup_zap_records(unsigned short id);
>   extern int swap_cgroup_swapon(int type, unsigned long max_pages);
>   extern void swap_cgroup_swapoff(int type);
>
> @@ -26,6 +27,11 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
>   	return 0;
>   }
>
> +static inline unsigned long swap_cgroup_zap_records(unsigned short id)
> +{
> +	return 0;
> +}
> +
>   static inline int
>   swap_cgroup_swapon(int type, unsigned long max_pages)
>   {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 692e96407627..40c426add613 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5197,6 +5197,131 @@ static void mem_cgroup_bind(struct cgroup_subsys_state *root_css)
>   		mem_cgroup_from_css(root_css)->use_hierarchy = true;
>   }
>
> +static void unbind_lru_list(struct mem_cgroup *memcg,
> +			    struct zone *zone, enum lru_list lru)
> +{
> +	struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> +	struct list_head *list = &lruvec->lists[lru];
> +
> +	while (!list_empty(list)) {
> +		unsigned int nr_pages;
> +		unsigned long flags;
> +		struct page *page;
> +
> +		spin_lock_irqsave(&zone->lru_lock, flags);
> +		if (list_empty(list)) {
> +			spin_unlock_irqrestore(&zone->lru_lock, flags);
> +			break;
> +		}
> +		page = list_last_entry(list, struct page, lru);
> +		if (!get_page_unless_zero(page)) {
> +			list_move(&page->lru, list);
> +			spin_unlock_irqrestore(&zone->lru_lock, flags);
> +			continue;
> +		}
> +		BUG_ON(!PageLRU(page));
> +		ClearPageLRU(page);
> +		del_page_from_lru_list(page, lruvec, lru);
> +		spin_unlock_irqrestore(&zone->lru_lock, flags);
> +
> +		compound_lock(page);
> +		nr_pages = hpage_nr_pages(page);
> +
> +		if (!mem_cgroup_move_account(page, nr_pages,
> +					     memcg, root_mem_cgroup)) {
> +			/*
> +			 * root_mem_cgroup page counters are not used,
> +			 * otherwise we'd have to charge them here.
> +			 */
> +			page_counter_uncharge(&memcg->memory, nr_pages);
> +			if (do_swap_account)
> +				page_counter_uncharge(&memcg->memsw, nr_pages);
> +			css_put_many(&memcg->css, nr_pages);
> +		}
> +
> +		compound_unlock(page);
> +
> +		putback_lru_page(page);
> +	}
> +}
> +
> +static void unbind_work_fn(struct work_struct *work)
> +{
> +	struct cgroup_subsys_state *css;
> +retry:
> +	drain_all_stock(root_mem_cgroup);
> +
> +	rcu_read_lock();
> +	css_for_each_child(css, &root_mem_cgroup->css) {
> +		struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +
> +		/* Drop references from swap-out records */
> +		if (do_swap_account) {
> +			long zapped;
> +
> +			zapped = swap_cgroup_zap_records(memcg->css.id);
> +			page_counter_uncharge(&memcg->memsw, zapped);
> +			css_put_many(&memcg->css, zapped);
> +		}
> +
> +		/* Drop references from leftover LRU pages */
> +		css_get(css);
> +		rcu_read_unlock();
> +		atomic_inc(&memcg->moving_account);
> +		synchronize_rcu();
> +		while (page_counter_read(&memcg->memory) -
> +		       page_counter_read(&memcg->kmem) > 0) {
> +			struct zone *zone;
> +			enum lru_list lru;
> +
> +			lru_add_drain_all();
> +
> +			for_each_zone(zone)
> +				for_each_lru(lru)
> +					unbind_lru_list(memcg, zone, lru);
> +
> +			cond_resched();
> +		}
> +		atomic_dec(&memcg->moving_account);
> +		rcu_read_lock();
> +		css_put(css);
> +	}
> +	rcu_read_unlock();
> +	/*
> +	 * Swap-in is racy:
> +	 *
> +	 * #0                        #1
> +	 *                           lookup_swap_cgroup_id()
> +	 *                           rcu_read_lock()
> +	 *                           mem_cgroup_lookup()
> +	 *                           css_tryget_online()
> +	 *                           rcu_read_unlock()
> +	 * cgroup_kill_sb()
> +	 *   !css_has_online_children()
> +	 *     ->unbind()
> +	 *                           page_counter_try_charge()
> +	 *                           css_put()
> +	 *                             css_free()
> +	 *                           pc->mem_cgroup = dead memcg
> +	 *                           add page to lru
> +	 *
> +	 * Loop until until all references established from previously
> +	 * existing swap-out records have been transferred to pages on
> +	 * the LRU and then uncharged from there.
> +	 */
> +	if (!list_empty(&root_mem_cgroup->css.children)) {
> +		msleep(10);
> +		goto retry;
> +	}
> +}
> +
> +static DECLARE_WORK(unbind_work, unbind_work_fn);
> +
> +static void mem_cgroup_unbind(struct cgroup_subsys_state *root_css)
> +{
> +	schedule_work(&unbind_work);
> +}
> +
>   static u64 memory_current_read(struct cgroup_subsys_state *css,
>   			       struct cftype *cft)
>   {
> @@ -5360,6 +5485,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
>   	.cancel_attach = mem_cgroup_cancel_attach,
>   	.attach = mem_cgroup_move_task,
>   	.bind = mem_cgroup_bind,
> +	.unbind = mem_cgroup_unbind,
>   	.dfl_cftypes = memory_files,
>   	.legacy_cftypes = mem_cgroup_legacy_files,
>   	.early_init = 0,
> diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
> index b5f7f24b8dd1..665923a558c4 100644
> --- a/mm/swap_cgroup.c
> +++ b/mm/swap_cgroup.c
> @@ -140,6 +140,44 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
>   	return lookup_swap_cgroup(ent, NULL)->id;
>   }
>
> +/**
> + * swap_cgroup_zap_records - delete all swapout records of one cgroup
> + * @id: memcg id
> + *
> + * Returns the number of deleted records.
> + */
> +unsigned long swap_cgroup_zap_records(unsigned short id)
> +{
> +	unsigned long zapped = 0;
> +	unsigned int type;
> +
> +	for (type = 0; type < MAX_SWAPFILES; type++) {
> +		struct swap_cgroup_ctrl *ctrl;
> +		unsigned long flags;
> +		unsigned int page;
> +
> +		ctrl = &swap_cgroup_ctrl[type];
> +		spin_lock_irqsave(&ctrl->lock, flags);
> +		for (page = 0; page < ctrl->length; page++) {
> +			struct swap_cgroup *base;
> +			pgoff_t offset;
> +
> +			base = page_address(ctrl->map[page]);
> +			for (offset = 0; offset < SC_PER_PAGE; offset++) {
> +				struct swap_cgroup *sc;
> +
> +				sc = base + offset;
> +				if (sc->id == id) {
> +					sc->id = 0;
> +					zapped++;
> +				}
> +			}
> +		}
> +		spin_unlock_irqrestore(&ctrl->lock, flags);
> +	}
> +	return zapped;
> +}
> +
>   int swap_cgroup_swapon(int type, unsigned long max_pages)
>   {
>   	void *array;
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
  2015-01-10 21:43     ` Tejun Heo
@ 2015-01-15 17:26       ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2015-01-15 17:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vladimir Davydov, Suzuki K. Poulose, Johannes Weiner, linux-mm,
	linux-kernel, Will Deacon

On Sat 10-01-15 16:43:16, Tejun Heo wrote:
> Currently, if a hierarchy doesn't have any live children when it's
> unmounted, the hierarchy starts dying by killing its refcnt.  The
> expectation is that even if there are lingering dead children which
> are lingering due to remaining references, they'll be put in a finite
> amount of time.  When the children are finally released, the hierarchy
> is destroyed and all controllers bound to it also are released.
> 
> However, for memcg, the premise that the lingering refs will be put in
> a finite amount time is not true.  In the absense of memory pressure,
> dead memcg's may hang around indefinitely pinned by its pages.  This
> unfortunately may lead to indefinite hang on the next mount attempt
> involving memcg as the mount logic waits for it to get released.
> 
> While we can change hierarchy destruction logic such that a hierarchy
> is only destroyed when it's not mounted anywhere and all its children,
> live or dead, are gone, this makes whether the hierarchy gets
> destroyed or not to be determined by factors opaque to userland.
> Userland may or may not get a new hierarchy on the next mount attempt.
> Worse, if it explicitly wants to create a new hierarchy with different
> options or controller compositions involving memcg, it will fail in an
> essentially arbitrary manner.
> 
> We want to guarantee that a hierarchy is destroyed once the
> conditions, unmounted and no visible children, are met.  To aid it,
> this patch introduces a new callback cgroup_subsys->unbind() which is
> invoked right before the hierarchy a subsystem is bound to starts
> dying.  memcg can implement this callback and initiate draining of
> remaining refs so that the hierarchy can eventually be released in a
> finite amount of time.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Li Zefan <lizefan@huawei.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Vladimir Davydov <vdavydov@parallels.com>

Ohh, I have missed this one as I wasn't on the CC list.

FWIW this approach makes sense to me. I just think that we should have a
way to fail. E.g. kmem pages are impossible to reclaim because there
might be some objects lingering somewhere not bound to a task context
and reparenting is hard as Vladimir has pointed out several times
already.
Normal LRU pages should be reclaimable or reparented to the root easily.

I cannot judge the implementation but I agree with the fact that memcg
controller should be the one to take an action.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
@ 2015-01-15 17:26       ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2015-01-15 17:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vladimir Davydov, Suzuki K. Poulose, Johannes Weiner, linux-mm,
	linux-kernel, Will Deacon

On Sat 10-01-15 16:43:16, Tejun Heo wrote:
> Currently, if a hierarchy doesn't have any live children when it's
> unmounted, the hierarchy starts dying by killing its refcnt.  The
> expectation is that even if there are lingering dead children which
> are lingering due to remaining references, they'll be put in a finite
> amount of time.  When the children are finally released, the hierarchy
> is destroyed and all controllers bound to it also are released.
> 
> However, for memcg, the premise that the lingering refs will be put in
> a finite amount time is not true.  In the absense of memory pressure,
> dead memcg's may hang around indefinitely pinned by its pages.  This
> unfortunately may lead to indefinite hang on the next mount attempt
> involving memcg as the mount logic waits for it to get released.
> 
> While we can change hierarchy destruction logic such that a hierarchy
> is only destroyed when it's not mounted anywhere and all its children,
> live or dead, are gone, this makes whether the hierarchy gets
> destroyed or not to be determined by factors opaque to userland.
> Userland may or may not get a new hierarchy on the next mount attempt.
> Worse, if it explicitly wants to create a new hierarchy with different
> options or controller compositions involving memcg, it will fail in an
> essentially arbitrary manner.
> 
> We want to guarantee that a hierarchy is destroyed once the
> conditions, unmounted and no visible children, are met.  To aid it,
> this patch introduces a new callback cgroup_subsys->unbind() which is
> invoked right before the hierarchy a subsystem is bound to starts
> dying.  memcg can implement this callback and initiate draining of
> remaining refs so that the hierarchy can eventually be released in a
> finite amount of time.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Li Zefan <lizefan@huawei.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Vladimir Davydov <vdavydov@parallels.com>

Ohh, I have missed this one as I wasn't on the CC list.

FWIW this approach makes sense to me. I just think that we should have a
way to fail. E.g. kmem pages are impossible to reclaim because there
might be some objects lingering somewhere not bound to a task context
and reparenting is hard as Vladimir has pointed out several times
already.
Normal LRU pages should be reclaimable or reparented to the root easily.

I cannot judge the implementation but I agree with the fact that memcg
controller should be the one to take an action.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
  2015-01-11 20:55       ` Johannes Weiner
@ 2015-01-15 17:56         ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2015-01-15 17:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Vladimir Davydov, Suzuki K. Poulose, linux-mm,
	linux-kernel, Will Deacon

On Sun 11-01-15 15:55:43, Johannes Weiner wrote:
> From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Sun, 11 Jan 2015 10:29:05 -0500
> Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during
>  unbind
> 
> Cgroup core assumes that any outstanding css references after
> offlining are temporary in nature, and e.g. mount waits for them to
> disappear and release the root cgroup.  But leftover page cache and
> swapout records in an offlined memcg are only dropped when the pages
> get reclaimed under pressure or the swapped out pages get faulted in
> from other cgroups, and so those cgroup operations can hang forever.
> 
> Implement the ->unbind() callback to actively get rid of outstanding
> references when cgroup core wants them gone.  Swap out records are
> deleted, such that the swap-in path will charge those pages to the
> faulting task. 

OK, that makes sense to me.

> Page cache pages are moved to the root memory cgroup.

OK, this is better than reclaiming them.

[...]
> +static void unbind_lru_list(struct mem_cgroup *memcg,
> +			    struct zone *zone, enum lru_list lru)
> +{
> +	struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> +	struct list_head *list = &lruvec->lists[lru];
> +
> +	while (!list_empty(list)) {
> +		unsigned int nr_pages;
> +		unsigned long flags;
> +		struct page *page;
> +
> +		spin_lock_irqsave(&zone->lru_lock, flags);
> +		if (list_empty(list)) {
> +			spin_unlock_irqrestore(&zone->lru_lock, flags);
> +			break;
> +		}
> +		page = list_last_entry(list, struct page, lru);

taking lru_lock for each page calls for troubles. The lock would bounce
like crazy. It shouldn't be a big problem to list_move to a local list
and then work on that one without the lock. Those pages wouldn't be
visible for the reclaim but that would be only temporary. Or if that is
not acceptable then just batch at least some number of pages (popular
SWAP_CLUSTER_MAX).

> +		if (!get_page_unless_zero(page)) {
> +			list_move(&page->lru, list);
> +			spin_unlock_irqrestore(&zone->lru_lock, flags);
> +			continue;
> +		}
> +		BUG_ON(!PageLRU(page));
> +		ClearPageLRU(page);
> +		del_page_from_lru_list(page, lruvec, lru);
> +		spin_unlock_irqrestore(&zone->lru_lock, flags);
> +
> +		compound_lock(page);
> +		nr_pages = hpage_nr_pages(page);
> +
> +		if (!mem_cgroup_move_account(page, nr_pages,
> +					     memcg, root_mem_cgroup)) {
> +			/*
> +			 * root_mem_cgroup page counters are not used,
> +			 * otherwise we'd have to charge them here.
> +			 */
> +			page_counter_uncharge(&memcg->memory, nr_pages);
> +			if (do_swap_account)
> +				page_counter_uncharge(&memcg->memsw, nr_pages);
> +			css_put_many(&memcg->css, nr_pages);
> +		}
> +
> +		compound_unlock(page);
> +
> +		putback_lru_page(page);
> +	}
> +}
> +
> +static void unbind_work_fn(struct work_struct *work)
> +{
> +	struct cgroup_subsys_state *css;
> +retry:
> +	drain_all_stock(root_mem_cgroup);
> +
> +	rcu_read_lock();
> +	css_for_each_child(css, &root_mem_cgroup->css) {
> +		struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +
> +		/* Drop references from swap-out records */
> +		if (do_swap_account) {
> +			long zapped;
> +
> +			zapped = swap_cgroup_zap_records(memcg->css.id);
> +			page_counter_uncharge(&memcg->memsw, zapped);
> +			css_put_many(&memcg->css, zapped);
> +		}
> +
> +		/* Drop references from leftover LRU pages */
> +		css_get(css);
> +		rcu_read_unlock();
> +		atomic_inc(&memcg->moving_account);
> +		synchronize_rcu();

Why do we need this? Who can migrate to/from offline memcgs? 

> +		while (page_counter_read(&memcg->memory) -
> +		       page_counter_read(&memcg->kmem) > 0) {
> +			struct zone *zone;
> +			enum lru_list lru;
> +
> +			lru_add_drain_all();
> +
> +			for_each_zone(zone)
> +				for_each_lru(lru)
> +					unbind_lru_list(memcg, zone, lru);
> +
> +			cond_resched();
> +		}
> +		atomic_dec(&memcg->moving_account);
> +		rcu_read_lock();
> +		css_put(css);
> +	}
> +	rcu_read_unlock();
> +	/*
> +	 * Swap-in is racy:
> +	 *
> +	 * #0                        #1
> +	 *                           lookup_swap_cgroup_id()
> +	 *                           rcu_read_lock()
> +	 *                           mem_cgroup_lookup()
> +	 *                           css_tryget_online()
> +	 *                           rcu_read_unlock()
> +	 * cgroup_kill_sb()
> +	 *   !css_has_online_children()
> +	 *     ->unbind()
> +	 *                           page_counter_try_charge()
> +	 *                           css_put()
> +	 *                             css_free()
> +	 *                           pc->mem_cgroup = dead memcg
> +	 *                           add page to lru
> +	 *
> +	 * Loop until until all references established from previously
> +	 * existing swap-out records have been transferred to pages on
> +	 * the LRU and then uncharged from there.
> +	 */
> +	if (!list_empty(&root_mem_cgroup->css.children)) {

But what if kmem pages pin the memcg? We would loop for ever. Or am I
missing something?

> +		msleep(10);
> +		goto retry;
> +	}
> +}
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
@ 2015-01-15 17:56         ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2015-01-15 17:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Vladimir Davydov, Suzuki K. Poulose, linux-mm,
	linux-kernel, Will Deacon

On Sun 11-01-15 15:55:43, Johannes Weiner wrote:
> From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Sun, 11 Jan 2015 10:29:05 -0500
> Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during
>  unbind
> 
> Cgroup core assumes that any outstanding css references after
> offlining are temporary in nature, and e.g. mount waits for them to
> disappear and release the root cgroup.  But leftover page cache and
> swapout records in an offlined memcg are only dropped when the pages
> get reclaimed under pressure or the swapped out pages get faulted in
> from other cgroups, and so those cgroup operations can hang forever.
> 
> Implement the ->unbind() callback to actively get rid of outstanding
> references when cgroup core wants them gone.  Swap out records are
> deleted, such that the swap-in path will charge those pages to the
> faulting task. 

OK, that makes sense to me.

> Page cache pages are moved to the root memory cgroup.

OK, this is better than reclaiming them.

[...]
> +static void unbind_lru_list(struct mem_cgroup *memcg,
> +			    struct zone *zone, enum lru_list lru)
> +{
> +	struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> +	struct list_head *list = &lruvec->lists[lru];
> +
> +	while (!list_empty(list)) {
> +		unsigned int nr_pages;
> +		unsigned long flags;
> +		struct page *page;
> +
> +		spin_lock_irqsave(&zone->lru_lock, flags);
> +		if (list_empty(list)) {
> +			spin_unlock_irqrestore(&zone->lru_lock, flags);
> +			break;
> +		}
> +		page = list_last_entry(list, struct page, lru);

taking lru_lock for each page calls for troubles. The lock would bounce
like crazy. It shouldn't be a big problem to list_move to a local list
and then work on that one without the lock. Those pages wouldn't be
visible for the reclaim but that would be only temporary. Or if that is
not acceptable then just batch at least some number of pages (popular
SWAP_CLUSTER_MAX).

> +		if (!get_page_unless_zero(page)) {
> +			list_move(&page->lru, list);
> +			spin_unlock_irqrestore(&zone->lru_lock, flags);
> +			continue;
> +		}
> +		BUG_ON(!PageLRU(page));
> +		ClearPageLRU(page);
> +		del_page_from_lru_list(page, lruvec, lru);
> +		spin_unlock_irqrestore(&zone->lru_lock, flags);
> +
> +		compound_lock(page);
> +		nr_pages = hpage_nr_pages(page);
> +
> +		if (!mem_cgroup_move_account(page, nr_pages,
> +					     memcg, root_mem_cgroup)) {
> +			/*
> +			 * root_mem_cgroup page counters are not used,
> +			 * otherwise we'd have to charge them here.
> +			 */
> +			page_counter_uncharge(&memcg->memory, nr_pages);
> +			if (do_swap_account)
> +				page_counter_uncharge(&memcg->memsw, nr_pages);
> +			css_put_many(&memcg->css, nr_pages);
> +		}
> +
> +		compound_unlock(page);
> +
> +		putback_lru_page(page);
> +	}
> +}
> +
> +static void unbind_work_fn(struct work_struct *work)
> +{
> +	struct cgroup_subsys_state *css;
> +retry:
> +	drain_all_stock(root_mem_cgroup);
> +
> +	rcu_read_lock();
> +	css_for_each_child(css, &root_mem_cgroup->css) {
> +		struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +
> +		/* Drop references from swap-out records */
> +		if (do_swap_account) {
> +			long zapped;
> +
> +			zapped = swap_cgroup_zap_records(memcg->css.id);
> +			page_counter_uncharge(&memcg->memsw, zapped);
> +			css_put_many(&memcg->css, zapped);
> +		}
> +
> +		/* Drop references from leftover LRU pages */
> +		css_get(css);
> +		rcu_read_unlock();
> +		atomic_inc(&memcg->moving_account);
> +		synchronize_rcu();

Why do we need this? Who can migrate to/from offline memcgs? 

> +		while (page_counter_read(&memcg->memory) -
> +		       page_counter_read(&memcg->kmem) > 0) {
> +			struct zone *zone;
> +			enum lru_list lru;
> +
> +			lru_add_drain_all();
> +
> +			for_each_zone(zone)
> +				for_each_lru(lru)
> +					unbind_lru_list(memcg, zone, lru);
> +
> +			cond_resched();
> +		}
> +		atomic_dec(&memcg->moving_account);
> +		rcu_read_lock();
> +		css_put(css);
> +	}
> +	rcu_read_unlock();
> +	/*
> +	 * Swap-in is racy:
> +	 *
> +	 * #0                        #1
> +	 *                           lookup_swap_cgroup_id()
> +	 *                           rcu_read_lock()
> +	 *                           mem_cgroup_lookup()
> +	 *                           css_tryget_online()
> +	 *                           rcu_read_unlock()
> +	 * cgroup_kill_sb()
> +	 *   !css_has_online_children()
> +	 *     ->unbind()
> +	 *                           page_counter_try_charge()
> +	 *                           css_put()
> +	 *                             css_free()
> +	 *                           pc->mem_cgroup = dead memcg
> +	 *                           add page to lru
> +	 *
> +	 * Loop until until all references established from previously
> +	 * existing swap-out records have been transferred to pages on
> +	 * the LRU and then uncharged from there.
> +	 */
> +	if (!list_empty(&root_mem_cgroup->css.children)) {

But what if kmem pages pin the memcg? We would loop for ever. Or am I
missing something?

> +		msleep(10);
> +		goto retry;
> +	}
> +}
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg
  2015-01-10  8:55   ` Vladimir Davydov
@ 2015-01-19 12:51     ` Suzuki K. Poulose
  -1 siblings, 0 replies; 40+ messages in thread
From: Suzuki K. Poulose @ 2015-01-19 12:51 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Tejun Heo, Johannes Weiner, linux-mm, linux-kernel, Will Deacon,
	mhocko, akpm

On 10/01/15 08:55, Vladimir Davydov wrote:
> On Fri, Jan 09, 2015 at 05:43:17PM +0000, Suzuki K. Poulose wrote:
>> Hi
>>
>> We have hit a hang on ARM64 defconfig, while running LTP tests on
>> 3.19-rc3. We are
>> in the process of a git bisect and will update the results as and
>> when we find the commit.
>>
>> During the ksm ltp run, the test hangs trying to mount memcg with
>> the following strace
>> output:
>>
>> mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ?
>> ERESTARTNOINTR (To be restarted)
>> mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ?
>> ERESTARTNOINTR (To be restarted)
>> [ ... repeated forever ... ]
>>
>> At this point, one can try mounting the memcg to verify the problem.
>> # mount -t cgroup -o memory memcg memcg_dir
>> --hangs--
>>
>> Strangely, if we run the mount command from a cold boot (i.e.
>> without running LTP first),
>> then it succeeds.
>>
>> Upon a quick look we are hitting the following code :
>> kernel/cgroup.c: cgroup_mount() :
>>
>> 1779         for_each_subsys(ss, i) {
>> 1780                 if (!(opts.subsys_mask & (1 << i)) ||
>> 1781                     ss->root == &cgrp_dfl_root)
>> 1782                         continue;
>> 1783
>> 1784                 if
>> (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
>> 1785                         mutex_unlock(&cgroup_mutex);
>> 1786                         msleep(10);
>> 1787                         ret = restart_syscall(); <=====
>> 1788                         goto out_free;
>> 1789                 }
>> 1790                 cgroup_put(&ss->root->cgrp);
>> 1791         }
>>
>> with ss->root->cgrp.self.refct.percpu_count_ptr == __PERCPU_REF_ATOMIC_DEAD
>>
>> Any ideas?
>
> The problem is that the memory cgroup controller takes a css reference
> per each charged page and does not reparent charged pages on css
> offline, while cgroup_mount/cgroup_kill_sb expect all css references to
> offline cgroups to be gone soon, restarting the syscall if the ref count
> != 0. As a result, if you create a memory cgroup, charge some page cache
> to it, and then remove it, unmount/mount will hang forever.
>
> May be, we should kill the ref counter to the memory controller root in
> cgroup_kill_sb only if there is no children at all, neither online nor
> offline.
>

Still reproducible on 3.19-rc5 with the same setup. From git bisect, the 
last good commit is :

commit 8df0c2dcf61781d2efa8e6e5b06870f6c6785735
Author: Pranith Kumar <bobby.prani@gmail.com>
Date:   Wed Dec 10 15:42:28 2014 -0800

     slab: replace smp_read_barrier_depends() with lockless_dereference()



Thanks
Suzuki


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg
@ 2015-01-19 12:51     ` Suzuki K. Poulose
  0 siblings, 0 replies; 40+ messages in thread
From: Suzuki K. Poulose @ 2015-01-19 12:51 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Tejun Heo, Johannes Weiner, linux-mm, linux-kernel, Will Deacon,
	mhocko, akpm

On 10/01/15 08:55, Vladimir Davydov wrote:
> On Fri, Jan 09, 2015 at 05:43:17PM +0000, Suzuki K. Poulose wrote:
>> Hi
>>
>> We have hit a hang on ARM64 defconfig, while running LTP tests on
>> 3.19-rc3. We are
>> in the process of a git bisect and will update the results as and
>> when we find the commit.
>>
>> During the ksm ltp run, the test hangs trying to mount memcg with
>> the following strace
>> output:
>>
>> mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ?
>> ERESTARTNOINTR (To be restarted)
>> mount("memcg", "/dev/cgroup", "cgroup", 0, "memory") = ?
>> ERESTARTNOINTR (To be restarted)
>> [ ... repeated forever ... ]
>>
>> At this point, one can try mounting the memcg to verify the problem.
>> # mount -t cgroup -o memory memcg memcg_dir
>> --hangs--
>>
>> Strangely, if we run the mount command from a cold boot (i.e.
>> without running LTP first),
>> then it succeeds.
>>
>> Upon a quick look we are hitting the following code :
>> kernel/cgroup.c: cgroup_mount() :
>>
>> 1779         for_each_subsys(ss, i) {
>> 1780                 if (!(opts.subsys_mask & (1 << i)) ||
>> 1781                     ss->root == &cgrp_dfl_root)
>> 1782                         continue;
>> 1783
>> 1784                 if
>> (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
>> 1785                         mutex_unlock(&cgroup_mutex);
>> 1786                         msleep(10);
>> 1787                         ret = restart_syscall(); <=====
>> 1788                         goto out_free;
>> 1789                 }
>> 1790                 cgroup_put(&ss->root->cgrp);
>> 1791         }
>>
>> with ss->root->cgrp.self.refct.percpu_count_ptr == __PERCPU_REF_ATOMIC_DEAD
>>
>> Any ideas?
>
> The problem is that the memory cgroup controller takes a css reference
> per each charged page and does not reparent charged pages on css
> offline, while cgroup_mount/cgroup_kill_sb expect all css references to
> offline cgroups to be gone soon, restarting the syscall if the ref count
> != 0. As a result, if you create a memory cgroup, charge some page cache
> to it, and then remove it, unmount/mount will hang forever.
>
> May be, we should kill the ref counter to the memory controller root in
> cgroup_kill_sb only if there is no children at all, neither online nor
> offline.
>

Still reproducible on 3.19-rc5 with the same setup. From git bisect, the 
last good commit is :

commit 8df0c2dcf61781d2efa8e6e5b06870f6c6785735
Author: Pranith Kumar <bobby.prani@gmail.com>
Date:   Wed Dec 10 15:42:28 2014 -0800

     slab: replace smp_read_barrier_depends() with lockless_dereference()



Thanks
Suzuki

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg
  2015-01-19 12:51     ` Suzuki K. Poulose
@ 2015-01-21 16:39       ` Will Deacon
  -1 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2015-01-21 16:39 UTC (permalink / raw)
  To: Suzuki K. Poulose
  Cc: Vladimir Davydov, Tejun Heo, Johannes Weiner, linux-mm,
	linux-kernel, mhocko, akpm

On Mon, Jan 19, 2015 at 12:51:27PM +0000, Suzuki K. Poulose wrote:
> On 10/01/15 08:55, Vladimir Davydov wrote:
> > The problem is that the memory cgroup controller takes a css reference
> > per each charged page and does not reparent charged pages on css
> > offline, while cgroup_mount/cgroup_kill_sb expect all css references to
> > offline cgroups to be gone soon, restarting the syscall if the ref count
> > != 0. As a result, if you create a memory cgroup, charge some page cache
> > to it, and then remove it, unmount/mount will hang forever.
> >
> > May be, we should kill the ref counter to the memory controller root in
> > cgroup_kill_sb only if there is no children at all, neither online nor
> > offline.
> >
> 
> Still reproducible on 3.19-rc5 with the same setup.

Yeah, I'm seeing the same failure on my setup too.

> From git bisect, the last good commit is :
> 
> commit 8df0c2dcf61781d2efa8e6e5b06870f6c6785735
> Author: Pranith Kumar <bobby.prani@gmail.com>
> Date:   Wed Dec 10 15:42:28 2014 -0800
> 
>      slab: replace smp_read_barrier_depends() with lockless_dereference()

So that points at 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
as the offending commit.

Will

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg
@ 2015-01-21 16:39       ` Will Deacon
  0 siblings, 0 replies; 40+ messages in thread
From: Will Deacon @ 2015-01-21 16:39 UTC (permalink / raw)
  To: Suzuki K. Poulose
  Cc: Vladimir Davydov, Tejun Heo, Johannes Weiner, linux-mm,
	linux-kernel, mhocko, akpm

On Mon, Jan 19, 2015 at 12:51:27PM +0000, Suzuki K. Poulose wrote:
> On 10/01/15 08:55, Vladimir Davydov wrote:
> > The problem is that the memory cgroup controller takes a css reference
> > per each charged page and does not reparent charged pages on css
> > offline, while cgroup_mount/cgroup_kill_sb expect all css references to
> > offline cgroups to be gone soon, restarting the syscall if the ref count
> > != 0. As a result, if you create a memory cgroup, charge some page cache
> > to it, and then remove it, unmount/mount will hang forever.
> >
> > May be, we should kill the ref counter to the memory controller root in
> > cgroup_kill_sb only if there is no children at all, neither online nor
> > offline.
> >
> 
> Still reproducible on 3.19-rc5 with the same setup.

Yeah, I'm seeing the same failure on my setup too.

> From git bisect, the last good commit is :
> 
> commit 8df0c2dcf61781d2efa8e6e5b06870f6c6785735
> Author: Pranith Kumar <bobby.prani@gmail.com>
> Date:   Wed Dec 10 15:42:28 2014 -0800
> 
>      slab: replace smp_read_barrier_depends() with lockless_dereference()

So that points at 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
as the offending commit.

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg
  2015-01-21 16:39       ` Will Deacon
@ 2015-01-22 13:45         ` Johannes Weiner
  -1 siblings, 0 replies; 40+ messages in thread
From: Johannes Weiner @ 2015-01-22 13:45 UTC (permalink / raw)
  To: Will Deacon
  Cc: Suzuki K. Poulose, Vladimir Davydov, Tejun Heo, linux-mm,
	linux-kernel, mhocko, akpm

On Wed, Jan 21, 2015 at 04:39:55PM +0000, Will Deacon wrote:
> On Mon, Jan 19, 2015 at 12:51:27PM +0000, Suzuki K. Poulose wrote:
> > On 10/01/15 08:55, Vladimir Davydov wrote:
> > > The problem is that the memory cgroup controller takes a css reference
> > > per each charged page and does not reparent charged pages on css
> > > offline, while cgroup_mount/cgroup_kill_sb expect all css references to
> > > offline cgroups to be gone soon, restarting the syscall if the ref count
> > > != 0. As a result, if you create a memory cgroup, charge some page cache
> > > to it, and then remove it, unmount/mount will hang forever.
> > >
> > > May be, we should kill the ref counter to the memory controller root in
> > > cgroup_kill_sb only if there is no children at all, neither online nor
> > > offline.
> > >
> > 
> > Still reproducible on 3.19-rc5 with the same setup.
> 
> Yeah, I'm seeing the same failure on my setup too.
> 
> > From git bisect, the last good commit is :
> > 
> > commit 8df0c2dcf61781d2efa8e6e5b06870f6c6785735
> > Author: Pranith Kumar <bobby.prani@gmail.com>
> > Date:   Wed Dec 10 15:42:28 2014 -0800
> > 
> >      slab: replace smp_read_barrier_depends() with lockless_dereference()
> 
> So that points at 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
> as the offending commit.

With b2052564e66d ("mm: memcontrol: continue cache reclaim from
offlined groups"), page cache can pin an old css and its ancestors
indefinitely, making that hang in a second mount() very likely.

However, swap entries have also been doing that for quite a while now,
and as Vladimir pointed out, the same is true for kernel memory.  This
latest change just makes this existing bug easier to trigger.

I think we have to update the lifetime rules to reflect reality here:
memory and swap lifetime is indefinite, so once the memory controller
is used, it has state that is independent from whether its mounted or
not.  We can support an identical remount, but have to fail mounting
with new parameters that would change the behavior of the controller.

Suzuki, Will, could you give the following patch a shot?

Tejun, would that route be acceptable to you?

Thanks

---
>From c5e88d02d185c52748df664aa30a2c5f8949b0f7 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 22 Jan 2015 08:16:31 -0500
Subject: [patch] kernel: cgroup: prevent mount hang due to memory controller
 lifetime

Since b2052564e66d ("mm: memcontrol: continue cache reclaim from
offlined groups"), re-mounting the memory controller after using it is
very likely to hang.

The cgroup core assumes that any remaining references after deleting a
cgroup are temporary in nature, and synchroneously waits for them, but
the above-mentioned commit has left-over page cache pin its css until
it is reclaimed naturally.  That being said, swap entries and charged
kernel memory have been doing the same indefinite pinning forever, the
bug is just more likely to trigger with left-over page cache.

Reparenting kernel memory is highly impractical, which leaves changing
the cgroup assumptions to reflect this: once a controller has been
mounted and used, it has internal state that is independent from mount
and cgroup lifetime.  It can be unmounted and remounted, but it can't
be reconfigured during subsequent mounts.

Don't offline the controller root as long as there are any children,
dead or alive.  A remount will no longer wait for these old references
to drain, it will simply mount the persistent controller state again.

Reported-by: "Suzuki K. Poulose" <Suzuki.Poulose@arm.com>
Reported-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/cgroup.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index bb263d0caab3..9a09308c8066 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1819,8 +1819,11 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 			goto out_unlock;
 		}
 
-		if (root->flags ^ opts.flags)
-			pr_warn("new mount options do not match the existing superblock, will be ignored\n");
+		if (root->flags ^ opts.flags) {
+			pr_warn("new mount options do not match the existing superblock\n");
+			ret = -EBUSY;
+			goto out_unlock;
+		}
 
 		/*
 		 * We want to reuse @root whose lifetime is governed by its
@@ -1909,7 +1912,7 @@ static void cgroup_kill_sb(struct super_block *sb)
 	 *
 	 * And don't kill the default root.
 	 */
-	if (css_has_online_children(&root->cgrp.self) ||
+	if (!list_empty(&root->cgrp.self.children) ||
 	    root == &cgrp_dfl_root)
 		cgroup_put(&root->cgrp);
 	else
-- 
2.2.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg
@ 2015-01-22 13:45         ` Johannes Weiner
  0 siblings, 0 replies; 40+ messages in thread
From: Johannes Weiner @ 2015-01-22 13:45 UTC (permalink / raw)
  To: Will Deacon
  Cc: Suzuki K. Poulose, Vladimir Davydov, Tejun Heo, linux-mm,
	linux-kernel, mhocko, akpm

On Wed, Jan 21, 2015 at 04:39:55PM +0000, Will Deacon wrote:
> On Mon, Jan 19, 2015 at 12:51:27PM +0000, Suzuki K. Poulose wrote:
> > On 10/01/15 08:55, Vladimir Davydov wrote:
> > > The problem is that the memory cgroup controller takes a css reference
> > > per each charged page and does not reparent charged pages on css
> > > offline, while cgroup_mount/cgroup_kill_sb expect all css references to
> > > offline cgroups to be gone soon, restarting the syscall if the ref count
> > > != 0. As a result, if you create a memory cgroup, charge some page cache
> > > to it, and then remove it, unmount/mount will hang forever.
> > >
> > > May be, we should kill the ref counter to the memory controller root in
> > > cgroup_kill_sb only if there is no children at all, neither online nor
> > > offline.
> > >
> > 
> > Still reproducible on 3.19-rc5 with the same setup.
> 
> Yeah, I'm seeing the same failure on my setup too.
> 
> > From git bisect, the last good commit is :
> > 
> > commit 8df0c2dcf61781d2efa8e6e5b06870f6c6785735
> > Author: Pranith Kumar <bobby.prani@gmail.com>
> > Date:   Wed Dec 10 15:42:28 2014 -0800
> > 
> >      slab: replace smp_read_barrier_depends() with lockless_dereference()
> 
> So that points at 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
> as the offending commit.

With b2052564e66d ("mm: memcontrol: continue cache reclaim from
offlined groups"), page cache can pin an old css and its ancestors
indefinitely, making that hang in a second mount() very likely.

However, swap entries have also been doing that for quite a while now,
and as Vladimir pointed out, the same is true for kernel memory.  This
latest change just makes this existing bug easier to trigger.

I think we have to update the lifetime rules to reflect reality here:
memory and swap lifetime is indefinite, so once the memory controller
is used, it has state that is independent from whether its mounted or
not.  We can support an identical remount, but have to fail mounting
with new parameters that would change the behavior of the controller.

Suzuki, Will, could you give the following patch a shot?

Tejun, would that route be acceptable to you?

Thanks

---

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg
  2015-01-22 13:45         ` Johannes Weiner
@ 2015-01-22 14:34           ` Tejun Heo
  -1 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2015-01-22 14:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Will Deacon, Suzuki K. Poulose, Vladimir Davydov, linux-mm,
	linux-kernel, mhocko, akpm

Hello,

On Thu, Jan 22, 2015 at 08:45:50AM -0500, Johannes Weiner wrote:
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index bb263d0caab3..9a09308c8066 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1819,8 +1819,11 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>  			goto out_unlock;
>  		}
>  
> -		if (root->flags ^ opts.flags)
> -			pr_warn("new mount options do not match the existing superblock, will be ignored\n");
> +		if (root->flags ^ opts.flags) {
> +			pr_warn("new mount options do not match the existing superblock\n");
> +			ret = -EBUSY;
> +			goto out_unlock;
> +		}

Do we really need the above chunk?

> @@ -1909,7 +1912,7 @@ static void cgroup_kill_sb(struct super_block *sb)
>  	 *
>  	 * And don't kill the default root.
>  	 */
> -	if (css_has_online_children(&root->cgrp.self) ||
> +	if (!list_empty(&root->cgrp.self.children) ||
>  	    root == &cgrp_dfl_root)
>  		cgroup_put(&root->cgrp);

I tried to do something a bit more advanced so that eventual async
release of dying children, if they happen, can also release the
hierarchy but I don't think it really matters unless we can forcefully
drain.  So, shouldn't just the above part be enough?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg
@ 2015-01-22 14:34           ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2015-01-22 14:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Will Deacon, Suzuki K. Poulose, Vladimir Davydov, linux-mm,
	linux-kernel, mhocko, akpm

Hello,

On Thu, Jan 22, 2015 at 08:45:50AM -0500, Johannes Weiner wrote:
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index bb263d0caab3..9a09308c8066 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1819,8 +1819,11 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>  			goto out_unlock;
>  		}
>  
> -		if (root->flags ^ opts.flags)
> -			pr_warn("new mount options do not match the existing superblock, will be ignored\n");
> +		if (root->flags ^ opts.flags) {
> +			pr_warn("new mount options do not match the existing superblock\n");
> +			ret = -EBUSY;
> +			goto out_unlock;
> +		}

Do we really need the above chunk?

> @@ -1909,7 +1912,7 @@ static void cgroup_kill_sb(struct super_block *sb)
>  	 *
>  	 * And don't kill the default root.
>  	 */
> -	if (css_has_online_children(&root->cgrp.self) ||
> +	if (!list_empty(&root->cgrp.self.children) ||
>  	    root == &cgrp_dfl_root)
>  		cgroup_put(&root->cgrp);

I tried to do something a bit more advanced so that eventual async
release of dying children, if they happen, can also release the
hierarchy but I don't think it really matters unless we can forcefully
drain.  So, shouldn't just the above part be enough?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg
  2015-01-22 14:34           ` Tejun Heo
@ 2015-01-22 15:19             ` Johannes Weiner
  -1 siblings, 0 replies; 40+ messages in thread
From: Johannes Weiner @ 2015-01-22 15:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Will Deacon, Suzuki K. Poulose, Vladimir Davydov, linux-mm,
	linux-kernel, mhocko, akpm

Hi,

On Thu, Jan 22, 2015 at 09:34:54AM -0500, Tejun Heo wrote:
> On Thu, Jan 22, 2015 at 08:45:50AM -0500, Johannes Weiner wrote:
> > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> > index bb263d0caab3..9a09308c8066 100644
> > --- a/kernel/cgroup.c
> > +++ b/kernel/cgroup.c
> > @@ -1819,8 +1819,11 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
> >  			goto out_unlock;
> >  		}
> >  
> > -		if (root->flags ^ opts.flags)
> > -			pr_warn("new mount options do not match the existing superblock, will be ignored\n");
> > +		if (root->flags ^ opts.flags) {
> > +			pr_warn("new mount options do not match the existing superblock\n");
> > +			ret = -EBUSY;
> > +			goto out_unlock;
> > +		}
> 
> Do we really need the above chunk?

Inform and ignore or fail hard?  I guess we can drop this hunk and
keep with the current behavior.

> > @@ -1909,7 +1912,7 @@ static void cgroup_kill_sb(struct super_block *sb)
> >  	 *
> >  	 * And don't kill the default root.
> >  	 */
> > -	if (css_has_online_children(&root->cgrp.self) ||
> > +	if (!list_empty(&root->cgrp.self.children) ||
> >  	    root == &cgrp_dfl_root)
> >  		cgroup_put(&root->cgrp);
> 
> I tried to do something a bit more advanced so that eventual async
> release of dying children, if they happen, can also release the
> hierarchy but I don't think it really matters unless we can forcefully
> drain.  So, shouldn't just the above part be enough?

Yep, I'd be fine with that.

---

>From 3d7ae5aeb16ce6118d8bff17194e791339a1f06c Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 22 Jan 2015 08:16:31 -0500
Subject: [patch] kernel: cgroup: prevent mount hang due to memory controller
 lifetime

Since b2052564e66d ("mm: memcontrol: continue cache reclaim from
offlined groups"), re-mounting the memory controller after using it is
very likely to hang.

The cgroup core assumes that any remaining references after deleting a
cgroup are temporary in nature, and synchroneously waits for them, but
the above-mentioned commit has left-over page cache pin its css until
it is reclaimed naturally.  That being said, swap entries and charged
kernel memory have been doing the same indefinite pinning forever, the
bug is just more likely to trigger with left-over page cache.

Reparenting kernel memory is highly impractical, which leaves changing
the cgroup assumptions to reflect this: once a controller has been
mounted and used, it has internal state that is independent from mount
and cgroup lifetime.  It can be unmounted and remounted, but it can't
be reconfigured during subsequent mounts.

Don't offline the controller root as long as there are any children,
dead or alive.  A remount will no longer wait for these old references
to drain, it will simply mount the persistent controller state again.

Reported-by: "Suzuki K. Poulose" <Suzuki.Poulose@arm.com>
Reported-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index bb263d0caab3..04cfe8ace520 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1909,7 +1909,7 @@ static void cgroup_kill_sb(struct super_block *sb)
 	 *
 	 * And don't kill the default root.
 	 */
-	if (css_has_online_children(&root->cgrp.self) ||
+	if (!list_empty(&root->cgrp.self.children) ||
 	    root == &cgrp_dfl_root)
 		cgroup_put(&root->cgrp);
 	else
-- 
2.2.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg
@ 2015-01-22 15:19             ` Johannes Weiner
  0 siblings, 0 replies; 40+ messages in thread
From: Johannes Weiner @ 2015-01-22 15:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Will Deacon, Suzuki K. Poulose, Vladimir Davydov, linux-mm,
	linux-kernel, mhocko, akpm

Hi,

On Thu, Jan 22, 2015 at 09:34:54AM -0500, Tejun Heo wrote:
> On Thu, Jan 22, 2015 at 08:45:50AM -0500, Johannes Weiner wrote:
> > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> > index bb263d0caab3..9a09308c8066 100644
> > --- a/kernel/cgroup.c
> > +++ b/kernel/cgroup.c
> > @@ -1819,8 +1819,11 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
> >  			goto out_unlock;
> >  		}
> >  
> > -		if (root->flags ^ opts.flags)
> > -			pr_warn("new mount options do not match the existing superblock, will be ignored\n");
> > +		if (root->flags ^ opts.flags) {
> > +			pr_warn("new mount options do not match the existing superblock\n");
> > +			ret = -EBUSY;
> > +			goto out_unlock;
> > +		}
> 
> Do we really need the above chunk?

Inform and ignore or fail hard?  I guess we can drop this hunk and
keep with the current behavior.

> > @@ -1909,7 +1912,7 @@ static void cgroup_kill_sb(struct super_block *sb)
> >  	 *
> >  	 * And don't kill the default root.
> >  	 */
> > -	if (css_has_online_children(&root->cgrp.self) ||
> > +	if (!list_empty(&root->cgrp.self.children) ||
> >  	    root == &cgrp_dfl_root)
> >  		cgroup_put(&root->cgrp);
> 
> I tried to do something a bit more advanced so that eventual async
> release of dying children, if they happen, can also release the
> hierarchy but I don't think it really matters unless we can forcefully
> drain.  So, shouldn't just the above part be enough?

Yep, I'd be fine with that.

---

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg
  2015-01-22 15:19             ` Johannes Weiner
@ 2015-01-22 15:28               ` Tejun Heo
  -1 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2015-01-22 15:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Will Deacon, Suzuki K. Poulose, Vladimir Davydov, linux-mm,
	linux-kernel, mhocko, akpm

On Thu, Jan 22, 2015 at 10:19:43AM -0500, Johannes Weiner wrote:
> From 3d7ae5aeb16ce6118d8bff17194e791339a1f06c Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Thu, 22 Jan 2015 08:16:31 -0500
> Subject: [patch] kernel: cgroup: prevent mount hang due to memory controller
>  lifetime
> 
> Since b2052564e66d ("mm: memcontrol: continue cache reclaim from
> offlined groups"), re-mounting the memory controller after using it is
> very likely to hang.
> 
> The cgroup core assumes that any remaining references after deleting a
> cgroup are temporary in nature, and synchroneously waits for them, but
> the above-mentioned commit has left-over page cache pin its css until
> it is reclaimed naturally.  That being said, swap entries and charged
> kernel memory have been doing the same indefinite pinning forever, the
> bug is just more likely to trigger with left-over page cache.
> 
> Reparenting kernel memory is highly impractical, which leaves changing
> the cgroup assumptions to reflect this: once a controller has been
> mounted and used, it has internal state that is independent from mount
> and cgroup lifetime.  It can be unmounted and remounted, but it can't
> be reconfigured during subsequent mounts.
> 
> Don't offline the controller root as long as there are any children,
> dead or alive.  A remount will no longer wait for these old references
> to drain, it will simply mount the persistent controller state again.
> 
> Reported-by: "Suzuki K. Poulose" <Suzuki.Poulose@arm.com>
> Reported-by: Will Deacon <will.deacon@arm.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Applied to cgroup/for-3.19-fixes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg
@ 2015-01-22 15:28               ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2015-01-22 15:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Will Deacon, Suzuki K. Poulose, Vladimir Davydov, linux-mm,
	linux-kernel, mhocko, akpm

On Thu, Jan 22, 2015 at 10:19:43AM -0500, Johannes Weiner wrote:
> From 3d7ae5aeb16ce6118d8bff17194e791339a1f06c Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Thu, 22 Jan 2015 08:16:31 -0500
> Subject: [patch] kernel: cgroup: prevent mount hang due to memory controller
>  lifetime
> 
> Since b2052564e66d ("mm: memcontrol: continue cache reclaim from
> offlined groups"), re-mounting the memory controller after using it is
> very likely to hang.
> 
> The cgroup core assumes that any remaining references after deleting a
> cgroup are temporary in nature, and synchroneously waits for them, but
> the above-mentioned commit has left-over page cache pin its css until
> it is reclaimed naturally.  That being said, swap entries and charged
> kernel memory have been doing the same indefinite pinning forever, the
> bug is just more likely to trigger with left-over page cache.
> 
> Reparenting kernel memory is highly impractical, which leaves changing
> the cgroup assumptions to reflect this: once a controller has been
> mounted and used, it has internal state that is independent from mount
> and cgroup lifetime.  It can be unmounted and remounted, but it can't
> be reconfigured during subsequent mounts.
> 
> Don't offline the controller root as long as there are any children,
> dead or alive.  A remount will no longer wait for these old references
> to drain, it will simply mount the persistent controller state again.
> 
> Reported-by: "Suzuki K. Poulose" <Suzuki.Poulose@arm.com>
> Reported-by: Will Deacon <will.deacon@arm.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Applied to cgroup/for-3.19-fixes.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg
  2015-01-22 13:45         ` Johannes Weiner
@ 2015-01-23 15:00           ` Suzuki K. Poulose
  -1 siblings, 0 replies; 40+ messages in thread
From: Suzuki K. Poulose @ 2015-01-23 15:00 UTC (permalink / raw)
  To: Johannes Weiner, Will Deacon
  Cc: Vladimir Davydov, Tejun Heo, linux-mm, linux-kernel, mhocko, akpm

On 22/01/15 13:45, Johannes Weiner wrote:
> On Wed, Jan 21, 2015 at 04:39:55PM +0000, Will Deacon wrote:
>> On Mon, Jan 19, 2015 at 12:51:27PM +0000, Suzuki K. Poulose wrote:
>>> On 10/01/15 08:55, Vladimir Davydov wrote:
>>>> The problem is that the memory cgroup controller takes a css reference
>>>> per each charged page and does not reparent charged pages on css
>>>> offline, while cgroup_mount/cgroup_kill_sb expect all css references to
>>>> offline cgroups to be gone soon, restarting the syscall if the ref count
>>>> != 0. As a result, if you create a memory cgroup, charge some page cache
>>>> to it, and then remove it, unmount/mount will hang forever.
>>>>
>>>> May be, we should kill the ref counter to the memory controller root in
>>>> cgroup_kill_sb only if there is no children at all, neither online nor
>>>> offline.
>>>>
>>>
>>> Still reproducible on 3.19-rc5 with the same setup.
>>
>> Yeah, I'm seeing the same failure on my setup too.
>>
>>>  From git bisect, the last good commit is :
>>>
>>> commit 8df0c2dcf61781d2efa8e6e5b06870f6c6785735
>>> Author: Pranith Kumar <bobby.prani@gmail.com>
>>> Date:   Wed Dec 10 15:42:28 2014 -0800
>>>
>>>       slab: replace smp_read_barrier_depends() with lockless_dereference()
>>
>> So that points at 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
>> as the offending commit.
>
> With b2052564e66d ("mm: memcontrol: continue cache reclaim from
> offlined groups"), page cache can pin an old css and its ancestors
> indefinitely, making that hang in a second mount() very likely.
>
> However, swap entries have also been doing that for quite a while now,
> and as Vladimir pointed out, the same is true for kernel memory.  This
> latest change just makes this existing bug easier to trigger.
>
> I think we have to update the lifetime rules to reflect reality here:
> memory and swap lifetime is indefinite, so once the memory controller
> is used, it has state that is independent from whether its mounted or
> not.  We can support an identical remount, but have to fail mounting
> with new parameters that would change the behavior of the controller.
>
> Suzuki, Will, could you give the following patch a shot?


>
> Tejun, would that route be acceptable to you?
>
> Thanks
>
> ---
>  From c5e88d02d185c52748df664aa30a2c5f8949b0f7 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Thu, 22 Jan 2015 08:16:31 -0500
> Subject: [patch] kernel: cgroup: prevent mount hang due to memory controller
>   lifetime
>

>
> Don't offline the controller root as long as there are any children,
> dead or alive.  A remount will no longer wait for these old references
> to drain, it will simply mount the persistent controller state again.
>
> Reported-by: "Suzuki K. Poulose" <Suzuki.Poulose@arm.com>
> Reported-by: Will Deacon <will.deacon@arm.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
This one fixes the issue.

Tested-by : Suzuki K. Poulose <suzuki.poulose@arm.com>

Thanks
Suzuki




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] 3.19-rc3 : memcg: Hang in mount memcg
@ 2015-01-23 15:00           ` Suzuki K. Poulose
  0 siblings, 0 replies; 40+ messages in thread
From: Suzuki K. Poulose @ 2015-01-23 15:00 UTC (permalink / raw)
  To: Johannes Weiner, Will Deacon
  Cc: Vladimir Davydov, Tejun Heo, linux-mm, linux-kernel, mhocko, akpm

On 22/01/15 13:45, Johannes Weiner wrote:
> On Wed, Jan 21, 2015 at 04:39:55PM +0000, Will Deacon wrote:
>> On Mon, Jan 19, 2015 at 12:51:27PM +0000, Suzuki K. Poulose wrote:
>>> On 10/01/15 08:55, Vladimir Davydov wrote:
>>>> The problem is that the memory cgroup controller takes a css reference
>>>> per each charged page and does not reparent charged pages on css
>>>> offline, while cgroup_mount/cgroup_kill_sb expect all css references to
>>>> offline cgroups to be gone soon, restarting the syscall if the ref count
>>>> != 0. As a result, if you create a memory cgroup, charge some page cache
>>>> to it, and then remove it, unmount/mount will hang forever.
>>>>
>>>> May be, we should kill the ref counter to the memory controller root in
>>>> cgroup_kill_sb only if there is no children at all, neither online nor
>>>> offline.
>>>>
>>>
>>> Still reproducible on 3.19-rc5 with the same setup.
>>
>> Yeah, I'm seeing the same failure on my setup too.
>>
>>>  From git bisect, the last good commit is :
>>>
>>> commit 8df0c2dcf61781d2efa8e6e5b06870f6c6785735
>>> Author: Pranith Kumar <bobby.prani@gmail.com>
>>> Date:   Wed Dec 10 15:42:28 2014 -0800
>>>
>>>       slab: replace smp_read_barrier_depends() with lockless_dereference()
>>
>> So that points at 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
>> as the offending commit.
>
> With b2052564e66d ("mm: memcontrol: continue cache reclaim from
> offlined groups"), page cache can pin an old css and its ancestors
> indefinitely, making that hang in a second mount() very likely.
>
> However, swap entries have also been doing that for quite a while now,
> and as Vladimir pointed out, the same is true for kernel memory.  This
> latest change just makes this existing bug easier to trigger.
>
> I think we have to update the lifetime rules to reflect reality here:
> memory and swap lifetime is indefinite, so once the memory controller
> is used, it has state that is independent from whether its mounted or
> not.  We can support an identical remount, but have to fail mounting
> with new parameters that would change the behavior of the controller.
>
> Suzuki, Will, could you give the following patch a shot?


>
> Tejun, would that route be acceptable to you?
>
> Thanks
>
> ---
>  From c5e88d02d185c52748df664aa30a2c5f8949b0f7 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Thu, 22 Jan 2015 08:16:31 -0500
> Subject: [patch] kernel: cgroup: prevent mount hang due to memory controller
>   lifetime
>

>
> Don't offline the controller root as long as there are any children,
> dead or alive.  A remount will no longer wait for these old references
> to drain, it will simply mount the persistent controller state again.
>
> Reported-by: "Suzuki K. Poulose" <Suzuki.Poulose@arm.com>
> Reported-by: Will Deacon <will.deacon@arm.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
This one fixes the issue.

Tested-by : Suzuki K. Poulose <suzuki.poulose@arm.com>

Thanks
Suzuki



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2015-01-23 15:00 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-09 17:43 [Regression] 3.19-rc3 : memcg: Hang in mount memcg Suzuki K. Poulose
2015-01-09 17:43 ` Suzuki K. Poulose
2015-01-09 21:46 ` Tejun Heo
2015-01-09 21:46   ` Tejun Heo
2015-01-12 17:02   ` Suzuki K. Poulose
2015-01-12 17:02     ` Suzuki K. Poulose
2015-01-10  8:55 ` Vladimir Davydov
2015-01-10  8:55   ` Vladimir Davydov
2015-01-10 21:43   ` [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback Tejun Heo
2015-01-10 21:43     ` Tejun Heo
2015-01-11 20:55     ` Johannes Weiner
2015-01-11 20:55       ` Johannes Weiner
2015-01-12  8:01       ` Vladimir Davydov
2015-01-12  8:01         ` Vladimir Davydov
2015-01-12 11:28         ` Tejun Heo
2015-01-12 11:28           ` Tejun Heo
2015-01-12 12:59           ` Vladimir Davydov
2015-01-12 12:59             ` Vladimir Davydov
2015-01-12 13:05             ` Tejun Heo
2015-01-12 13:05               ` Tejun Heo
2015-01-14 11:16       ` Suzuki K. Poulose
2015-01-14 11:16         ` Suzuki K. Poulose
2015-01-15 17:56       ` Michal Hocko
2015-01-15 17:56         ` Michal Hocko
2015-01-15 17:26     ` Michal Hocko
2015-01-15 17:26       ` Michal Hocko
2015-01-19 12:51   ` [Regression] 3.19-rc3 : memcg: Hang in mount memcg Suzuki K. Poulose
2015-01-19 12:51     ` Suzuki K. Poulose
2015-01-21 16:39     ` Will Deacon
2015-01-21 16:39       ` Will Deacon
2015-01-22 13:45       ` Johannes Weiner
2015-01-22 13:45         ` Johannes Weiner
2015-01-22 14:34         ` Tejun Heo
2015-01-22 14:34           ` Tejun Heo
2015-01-22 15:19           ` Johannes Weiner
2015-01-22 15:19             ` Johannes Weiner
2015-01-22 15:28             ` Tejun Heo
2015-01-22 15:28               ` Tejun Heo
2015-01-23 15:00         ` Suzuki K. Poulose
2015-01-23 15:00           ` Suzuki K. Poulose

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.