All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] cgroup: fix top cgroup refcnt leak
@ 2018-12-28 23:59 Andrei Vagin
  2018-12-29  0:04 ` [PATCH vfs/for-next v2] " Andrei Vagin
  2019-01-02 22:26 ` [PATCH vfs/for-next v2] " David Howells
  0 siblings, 2 replies; 29+ messages in thread
From: Andrei Vagin @ 2018-12-28 23:59 UTC (permalink / raw)
  To: Alexander Viro, David Howells
  Cc: linux-fsdevel, cgroups, Andrei Vagin, Li Zefan

It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak")
commit was reverted by mistake.

$ mkdir /tmp/cgroup
$ mkdir /tmp/cgroup2
$ mount -t cgroup -o none,name=test test /tmp/cgroup
$ mount -t cgroup -o none,name=test test /tmp/cgroup2
$ umount /tmp/cgroup
$ umount /tmp/cgroup2
$ cat /proc/self/cgroup | grep test
12:name=test:/

You can see the test cgroup was not freed.

Cc: Li Zefan <lizefan@huawei.com>
Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
---
 kernel/cgroup/cgroup.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index fb0717696895..dbb8805bf66c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2045,8 +2045,11 @@ int cgroup_do_get_tree(struct fs_context *fc)
 	}
 
 	ret = 0;
-	if (ctx->kfc.new_sb_created)
+	if (ctx->kfc.new_sb_created) {
 		goto out_cgrp;
+	} else {
+		cgroup_put(&ctx->root->cgrp);
+	}
 	apply_cgroup_root_flags(ctx->flags);
 	return 0;
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak
  2018-12-28 23:59 [PATCH] cgroup: fix top cgroup refcnt leak Andrei Vagin
@ 2018-12-29  0:04 ` Andrei Vagin
  2018-12-30 19:41   ` Andrei Vagin
  2019-01-02  2:28   ` Al Viro
  2019-01-02 22:26 ` [PATCH vfs/for-next v2] " David Howells
  1 sibling, 2 replies; 29+ messages in thread
From: Andrei Vagin @ 2018-12-29  0:04 UTC (permalink / raw)
  To: Alexander Viro, David Howells
  Cc: linux-fsdevel, cgroups, Andrei Vagin, Li Zefan

It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak")
commit was reverted by mistake.

$ mkdir /tmp/cgroup
$ mkdir /tmp/cgroup2
$ mount -t cgroup -o none,name=test test /tmp/cgroup
$ mount -t cgroup -o none,name=test test /tmp/cgroup2
$ umount /tmp/cgroup
$ umount /tmp/cgroup2
$ cat /proc/self/cgroup | grep test
12:name=test:/

You can see the test cgroup was not freed.

Cc: Li Zefan <lizefan@huawei.com>
Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
---

v2: clean up code and add the vfs/for-next tag

 kernel/cgroup/cgroup.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index fb0717696895..f63974a3725f 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2047,6 +2047,9 @@ int cgroup_do_get_tree(struct fs_context *fc)
 	ret = 0;
 	if (ctx->kfc.new_sb_created)
 		goto out_cgrp;
+	else
+		cgroup_put(&ctx->root->cgrp);
+
 	apply_cgroup_root_flags(ctx->flags);
 	return 0;
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak
  2018-12-29  0:04 ` [PATCH vfs/for-next v2] " Andrei Vagin
@ 2018-12-30 19:41   ` Andrei Vagin
  2019-01-02  2:28   ` Al Viro
  1 sibling, 0 replies; 29+ messages in thread
From: Andrei Vagin @ 2018-12-30 19:41 UTC (permalink / raw)
  To: Alexander Viro, David Howells; +Cc: linux-fsdevel, cgroups, Li Zefan

Alexander and David,

This patch fixed the problem which was reported more than a mounth ago
https://patchwork.kernel.org/patch/10610671/

Now this code is in the linux-next and the problem blocks running CRIU
tests on the linux next kernels:

https://travis-ci.org/avagin/linux/builds

Thanks,
Andrei

On Fri, Dec 28, 2018 at 04:04:00PM -0800, Andrei Vagin wrote:
> It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak")
> commit was reverted by mistake.
> 
> $ mkdir /tmp/cgroup
> $ mkdir /tmp/cgroup2
> $ mount -t cgroup -o none,name=test test /tmp/cgroup
> $ mount -t cgroup -o none,name=test test /tmp/cgroup2
> $ umount /tmp/cgroup
> $ umount /tmp/cgroup2
> $ cat /proc/self/cgroup | grep test
> 12:name=test:/
> 
> You can see the test cgroup was not freed.
> 
> Cc: Li Zefan <lizefan@huawei.com>
> Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
> Signed-off-by: Andrei Vagin <avagin@gmail.com>
> ---
> 
> v2: clean up code and add the vfs/for-next tag
> 
>  kernel/cgroup/cgroup.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index fb0717696895..f63974a3725f 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -2047,6 +2047,9 @@ int cgroup_do_get_tree(struct fs_context *fc)
>  	ret = 0;
>  	if (ctx->kfc.new_sb_created)
>  		goto out_cgrp;
> +	else
> +		cgroup_put(&ctx->root->cgrp);
> +
>  	apply_cgroup_root_flags(ctx->flags);
>  	return 0;
>  
> -- 
> 2.17.2
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak
  2018-12-29  0:04 ` [PATCH vfs/for-next v2] " Andrei Vagin
  2018-12-30 19:41   ` Andrei Vagin
@ 2019-01-02  2:28   ` Al Viro
  2019-01-02 18:14     ` [PATCH vfs/for-next v3] " Andrei Vagin
                       ` (2 more replies)
  1 sibling, 3 replies; 29+ messages in thread
From: Al Viro @ 2019-01-02  2:28 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: David Howells, linux-fsdevel, cgroups, Li Zefan

On Fri, Dec 28, 2018 at 04:04:00PM -0800, Andrei Vagin wrote:
> It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak")
> commit was reverted by mistake.
> 
> $ mkdir /tmp/cgroup
> $ mkdir /tmp/cgroup2
> $ mount -t cgroup -o none,name=test test /tmp/cgroup
> $ mount -t cgroup -o none,name=test test /tmp/cgroup2
> $ umount /tmp/cgroup
> $ umount /tmp/cgroup2
> $ cat /proc/self/cgroup | grep test
> 12:name=test:/
> 
> You can see the test cgroup was not freed.
> 
> Cc: Li Zefan <lizefan@huawei.com>
> Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
> Signed-off-by: Andrei Vagin <avagin@gmail.com>
> ---
> 
> v2: clean up code and add the vfs/for-next tag
> 
>  kernel/cgroup/cgroup.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index fb0717696895..f63974a3725f 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -2047,6 +2047,9 @@ int cgroup_do_get_tree(struct fs_context *fc)
>  	ret = 0;
>  	if (ctx->kfc.new_sb_created)
>  		goto out_cgrp;
> +	else
> +		cgroup_put(&ctx->root->cgrp);
> +
>  	apply_cgroup_root_flags(ctx->flags);
>  	return 0;

That looks horrible, especially since out_cgrp is return ret;
If anything, it should be
	if (!ctx->kfc.new_sb_created) {
		cgroup_put(&ctx->root->cgrp);
		apply_cgroup_root_flags(ctx->flags);
	}
	return 0;

What I don't understand is why apply_cgroup_root_flags() is not
called in "new superblock" case here.  It used to, prior to that
conversion...

Another fishy place I see there is
                nsdentry = kernfs_node_dentry(cgrp->kn, fc->root->d_sb);
                if (IS_ERR(nsdentry))
                        return PTR_ERR(nsdentry);
                dput(fc->root);
                fc->root = nsdentry;
What happens if we get here with non-NULL fc->root (and we'd better,
after successful from kernfs_get_tree() a bit earlier) and hit that
failure exit?  A leak?

With apologies for being MIA for a week - it had been insane here...

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH vfs/for-next v3] cgroup: fix top cgroup refcnt leak
  2019-01-02  2:28   ` Al Viro
@ 2019-01-02 18:14     ` Andrei Vagin
  2019-01-02 19:37     ` [PATCH vfs/for-next v2] " Andrei Vagin
  2019-01-02 19:37     ` [PATCH vfs/for-next v4] " Andrei Vagin
  2 siblings, 0 replies; 29+ messages in thread
From: Andrei Vagin @ 2019-01-02 18:14 UTC (permalink / raw)
  To: Alexander Viro, David Howells
  Cc: linux-fsdevel, cgroups, Andrei Vagin, Li Zefan

It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak")
commit was reverted by mistake.

$ mkdir /tmp/cgroup
$ mkdir /tmp/cgroup2
$ mount -t cgroup -o none,name=test test /tmp/cgroup
$ mount -t cgroup -o none,name=test test /tmp/cgroup2
$ umount /tmp/cgroup
$ umount /tmp/cgroup2
$ cat /proc/self/cgroup | grep test
12:name=test:/

You can see the test cgroup was not freed.

Cc: Li Zefan <lizefan@huawei.com>
Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
---

v2: clean up code and add the vfs/for-next tag
v3: fix a reference leak when kernfs_node_dentry fails 

 kernel/cgroup/cgroup.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index fb0717696895..b43479d1f9a3 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2019,7 +2019,7 @@ int cgroup_do_get_tree(struct fs_context *fc)
 
 	ret = kernfs_get_tree(fc);
 	if (ret < 0)
-		goto out_cgrp;
+		return ret;
 
 	/*
 	 * In non-init cgroup namespace, instead of root cgroup's dentry,
@@ -2038,19 +2038,22 @@ int cgroup_do_get_tree(struct fs_context *fc)
 		mutex_unlock(&cgroup_mutex);
 
 		nsdentry = kernfs_node_dentry(cgrp->kn, fc->root->d_sb);
-		if (IS_ERR(nsdentry))
-			return PTR_ERR(nsdentry);
+		if (IS_ERR(nsdentry)) {
+			ret = PTR_ERR(nsdentry);
+			goto out_cgrp;
+		}
 		dput(fc->root);
 		fc->root = nsdentry;
 	}
 
 	ret = 0;
-	if (ctx->kfc.new_sb_created)
-		goto out_cgrp;
-	apply_cgroup_root_flags(ctx->flags);
-	return 0;
+	if (!ctx->kfc.new_sb_created)
+		apply_cgroup_root_flags(ctx->flags);
 
 out_cgrp:
+	if (!ctx->kfc.new_sb_created)
+		cgroup_put(&ctx->root->cgrp);
+
 	return ret;
 }
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak
  2019-01-02  2:28   ` Al Viro
  2019-01-02 18:14     ` [PATCH vfs/for-next v3] " Andrei Vagin
@ 2019-01-02 19:37     ` Andrei Vagin
  2019-01-02 19:37     ` [PATCH vfs/for-next v4] " Andrei Vagin
  2 siblings, 0 replies; 29+ messages in thread
From: Andrei Vagin @ 2019-01-02 19:37 UTC (permalink / raw)
  To: Al Viro; +Cc: David Howells, linux-fsdevel, cgroups, Li Zefan

On Wed, Jan 02, 2019 at 02:28:04AM +0000, Al Viro wrote:
> On Fri, Dec 28, 2018 at 04:04:00PM -0800, Andrei Vagin wrote:
> > It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak")
> > commit was reverted by mistake.
> > 
> > $ mkdir /tmp/cgroup
> > $ mkdir /tmp/cgroup2
> > $ mount -t cgroup -o none,name=test test /tmp/cgroup
> > $ mount -t cgroup -o none,name=test test /tmp/cgroup2
> > $ umount /tmp/cgroup
> > $ umount /tmp/cgroup2
> > $ cat /proc/self/cgroup | grep test
> > 12:name=test:/
> > 
> > You can see the test cgroup was not freed.
> > 
> > Cc: Li Zefan <lizefan@huawei.com>
> > Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
> > Signed-off-by: Andrei Vagin <avagin@gmail.com>
> > ---
> > 
> > v2: clean up code and add the vfs/for-next tag
> > 
> >  kernel/cgroup/cgroup.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > index fb0717696895..f63974a3725f 100644
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -2047,6 +2047,9 @@ int cgroup_do_get_tree(struct fs_context *fc)
> >  	ret = 0;
> >  	if (ctx->kfc.new_sb_created)
> >  		goto out_cgrp;
> > +	else
> > +		cgroup_put(&ctx->root->cgrp);
> > +
> >  	apply_cgroup_root_flags(ctx->flags);
> >  	return 0;
> 
> That looks horrible, especially since out_cgrp is return ret;
> If anything, it should be
> 	if (!ctx->kfc.new_sb_created) {
> 		cgroup_put(&ctx->root->cgrp);
> 		apply_cgroup_root_flags(ctx->flags);
> 	}
> 	return 0;
> 
> What I don't understand is why apply_cgroup_root_flags() is not
> called in "new superblock" case here.  It used to, prior to that
> conversion...

It is a good question and I don't have an answer on it. I think David
can tell more about this.

> 
> Another fishy place I see there is
>                 nsdentry = kernfs_node_dentry(cgrp->kn, fc->root->d_sb);
>                 if (IS_ERR(nsdentry))
>                         return PTR_ERR(nsdentry);
>                 dput(fc->root);
>                 fc->root = nsdentry;
> What happens if we get here with non-NULL fc->root (and we'd better,
> after successful from kernfs_get_tree() a bit earlier) and hit that
> failure exit?  A leak?

Yes, here is a leak too. I fixed it and sent v3, but then I decided that
it would be good to test this error path and found one more problem:

[   22.669696] ================================================
[   22.670468] WARNING: lock held when returning to user space!
[   22.671225] 4.20.0-rc1-00081-g01a72fa4bd0e-dirty #12 Not tainted
[   22.672018] ------------------------------------------------
[   22.672817] mount/1148 is leaving the kernel with locks still held!
[   22.673660] 1 lock held by mount/1148:
[   22.674165]  #0: 00000000c07be72c (&fc->fs_type->s_umount_key#41){+.+.}, at: grab_super+0x29/0x90

deactivate_locked_super() has to be called on the error path.  I sent v4 with
this fix. I'm sorry for the noise in the mailing list, I had to test this code
before sending v3;

> With apologies for being MIA for a week - it had been insane here...

Good to see you back in stride.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH vfs/for-next v4] cgroup: fix top cgroup refcnt leak
  2019-01-02  2:28   ` Al Viro
  2019-01-02 18:14     ` [PATCH vfs/for-next v3] " Andrei Vagin
  2019-01-02 19:37     ` [PATCH vfs/for-next v2] " Andrei Vagin
@ 2019-01-02 19:37     ` Andrei Vagin
  2019-01-02 20:02       ` Al Viro
  2 siblings, 1 reply; 29+ messages in thread
From: Andrei Vagin @ 2019-01-02 19:37 UTC (permalink / raw)
  To: Alexander Viro, David Howells
  Cc: linux-fsdevel, cgroups, Andrei Vagin, Li Zefan

It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak")
commit was reverted by mistake.

$ mkdir /tmp/cgroup
$ mkdir /tmp/cgroup2
$ mount -t cgroup -o none,name=test test /tmp/cgroup
$ mount -t cgroup -o none,name=test test /tmp/cgroup2
$ umount /tmp/cgroup
$ umount /tmp/cgroup2
$ cat /proc/self/cgroup | grep test
12:name=test:/

You can see the test cgroup was not freed.

Cc: Li Zefan <lizefan@huawei.com>
Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
---

v2: clean up code and add the vfs/for-next tag
v3: fix a reference leak when kernfs_node_dentry fails
v4: call deactivate_locked_super() in a error case

 kernel/cgroup/cgroup.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index fb0717696895..309c8507fe6c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2019,7 +2019,7 @@ int cgroup_do_get_tree(struct fs_context *fc)
 
 	ret = kernfs_get_tree(fc);
 	if (ret < 0)
-		goto out_cgrp;
+		return ret;
 
 	/*
 	 * In non-init cgroup namespace, instead of root cgroup's dentry,
@@ -2038,19 +2038,28 @@ int cgroup_do_get_tree(struct fs_context *fc)
 		mutex_unlock(&cgroup_mutex);
 
 		nsdentry = kernfs_node_dentry(cgrp->kn, fc->root->d_sb);
-		if (IS_ERR(nsdentry))
-			return PTR_ERR(nsdentry);
+		if (IS_ERR(nsdentry)) {
+			ret = PTR_ERR(nsdentry);
+			goto out_cgrp;
+		}
 		dput(fc->root);
 		fc->root = nsdentry;
 	}
 
 	ret = 0;
-	if (ctx->kfc.new_sb_created)
-		goto out_cgrp;
-	apply_cgroup_root_flags(ctx->flags);
-	return 0;
+	if (!ctx->kfc.new_sb_created)
+		apply_cgroup_root_flags(ctx->flags);
 
 out_cgrp:
+	if (!ctx->kfc.new_sb_created)
+		cgroup_put(&ctx->root->cgrp);
+
+	if (unlikely(ret)) {
+		dput(fc->root);
+		deactivate_locked_super(fc->root->d_sb);
+		fc->root = NULL;
+	}
+
 	return ret;
 }
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH vfs/for-next v4] cgroup: fix top cgroup refcnt leak
  2019-01-02 19:37     ` [PATCH vfs/for-next v4] " Andrei Vagin
@ 2019-01-02 20:02       ` Al Viro
  2019-01-02 21:06         ` Andrei Vagin
  2019-01-03  0:26         ` David Howells
  0 siblings, 2 replies; 29+ messages in thread
From: Al Viro @ 2019-01-02 20:02 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: David Howells, linux-fsdevel, cgroups, Li Zefan

On Wed, Jan 02, 2019 at 11:37:56AM -0800, Andrei Vagin wrote:
> +	if (unlikely(ret)) {
> +		dput(fc->root);
> +		deactivate_locked_super(fc->root->d_sb);
> +		fc->root = NULL;

Er...  I'd rather not dereference the damn thing after dput().
It's _probably_ OK (fc->root here, AFAICS, is going to be
equal to its ->sb->s_root, and thus still pinned), but it's
better to fetch ->d_sb first.  If nothing else, it's easier
to prove correctness that way...

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH vfs/for-next v4] cgroup: fix top cgroup refcnt leak
  2019-01-02 20:02       ` Al Viro
@ 2019-01-02 21:06         ` Andrei Vagin
  2019-01-03  0:26         ` David Howells
  1 sibling, 0 replies; 29+ messages in thread
From: Andrei Vagin @ 2019-01-02 21:06 UTC (permalink / raw)
  To: Alexander Viro, David Howells
  Cc: linux-fsdevel, cgroups, Andrei Vagin, Li Zefan

It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak")
commit was reverted by mistake.

$ mkdir /tmp/cgroup
$ mkdir /tmp/cgroup2
$ mount -t cgroup -o none,name=test test /tmp/cgroup
$ mount -t cgroup -o none,name=test test /tmp/cgroup2
$ umount /tmp/cgroup
$ umount /tmp/cgroup2
$ cat /proc/self/cgroup | grep test
12:name=test:/

You can see the test cgroup was not freed.

Cc: Li Zefan <lizefan@huawei.com>
Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
---

v2: clean up code and add the vfs/for-next tag
v3: fix a reference leak when kernfs_node_dentry fails
v4: call deactivate_locked_super() in a error case
v5: don't dereference fc->root after dput()

 kernel/cgroup/cgroup.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index fb0717696895..53b730cf1f7b 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2019,7 +2019,7 @@ int cgroup_do_get_tree(struct fs_context *fc)
 
 	ret = kernfs_get_tree(fc);
 	if (ret < 0)
-		goto out_cgrp;
+		return ret;
 
 	/*
 	 * In non-init cgroup namespace, instead of root cgroup's dentry,
@@ -2038,19 +2038,30 @@ int cgroup_do_get_tree(struct fs_context *fc)
 		mutex_unlock(&cgroup_mutex);
 
 		nsdentry = kernfs_node_dentry(cgrp->kn, fc->root->d_sb);
-		if (IS_ERR(nsdentry))
-			return PTR_ERR(nsdentry);
+		if (IS_ERR(nsdentry)) {
+			ret = PTR_ERR(nsdentry);
+			goto out_cgrp;
+		}
 		dput(fc->root);
 		fc->root = nsdentry;
 	}
 
 	ret = 0;
-	if (ctx->kfc.new_sb_created)
-		goto out_cgrp;
-	apply_cgroup_root_flags(ctx->flags);
-	return 0;
+	if (!ctx->kfc.new_sb_created)
+		apply_cgroup_root_flags(ctx->flags);
 
 out_cgrp:
+	if (!ctx->kfc.new_sb_created)
+		cgroup_put(&ctx->root->cgrp);
+
+	if (unlikely(ret)) {
+		struct super_block *sb = fc->root->d_sb;
+
+		dput(fc->root);
+		deactivate_locked_super(sb);
+		fc->root = NULL;
+	}
+
 	return ret;
 }
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak
  2018-12-28 23:59 [PATCH] cgroup: fix top cgroup refcnt leak Andrei Vagin
  2018-12-29  0:04 ` [PATCH vfs/for-next v2] " Andrei Vagin
@ 2019-01-02 22:26 ` David Howells
  2019-01-02 23:06   ` Andrei Vagin
  1 sibling, 1 reply; 29+ messages in thread
From: David Howells @ 2019-01-02 22:26 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: dhowells, Alexander Viro, linux-fsdevel, cgroups, Li Zefan

Andrei Vagin <avagin@gmail.com> wrote:

> It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak")
> commit was reverted by mistake.
> 
> $ mkdir /tmp/cgroup
> $ mkdir /tmp/cgroup2
> $ mount -t cgroup -o none,name=test test /tmp/cgroup
> $ mount -t cgroup -o none,name=test test /tmp/cgroup2
> $ umount /tmp/cgroup
> $ umount /tmp/cgroup2
> $ cat /proc/self/cgroup | grep test
> 12:name=test:/
> 
> You can see the test cgroup was not freed.
> 
> Cc: Li Zefan <lizefan@huawei.com>
> Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
> Signed-off-by: Andrei Vagin <avagin@gmail.com>

The kernel (Al's for-next branch, that is) seems to work fine without this
patch; this patch causes the kernel to go bang (trace below).

David
---
percpu ref (css_release) <= 0 (0) after switching to atomic
WARNING: CPU: 1 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x90/0x1a0
Modules linked in:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.20.0-rc1-fscache+ #1256
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x90/0x1a0
Code: d8 48 85 c0 7f 26 80 3d 29 60 ff 00 00 75 1d 48 8b 53 d8 48 c7 c7 00 3c 16 82 c6 05 15 60 ff 00 01 48 8b 73 e8 e8 26 1f ac ff <0f> 0b 48 8d 43 d8 48 89 c7 49 89 c6 ff 53 f0 48 c7 43 f0 00 00 00
RSP: 0018:ffff8800c6c83ed8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff8800d3a6e048 RCX: ffff8800c6c83dc4
RDX: 0000000000000003 RSI: ffff8800c5c9cb08 RDI: ffff8800c5c9c340
RBP: ffff8800c6c83ef8 R08: 000000000000003b R09: 0000000000021900
R10: ffff8800c6c83b00 R11: 0000004ac8b230c2 R12: 000060ff39019cd0
R13: ffffffff825b0be8 R14: ffffffffffffffff R15: 0000000000000009
FS:  0000000000000000(0000) GS:ffff8800c6c80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f33b5ad1f54 CR3: 0000000002410001 CR4: 00000000001606e0
Call Trace:
 <IRQ>
 ? rcu_process_callbacks+0x469/0x6de
 ? percpu_ref_exit+0x26/0x26
 rcu_process_callbacks+0x4d7/0x6de
 __do_softirq+0x1a5/0x38f
 irq_exit+0x63/0xd1
 smp_apic_timer_interrupt+0x1cd/0x1e0
 apic_timer_interrupt+0xf/0x20
 </IRQ>
RIP: 0010:cpuidle_enter_state+0x24e/0x2b1
Code: ff e8 7b be 8a ff 45 84 ed 74 17 9c 58 0f ba e0 09 73 08 0f 0b fa e8 63 8d 94 ff 31 ff e8 14 af 8f ff e8 cd 8b 94 ff fb 85 db <78> 47 48 8b 14 24 b8 ff ff ff 7f 48 b9 ff ff ff ff f3 01 00 00 48
RSP: 0018:ffff8800c5d23ea0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
RAX: ffff8800c5c9c340 RBX: 0000000000000005 RCX: 000000000000001f
RDX: 0000000000000046 RSI: 0000000000000001 RDI: ffff8800c5c9c340
RBP: 0000000000000005 R08: 0000000000000002 R09: 0000000000021900
R10: 071c71c71c71c71c R11: 0000004ac7f2c3ed R12: ffffe8ffffc89ed0
R13: 0000000000000000 R14: ffffffff8251d478 R15: 0000000000000000
 do_idle+0x163/0x1ea
 cpu_startup_entry+0x1d/0x1f
 start_secondary+0x175/0x190
 secondary_startup_64+0xa4/0xb0
irq event stamp: 99625
hardirqs last  enabled at (99624): [<ffffffff810b18a5>] vprintk_emit+0xe6/0x24a
hardirqs last disabled at (99625): [<ffffffff81001639>] trace_hardirqs_off_thunk+0x1a/0x1c
softirqs last  enabled at (99590): [<ffffffff8105eba9>] irq_enter+0x42/0x5d
softirqs last disabled at (99591): [<ffffffff8105ec27>] irq_exit+0x63/0xd1
---[ end trace f59fd95ebc091779 ]---

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak
  2019-01-02 22:26 ` [PATCH vfs/for-next v2] " David Howells
@ 2019-01-02 23:06   ` Andrei Vagin
  2019-01-02 23:31     ` Andrei Vagin
                       ` (5 more replies)
  0 siblings, 6 replies; 29+ messages in thread
From: Andrei Vagin @ 2019-01-02 23:06 UTC (permalink / raw)
  To: David Howells; +Cc: Alexander Viro, linux-fsdevel, cgroups, Li Zefan

On Wed, Jan 02, 2019 at 10:26:20PM +0000, David Howells wrote:
> Andrei Vagin <avagin@gmail.com> wrote:
> 
> > It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak")
> > commit was reverted by mistake.
> > 
> > $ mkdir /tmp/cgroup
> > $ mkdir /tmp/cgroup2
> > $ mount -t cgroup -o none,name=test test /tmp/cgroup
> > $ mount -t cgroup -o none,name=test test /tmp/cgroup2
> > $ umount /tmp/cgroup
> > $ umount /tmp/cgroup2
> > $ cat /proc/self/cgroup | grep test
> > 12:name=test:/
> > 
> > You can see the test cgroup was not freed.
> > 
> > Cc: Li Zefan <lizefan@huawei.com>
> > Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
> > Signed-off-by: Andrei Vagin <avagin@gmail.com>
> 
> The kernel (Al's for-next branch, that is) seems to work fine without this
> patch;

It doesn't work for me. The mount system call stucks in cgroup1_get_tree:

[root@fc24 ~]# uname -a
Linux fc24 4.20.0-rc1-00071-g1fab5fff0a7a #4 SMP Wed Jan 2 14:59:36 PST 2019 x86_64 x86_64 x86_64 GNU/Linux

[avagin@laptop linux]$ git log | head -n 6
commit 1fab5fff0a7ae1fa3b78383a78f7a56f03a3d673
Merge: ea5751ccd665 fd6261f4322c 4addd2640fca a40612ef0ee1 f91528955d00
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Fri Dec 28 02:05:08 2018 -0500

    Merge branches 'work.mount', 'work.misc', 'misc.misc' and 'work.iov_iter' into for-next


$ ps axf
...
  636 ?        S      0:00      \_ sshd: root@pts/0
  643 pts/0    Ss     0:00          \_ -bash
  710 pts/0    S      0:00              \_ python test/zdtm.py run -p 4 --keep-going --report report -T .*cgroup.* --ignore-taint
 1449 pts/0    S      0:00              |   \_ flock zdtm_mount_cgroups.lock ./zdtm_umount_cgroups
 1450 pts/0    S      0:00              |       \_ /bin/sh ./zdtm_umount_cgroups
 1456 pts/0    D      0:00              |           \_ mount -t cgroup -o none,name=zdtmtst.defaultroot zdtm zdtm.EW1qB8


[root@fc24 criu]# cat /proc/1456/stack 
[<0>] msleep+0x38/0x40
[<0>] cgroup1_get_tree+0x47b/0x76b
[<0>] vfs_get_tree+0x3d/0x100
[<0>] do_mount+0x2d8/0xde0
[<0>] ksys_mount+0xba/0xd0
[<0>] __x64_sys_mount+0x21/0x30
[<0>] do_syscall_64+0x5a/0x200
[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[<0>] 0xffffffffffffffff

> this patch causes the kernel to go bang (trace below).
> 
> David
> ---
> percpu ref (css_release) <= 0 (0) after switching to atomic

I have updated to Al's vfs-next and now I see this error too.

I tested by patch on 40effd960bec ("Merge branches 'work.mount', 'work.misc', 'misc.misc' and 'work.iov_iter' into for-next")
and it works fine there...


> WARNING: CPU: 1 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x90/0x1a0
> Modules linked in:
> CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.20.0-rc1-fscache+ #1256
> Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
> RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x90/0x1a0
> Code: d8 48 85 c0 7f 26 80 3d 29 60 ff 00 00 75 1d 48 8b 53 d8 48 c7 c7 00 3c 16 82 c6 05 15 60 ff 00 01 48 8b 73 e8 e8 26 1f ac ff <0f> 0b 48 8d 43 d8 48 89 c7 49 89 c6 ff 53 f0 48 c7 43 f0 00 00 00
> RSP: 0018:ffff8800c6c83ed8 EFLAGS: 00010286
> RAX: 0000000000000000 RBX: ffff8800d3a6e048 RCX: ffff8800c6c83dc4
> RDX: 0000000000000003 RSI: ffff8800c5c9cb08 RDI: ffff8800c5c9c340
> RBP: ffff8800c6c83ef8 R08: 000000000000003b R09: 0000000000021900
> R10: ffff8800c6c83b00 R11: 0000004ac8b230c2 R12: 000060ff39019cd0
> R13: ffffffff825b0be8 R14: ffffffffffffffff R15: 0000000000000009
> FS:  0000000000000000(0000) GS:ffff8800c6c80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f33b5ad1f54 CR3: 0000000002410001 CR4: 00000000001606e0
> Call Trace:
>  <IRQ>
>  ? rcu_process_callbacks+0x469/0x6de
>  ? percpu_ref_exit+0x26/0x26
>  rcu_process_callbacks+0x4d7/0x6de
>  __do_softirq+0x1a5/0x38f
>  irq_exit+0x63/0xd1
>  smp_apic_timer_interrupt+0x1cd/0x1e0
>  apic_timer_interrupt+0xf/0x20
>  </IRQ>
> RIP: 0010:cpuidle_enter_state+0x24e/0x2b1
> Code: ff e8 7b be 8a ff 45 84 ed 74 17 9c 58 0f ba e0 09 73 08 0f 0b fa e8 63 8d 94 ff 31 ff e8 14 af 8f ff e8 cd 8b 94 ff fb 85 db <78> 47 48 8b 14 24 b8 ff ff ff 7f 48 b9 ff ff ff ff f3 01 00 00 48
> RSP: 0018:ffff8800c5d23ea0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
> RAX: ffff8800c5c9c340 RBX: 0000000000000005 RCX: 000000000000001f
> RDX: 0000000000000046 RSI: 0000000000000001 RDI: ffff8800c5c9c340
> RBP: 0000000000000005 R08: 0000000000000002 R09: 0000000000021900
> R10: 071c71c71c71c71c R11: 0000004ac7f2c3ed R12: ffffe8ffffc89ed0
> R13: 0000000000000000 R14: ffffffff8251d478 R15: 0000000000000000
>  do_idle+0x163/0x1ea
>  cpu_startup_entry+0x1d/0x1f
>  start_secondary+0x175/0x190
>  secondary_startup_64+0xa4/0xb0
> irq event stamp: 99625
> hardirqs last  enabled at (99624): [<ffffffff810b18a5>] vprintk_emit+0xe6/0x24a
> hardirqs last disabled at (99625): [<ffffffff81001639>] trace_hardirqs_off_thunk+0x1a/0x1c
> softirqs last  enabled at (99590): [<ffffffff8105eba9>] irq_enter+0x42/0x5d
> softirqs last disabled at (99591): [<ffffffff8105ec27>] irq_exit+0x63/0xd1
> ---[ end trace f59fd95ebc091779 ]---

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak
  2019-01-02 23:06   ` Andrei Vagin
@ 2019-01-02 23:31     ` Andrei Vagin
  2019-01-03  0:33     ` David Howells
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Andrei Vagin @ 2019-01-02 23:31 UTC (permalink / raw)
  To: David Howells; +Cc: Alexander Viro, linux-fsdevel, cgroups, Li Zefan

On Wed, Jan 02, 2019 at 03:06:55PM -0800, Andrei Vagin wrote:
> On Wed, Jan 02, 2019 at 10:26:20PM +0000, David Howells wrote:
> > Andrei Vagin <avagin@gmail.com> wrote:
> > 
> > > It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak")
> > > commit was reverted by mistake.
> > > 
> > > $ mkdir /tmp/cgroup
> > > $ mkdir /tmp/cgroup2
> > > $ mount -t cgroup -o none,name=test test /tmp/cgroup
> > > $ mount -t cgroup -o none,name=test test /tmp/cgroup2
> > > $ umount /tmp/cgroup
> > > $ umount /tmp/cgroup2
> > > $ cat /proc/self/cgroup | grep test
> > > 12:name=test:/
> > > 
> > > You can see the test cgroup was not freed.
> > > 
> > > Cc: Li Zefan <lizefan@huawei.com>
> > > Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
> > > Signed-off-by: Andrei Vagin <avagin@gmail.com>
> > 
> > The kernel (Al's for-next branch, that is) seems to work fine without this
> > patch;
> 
> It doesn't work for me. The mount system call stucks in cgroup1_get_tree:

Here is a reproducer for this problem:

[root@fc24 ~]# cat fs-vs-cg 
set -m
d=$(mktemp -d /tmp/cg.XXXXXX)
for i in `seq 2`; do 
	mkdir $d/a$i
	mount -t cgroup -o none,name=xxxxy xxx $d/a$i
done
mkdir -p $d/a1/test
for i in `seq 2`; do 
	umount $d/a$i
done
d=$(mktemp -d /tmp/cg.XXXXXX)
for i in `seq 2`; do 
	mkdir $d/a$i
	mount -t cgroup -o none,name=xxxxy xxx $d/a$i
done
rmdir $d/a1/test
for i in `seq 2`; do 
	umount $d/a$i
done

You need to execute this script twice and it will stuck on a second
attmept.

I tested it on Al's vfs-next without my patch.

You will see this warning in the kernel log:

[   47.043634] ------------[ cut here ]------------
[   47.044522] percpu ref (css_release) <= 0 (0) after switching to atomic
[   47.044544] WARNING: CPU: 1 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x1dc/0x210
[   47.047369] Modules linked in:
[   47.047911] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.20.0-rc1-00071-g1fab5fff0a7a #4

So my patch is probably correct and the problem is in another place.

> 
> [root@fc24 ~]# uname -a
> Linux fc24 4.20.0-rc1-00071-g1fab5fff0a7a #4 SMP Wed Jan 2 14:59:36 PST 2019 x86_64 x86_64 x86_64 GNU/Linux
> 
> [avagin@laptop linux]$ git log | head -n 6
> commit 1fab5fff0a7ae1fa3b78383a78f7a56f03a3d673
> Merge: ea5751ccd665 fd6261f4322c 4addd2640fca a40612ef0ee1 f91528955d00
> Author: Al Viro <viro@zeniv.linux.org.uk>
> Date:   Fri Dec 28 02:05:08 2018 -0500
> 
>     Merge branches 'work.mount', 'work.misc', 'misc.misc' and 'work.iov_iter' into for-next
> 
> 
> $ ps axf
> ...
>   636 ?        S      0:00      \_ sshd: root@pts/0
>   643 pts/0    Ss     0:00          \_ -bash
>   710 pts/0    S      0:00              \_ python test/zdtm.py run -p 4 --keep-going --report report -T .*cgroup.* --ignore-taint
>  1449 pts/0    S      0:00              |   \_ flock zdtm_mount_cgroups.lock ./zdtm_umount_cgroups
>  1450 pts/0    S      0:00              |       \_ /bin/sh ./zdtm_umount_cgroups
>  1456 pts/0    D      0:00              |           \_ mount -t cgroup -o none,name=zdtmtst.defaultroot zdtm zdtm.EW1qB8
> 
> 
> [root@fc24 criu]# cat /proc/1456/stack 
> [<0>] msleep+0x38/0x40
> [<0>] cgroup1_get_tree+0x47b/0x76b
> [<0>] vfs_get_tree+0x3d/0x100
> [<0>] do_mount+0x2d8/0xde0
> [<0>] ksys_mount+0xba/0xd0
> [<0>] __x64_sys_mount+0x21/0x30
> [<0>] do_syscall_64+0x5a/0x200
> [<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [<0>] 0xffffffffffffffff
> 
> > this patch causes the kernel to go bang (trace below).
> > 
> > David
> > ---
> > percpu ref (css_release) <= 0 (0) after switching to atomic
> 
> I have updated to Al's vfs-next and now I see this error too.
> 
> I tested by patch on 40effd960bec ("Merge branches 'work.mount', 'work.misc', 'misc.misc' and 'work.iov_iter' into for-next")
> and it works fine there...
> 
> 
> > WARNING: CPU: 1 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x90/0x1a0
> > Modules linked in:
> > CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.20.0-rc1-fscache+ #1256
> > Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
> > RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x90/0x1a0
> > Code: d8 48 85 c0 7f 26 80 3d 29 60 ff 00 00 75 1d 48 8b 53 d8 48 c7 c7 00 3c 16 82 c6 05 15 60 ff 00 01 48 8b 73 e8 e8 26 1f ac ff <0f> 0b 48 8d 43 d8 48 89 c7 49 89 c6 ff 53 f0 48 c7 43 f0 00 00 00
> > RSP: 0018:ffff8800c6c83ed8 EFLAGS: 00010286
> > RAX: 0000000000000000 RBX: ffff8800d3a6e048 RCX: ffff8800c6c83dc4
> > RDX: 0000000000000003 RSI: ffff8800c5c9cb08 RDI: ffff8800c5c9c340
> > RBP: ffff8800c6c83ef8 R08: 000000000000003b R09: 0000000000021900
> > R10: ffff8800c6c83b00 R11: 0000004ac8b230c2 R12: 000060ff39019cd0
> > R13: ffffffff825b0be8 R14: ffffffffffffffff R15: 0000000000000009
> > FS:  0000000000000000(0000) GS:ffff8800c6c80000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007f33b5ad1f54 CR3: 0000000002410001 CR4: 00000000001606e0
> > Call Trace:
> >  <IRQ>
> >  ? rcu_process_callbacks+0x469/0x6de
> >  ? percpu_ref_exit+0x26/0x26
> >  rcu_process_callbacks+0x4d7/0x6de
> >  __do_softirq+0x1a5/0x38f
> >  irq_exit+0x63/0xd1
> >  smp_apic_timer_interrupt+0x1cd/0x1e0
> >  apic_timer_interrupt+0xf/0x20
> >  </IRQ>
> > RIP: 0010:cpuidle_enter_state+0x24e/0x2b1
> > Code: ff e8 7b be 8a ff 45 84 ed 74 17 9c 58 0f ba e0 09 73 08 0f 0b fa e8 63 8d 94 ff 31 ff e8 14 af 8f ff e8 cd 8b 94 ff fb 85 db <78> 47 48 8b 14 24 b8 ff ff ff 7f 48 b9 ff ff ff ff f3 01 00 00 48
> > RSP: 0018:ffff8800c5d23ea0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
> > RAX: ffff8800c5c9c340 RBX: 0000000000000005 RCX: 000000000000001f
> > RDX: 0000000000000046 RSI: 0000000000000001 RDI: ffff8800c5c9c340
> > RBP: 0000000000000005 R08: 0000000000000002 R09: 0000000000021900
> > R10: 071c71c71c71c71c R11: 0000004ac7f2c3ed R12: ffffe8ffffc89ed0
> > R13: 0000000000000000 R14: ffffffff8251d478 R15: 0000000000000000
> >  do_idle+0x163/0x1ea
> >  cpu_startup_entry+0x1d/0x1f
> >  start_secondary+0x175/0x190
> >  secondary_startup_64+0xa4/0xb0
> > irq event stamp: 99625
> > hardirqs last  enabled at (99624): [<ffffffff810b18a5>] vprintk_emit+0xe6/0x24a
> > hardirqs last disabled at (99625): [<ffffffff81001639>] trace_hardirqs_off_thunk+0x1a/0x1c
> > softirqs last  enabled at (99590): [<ffffffff8105eba9>] irq_enter+0x42/0x5d
> > softirqs last disabled at (99591): [<ffffffff8105ec27>] irq_exit+0x63/0xd1
> > ---[ end trace f59fd95ebc091779 ]---

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH vfs/for-next v4] cgroup: fix top cgroup refcnt leak
  2019-01-02 20:02       ` Al Viro
  2019-01-02 21:06         ` Andrei Vagin
@ 2019-01-03  0:26         ` David Howells
  2019-01-03  0:43           ` Andrei Vagin
  1 sibling, 1 reply; 29+ messages in thread
From: David Howells @ 2019-01-03  0:26 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: dhowells, Alexander Viro, linux-fsdevel, cgroups, Li Zefan

Andrei Vagin <avagin@gmail.com> wrote:

> It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak")
> commit was reverted by mistake.
> 
> $ mkdir /tmp/cgroup
> $ mkdir /tmp/cgroup2
> $ mount -t cgroup -o none,name=test test /tmp/cgroup
> $ mount -t cgroup -o none,name=test test /tmp/cgroup2
> $ umount /tmp/cgroup
> $ umount /tmp/cgroup2
> $ cat /proc/self/cgroup | grep test
> 12:name=test:/
> 
> You can see the test cgroup was not freed.
> 
> Cc: Li Zefan <lizefan@huawei.com>
> Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
> Signed-off-by: Andrei Vagin <avagin@gmail.com>
> ---
> 
> v2: clean up code and add the vfs/for-next tag
> v3: fix a reference leak when kernfs_node_dentry fails
> v4: call deactivate_locked_super() in a error case
> v5: don't dereference fc->root after dput()
> 
>  kernel/cgroup/cgroup.c | 25 ++++++++++++++++++-------
>  1 file changed, 18 insertions(+), 7 deletions(-)

This patch doesn't work either.

	percpu ref (css_release) <= 0 (0) after switching to atomic
	RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x90/0x1a0

Btw, note that the subject says "v4" but the changelog says "v5".

David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak
  2019-01-02 23:06   ` Andrei Vagin
  2019-01-02 23:31     ` Andrei Vagin
@ 2019-01-03  0:33     ` David Howells
  2019-01-03 13:41     ` David Howells
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: David Howells @ 2019-01-03  0:33 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: dhowells, Alexander Viro, linux-fsdevel, cgroups, Li Zefan

Andrei Vagin <avagin@gmail.com> wrote:

> Here is a reproducer for this problem:
> 
> [root@fc24 ~]# cat fs-vs-cg 
> set -m
> d=$(mktemp -d /tmp/cg.XXXXXX)
> for i in `seq 2`; do 
> 	mkdir $d/a$i
> 	mount -t cgroup -o none,name=xxxxy xxx $d/a$i
> done
> mkdir -p $d/a1/test
> for i in `seq 2`; do 
> 	umount $d/a$i
> done
> d=$(mktemp -d /tmp/cg.XXXXXX)
> for i in `seq 2`; do 
> 	mkdir $d/a$i
> 	mount -t cgroup -o none,name=xxxxy xxx $d/a$i
> done
> rmdir $d/a1/test
> for i in `seq 2`; do 
> 	umount $d/a$i
> done
> 
> You need to execute this script twice and it will stuck on a second
> attmept.

This causes:

	percpu ref (css_release) <= 0 (0) after switching to atomic

without your patch and:

	percpu ref (css_release) <= 0 (-2) after switching to atomic

with your patch, both followed by:

	WARNING: CPU: 0 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x90/0x1a0

on Al's for-next branch.

David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH vfs/for-next v4] cgroup: fix top cgroup refcnt leak
  2019-01-03  0:26         ` David Howells
@ 2019-01-03  0:43           ` Andrei Vagin
  2019-01-03  1:00             ` Andrei Vagin
  0 siblings, 1 reply; 29+ messages in thread
From: Andrei Vagin @ 2019-01-03  0:43 UTC (permalink / raw)
  To: David Howells; +Cc: Alexander Viro, linux-fsdevel, cgroups, Li Zefan

On Thu, Jan 03, 2019 at 12:26:23AM +0000, David Howells wrote:
> Andrei Vagin <avagin@gmail.com> wrote:
> 
> > It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak")
> > commit was reverted by mistake.
> > 
> > $ mkdir /tmp/cgroup
> > $ mkdir /tmp/cgroup2
> > $ mount -t cgroup -o none,name=test test /tmp/cgroup
> > $ mount -t cgroup -o none,name=test test /tmp/cgroup2
> > $ umount /tmp/cgroup
> > $ umount /tmp/cgroup2
> > $ cat /proc/self/cgroup | grep test
> > 12:name=test:/
> > 
> > You can see the test cgroup was not freed.
> > 
> > Cc: Li Zefan <lizefan@huawei.com>
> > Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
> > Signed-off-by: Andrei Vagin <avagin@gmail.com>
> > ---
> > 
> > v2: clean up code and add the vfs/for-next tag
> > v3: fix a reference leak when kernfs_node_dentry fails
> > v4: call deactivate_locked_super() in a error case
> > v5: don't dereference fc->root after dput()
> > 
> >  kernel/cgroup/cgroup.c | 25 ++++++++++++++++++-------
> >  1 file changed, 18 insertions(+), 7 deletions(-)
> 
> This patch doesn't work either.

I'm sorry, but we can't say anything about this patch now, because it
looks like recent changes in vfs-next break something else here...

> 
> 	percpu ref (css_release) <= 0 (0) after switching to atomic
> 	RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x90/0x1a0
> 
> Btw, note that the subject says "v4" but the changelog says "v5".

It is v5.

> 
> David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH vfs/for-next v4] cgroup: fix top cgroup refcnt leak
  2019-01-03  0:43           ` Andrei Vagin
@ 2019-01-03  1:00             ` Andrei Vagin
  2019-01-03  3:54               ` [PATCH vfs/for-next v6] " Andrei Vagin
  0 siblings, 1 reply; 29+ messages in thread
From: Andrei Vagin @ 2019-01-03  1:00 UTC (permalink / raw)
  To: David Howells; +Cc: Alexander Viro, linux-fsdevel, cgroups, Li Zefan

On Wed, Jan 02, 2019 at 04:43:39PM -0800, Andrei Vagin wrote:
> On Thu, Jan 03, 2019 at 12:26:23AM +0000, David Howells wrote:
> > Andrei Vagin <avagin@gmail.com> wrote:
> > 
> > > It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak")
> > > commit was reverted by mistake.
> > > 
> > > $ mkdir /tmp/cgroup
> > > $ mkdir /tmp/cgroup2
> > > $ mount -t cgroup -o none,name=test test /tmp/cgroup
> > > $ mount -t cgroup -o none,name=test test /tmp/cgroup2
> > > $ umount /tmp/cgroup
> > > $ umount /tmp/cgroup2
> > > $ cat /proc/self/cgroup | grep test
> > > 12:name=test:/
> > > 
> > > You can see the test cgroup was not freed.
> > > 
> > > Cc: Li Zefan <lizefan@huawei.com>
> > > Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
> > > Signed-off-by: Andrei Vagin <avagin@gmail.com>
> > > ---
> > > 
> > > v2: clean up code and add the vfs/for-next tag
> > > v3: fix a reference leak when kernfs_node_dentry fails
> > > v4: call deactivate_locked_super() in a error case
> > > v5: don't dereference fc->root after dput()
> > > 
> > >  kernel/cgroup/cgroup.c | 25 ++++++++++++++++++-------
> > >  1 file changed, 18 insertions(+), 7 deletions(-)
> > 
> > This patch doesn't work either.
> 
> I'm sorry, but we can't say anything about this patch now, because it
> looks like recent changes in vfs-next break something else here...

I found a reason why this patch doesn't work on Al's vfs/for-next:

[avagin@laptop linux]$ git diff 40effd960becd8a355b7aafc789712afd64f5759..vfs/for-next  kernel/cgroup/cgroup-v1.c | grep -B 5 -A 5 cgroup_get 
 	/*
@@ -1280,8 +1285,8 @@ int cgroup1_get_tree(struct fs_context *fc)
 		mutex_lock(&cgroup_mutex);
 		percpu_ref_reinit(&root->cgrp.self.refcnt);
 		mutex_unlock(&cgroup_mutex);
+		cgroup_get(&root->cgrp);
 	}
-	cgroup_get(&root->cgrp);
 
 	/*
 	 * If @pinned_sb, we're reusing an existing root and holding an

40effd960becd8a355b7aafc789712afd64f5759 is the previous head of vfs/for-next

I reverted this hunk, applied my patch and all criu test passed.

> 
> > 
> > 	percpu ref (css_release) <= 0 (0) after switching to atomic
> > 	RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x90/0x1a0
> > 
> > Btw, note that the subject says "v4" but the changelog says "v5".
> 
> It is v5.
> 
> > 
> > David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH vfs/for-next v6] cgroup: fix top cgroup refcnt leak
  2019-01-03  1:00             ` Andrei Vagin
@ 2019-01-03  3:54               ` Andrei Vagin
  2019-01-03  8:32                 ` Al Viro
  0 siblings, 1 reply; 29+ messages in thread
From: Andrei Vagin @ 2019-01-03  3:54 UTC (permalink / raw)
  To: Alexander Viro, David Howells
  Cc: linux-fsdevel, cgroups, Andrei Vagin, Li Zefan

It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak")
commit was reverted by mistake.

$ mkdir /tmp/cgroup
$ mkdir /tmp/cgroup2
$ mount -t cgroup -o none,name=test test /tmp/cgroup
$ mount -t cgroup -o none,name=test test /tmp/cgroup2
$ umount /tmp/cgroup
$ umount /tmp/cgroup2
$ cat /proc/self/cgroup | grep test
12:name=test:/

You can see the test cgroup was not freed.

Cc: Li Zefan <lizefan@huawei.com>
Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
---

v2: clean up code and add the vfs/for-next tag
v3: fix a reference leak when kernfs_node_dentry fails
v4: call deactivate_locked_super() in a error case
v5: don't dereference fc->root after dput()
v6: rebase on today's vfs/for-next

 kernel/cgroup/cgroup-v1.c |  2 +-
 kernel/cgroup/cgroup.c    | 25 ++++++++++++++++++-------
 2 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 4b189e821cad..de7d625ec077 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -1285,8 +1285,8 @@ int cgroup1_get_tree(struct fs_context *fc)
 		mutex_lock(&cgroup_mutex);
 		percpu_ref_reinit(&root->cgrp.self.refcnt);
 		mutex_unlock(&cgroup_mutex);
-		cgroup_get(&root->cgrp);
 	}
+	cgroup_get(&root->cgrp);
 
 	/*
 	 * If @pinned_sb, we're reusing an existing root and holding an
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index a19f0fec9d82..fe67b5e81f9a 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2019,7 +2019,7 @@ int cgroup_do_get_tree(struct fs_context *fc)
 
 	ret = kernfs_get_tree(fc);
 	if (ret < 0)
-		goto out_cgrp;
+		return ret;
 
 	/*
 	 * In non-init cgroup namespace, instead of root cgroup's dentry,
@@ -2038,19 +2038,30 @@ int cgroup_do_get_tree(struct fs_context *fc)
 		mutex_unlock(&cgroup_mutex);
 
 		nsdentry = kernfs_node_dentry(cgrp->kn, fc->root->d_sb);
-		if (IS_ERR(nsdentry))
-			return PTR_ERR(nsdentry);
+		if (IS_ERR(nsdentry)) {
+			ret = PTR_ERR(nsdentry);
+			goto out_cgrp;
+		}
 		dput(fc->root);
 		fc->root = nsdentry;
 	}
 
 	ret = 0;
-	if (ctx->kfc.new_sb_created)
-		goto out_cgrp;
-	apply_cgroup_root_flags(ctx->flags);
-	return 0;
+	if (!ctx->kfc.new_sb_created)
+		apply_cgroup_root_flags(ctx->flags);
 
 out_cgrp:
+	if (!ctx->kfc.new_sb_created)
+		cgroup_put(&ctx->root->cgrp);
+
+	if (unlikely(ret)) {
+		struct super_block *sb = fc->root->d_sb;
+
+		dput(fc->root);
+		deactivate_locked_super(sb);
+		fc->root = NULL;
+	}
+
 	return ret;
 }
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH vfs/for-next v6] cgroup: fix top cgroup refcnt leak
  2019-01-03  3:54               ` [PATCH vfs/for-next v6] " Andrei Vagin
@ 2019-01-03  8:32                 ` Al Viro
  2019-01-03 17:34                   ` Andrei Vagin
  2019-01-03 21:54                   ` David Howells
  0 siblings, 2 replies; 29+ messages in thread
From: Al Viro @ 2019-01-03  8:32 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: David Howells, linux-fsdevel, cgroups, Li Zefan

On Wed, Jan 02, 2019 at 07:54:26PM -0800, Andrei Vagin wrote:

[I'm thoroughly sick of refcounting in that thing, TBH ;-/]

> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index a19f0fec9d82..fe67b5e81f9a 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -2019,7 +2019,7 @@ int cgroup_do_get_tree(struct fs_context *fc)
>  
>  	ret = kernfs_get_tree(fc);
>  	if (ret < 0)
> -		goto out_cgrp;
> +		return ret;

Why does that case avoid needing cgroup_put()?  Note, BTW, that we
also have this:
/*
 * Destroy a cgroup filesystem context.
 */
static void cgroup_fs_context_free(struct fs_context *fc)
{
        struct cgroup_fs_context *ctx = cgroup_fc2context(fc);

        kfree(ctx->name);
        kfree(ctx->release_agent);
        if (ctx->root)
                cgroup_put(&ctx->root->cgrp);
        put_cgroup_ns(ctx->ns);
        kernfs_free_fs_context(fc);
        kfree(ctx);
}

which also needs to be taken into account.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak
  2019-01-02 23:06   ` Andrei Vagin
  2019-01-02 23:31     ` Andrei Vagin
  2019-01-03  0:33     ` David Howells
@ 2019-01-03 13:41     ` David Howells
  2019-01-03 13:42     ` David Howells
                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: David Howells @ 2019-01-03 13:41 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: dhowells, Alexander Viro, linux-fsdevel, cgroups, Li Zefan

I'm seeing something a bit weird, but I'm not sure why.  If I expand Andrei's
test script to:

	set -m
	d=$(mktemp -d /tmp/cg.XXXXXX)
	mkdir $d/a1
	mount -t cgroup -o none,name=xxxxy xxx $d/a1
	mkdir $d/a2
	mount -t cgroup -o none,name=xxxxy xxx $d/a2
	mkdir -p $d/a1/test
	umount $d/a1
	umount $d/a2

	d=$(mktemp -d /tmp/cg.XXXXXX)
	mkdir $d/a1
	mount -t cgroup -o none,name=xxxxy xxx $d/a1
	mkdir $d/a2
	mount -t cgroup -o none,name=xxxxy xxx $d/a2

	rmdir $d/a1/test
	umount $d/a1
	umount $d/a2

then a cgroup gets leaked at the end.  Just retaining the final:

	for i in `seq 2`; do 
		umount $d/a$i
	done

in place of the two final expanded umount commands fixes the issue.

The mounts are getting released, but not the cgroups as far as I can tell.

David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak
  2019-01-02 23:06   ` Andrei Vagin
                       ` (2 preceding siblings ...)
  2019-01-03 13:41     ` David Howells
@ 2019-01-03 13:42     ` David Howells
  2019-01-03 15:27     ` David Howells
  2019-01-03 15:44     ` David Howells
  5 siblings, 0 replies; 29+ messages in thread
From: David Howells @ 2019-01-03 13:42 UTC (permalink / raw)
  Cc: dhowells, Andrei Vagin, Alexander Viro, linux-fsdevel, cgroups, Li Zefan

David Howells <dhowells@redhat.com> wrote:

> 	umount $d/a1
> 	umount $d/a2

Putting a small sleep between the two fixes the problem too.

David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak
  2019-01-02 23:06   ` Andrei Vagin
                       ` (3 preceding siblings ...)
  2019-01-03 13:42     ` David Howells
@ 2019-01-03 15:27     ` David Howells
  2019-01-03 15:44     ` David Howells
  5 siblings, 0 replies; 29+ messages in thread
From: David Howells @ 2019-01-03 15:27 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: dhowells, Alexander Viro, linux-fsdevel, cgroups, Li Zefan

David Howells <dhowells@redhat.com> wrote:

> 	umount $d/a1
> 	umount $d/a2
> 
> then a cgroup gets leaked at the end.  Just retaining the final:
> 
> 	for i in `seq 2`; do 
> 		umount $d/a$i
> 	done
> 
> in place of the two final expanded umount commands fixes the issue.

The issue also goes away if I set PERCPU_REF_INIT_ATOMIC on the refcount
object when initialising it:-/

For reference, the sequence of css_get/tryget/put calls is:

*** 00000000e2655b5f RIN   1: cgroup1_get_tree+0x5de/0x73b
*** 00000000e2655b5f GET   2: cgroup1_get_tree+0x6e1/0x73b
*** 00000000e2655b5f PUT   1: cgroup_fs_context_free+0x12c/0x15f

*** 00000000e2655b5f TRYO  2: cgroup1_get_tree+0x4aa/0x73b
*** 00000000e2655b5f PUT   1: cgroup_do_get_tree+0x1c5/0x1fa
*** 00000000e2655b5f GET   2: cgroup1_get_tree+0x6e1/0x73b
*** 00000000e2655b5f PUT   1: cgroup_fs_context_free+0x12c/0x15f

*** 00000000e2655b5f TRY   2: cgroup_kn_lock_live+0x145/0x18a
*** 00000000e2655b5f GET   3: cgroup_mkdir+0x241/0x454
*** 00000000e2655b5f PUT   2: cgroup_mkdir+0xad/0x454
*** 00000000e2655b5f PUT   1: cgroup_kill_sb+0x134/0x161

*** 00000000e2655b5f TRYO  2: cgroup1_get_tree+0x4aa/0x73b
*** 00000000e2655b5f GET   3: cgroup1_get_tree+0x6e1/0x73b
*** 00000000e2655b5f PUT   2: cgroup_fs_context_free+0x12c/0x15f

*** 00000000e2655b5f TRYO  3: cgroup1_get_tree+0x4aa/0x73b
*** 00000000e2655b5f PUT   2: cgroup_do_get_tree+0x1c5/0x1fa
*** 00000000e2655b5f GET   3: cgroup1_get_tree+0x6e1/0x73b
*** 00000000e2655b5f PUT   2: cgroup_fs_context_free+0x12c/0x15f

*** 00000000e2655b5f KIL   2: cgroup_kill_sb+0x146/0x161
*** 00000000e2655b5f PUT   0: css_free_rwork_fn+0x399/0x5ec

As obtained with the attached patch.

I'm not sure the refcount is correct after the second block (at the end of the
second mount).  Should it have one ref per active sb?

David
---
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 5e1694fe035b..23d022eb669f 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -165,6 +165,8 @@ struct cgroup_subsys_state {
 	 * fields of the containing structure.
 	 */
 	struct cgroup_subsys_state *parent;
+
+	bool debug;
 };
 
 /*
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index bb0c7da50ed2..76ca2be4986d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -295,6 +295,8 @@ void css_task_iter_end(struct css_task_iter *it);
  * Inline functions.
  */
 
+extern void css_refcount(struct cgroup_subsys_state *css, const char *op);
+
 /**
  * css_get - obtain a reference on the specified css
  * @css: target css
@@ -303,8 +305,10 @@ void css_task_iter_end(struct css_task_iter *it);
  */
 static inline void css_get(struct cgroup_subsys_state *css)
 {
-	if (!(css->flags & CSS_NO_REF))
+	if (!(css->flags & CSS_NO_REF)) {
 		percpu_ref_get(&css->refcnt);
+		css_refcount(css, "GET");
+	}
 }
 
 /**
@@ -316,8 +320,10 @@ static inline void css_get(struct cgroup_subsys_state *css)
  */
 static inline void css_get_many(struct cgroup_subsys_state *css, unsigned int n)
 {
-	if (!(css->flags & CSS_NO_REF))
+	if (!(css->flags & CSS_NO_REF)) {
 		percpu_ref_get_many(&css->refcnt, n);
+		css_refcount(css, "GETM");
+	}
 }
 
 /**
@@ -333,8 +339,11 @@ static inline void css_get_many(struct cgroup_subsys_state *css, unsigned int n)
  */
 static inline bool css_tryget(struct cgroup_subsys_state *css)
 {
-	if (!(css->flags & CSS_NO_REF))
-		return percpu_ref_tryget(&css->refcnt);
+	if (css->flags & CSS_NO_REF)
+		return true;
+	if (!percpu_ref_tryget(&css->refcnt))
+		return false;
+	css_refcount(css, "TRY");
 	return true;
 }
 
@@ -350,8 +359,11 @@ static inline bool css_tryget(struct cgroup_subsys_state *css)
  */
 static inline bool css_tryget_online(struct cgroup_subsys_state *css)
 {
-	if (!(css->flags & CSS_NO_REF))
-		return percpu_ref_tryget_live(&css->refcnt);
+	if (css->flags & CSS_NO_REF)
+		return true;
+	if (!percpu_ref_tryget_live(&css->refcnt))
+		return false;
+	css_refcount(css, "TRYO");
 	return true;
 }
 
@@ -383,8 +395,10 @@ static inline bool css_is_dying(struct cgroup_subsys_state *css)
  */
 static inline void css_put(struct cgroup_subsys_state *css)
 {
-	if (!(css->flags & CSS_NO_REF))
+	if (!(css->flags & CSS_NO_REF)) {
 		percpu_ref_put(&css->refcnt);
+		css_refcount(css, "PUT");
+	}
 }
 
 /**
@@ -396,8 +410,10 @@ static inline void css_put(struct cgroup_subsys_state *css)
  */
 static inline void css_put_many(struct cgroup_subsys_state *css, unsigned int n)
 {
-	if (!(css->flags & CSS_NO_REF))
+	if (!(css->flags & CSS_NO_REF)) {
 		percpu_ref_put_many(&css->refcnt, n);
+		css_refcount(css, "PUTM");
+	}
 }
 
 static inline void cgroup_get(struct cgroup *cgrp)
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index de7d625ec077..d87a0b8c3441 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -1175,7 +1175,7 @@ int cgroup1_get_tree(struct fs_context *fc)
 		    ss->root == &cgrp_dfl_root)
 			continue;
 
-		if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
+		if (!css_tryget_online(&ss->root->cgrp.self)) {
 			mutex_unlock(&cgroup_mutex);
 			goto err_restart;
 		}
@@ -1228,7 +1228,7 @@ int cgroup1_get_tree(struct fs_context *fc)
 		 */
 		pinned_sb = kernfs_pin_sb(root->kf_root, NULL);
 		if (IS_ERR(pinned_sb) ||
-		    !percpu_ref_tryget_live(&root->cgrp.self.refcnt)) {
+		    !css_tryget_online(&root->cgrp.self)) {
 			mutex_unlock(&cgroup_mutex);
 			if (!IS_ERR_OR_NULL(pinned_sb))
 				deactivate_super(pinned_sb);
@@ -1284,6 +1284,7 @@ int cgroup1_get_tree(struct fs_context *fc)
 	if (new_root) {
 		mutex_lock(&cgroup_mutex);
 		percpu_ref_reinit(&root->cgrp.self.refcnt);
+		css_refcount(&root->cgrp.self, "RIN");
 		mutex_unlock(&cgroup_mutex);
 	}
 	cgroup_get(&root->cgrp);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 7dbe71b23941..299e3fb30731 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1904,8 +1904,11 @@ void init_cgroup_root(struct cgroup_fs_context *ctx)
 	root->flags = ctx->flags;
 	if (ctx->release_agent)
 		strscpy(root->release_agent_path, ctx->release_agent, PATH_MAX);
-	if (ctx->name)
+	if (ctx->name) {
 		strscpy(root->name, ctx->name, MAX_CGROUP_ROOT_NAMELEN);
+		if (strcmp(root->name, "xxxxy") == 0)
+			cgrp->self.debug = true;
+	}
 	if (ctx->cpuset_clone_children)
 		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
 }
@@ -1927,7 +1930,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags)
 	root_cgrp->ancestor_ids[0] = ret;
 
 	ret = percpu_ref_init(&root_cgrp->self.refcnt, css_release,
-			      ref_flags, GFP_KERNEL);
+			      ref_flags | PERCPU_REF_INIT_ATOMIC, GFP_KERNEL);
 	if (ret)
 		goto out;
 
@@ -2166,8 +2169,10 @@ static void cgroup_kill_sb(struct super_block *sb)
 	if (!list_empty(&root->cgrp.self.children) ||
 	    root == &cgrp_dfl_root)
 		cgroup_put(&root->cgrp);
-	else
+	else {
+		css_refcount(&root->cgrp.self, "KIL");
 		percpu_ref_kill(&root->cgrp.self.refcnt);
+	}
 
 	kernfs_kill_sb(sb);
 }
@@ -4891,7 +4896,7 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 
 	init_and_link_css(css, ss, cgrp);
 
-	err = percpu_ref_init(&css->refcnt, css_release, 0, GFP_KERNEL);
+	err = percpu_ref_init(&css->refcnt, css_release, PERCPU_REF_INIT_ATOMIC, GFP_KERNEL);
 	if (err)
 		goto err_free_css;
 
@@ -4946,7 +4951,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
 	if (!cgrp)
 		return ERR_PTR(-ENOMEM);
 
-	ret = percpu_ref_init(&cgrp->self.refcnt, css_release, 0, GFP_KERNEL);
+	ret = percpu_ref_init(&cgrp->self.refcnt, css_release, PERCPU_REF_INIT_ATOMIC, GFP_KERNEL);
 	if (ret)
 		goto out_free_cgrp;
 
@@ -5194,6 +5199,7 @@ static void kill_css(struct cgroup_subsys_state *css)
 	 * Use percpu_ref_kill_and_confirm() to get notifications as each
 	 * css is confirmed to be seen as killed on all CPUs.
 	 */
+	css_refcount(css, "KLC");
 	percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn);
 }
 
@@ -5278,6 +5284,7 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 	cgroup1_check_for_release(parent);
 
 	/* put the base reference */
+	css_refcount(&cgrp->self, "KIL");
 	percpu_ref_kill(&cgrp->self.refcnt);
 
 	return 0;
@@ -6106,3 +6113,11 @@ static int __init cgroup_sysfs_init(void)
 }
 subsys_initcall(cgroup_sysfs_init);
 #endif /* CONFIG_SYSFS */
+
+noinline void css_refcount(struct cgroup_subsys_state *css, const char *op)
+{
+	if (css->debug)
+		printk("*** %p %s %2ld: %pSR\n",
+		       css, op, atomic_long_read(&css->refcnt.count),
+		       __builtin_return_address(0));
+}

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak
  2019-01-02 23:06   ` Andrei Vagin
                       ` (4 preceding siblings ...)
  2019-01-03 15:27     ` David Howells
@ 2019-01-03 15:44     ` David Howells
  5 siblings, 0 replies; 29+ messages in thread
From: David Howells @ 2019-01-03 15:44 UTC (permalink / raw)
  Cc: dhowells, Andrei Vagin, Alexander Viro, linux-fsdevel, cgroups, Li Zefan

David Howells <dhowells@redhat.com> wrote:

> I'm not sure the refcount is correct after the second block (at the end of the
> second mount).  Should it have one ref per active sb?

No, it shouldn't because there's only one sb.

David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH vfs/for-next v6] cgroup: fix top cgroup refcnt leak
  2019-01-03  8:32                 ` Al Viro
@ 2019-01-03 17:34                   ` Andrei Vagin
  2019-01-03 21:54                   ` David Howells
  1 sibling, 0 replies; 29+ messages in thread
From: Andrei Vagin @ 2019-01-03 17:34 UTC (permalink / raw)
  To: Al Viro; +Cc: David Howells, linux-fsdevel, cgroups, Li Zefan

On Thu, Jan 03, 2019 at 08:32:29AM +0000, Al Viro wrote:
> On Wed, Jan 02, 2019 at 07:54:26PM -0800, Andrei Vagin wrote:
> 
> [I'm thoroughly sick of refcounting in that thing, TBH ;-/]

feel your pain...

> 
> > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > index a19f0fec9d82..fe67b5e81f9a 100644
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -2019,7 +2019,7 @@ int cgroup_do_get_tree(struct fs_context *fc)
> >  
> >  	ret = kernfs_get_tree(fc);
> >  	if (ret < 0)
> > -		goto out_cgrp;
> > +		return ret;
> 
> Why does that case avoid needing cgroup_put()?  Note, BTW, that we
> also have this:

The origin patch which added this cgroup_put says that it is needed
because mount() and kill_sb() is not a one-to-one match. sb is created
in kernfs_get_tree(), so I decided that this cgroup_put() is needed only
after kernfs_get_tree().

commit c6b3d5bcd67c75961a1e8b9564d1475c0f194a84
Author: Li Zefan <lizefan@huawei.com>
Date:   Fri Apr 4 17:14:41 2014 +0800

cgroup: fix top cgroup refcnt leak

As mount() and kill_sb() is not a one-to-one match, If we mount the same
cgroupfs in serveral mount points, and then umount all of them, kill_sb()
will be called only once.


> /*
>  * Destroy a cgroup filesystem context.
>  */
> static void cgroup_fs_context_free(struct fs_context *fc)
> {
>         struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
> 
>         kfree(ctx->name);
>         kfree(ctx->release_agent);
>         if (ctx->root)
>                 cgroup_put(&ctx->root->cgrp);
>         put_cgroup_ns(ctx->ns);
>         kernfs_free_fs_context(fc);
>         kfree(ctx);
> }
> 
> which also needs to be taken into account.

we have unconditional cgroup_get in cgroup1_get_tree() to match this
put, but we need to check error paths.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH vfs/for-next v6] cgroup: fix top cgroup refcnt leak
  2019-01-03  8:32                 ` Al Viro
  2019-01-03 17:34                   ` Andrei Vagin
@ 2019-01-03 21:54                   ` David Howells
  1 sibling, 0 replies; 29+ messages in thread
From: David Howells @ 2019-01-03 21:54 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: dhowells, Al Viro, linux-fsdevel, cgroups, Li Zefan

Hi Andrei,

It turns out that the cgroup-v1 refcounting is slightly broken upstream.  If
kernfs_get_inode() fails in kernfs_fill_super(), then there's refcount
breakage in the error handling.

This can be provoked by making the following change:

--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -240,6 +240,8 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 	sb->s_shrink.seeks = 0;
 
 	/* get root inode, initialize and unlock it */
+	if (strcmp(current->comm, "foobar") == 0)
+		return -ENOANO;
 	mutex_lock(&kernfs_mutex);
 	inode = kernfs_get_inode(sb, info->root->kn);
 	mutex_unlock(&kernfs_mutex);

and then copying /bin/mount to /tmp/foobar and doing:

[root@andromeda ~]# /tmp/foobar -t cgroup -o none,name=xxxxy xxx /tmp/x/a1
foobar: /tmp/x/a1: mount(2) system call failed: No anode.

In dmesg I see the attached traces (see below).

The problem appears to be because cgroup_do_mount() calls kernfs_mount(), but
the refcount on the new root cgroup object hasn't been properly initialised
yet.  However, because we make it past sget(), the superblock thinks it has
taken the caller's ref - and this gets eaten by cgroup_kill_sb().

Further, another ref appears to be released by cgroup_do_mount() in the event
that kernfs_mount() fails.

David

------------[ cut here ]------------
percpu_ref_kill_and_confirm called more than once on css_release!
WARNING: CPU: 3 PID: 3218 at lib/percpu-refcount.c:336 percpu_ref_kill_and_confirm+0x4b/0x14c
Modules linked in:
CPU: 3 PID: 3218 Comm: bugger Not tainted 4.20.0-fscache+ #1287
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
RIP: 0010:percpu_ref_kill_and_confirm+0x4b/0x14c
Code: c6 74 29 80 3d 42 0a 00 01 00 75 20 48 8b 53 10 48 c7 c6 f0 cd e9 81 48 c7 c7 82 99 16 82 c6 05 27 0a 00 01 01 e8 04 54 ac ff <0f> 0b 48 83 4b 08 02 4c 89 e6 48 89 df e8 fb fc ff ff 65 ff 05 54
RSP: 0018:ffff8880c5103c60 EFLAGS: 00010086
RAX: 0000000000000000 RBX: ffff8880d35e8020 RCX: ffff8880c5103b4c
RDX: 0000000000000046 RSI: ffffffff8245fef8 RDI: ffffffff810ac1ec
RBP: ffff8880c5103c78 R08: 0000000000000041 R09: 0000000000021900
R10: ffff8880c5103900 R11: 0000006423114dc0 R12: 0000000000000000
R13: 000000000027e0eb R14: 0000000000000246 R15: 0000000000000000
FS:  00007f8ec2dbb080(0000) GS:ffff8880c6d80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005586974b0198 CR3: 00000000c4bfe002 CR4: 00000000001606e0
Call Trace:
 cgroup_kill_sb+0x131/0x141
 deactivate_locked_super+0x29/0x5b
 kernfs_mount_ns+0x1fa/0x223
 cgroup_do_mount+0x36/0x1c8
 cgroup1_mount+0x5b2/0x610
 cgroup_mount+0x33b/0x37f
 mount_fs+0x6a/0x10b
 vfs_kern_mount+0x67/0x13c
 do_mount+0x90e/0xb7e
 ? kmem_cache_alloc_trace+0x241/0x27d
 ksys_mount+0x72/0x97
 __x64_sys_mount+0x21/0x24
 do_syscall_64+0x7d/0x1a0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f8ec1e14ada
Code: 48 8b 0d c9 a3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 96 a3 2b 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe2fa32b18 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00005586974ae2a0 RCX: 00007f8ec1e14ada
RDX: 00005586974ae480 RSI: 00005586974ae500 RDI: 00005586974ae4e0
RBP: 0000000000000000 R08: 00005586974ae4a0 R09: 00005586974ae480
R10: 00000000c0ed0000 R11: 0000000000000246 R12: 00005586974ae4e0
R13: 00005586974ae480 R14: 0000000000000000 R15: 00007f8ec2ba8184
irq event stamp: 3532
hardirqs last  enabled at (3531): [<ffffffff811c0584>] kfree+0x152/0x159
hardirqs last disabled at (3532): [<ffffffff819e27e8>] _raw_spin_lock_irqsave+0x12/0x44
softirqs last  enabled at (3326): [<ffffffff81c00353>] __do_softirq+0x353/0x38f
softirqs last disabled at (3317): [<ffffffff81058c16>] irq_exit+0x63/0xd1
---[ end trace 9bca09dc135d9213 ]---
WARNING: CPU: 3 PID: 3218 at lib/percpu-refcount.c:359 percpu_ref_reinit+0x10/0x17
Modules linked in:
CPU: 3 PID: 3218 Comm: bugger Tainted: G        W         4.20.0-fscache+ #1287
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
RIP: 0010:percpu_ref_reinit+0x10/0x17
Code: fa ff ff 48 8d 65 f0 4c 89 f6 48 c7 c7 e0 8d 51 82 5b 41 5e 5d e9 3c 50 45 00 48 8b 47 08 a8 03 74 08 48 8b 07 48 85 c0 74 02 <0f> 0b e9 c4 fe ff ff 55 31 f6 53 48 89 fb 48 83 ec 30 65 48 8b 04
RSP: 0018:ffff8880c5103d38 EFLAGS: 00010282
RAX: fffffffffffffffe RBX: ffff8880d35e8000 RCX: ffffffff810f1a8f
RDX: 00000000ffffbf7e RSI: 00000000425b018d RDI: ffff8880d35e8020
RBP: ffff8880c5103da8 R08: 0000000000000001 R09: 0000000000000000
R10: ffff8880c5103d40 R11: 0000000000000001 R12: 0000000000000001
R13: 0000000000000000 R14: ffffffffffffffc9 R15: ffff88803f63f001
FS:  00007f8ec2dbb080(0000) GS:ffff8880c6d80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005586974b0198 CR3: 00000000c4bfe002 CR4: 00000000001606e0
Call Trace:
 cgroup1_mount+0x5d1/0x610
 cgroup_mount+0x33b/0x37f
 mount_fs+0x6a/0x10b
 vfs_kern_mount+0x67/0x13c
 do_mount+0x90e/0xb7e
 ? kmem_cache_alloc_trace+0x241/0x27d
 ksys_mount+0x72/0x97
 __x64_sys_mount+0x21/0x24
 do_syscall_64+0x7d/0x1a0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f8ec1e14ada
Code: 48 8b 0d c9 a3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 96 a3 2b 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe2fa32b18 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00005586974ae2a0 RCX: 00007f8ec1e14ada
RDX: 00005586974ae480 RSI: 00005586974ae500 RDI: 00005586974ae4e0
RBP: 0000000000000000 R08: 00005586974ae4a0 R09: 00005586974ae480
R10: 00000000c0ed0000 R11: 0000000000000246 R12: 00005586974ae4e0
R13: 00005586974ae480 R14: 0000000000000000 R15: 00007f8ec2ba8184
irq event stamp: 3666
hardirqs last  enabled at (3665): [<ffffffff810bb0b1>] __call_rcu+0x1dc/0x1fa
hardirqs last disabled at (3666): [<ffffffff81001639>] trace_hardirqs_off_thunk+0x1a/0x1c
softirqs last  enabled at (3608): [<ffffffff81c00353>] __do_softirq+0x353/0x38f
softirqs last disabled at (3535): [<ffffffff81058c16>] irq_exit+0x63/0xd1
---[ end trace 9bca09dc135d9214 ]---

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] cgroup: fix top cgroup refcnt leak
  2014-02-14 11:15   ` Li Zefan
  (?)
@ 2014-02-14 16:02   ` Tejun Heo
  -1 siblings, 0 replies; 29+ messages in thread
From: Tejun Heo @ 2014-02-14 16:02 UTC (permalink / raw)
  To: Li Zefan; +Cc: Li Zefan, LKML, Cgroups

On Fri, Feb 14, 2014 at 07:15:18PM +0800, Li Zefan wrote:
> >  		/*
> > +		 * A root's lifetime is governed by its top cgroup.  Zero
> > +		 * ref indicate that the root is being destroyed.  Wait for
> > +		 * destruction to complete so that the subsystems are free.
> > +		 * We can use wait_queue for the wait but this path is
> > +		 * super cold.  Let's just sleep for a bit and retry.
> > +		 */
> > +		if (!atomic_read(&root->top_cgroup.refcnt)) {
> 
> oops, this fix is wrong. We call kernfs_mount() without cgroup locks and it
> drops cgroup refcnt if failed.
> 
> I guess we need to bump the refcnt and then drop it after kernfs_mount().

Alright, will wait for the updated fix.

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] cgroup: fix top cgroup refcnt leak
@ 2014-02-14 11:15   ` Li Zefan
  0 siblings, 0 replies; 29+ messages in thread
From: Li Zefan @ 2014-02-14 11:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Li Zefan, LKML, Cgroups

于 2014年02月14日 17:36, Li Zefan 写道:
> If we mount the same cgroupfs in serveral mount points, and then
> umount all of them, kill_sb() will be called only once.
> 
> Therefore it's wrong to increment top_cgroup's refcnt when we find
> an existing cgroup_root.
> 
> Try:
> 	# mount -t cgroup -o cpuacct xxx /cgroup
> 	# mount -t cgroup -o cpuacct xxx /cgroup2
> 	# cat /proc/cgroups | grep cpuacct
> 	cpuacct 2       1       1
> 	# umount /cgroup
> 	# umount /cgroup2
> 	# cat /proc/cgroups | grep cpuacct
> 	cpuacct 2       1       1
> 
> You'll see cgroupfs will never be freed.
> 
> Also move this chunk of code upwards.
> 
> Signed-off-by: Li Zefan <lizefan@huawei.com>
> ---
>  kernel/cgroup.c | 32 ++++++++++++++++----------------
>  1 file changed, 16 insertions(+), 16 deletions(-)
> 
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 37d94a2..5bfe738 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1498,6 +1498,22 @@ retry:
>  		bool name_match = false;
>  
>  		/*
> +		 * A root's lifetime is governed by its top cgroup.  Zero
> +		 * ref indicate that the root is being destroyed.  Wait for
> +		 * destruction to complete so that the subsystems are free.
> +		 * We can use wait_queue for the wait but this path is
> +		 * super cold.  Let's just sleep for a bit and retry.
> +		 */
> +		if (!atomic_read(&root->top_cgroup.refcnt)) {

oops, this fix is wrong. We call kernfs_mount() without cgroup locks and it
drops cgroup refcnt if failed.

I guess we need to bump the refcnt and then drop it after kernfs_mount().

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] cgroup: fix top cgroup refcnt leak
@ 2014-02-14 11:15   ` Li Zefan
  0 siblings, 0 replies; 29+ messages in thread
From: Li Zefan @ 2014-02-14 11:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Li Zefan, LKML, Cgroups

于 2014年02月14日 17:36, Li Zefan 写道:
> If we mount the same cgroupfs in serveral mount points, and then
> umount all of them, kill_sb() will be called only once.
> 
> Therefore it's wrong to increment top_cgroup's refcnt when we find
> an existing cgroup_root.
> 
> Try:
> 	# mount -t cgroup -o cpuacct xxx /cgroup
> 	# mount -t cgroup -o cpuacct xxx /cgroup2
> 	# cat /proc/cgroups | grep cpuacct
> 	cpuacct 2       1       1
> 	# umount /cgroup
> 	# umount /cgroup2
> 	# cat /proc/cgroups | grep cpuacct
> 	cpuacct 2       1       1
> 
> You'll see cgroupfs will never be freed.
> 
> Also move this chunk of code upwards.
> 
> Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> ---
>  kernel/cgroup.c | 32 ++++++++++++++++----------------
>  1 file changed, 16 insertions(+), 16 deletions(-)
> 
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 37d94a2..5bfe738 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1498,6 +1498,22 @@ retry:
>  		bool name_match = false;
>  
>  		/*
> +		 * A root's lifetime is governed by its top cgroup.  Zero
> +		 * ref indicate that the root is being destroyed.  Wait for
> +		 * destruction to complete so that the subsystems are free.
> +		 * We can use wait_queue for the wait but this path is
> +		 * super cold.  Let's just sleep for a bit and retry.
> +		 */
> +		if (!atomic_read(&root->top_cgroup.refcnt)) {

oops, this fix is wrong. We call kernfs_mount() without cgroup locks and it
drops cgroup refcnt if failed.

I guess we need to bump the refcnt and then drop it after kernfs_mount().

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] cgroup: fix top cgroup refcnt leak
@ 2014-02-14  9:36 ` Li Zefan
  0 siblings, 0 replies; 29+ messages in thread
From: Li Zefan @ 2014-02-14  9:36 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML, Cgroups

If we mount the same cgroupfs in serveral mount points, and then
umount all of them, kill_sb() will be called only once.

Therefore it's wrong to increment top_cgroup's refcnt when we find
an existing cgroup_root.

Try:
	# mount -t cgroup -o cpuacct xxx /cgroup
	# mount -t cgroup -o cpuacct xxx /cgroup2
	# cat /proc/cgroups | grep cpuacct
	cpuacct 2       1       1
	# umount /cgroup
	# umount /cgroup2
	# cat /proc/cgroups | grep cpuacct
	cpuacct 2       1       1

You'll see cgroupfs will never be freed.

Also move this chunk of code upwards.

Signed-off-by: Li Zefan <lizefan@huawei.com>
---
 kernel/cgroup.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 37d94a2..5bfe738 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1498,6 +1498,22 @@ retry:
 		bool name_match = false;
 
 		/*
+		 * A root's lifetime is governed by its top cgroup.  Zero
+		 * ref indicate that the root is being destroyed.  Wait for
+		 * destruction to complete so that the subsystems are free.
+		 * We can use wait_queue for the wait but this path is
+		 * super cold.  Let's just sleep for a bit and retry.
+		 */
+		if (!atomic_read(&root->top_cgroup.refcnt)) {
+			mutex_unlock(&cgroup_mutex);
+			mutex_unlock(&cgroup_tree_mutex);
+			kfree(opts.release_agent);
+			kfree(opts.name);
+			msleep(10);
+			goto retry;
+		}
+
+		/*
 		 * If we asked for a name then it must match.  Also, if
 		 * name matches but sybsys_mask doesn't, we should fail.
 		 * Remember whether name matched.
@@ -1530,22 +1546,6 @@ retry:
 			}
 		}
 
-		/*
-		 * A root's lifetime is governed by its top cgroup.  Zero
-		 * ref indicate that the root is being destroyed.  Wait for
-		 * destruction to complete so that the subsystems are free.
-		 * We can use wait_queue for the wait but this path is
-		 * super cold.  Let's just sleep for a bit and retry.
-		 */
-		if (!atomic_inc_not_zero(&root->top_cgroup.refcnt)) {
-			mutex_unlock(&cgroup_mutex);
-			mutex_unlock(&cgroup_tree_mutex);
-			kfree(opts.release_agent);
-			kfree(opts.name);
-			msleep(10);
-			goto retry;
-		}
-
 		ret = 0;
 		goto out_unlock;
 	}
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH] cgroup: fix top cgroup refcnt leak
@ 2014-02-14  9:36 ` Li Zefan
  0 siblings, 0 replies; 29+ messages in thread
From: Li Zefan @ 2014-02-14  9:36 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML, Cgroups

If we mount the same cgroupfs in serveral mount points, and then
umount all of them, kill_sb() will be called only once.

Therefore it's wrong to increment top_cgroup's refcnt when we find
an existing cgroup_root.

Try:
	# mount -t cgroup -o cpuacct xxx /cgroup
	# mount -t cgroup -o cpuacct xxx /cgroup2
	# cat /proc/cgroups | grep cpuacct
	cpuacct 2       1       1
	# umount /cgroup
	# umount /cgroup2
	# cat /proc/cgroups | grep cpuacct
	cpuacct 2       1       1

You'll see cgroupfs will never be freed.

Also move this chunk of code upwards.

Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 kernel/cgroup.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 37d94a2..5bfe738 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1498,6 +1498,22 @@ retry:
 		bool name_match = false;
 
 		/*
+		 * A root's lifetime is governed by its top cgroup.  Zero
+		 * ref indicate that the root is being destroyed.  Wait for
+		 * destruction to complete so that the subsystems are free.
+		 * We can use wait_queue for the wait but this path is
+		 * super cold.  Let's just sleep for a bit and retry.
+		 */
+		if (!atomic_read(&root->top_cgroup.refcnt)) {
+			mutex_unlock(&cgroup_mutex);
+			mutex_unlock(&cgroup_tree_mutex);
+			kfree(opts.release_agent);
+			kfree(opts.name);
+			msleep(10);
+			goto retry;
+		}
+
+		/*
 		 * If we asked for a name then it must match.  Also, if
 		 * name matches but sybsys_mask doesn't, we should fail.
 		 * Remember whether name matched.
@@ -1530,22 +1546,6 @@ retry:
 			}
 		}
 
-		/*
-		 * A root's lifetime is governed by its top cgroup.  Zero
-		 * ref indicate that the root is being destroyed.  Wait for
-		 * destruction to complete so that the subsystems are free.
-		 * We can use wait_queue for the wait but this path is
-		 * super cold.  Let's just sleep for a bit and retry.
-		 */
-		if (!atomic_inc_not_zero(&root->top_cgroup.refcnt)) {
-			mutex_unlock(&cgroup_mutex);
-			mutex_unlock(&cgroup_tree_mutex);
-			kfree(opts.release_agent);
-			kfree(opts.name);
-			msleep(10);
-			goto retry;
-		}
-
 		ret = 0;
 		goto out_unlock;
 	}
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2019-01-03 21:54 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-28 23:59 [PATCH] cgroup: fix top cgroup refcnt leak Andrei Vagin
2018-12-29  0:04 ` [PATCH vfs/for-next v2] " Andrei Vagin
2018-12-30 19:41   ` Andrei Vagin
2019-01-02  2:28   ` Al Viro
2019-01-02 18:14     ` [PATCH vfs/for-next v3] " Andrei Vagin
2019-01-02 19:37     ` [PATCH vfs/for-next v2] " Andrei Vagin
2019-01-02 19:37     ` [PATCH vfs/for-next v4] " Andrei Vagin
2019-01-02 20:02       ` Al Viro
2019-01-02 21:06         ` Andrei Vagin
2019-01-03  0:26         ` David Howells
2019-01-03  0:43           ` Andrei Vagin
2019-01-03  1:00             ` Andrei Vagin
2019-01-03  3:54               ` [PATCH vfs/for-next v6] " Andrei Vagin
2019-01-03  8:32                 ` Al Viro
2019-01-03 17:34                   ` Andrei Vagin
2019-01-03 21:54                   ` David Howells
2019-01-02 22:26 ` [PATCH vfs/for-next v2] " David Howells
2019-01-02 23:06   ` Andrei Vagin
2019-01-02 23:31     ` Andrei Vagin
2019-01-03  0:33     ` David Howells
2019-01-03 13:41     ` David Howells
2019-01-03 13:42     ` David Howells
2019-01-03 15:27     ` David Howells
2019-01-03 15:44     ` David Howells
  -- strict thread matches above, loose matches on Subject: below --
2014-02-14  9:36 [PATCH] " Li Zefan
2014-02-14  9:36 ` Li Zefan
2014-02-14 11:15 ` Li Zefan
2014-02-14 11:15   ` Li Zefan
2014-02-14 16:02   ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.