* [PATCH] cgroup: fix top cgroup refcnt leak @ 2018-12-28 23:59 Andrei Vagin 2018-12-29 0:04 ` [PATCH vfs/for-next v2] " Andrei Vagin 2019-01-02 22:26 ` [PATCH vfs/for-next v2] " David Howells 0 siblings, 2 replies; 24+ messages in thread From: Andrei Vagin @ 2018-12-28 23:59 UTC (permalink / raw) To: Alexander Viro, David Howells Cc: linux-fsdevel, cgroups, Andrei Vagin, Li Zefan It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak") commit was reverted by mistake. $ mkdir /tmp/cgroup $ mkdir /tmp/cgroup2 $ mount -t cgroup -o none,name=test test /tmp/cgroup $ mount -t cgroup -o none,name=test test /tmp/cgroup2 $ umount /tmp/cgroup $ umount /tmp/cgroup2 $ cat /proc/self/cgroup | grep test 12:name=test:/ You can see the test cgroup was not freed. Cc: Li Zefan <lizefan@huawei.com> Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context") Signed-off-by: Andrei Vagin <avagin@gmail.com> --- kernel/cgroup/cgroup.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index fb0717696895..dbb8805bf66c 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2045,8 +2045,11 @@ int cgroup_do_get_tree(struct fs_context *fc) } ret = 0; - if (ctx->kfc.new_sb_created) + if (ctx->kfc.new_sb_created) { goto out_cgrp; + } else { + cgroup_put(&ctx->root->cgrp); + } apply_cgroup_root_flags(ctx->flags); return 0; -- 2.17.2 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak 2018-12-28 23:59 [PATCH] cgroup: fix top cgroup refcnt leak Andrei Vagin @ 2018-12-29 0:04 ` Andrei Vagin 2018-12-30 19:41 ` Andrei Vagin 2019-01-02 2:28 ` Al Viro 2019-01-02 22:26 ` [PATCH vfs/for-next v2] " David Howells 1 sibling, 2 replies; 24+ messages in thread From: Andrei Vagin @ 2018-12-29 0:04 UTC (permalink / raw) To: Alexander Viro, David Howells Cc: linux-fsdevel, cgroups, Andrei Vagin, Li Zefan It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak") commit was reverted by mistake. $ mkdir /tmp/cgroup $ mkdir /tmp/cgroup2 $ mount -t cgroup -o none,name=test test /tmp/cgroup $ mount -t cgroup -o none,name=test test /tmp/cgroup2 $ umount /tmp/cgroup $ umount /tmp/cgroup2 $ cat /proc/self/cgroup | grep test 12:name=test:/ You can see the test cgroup was not freed. Cc: Li Zefan <lizefan@huawei.com> Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context") Signed-off-by: Andrei Vagin <avagin@gmail.com> --- v2: clean up code and add the vfs/for-next tag kernel/cgroup/cgroup.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index fb0717696895..f63974a3725f 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2047,6 +2047,9 @@ int cgroup_do_get_tree(struct fs_context *fc) ret = 0; if (ctx->kfc.new_sb_created) goto out_cgrp; + else + cgroup_put(&ctx->root->cgrp); + apply_cgroup_root_flags(ctx->flags); return 0; -- 2.17.2 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak 2018-12-29 0:04 ` [PATCH vfs/for-next v2] " Andrei Vagin @ 2018-12-30 19:41 ` Andrei Vagin 2019-01-02 2:28 ` Al Viro 1 sibling, 0 replies; 24+ messages in thread From: Andrei Vagin @ 2018-12-30 19:41 UTC (permalink / raw) To: Alexander Viro, David Howells; +Cc: linux-fsdevel, cgroups, Li Zefan Alexander and David, This patch fixed the problem which was reported more than a mounth ago https://patchwork.kernel.org/patch/10610671/ Now this code is in the linux-next and the problem blocks running CRIU tests on the linux next kernels: https://travis-ci.org/avagin/linux/builds Thanks, Andrei On Fri, Dec 28, 2018 at 04:04:00PM -0800, Andrei Vagin wrote: > It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak") > commit was reverted by mistake. > > $ mkdir /tmp/cgroup > $ mkdir /tmp/cgroup2 > $ mount -t cgroup -o none,name=test test /tmp/cgroup > $ mount -t cgroup -o none,name=test test /tmp/cgroup2 > $ umount /tmp/cgroup > $ umount /tmp/cgroup2 > $ cat /proc/self/cgroup | grep test > 12:name=test:/ > > You can see the test cgroup was not freed. > > Cc: Li Zefan <lizefan@huawei.com> > Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context") > Signed-off-by: Andrei Vagin <avagin@gmail.com> > --- > > v2: clean up code and add the vfs/for-next tag > > kernel/cgroup/cgroup.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index fb0717696895..f63974a3725f 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -2047,6 +2047,9 @@ int cgroup_do_get_tree(struct fs_context *fc) > ret = 0; > if (ctx->kfc.new_sb_created) > goto out_cgrp; > + else > + cgroup_put(&ctx->root->cgrp); > + > apply_cgroup_root_flags(ctx->flags); > return 0; > > -- > 2.17.2 > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak 2018-12-29 0:04 ` [PATCH vfs/for-next v2] " Andrei Vagin 2018-12-30 19:41 ` Andrei Vagin @ 2019-01-02 2:28 ` Al Viro 2019-01-02 18:14 ` [PATCH vfs/for-next v3] " Andrei Vagin ` (2 more replies) 1 sibling, 3 replies; 24+ messages in thread From: Al Viro @ 2019-01-02 2:28 UTC (permalink / raw) To: Andrei Vagin; +Cc: David Howells, linux-fsdevel, cgroups, Li Zefan On Fri, Dec 28, 2018 at 04:04:00PM -0800, Andrei Vagin wrote: > It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak") > commit was reverted by mistake. > > $ mkdir /tmp/cgroup > $ mkdir /tmp/cgroup2 > $ mount -t cgroup -o none,name=test test /tmp/cgroup > $ mount -t cgroup -o none,name=test test /tmp/cgroup2 > $ umount /tmp/cgroup > $ umount /tmp/cgroup2 > $ cat /proc/self/cgroup | grep test > 12:name=test:/ > > You can see the test cgroup was not freed. > > Cc: Li Zefan <lizefan@huawei.com> > Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context") > Signed-off-by: Andrei Vagin <avagin@gmail.com> > --- > > v2: clean up code and add the vfs/for-next tag > > kernel/cgroup/cgroup.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index fb0717696895..f63974a3725f 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -2047,6 +2047,9 @@ int cgroup_do_get_tree(struct fs_context *fc) > ret = 0; > if (ctx->kfc.new_sb_created) > goto out_cgrp; > + else > + cgroup_put(&ctx->root->cgrp); > + > apply_cgroup_root_flags(ctx->flags); > return 0; That looks horrible, especially since out_cgrp is return ret; If anything, it should be if (!ctx->kfc.new_sb_created) { cgroup_put(&ctx->root->cgrp); apply_cgroup_root_flags(ctx->flags); } return 0; What I don't understand is why apply_cgroup_root_flags() is not called in "new superblock" case here. It used to, prior to that conversion... Another fishy place I see there is nsdentry = kernfs_node_dentry(cgrp->kn, fc->root->d_sb); if (IS_ERR(nsdentry)) return PTR_ERR(nsdentry); dput(fc->root); fc->root = nsdentry; What happens if we get here with non-NULL fc->root (and we'd better, after successful from kernfs_get_tree() a bit earlier) and hit that failure exit? A leak? With apologies for being MIA for a week - it had been insane here... ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH vfs/for-next v3] cgroup: fix top cgroup refcnt leak 2019-01-02 2:28 ` Al Viro @ 2019-01-02 18:14 ` Andrei Vagin 2019-01-02 19:37 ` [PATCH vfs/for-next v2] " Andrei Vagin 2019-01-02 19:37 ` [PATCH vfs/for-next v4] " Andrei Vagin 2 siblings, 0 replies; 24+ messages in thread From: Andrei Vagin @ 2019-01-02 18:14 UTC (permalink / raw) To: Alexander Viro, David Howells Cc: linux-fsdevel, cgroups, Andrei Vagin, Li Zefan It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak") commit was reverted by mistake. $ mkdir /tmp/cgroup $ mkdir /tmp/cgroup2 $ mount -t cgroup -o none,name=test test /tmp/cgroup $ mount -t cgroup -o none,name=test test /tmp/cgroup2 $ umount /tmp/cgroup $ umount /tmp/cgroup2 $ cat /proc/self/cgroup | grep test 12:name=test:/ You can see the test cgroup was not freed. Cc: Li Zefan <lizefan@huawei.com> Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context") Signed-off-by: Andrei Vagin <avagin@gmail.com> --- v2: clean up code and add the vfs/for-next tag v3: fix a reference leak when kernfs_node_dentry fails kernel/cgroup/cgroup.c | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index fb0717696895..b43479d1f9a3 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2019,7 +2019,7 @@ int cgroup_do_get_tree(struct fs_context *fc) ret = kernfs_get_tree(fc); if (ret < 0) - goto out_cgrp; + return ret; /* * In non-init cgroup namespace, instead of root cgroup's dentry, @@ -2038,19 +2038,22 @@ int cgroup_do_get_tree(struct fs_context *fc) mutex_unlock(&cgroup_mutex); nsdentry = kernfs_node_dentry(cgrp->kn, fc->root->d_sb); - if (IS_ERR(nsdentry)) - return PTR_ERR(nsdentry); + if (IS_ERR(nsdentry)) { + ret = PTR_ERR(nsdentry); + goto out_cgrp; + } dput(fc->root); fc->root = nsdentry; } ret = 0; - if (ctx->kfc.new_sb_created) - goto out_cgrp; - apply_cgroup_root_flags(ctx->flags); - return 0; + if (!ctx->kfc.new_sb_created) + apply_cgroup_root_flags(ctx->flags); out_cgrp: + if (!ctx->kfc.new_sb_created) + cgroup_put(&ctx->root->cgrp); + return ret; } -- 2.17.2 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak 2019-01-02 2:28 ` Al Viro 2019-01-02 18:14 ` [PATCH vfs/for-next v3] " Andrei Vagin @ 2019-01-02 19:37 ` Andrei Vagin 2019-01-02 19:37 ` [PATCH vfs/for-next v4] " Andrei Vagin 2 siblings, 0 replies; 24+ messages in thread From: Andrei Vagin @ 2019-01-02 19:37 UTC (permalink / raw) To: Al Viro; +Cc: David Howells, linux-fsdevel, cgroups, Li Zefan On Wed, Jan 02, 2019 at 02:28:04AM +0000, Al Viro wrote: > On Fri, Dec 28, 2018 at 04:04:00PM -0800, Andrei Vagin wrote: > > It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak") > > commit was reverted by mistake. > > > > $ mkdir /tmp/cgroup > > $ mkdir /tmp/cgroup2 > > $ mount -t cgroup -o none,name=test test /tmp/cgroup > > $ mount -t cgroup -o none,name=test test /tmp/cgroup2 > > $ umount /tmp/cgroup > > $ umount /tmp/cgroup2 > > $ cat /proc/self/cgroup | grep test > > 12:name=test:/ > > > > You can see the test cgroup was not freed. > > > > Cc: Li Zefan <lizefan@huawei.com> > > Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context") > > Signed-off-by: Andrei Vagin <avagin@gmail.com> > > --- > > > > v2: clean up code and add the vfs/for-next tag > > > > kernel/cgroup/cgroup.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > > index fb0717696895..f63974a3725f 100644 > > --- a/kernel/cgroup/cgroup.c > > +++ b/kernel/cgroup/cgroup.c > > @@ -2047,6 +2047,9 @@ int cgroup_do_get_tree(struct fs_context *fc) > > ret = 0; > > if (ctx->kfc.new_sb_created) > > goto out_cgrp; > > + else > > + cgroup_put(&ctx->root->cgrp); > > + > > apply_cgroup_root_flags(ctx->flags); > > return 0; > > That looks horrible, especially since out_cgrp is return ret; > If anything, it should be > if (!ctx->kfc.new_sb_created) { > cgroup_put(&ctx->root->cgrp); > apply_cgroup_root_flags(ctx->flags); > } > return 0; > > What I don't understand is why apply_cgroup_root_flags() is not > called in "new superblock" case here. It used to, prior to that > conversion... It is a good question and I don't have an answer on it. I think David can tell more about this. > > Another fishy place I see there is > nsdentry = kernfs_node_dentry(cgrp->kn, fc->root->d_sb); > if (IS_ERR(nsdentry)) > return PTR_ERR(nsdentry); > dput(fc->root); > fc->root = nsdentry; > What happens if we get here with non-NULL fc->root (and we'd better, > after successful from kernfs_get_tree() a bit earlier) and hit that > failure exit? A leak? Yes, here is a leak too. I fixed it and sent v3, but then I decided that it would be good to test this error path and found one more problem: [ 22.669696] ================================================ [ 22.670468] WARNING: lock held when returning to user space! [ 22.671225] 4.20.0-rc1-00081-g01a72fa4bd0e-dirty #12 Not tainted [ 22.672018] ------------------------------------------------ [ 22.672817] mount/1148 is leaving the kernel with locks still held! [ 22.673660] 1 lock held by mount/1148: [ 22.674165] #0: 00000000c07be72c (&fc->fs_type->s_umount_key#41){+.+.}, at: grab_super+0x29/0x90 deactivate_locked_super() has to be called on the error path. I sent v4 with this fix. I'm sorry for the noise in the mailing list, I had to test this code before sending v3; > With apologies for being MIA for a week - it had been insane here... Good to see you back in stride. ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH vfs/for-next v4] cgroup: fix top cgroup refcnt leak 2019-01-02 2:28 ` Al Viro 2019-01-02 18:14 ` [PATCH vfs/for-next v3] " Andrei Vagin 2019-01-02 19:37 ` [PATCH vfs/for-next v2] " Andrei Vagin @ 2019-01-02 19:37 ` Andrei Vagin 2019-01-02 20:02 ` Al Viro 2 siblings, 1 reply; 24+ messages in thread From: Andrei Vagin @ 2019-01-02 19:37 UTC (permalink / raw) To: Alexander Viro, David Howells Cc: linux-fsdevel, cgroups, Andrei Vagin, Li Zefan It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak") commit was reverted by mistake. $ mkdir /tmp/cgroup $ mkdir /tmp/cgroup2 $ mount -t cgroup -o none,name=test test /tmp/cgroup $ mount -t cgroup -o none,name=test test /tmp/cgroup2 $ umount /tmp/cgroup $ umount /tmp/cgroup2 $ cat /proc/self/cgroup | grep test 12:name=test:/ You can see the test cgroup was not freed. Cc: Li Zefan <lizefan@huawei.com> Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context") Signed-off-by: Andrei Vagin <avagin@gmail.com> --- v2: clean up code and add the vfs/for-next tag v3: fix a reference leak when kernfs_node_dentry fails v4: call deactivate_locked_super() in a error case kernel/cgroup/cgroup.c | 23 ++++++++++++++++------- 1 file changed, 16 insertions(+), 7 deletions(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index fb0717696895..309c8507fe6c 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2019,7 +2019,7 @@ int cgroup_do_get_tree(struct fs_context *fc) ret = kernfs_get_tree(fc); if (ret < 0) - goto out_cgrp; + return ret; /* * In non-init cgroup namespace, instead of root cgroup's dentry, @@ -2038,19 +2038,28 @@ int cgroup_do_get_tree(struct fs_context *fc) mutex_unlock(&cgroup_mutex); nsdentry = kernfs_node_dentry(cgrp->kn, fc->root->d_sb); - if (IS_ERR(nsdentry)) - return PTR_ERR(nsdentry); + if (IS_ERR(nsdentry)) { + ret = PTR_ERR(nsdentry); + goto out_cgrp; + } dput(fc->root); fc->root = nsdentry; } ret = 0; - if (ctx->kfc.new_sb_created) - goto out_cgrp; - apply_cgroup_root_flags(ctx->flags); - return 0; + if (!ctx->kfc.new_sb_created) + apply_cgroup_root_flags(ctx->flags); out_cgrp: + if (!ctx->kfc.new_sb_created) + cgroup_put(&ctx->root->cgrp); + + if (unlikely(ret)) { + dput(fc->root); + deactivate_locked_super(fc->root->d_sb); + fc->root = NULL; + } + return ret; } -- 2.17.2 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH vfs/for-next v4] cgroup: fix top cgroup refcnt leak 2019-01-02 19:37 ` [PATCH vfs/for-next v4] " Andrei Vagin @ 2019-01-02 20:02 ` Al Viro 2019-01-02 21:06 ` Andrei Vagin 2019-01-03 0:26 ` David Howells 0 siblings, 2 replies; 24+ messages in thread From: Al Viro @ 2019-01-02 20:02 UTC (permalink / raw) To: Andrei Vagin; +Cc: David Howells, linux-fsdevel, cgroups, Li Zefan On Wed, Jan 02, 2019 at 11:37:56AM -0800, Andrei Vagin wrote: > + if (unlikely(ret)) { > + dput(fc->root); > + deactivate_locked_super(fc->root->d_sb); > + fc->root = NULL; Er... I'd rather not dereference the damn thing after dput(). It's _probably_ OK (fc->root here, AFAICS, is going to be equal to its ->sb->s_root, and thus still pinned), but it's better to fetch ->d_sb first. If nothing else, it's easier to prove correctness that way... ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH vfs/for-next v4] cgroup: fix top cgroup refcnt leak 2019-01-02 20:02 ` Al Viro @ 2019-01-02 21:06 ` Andrei Vagin 2019-01-03 0:26 ` David Howells 1 sibling, 0 replies; 24+ messages in thread From: Andrei Vagin @ 2019-01-02 21:06 UTC (permalink / raw) To: Alexander Viro, David Howells Cc: linux-fsdevel, cgroups, Andrei Vagin, Li Zefan It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak") commit was reverted by mistake. $ mkdir /tmp/cgroup $ mkdir /tmp/cgroup2 $ mount -t cgroup -o none,name=test test /tmp/cgroup $ mount -t cgroup -o none,name=test test /tmp/cgroup2 $ umount /tmp/cgroup $ umount /tmp/cgroup2 $ cat /proc/self/cgroup | grep test 12:name=test:/ You can see the test cgroup was not freed. Cc: Li Zefan <lizefan@huawei.com> Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context") Signed-off-by: Andrei Vagin <avagin@gmail.com> --- v2: clean up code and add the vfs/for-next tag v3: fix a reference leak when kernfs_node_dentry fails v4: call deactivate_locked_super() in a error case v5: don't dereference fc->root after dput() kernel/cgroup/cgroup.c | 25 ++++++++++++++++++------- 1 file changed, 18 insertions(+), 7 deletions(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index fb0717696895..53b730cf1f7b 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2019,7 +2019,7 @@ int cgroup_do_get_tree(struct fs_context *fc) ret = kernfs_get_tree(fc); if (ret < 0) - goto out_cgrp; + return ret; /* * In non-init cgroup namespace, instead of root cgroup's dentry, @@ -2038,19 +2038,30 @@ int cgroup_do_get_tree(struct fs_context *fc) mutex_unlock(&cgroup_mutex); nsdentry = kernfs_node_dentry(cgrp->kn, fc->root->d_sb); - if (IS_ERR(nsdentry)) - return PTR_ERR(nsdentry); + if (IS_ERR(nsdentry)) { + ret = PTR_ERR(nsdentry); + goto out_cgrp; + } dput(fc->root); fc->root = nsdentry; } ret = 0; - if (ctx->kfc.new_sb_created) - goto out_cgrp; - apply_cgroup_root_flags(ctx->flags); - return 0; + if (!ctx->kfc.new_sb_created) + apply_cgroup_root_flags(ctx->flags); out_cgrp: + if (!ctx->kfc.new_sb_created) + cgroup_put(&ctx->root->cgrp); + + if (unlikely(ret)) { + struct super_block *sb = fc->root->d_sb; + + dput(fc->root); + deactivate_locked_super(sb); + fc->root = NULL; + } + return ret; } -- 2.17.2 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH vfs/for-next v4] cgroup: fix top cgroup refcnt leak 2019-01-02 20:02 ` Al Viro 2019-01-02 21:06 ` Andrei Vagin @ 2019-01-03 0:26 ` David Howells 2019-01-03 0:43 ` Andrei Vagin 1 sibling, 1 reply; 24+ messages in thread From: David Howells @ 2019-01-03 0:26 UTC (permalink / raw) To: Andrei Vagin; +Cc: dhowells, Alexander Viro, linux-fsdevel, cgroups, Li Zefan Andrei Vagin <avagin@gmail.com> wrote: > It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak") > commit was reverted by mistake. > > $ mkdir /tmp/cgroup > $ mkdir /tmp/cgroup2 > $ mount -t cgroup -o none,name=test test /tmp/cgroup > $ mount -t cgroup -o none,name=test test /tmp/cgroup2 > $ umount /tmp/cgroup > $ umount /tmp/cgroup2 > $ cat /proc/self/cgroup | grep test > 12:name=test:/ > > You can see the test cgroup was not freed. > > Cc: Li Zefan <lizefan@huawei.com> > Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context") > Signed-off-by: Andrei Vagin <avagin@gmail.com> > --- > > v2: clean up code and add the vfs/for-next tag > v3: fix a reference leak when kernfs_node_dentry fails > v4: call deactivate_locked_super() in a error case > v5: don't dereference fc->root after dput() > > kernel/cgroup/cgroup.c | 25 ++++++++++++++++++------- > 1 file changed, 18 insertions(+), 7 deletions(-) This patch doesn't work either. percpu ref (css_release) <= 0 (0) after switching to atomic RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x90/0x1a0 Btw, note that the subject says "v4" but the changelog says "v5". David ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH vfs/for-next v4] cgroup: fix top cgroup refcnt leak 2019-01-03 0:26 ` David Howells @ 2019-01-03 0:43 ` Andrei Vagin 2019-01-03 1:00 ` Andrei Vagin 0 siblings, 1 reply; 24+ messages in thread From: Andrei Vagin @ 2019-01-03 0:43 UTC (permalink / raw) To: David Howells; +Cc: Alexander Viro, linux-fsdevel, cgroups, Li Zefan On Thu, Jan 03, 2019 at 12:26:23AM +0000, David Howells wrote: > Andrei Vagin <avagin@gmail.com> wrote: > > > It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak") > > commit was reverted by mistake. > > > > $ mkdir /tmp/cgroup > > $ mkdir /tmp/cgroup2 > > $ mount -t cgroup -o none,name=test test /tmp/cgroup > > $ mount -t cgroup -o none,name=test test /tmp/cgroup2 > > $ umount /tmp/cgroup > > $ umount /tmp/cgroup2 > > $ cat /proc/self/cgroup | grep test > > 12:name=test:/ > > > > You can see the test cgroup was not freed. > > > > Cc: Li Zefan <lizefan@huawei.com> > > Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context") > > Signed-off-by: Andrei Vagin <avagin@gmail.com> > > --- > > > > v2: clean up code and add the vfs/for-next tag > > v3: fix a reference leak when kernfs_node_dentry fails > > v4: call deactivate_locked_super() in a error case > > v5: don't dereference fc->root after dput() > > > > kernel/cgroup/cgroup.c | 25 ++++++++++++++++++------- > > 1 file changed, 18 insertions(+), 7 deletions(-) > > This patch doesn't work either. I'm sorry, but we can't say anything about this patch now, because it looks like recent changes in vfs-next break something else here... > > percpu ref (css_release) <= 0 (0) after switching to atomic > RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x90/0x1a0 > > Btw, note that the subject says "v4" but the changelog says "v5". It is v5. > > David ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH vfs/for-next v4] cgroup: fix top cgroup refcnt leak 2019-01-03 0:43 ` Andrei Vagin @ 2019-01-03 1:00 ` Andrei Vagin 2019-01-03 3:54 ` [PATCH vfs/for-next v6] " Andrei Vagin 0 siblings, 1 reply; 24+ messages in thread From: Andrei Vagin @ 2019-01-03 1:00 UTC (permalink / raw) To: David Howells; +Cc: Alexander Viro, linux-fsdevel, cgroups, Li Zefan On Wed, Jan 02, 2019 at 04:43:39PM -0800, Andrei Vagin wrote: > On Thu, Jan 03, 2019 at 12:26:23AM +0000, David Howells wrote: > > Andrei Vagin <avagin@gmail.com> wrote: > > > > > It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak") > > > commit was reverted by mistake. > > > > > > $ mkdir /tmp/cgroup > > > $ mkdir /tmp/cgroup2 > > > $ mount -t cgroup -o none,name=test test /tmp/cgroup > > > $ mount -t cgroup -o none,name=test test /tmp/cgroup2 > > > $ umount /tmp/cgroup > > > $ umount /tmp/cgroup2 > > > $ cat /proc/self/cgroup | grep test > > > 12:name=test:/ > > > > > > You can see the test cgroup was not freed. > > > > > > Cc: Li Zefan <lizefan@huawei.com> > > > Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context") > > > Signed-off-by: Andrei Vagin <avagin@gmail.com> > > > --- > > > > > > v2: clean up code and add the vfs/for-next tag > > > v3: fix a reference leak when kernfs_node_dentry fails > > > v4: call deactivate_locked_super() in a error case > > > v5: don't dereference fc->root after dput() > > > > > > kernel/cgroup/cgroup.c | 25 ++++++++++++++++++------- > > > 1 file changed, 18 insertions(+), 7 deletions(-) > > > > This patch doesn't work either. > > I'm sorry, but we can't say anything about this patch now, because it > looks like recent changes in vfs-next break something else here... I found a reason why this patch doesn't work on Al's vfs/for-next: [avagin@laptop linux]$ git diff 40effd960becd8a355b7aafc789712afd64f5759..vfs/for-next kernel/cgroup/cgroup-v1.c | grep -B 5 -A 5 cgroup_get /* @@ -1280,8 +1285,8 @@ int cgroup1_get_tree(struct fs_context *fc) mutex_lock(&cgroup_mutex); percpu_ref_reinit(&root->cgrp.self.refcnt); mutex_unlock(&cgroup_mutex); + cgroup_get(&root->cgrp); } - cgroup_get(&root->cgrp); /* * If @pinned_sb, we're reusing an existing root and holding an 40effd960becd8a355b7aafc789712afd64f5759 is the previous head of vfs/for-next I reverted this hunk, applied my patch and all criu test passed. > > > > > percpu ref (css_release) <= 0 (0) after switching to atomic > > RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x90/0x1a0 > > > > Btw, note that the subject says "v4" but the changelog says "v5". > > It is v5. > > > > > David ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH vfs/for-next v6] cgroup: fix top cgroup refcnt leak 2019-01-03 1:00 ` Andrei Vagin @ 2019-01-03 3:54 ` Andrei Vagin 2019-01-03 8:32 ` Al Viro 0 siblings, 1 reply; 24+ messages in thread From: Andrei Vagin @ 2019-01-03 3:54 UTC (permalink / raw) To: Alexander Viro, David Howells Cc: linux-fsdevel, cgroups, Andrei Vagin, Li Zefan It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak") commit was reverted by mistake. $ mkdir /tmp/cgroup $ mkdir /tmp/cgroup2 $ mount -t cgroup -o none,name=test test /tmp/cgroup $ mount -t cgroup -o none,name=test test /tmp/cgroup2 $ umount /tmp/cgroup $ umount /tmp/cgroup2 $ cat /proc/self/cgroup | grep test 12:name=test:/ You can see the test cgroup was not freed. Cc: Li Zefan <lizefan@huawei.com> Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context") Signed-off-by: Andrei Vagin <avagin@gmail.com> --- v2: clean up code and add the vfs/for-next tag v3: fix a reference leak when kernfs_node_dentry fails v4: call deactivate_locked_super() in a error case v5: don't dereference fc->root after dput() v6: rebase on today's vfs/for-next kernel/cgroup/cgroup-v1.c | 2 +- kernel/cgroup/cgroup.c | 25 ++++++++++++++++++------- 2 files changed, 19 insertions(+), 8 deletions(-) diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 4b189e821cad..de7d625ec077 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -1285,8 +1285,8 @@ int cgroup1_get_tree(struct fs_context *fc) mutex_lock(&cgroup_mutex); percpu_ref_reinit(&root->cgrp.self.refcnt); mutex_unlock(&cgroup_mutex); - cgroup_get(&root->cgrp); } + cgroup_get(&root->cgrp); /* * If @pinned_sb, we're reusing an existing root and holding an diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index a19f0fec9d82..fe67b5e81f9a 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2019,7 +2019,7 @@ int cgroup_do_get_tree(struct fs_context *fc) ret = kernfs_get_tree(fc); if (ret < 0) - goto out_cgrp; + return ret; /* * In non-init cgroup namespace, instead of root cgroup's dentry, @@ -2038,19 +2038,30 @@ int cgroup_do_get_tree(struct fs_context *fc) mutex_unlock(&cgroup_mutex); nsdentry = kernfs_node_dentry(cgrp->kn, fc->root->d_sb); - if (IS_ERR(nsdentry)) - return PTR_ERR(nsdentry); + if (IS_ERR(nsdentry)) { + ret = PTR_ERR(nsdentry); + goto out_cgrp; + } dput(fc->root); fc->root = nsdentry; } ret = 0; - if (ctx->kfc.new_sb_created) - goto out_cgrp; - apply_cgroup_root_flags(ctx->flags); - return 0; + if (!ctx->kfc.new_sb_created) + apply_cgroup_root_flags(ctx->flags); out_cgrp: + if (!ctx->kfc.new_sb_created) + cgroup_put(&ctx->root->cgrp); + + if (unlikely(ret)) { + struct super_block *sb = fc->root->d_sb; + + dput(fc->root); + deactivate_locked_super(sb); + fc->root = NULL; + } + return ret; } -- 2.17.2 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH vfs/for-next v6] cgroup: fix top cgroup refcnt leak 2019-01-03 3:54 ` [PATCH vfs/for-next v6] " Andrei Vagin @ 2019-01-03 8:32 ` Al Viro 2019-01-03 17:34 ` Andrei Vagin 2019-01-03 21:54 ` David Howells 0 siblings, 2 replies; 24+ messages in thread From: Al Viro @ 2019-01-03 8:32 UTC (permalink / raw) To: Andrei Vagin; +Cc: David Howells, linux-fsdevel, cgroups, Li Zefan On Wed, Jan 02, 2019 at 07:54:26PM -0800, Andrei Vagin wrote: [I'm thoroughly sick of refcounting in that thing, TBH ;-/] > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index a19f0fec9d82..fe67b5e81f9a 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -2019,7 +2019,7 @@ int cgroup_do_get_tree(struct fs_context *fc) > > ret = kernfs_get_tree(fc); > if (ret < 0) > - goto out_cgrp; > + return ret; Why does that case avoid needing cgroup_put()? Note, BTW, that we also have this: /* * Destroy a cgroup filesystem context. */ static void cgroup_fs_context_free(struct fs_context *fc) { struct cgroup_fs_context *ctx = cgroup_fc2context(fc); kfree(ctx->name); kfree(ctx->release_agent); if (ctx->root) cgroup_put(&ctx->root->cgrp); put_cgroup_ns(ctx->ns); kernfs_free_fs_context(fc); kfree(ctx); } which also needs to be taken into account. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH vfs/for-next v6] cgroup: fix top cgroup refcnt leak 2019-01-03 8:32 ` Al Viro @ 2019-01-03 17:34 ` Andrei Vagin 2019-01-03 21:54 ` David Howells 1 sibling, 0 replies; 24+ messages in thread From: Andrei Vagin @ 2019-01-03 17:34 UTC (permalink / raw) To: Al Viro; +Cc: David Howells, linux-fsdevel, cgroups, Li Zefan On Thu, Jan 03, 2019 at 08:32:29AM +0000, Al Viro wrote: > On Wed, Jan 02, 2019 at 07:54:26PM -0800, Andrei Vagin wrote: > > [I'm thoroughly sick of refcounting in that thing, TBH ;-/] feel your pain... > > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > > index a19f0fec9d82..fe67b5e81f9a 100644 > > --- a/kernel/cgroup/cgroup.c > > +++ b/kernel/cgroup/cgroup.c > > @@ -2019,7 +2019,7 @@ int cgroup_do_get_tree(struct fs_context *fc) > > > > ret = kernfs_get_tree(fc); > > if (ret < 0) > > - goto out_cgrp; > > + return ret; > > Why does that case avoid needing cgroup_put()? Note, BTW, that we > also have this: The origin patch which added this cgroup_put says that it is needed because mount() and kill_sb() is not a one-to-one match. sb is created in kernfs_get_tree(), so I decided that this cgroup_put() is needed only after kernfs_get_tree(). commit c6b3d5bcd67c75961a1e8b9564d1475c0f194a84 Author: Li Zefan <lizefan@huawei.com> Date: Fri Apr 4 17:14:41 2014 +0800 cgroup: fix top cgroup refcnt leak As mount() and kill_sb() is not a one-to-one match, If we mount the same cgroupfs in serveral mount points, and then umount all of them, kill_sb() will be called only once. > /* > * Destroy a cgroup filesystem context. > */ > static void cgroup_fs_context_free(struct fs_context *fc) > { > struct cgroup_fs_context *ctx = cgroup_fc2context(fc); > > kfree(ctx->name); > kfree(ctx->release_agent); > if (ctx->root) > cgroup_put(&ctx->root->cgrp); > put_cgroup_ns(ctx->ns); > kernfs_free_fs_context(fc); > kfree(ctx); > } > > which also needs to be taken into account. we have unconditional cgroup_get in cgroup1_get_tree() to match this put, but we need to check error paths. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH vfs/for-next v6] cgroup: fix top cgroup refcnt leak 2019-01-03 8:32 ` Al Viro 2019-01-03 17:34 ` Andrei Vagin @ 2019-01-03 21:54 ` David Howells 1 sibling, 0 replies; 24+ messages in thread From: David Howells @ 2019-01-03 21:54 UTC (permalink / raw) To: Andrei Vagin; +Cc: dhowells, Al Viro, linux-fsdevel, cgroups, Li Zefan Hi Andrei, It turns out that the cgroup-v1 refcounting is slightly broken upstream. If kernfs_get_inode() fails in kernfs_fill_super(), then there's refcount breakage in the error handling. This can be provoked by making the following change: --- a/fs/kernfs/mount.c +++ b/fs/kernfs/mount.c @@ -240,6 +240,8 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic) sb->s_shrink.seeks = 0; /* get root inode, initialize and unlock it */ + if (strcmp(current->comm, "foobar") == 0) + return -ENOANO; mutex_lock(&kernfs_mutex); inode = kernfs_get_inode(sb, info->root->kn); mutex_unlock(&kernfs_mutex); and then copying /bin/mount to /tmp/foobar and doing: [root@andromeda ~]# /tmp/foobar -t cgroup -o none,name=xxxxy xxx /tmp/x/a1 foobar: /tmp/x/a1: mount(2) system call failed: No anode. In dmesg I see the attached traces (see below). The problem appears to be because cgroup_do_mount() calls kernfs_mount(), but the refcount on the new root cgroup object hasn't been properly initialised yet. However, because we make it past sget(), the superblock thinks it has taken the caller's ref - and this gets eaten by cgroup_kill_sb(). Further, another ref appears to be released by cgroup_do_mount() in the event that kernfs_mount() fails. David ------------[ cut here ]------------ percpu_ref_kill_and_confirm called more than once on css_release! WARNING: CPU: 3 PID: 3218 at lib/percpu-refcount.c:336 percpu_ref_kill_and_confirm+0x4b/0x14c Modules linked in: CPU: 3 PID: 3218 Comm: bugger Not tainted 4.20.0-fscache+ #1287 Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014 RIP: 0010:percpu_ref_kill_and_confirm+0x4b/0x14c Code: c6 74 29 80 3d 42 0a 00 01 00 75 20 48 8b 53 10 48 c7 c6 f0 cd e9 81 48 c7 c7 82 99 16 82 c6 05 27 0a 00 01 01 e8 04 54 ac ff <0f> 0b 48 83 4b 08 02 4c 89 e6 48 89 df e8 fb fc ff ff 65 ff 05 54 RSP: 0018:ffff8880c5103c60 EFLAGS: 00010086 RAX: 0000000000000000 RBX: ffff8880d35e8020 RCX: ffff8880c5103b4c RDX: 0000000000000046 RSI: ffffffff8245fef8 RDI: ffffffff810ac1ec RBP: ffff8880c5103c78 R08: 0000000000000041 R09: 0000000000021900 R10: ffff8880c5103900 R11: 0000006423114dc0 R12: 0000000000000000 R13: 000000000027e0eb R14: 0000000000000246 R15: 0000000000000000 FS: 00007f8ec2dbb080(0000) GS:ffff8880c6d80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005586974b0198 CR3: 00000000c4bfe002 CR4: 00000000001606e0 Call Trace: cgroup_kill_sb+0x131/0x141 deactivate_locked_super+0x29/0x5b kernfs_mount_ns+0x1fa/0x223 cgroup_do_mount+0x36/0x1c8 cgroup1_mount+0x5b2/0x610 cgroup_mount+0x33b/0x37f mount_fs+0x6a/0x10b vfs_kern_mount+0x67/0x13c do_mount+0x90e/0xb7e ? kmem_cache_alloc_trace+0x241/0x27d ksys_mount+0x72/0x97 __x64_sys_mount+0x21/0x24 do_syscall_64+0x7d/0x1a0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f8ec1e14ada Code: 48 8b 0d c9 a3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 96 a3 2b 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe2fa32b18 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 00005586974ae2a0 RCX: 00007f8ec1e14ada RDX: 00005586974ae480 RSI: 00005586974ae500 RDI: 00005586974ae4e0 RBP: 0000000000000000 R08: 00005586974ae4a0 R09: 00005586974ae480 R10: 00000000c0ed0000 R11: 0000000000000246 R12: 00005586974ae4e0 R13: 00005586974ae480 R14: 0000000000000000 R15: 00007f8ec2ba8184 irq event stamp: 3532 hardirqs last enabled at (3531): [<ffffffff811c0584>] kfree+0x152/0x159 hardirqs last disabled at (3532): [<ffffffff819e27e8>] _raw_spin_lock_irqsave+0x12/0x44 softirqs last enabled at (3326): [<ffffffff81c00353>] __do_softirq+0x353/0x38f softirqs last disabled at (3317): [<ffffffff81058c16>] irq_exit+0x63/0xd1 ---[ end trace 9bca09dc135d9213 ]--- WARNING: CPU: 3 PID: 3218 at lib/percpu-refcount.c:359 percpu_ref_reinit+0x10/0x17 Modules linked in: CPU: 3 PID: 3218 Comm: bugger Tainted: G W 4.20.0-fscache+ #1287 Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014 RIP: 0010:percpu_ref_reinit+0x10/0x17 Code: fa ff ff 48 8d 65 f0 4c 89 f6 48 c7 c7 e0 8d 51 82 5b 41 5e 5d e9 3c 50 45 00 48 8b 47 08 a8 03 74 08 48 8b 07 48 85 c0 74 02 <0f> 0b e9 c4 fe ff ff 55 31 f6 53 48 89 fb 48 83 ec 30 65 48 8b 04 RSP: 0018:ffff8880c5103d38 EFLAGS: 00010282 RAX: fffffffffffffffe RBX: ffff8880d35e8000 RCX: ffffffff810f1a8f RDX: 00000000ffffbf7e RSI: 00000000425b018d RDI: ffff8880d35e8020 RBP: ffff8880c5103da8 R08: 0000000000000001 R09: 0000000000000000 R10: ffff8880c5103d40 R11: 0000000000000001 R12: 0000000000000001 R13: 0000000000000000 R14: ffffffffffffffc9 R15: ffff88803f63f001 FS: 00007f8ec2dbb080(0000) GS:ffff8880c6d80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005586974b0198 CR3: 00000000c4bfe002 CR4: 00000000001606e0 Call Trace: cgroup1_mount+0x5d1/0x610 cgroup_mount+0x33b/0x37f mount_fs+0x6a/0x10b vfs_kern_mount+0x67/0x13c do_mount+0x90e/0xb7e ? kmem_cache_alloc_trace+0x241/0x27d ksys_mount+0x72/0x97 __x64_sys_mount+0x21/0x24 do_syscall_64+0x7d/0x1a0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f8ec1e14ada Code: 48 8b 0d c9 a3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 96 a3 2b 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe2fa32b18 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 00005586974ae2a0 RCX: 00007f8ec1e14ada RDX: 00005586974ae480 RSI: 00005586974ae500 RDI: 00005586974ae4e0 RBP: 0000000000000000 R08: 00005586974ae4a0 R09: 00005586974ae480 R10: 00000000c0ed0000 R11: 0000000000000246 R12: 00005586974ae4e0 R13: 00005586974ae480 R14: 0000000000000000 R15: 00007f8ec2ba8184 irq event stamp: 3666 hardirqs last enabled at (3665): [<ffffffff810bb0b1>] __call_rcu+0x1dc/0x1fa hardirqs last disabled at (3666): [<ffffffff81001639>] trace_hardirqs_off_thunk+0x1a/0x1c softirqs last enabled at (3608): [<ffffffff81c00353>] __do_softirq+0x353/0x38f softirqs last disabled at (3535): [<ffffffff81058c16>] irq_exit+0x63/0xd1 ---[ end trace 9bca09dc135d9214 ]--- ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak 2018-12-28 23:59 [PATCH] cgroup: fix top cgroup refcnt leak Andrei Vagin 2018-12-29 0:04 ` [PATCH vfs/for-next v2] " Andrei Vagin @ 2019-01-02 22:26 ` David Howells 2019-01-02 23:06 ` Andrei Vagin 1 sibling, 1 reply; 24+ messages in thread From: David Howells @ 2019-01-02 22:26 UTC (permalink / raw) To: Andrei Vagin; +Cc: dhowells, Alexander Viro, linux-fsdevel, cgroups, Li Zefan Andrei Vagin <avagin@gmail.com> wrote: > It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak") > commit was reverted by mistake. > > $ mkdir /tmp/cgroup > $ mkdir /tmp/cgroup2 > $ mount -t cgroup -o none,name=test test /tmp/cgroup > $ mount -t cgroup -o none,name=test test /tmp/cgroup2 > $ umount /tmp/cgroup > $ umount /tmp/cgroup2 > $ cat /proc/self/cgroup | grep test > 12:name=test:/ > > You can see the test cgroup was not freed. > > Cc: Li Zefan <lizefan@huawei.com> > Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context") > Signed-off-by: Andrei Vagin <avagin@gmail.com> The kernel (Al's for-next branch, that is) seems to work fine without this patch; this patch causes the kernel to go bang (trace below). David --- percpu ref (css_release) <= 0 (0) after switching to atomic WARNING: CPU: 1 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x90/0x1a0 Modules linked in: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.20.0-rc1-fscache+ #1256 Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014 RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x90/0x1a0 Code: d8 48 85 c0 7f 26 80 3d 29 60 ff 00 00 75 1d 48 8b 53 d8 48 c7 c7 00 3c 16 82 c6 05 15 60 ff 00 01 48 8b 73 e8 e8 26 1f ac ff <0f> 0b 48 8d 43 d8 48 89 c7 49 89 c6 ff 53 f0 48 c7 43 f0 00 00 00 RSP: 0018:ffff8800c6c83ed8 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff8800d3a6e048 RCX: ffff8800c6c83dc4 RDX: 0000000000000003 RSI: ffff8800c5c9cb08 RDI: ffff8800c5c9c340 RBP: ffff8800c6c83ef8 R08: 000000000000003b R09: 0000000000021900 R10: ffff8800c6c83b00 R11: 0000004ac8b230c2 R12: 000060ff39019cd0 R13: ffffffff825b0be8 R14: ffffffffffffffff R15: 0000000000000009 FS: 0000000000000000(0000) GS:ffff8800c6c80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f33b5ad1f54 CR3: 0000000002410001 CR4: 00000000001606e0 Call Trace: <IRQ> ? rcu_process_callbacks+0x469/0x6de ? percpu_ref_exit+0x26/0x26 rcu_process_callbacks+0x4d7/0x6de __do_softirq+0x1a5/0x38f irq_exit+0x63/0xd1 smp_apic_timer_interrupt+0x1cd/0x1e0 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:cpuidle_enter_state+0x24e/0x2b1 Code: ff e8 7b be 8a ff 45 84 ed 74 17 9c 58 0f ba e0 09 73 08 0f 0b fa e8 63 8d 94 ff 31 ff e8 14 af 8f ff e8 cd 8b 94 ff fb 85 db <78> 47 48 8b 14 24 b8 ff ff ff 7f 48 b9 ff ff ff ff f3 01 00 00 48 RSP: 0018:ffff8800c5d23ea0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13 RAX: ffff8800c5c9c340 RBX: 0000000000000005 RCX: 000000000000001f RDX: 0000000000000046 RSI: 0000000000000001 RDI: ffff8800c5c9c340 RBP: 0000000000000005 R08: 0000000000000002 R09: 0000000000021900 R10: 071c71c71c71c71c R11: 0000004ac7f2c3ed R12: ffffe8ffffc89ed0 R13: 0000000000000000 R14: ffffffff8251d478 R15: 0000000000000000 do_idle+0x163/0x1ea cpu_startup_entry+0x1d/0x1f start_secondary+0x175/0x190 secondary_startup_64+0xa4/0xb0 irq event stamp: 99625 hardirqs last enabled at (99624): [<ffffffff810b18a5>] vprintk_emit+0xe6/0x24a hardirqs last disabled at (99625): [<ffffffff81001639>] trace_hardirqs_off_thunk+0x1a/0x1c softirqs last enabled at (99590): [<ffffffff8105eba9>] irq_enter+0x42/0x5d softirqs last disabled at (99591): [<ffffffff8105ec27>] irq_exit+0x63/0xd1 ---[ end trace f59fd95ebc091779 ]--- ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak 2019-01-02 22:26 ` [PATCH vfs/for-next v2] " David Howells @ 2019-01-02 23:06 ` Andrei Vagin 2019-01-02 23:31 ` Andrei Vagin ` (5 more replies) 0 siblings, 6 replies; 24+ messages in thread From: Andrei Vagin @ 2019-01-02 23:06 UTC (permalink / raw) To: David Howells; +Cc: Alexander Viro, linux-fsdevel, cgroups, Li Zefan On Wed, Jan 02, 2019 at 10:26:20PM +0000, David Howells wrote: > Andrei Vagin <avagin@gmail.com> wrote: > > > It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak") > > commit was reverted by mistake. > > > > $ mkdir /tmp/cgroup > > $ mkdir /tmp/cgroup2 > > $ mount -t cgroup -o none,name=test test /tmp/cgroup > > $ mount -t cgroup -o none,name=test test /tmp/cgroup2 > > $ umount /tmp/cgroup > > $ umount /tmp/cgroup2 > > $ cat /proc/self/cgroup | grep test > > 12:name=test:/ > > > > You can see the test cgroup was not freed. > > > > Cc: Li Zefan <lizefan@huawei.com> > > Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context") > > Signed-off-by: Andrei Vagin <avagin@gmail.com> > > The kernel (Al's for-next branch, that is) seems to work fine without this > patch; It doesn't work for me. The mount system call stucks in cgroup1_get_tree: [root@fc24 ~]# uname -a Linux fc24 4.20.0-rc1-00071-g1fab5fff0a7a #4 SMP Wed Jan 2 14:59:36 PST 2019 x86_64 x86_64 x86_64 GNU/Linux [avagin@laptop linux]$ git log | head -n 6 commit 1fab5fff0a7ae1fa3b78383a78f7a56f03a3d673 Merge: ea5751ccd665 fd6261f4322c 4addd2640fca a40612ef0ee1 f91528955d00 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Fri Dec 28 02:05:08 2018 -0500 Merge branches 'work.mount', 'work.misc', 'misc.misc' and 'work.iov_iter' into for-next $ ps axf ... 636 ? S 0:00 \_ sshd: root@pts/0 643 pts/0 Ss 0:00 \_ -bash 710 pts/0 S 0:00 \_ python test/zdtm.py run -p 4 --keep-going --report report -T .*cgroup.* --ignore-taint 1449 pts/0 S 0:00 | \_ flock zdtm_mount_cgroups.lock ./zdtm_umount_cgroups 1450 pts/0 S 0:00 | \_ /bin/sh ./zdtm_umount_cgroups 1456 pts/0 D 0:00 | \_ mount -t cgroup -o none,name=zdtmtst.defaultroot zdtm zdtm.EW1qB8 [root@fc24 criu]# cat /proc/1456/stack [<0>] msleep+0x38/0x40 [<0>] cgroup1_get_tree+0x47b/0x76b [<0>] vfs_get_tree+0x3d/0x100 [<0>] do_mount+0x2d8/0xde0 [<0>] ksys_mount+0xba/0xd0 [<0>] __x64_sys_mount+0x21/0x30 [<0>] do_syscall_64+0x5a/0x200 [<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe [<0>] 0xffffffffffffffff > this patch causes the kernel to go bang (trace below). > > David > --- > percpu ref (css_release) <= 0 (0) after switching to atomic I have updated to Al's vfs-next and now I see this error too. I tested by patch on 40effd960bec ("Merge branches 'work.mount', 'work.misc', 'misc.misc' and 'work.iov_iter' into for-next") and it works fine there... > WARNING: CPU: 1 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x90/0x1a0 > Modules linked in: > CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.20.0-rc1-fscache+ #1256 > Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014 > RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x90/0x1a0 > Code: d8 48 85 c0 7f 26 80 3d 29 60 ff 00 00 75 1d 48 8b 53 d8 48 c7 c7 00 3c 16 82 c6 05 15 60 ff 00 01 48 8b 73 e8 e8 26 1f ac ff <0f> 0b 48 8d 43 d8 48 89 c7 49 89 c6 ff 53 f0 48 c7 43 f0 00 00 00 > RSP: 0018:ffff8800c6c83ed8 EFLAGS: 00010286 > RAX: 0000000000000000 RBX: ffff8800d3a6e048 RCX: ffff8800c6c83dc4 > RDX: 0000000000000003 RSI: ffff8800c5c9cb08 RDI: ffff8800c5c9c340 > RBP: ffff8800c6c83ef8 R08: 000000000000003b R09: 0000000000021900 > R10: ffff8800c6c83b00 R11: 0000004ac8b230c2 R12: 000060ff39019cd0 > R13: ffffffff825b0be8 R14: ffffffffffffffff R15: 0000000000000009 > FS: 0000000000000000(0000) GS:ffff8800c6c80000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f33b5ad1f54 CR3: 0000000002410001 CR4: 00000000001606e0 > Call Trace: > <IRQ> > ? rcu_process_callbacks+0x469/0x6de > ? percpu_ref_exit+0x26/0x26 > rcu_process_callbacks+0x4d7/0x6de > __do_softirq+0x1a5/0x38f > irq_exit+0x63/0xd1 > smp_apic_timer_interrupt+0x1cd/0x1e0 > apic_timer_interrupt+0xf/0x20 > </IRQ> > RIP: 0010:cpuidle_enter_state+0x24e/0x2b1 > Code: ff e8 7b be 8a ff 45 84 ed 74 17 9c 58 0f ba e0 09 73 08 0f 0b fa e8 63 8d 94 ff 31 ff e8 14 af 8f ff e8 cd 8b 94 ff fb 85 db <78> 47 48 8b 14 24 b8 ff ff ff 7f 48 b9 ff ff ff ff f3 01 00 00 48 > RSP: 0018:ffff8800c5d23ea0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13 > RAX: ffff8800c5c9c340 RBX: 0000000000000005 RCX: 000000000000001f > RDX: 0000000000000046 RSI: 0000000000000001 RDI: ffff8800c5c9c340 > RBP: 0000000000000005 R08: 0000000000000002 R09: 0000000000021900 > R10: 071c71c71c71c71c R11: 0000004ac7f2c3ed R12: ffffe8ffffc89ed0 > R13: 0000000000000000 R14: ffffffff8251d478 R15: 0000000000000000 > do_idle+0x163/0x1ea > cpu_startup_entry+0x1d/0x1f > start_secondary+0x175/0x190 > secondary_startup_64+0xa4/0xb0 > irq event stamp: 99625 > hardirqs last enabled at (99624): [<ffffffff810b18a5>] vprintk_emit+0xe6/0x24a > hardirqs last disabled at (99625): [<ffffffff81001639>] trace_hardirqs_off_thunk+0x1a/0x1c > softirqs last enabled at (99590): [<ffffffff8105eba9>] irq_enter+0x42/0x5d > softirqs last disabled at (99591): [<ffffffff8105ec27>] irq_exit+0x63/0xd1 > ---[ end trace f59fd95ebc091779 ]--- ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak 2019-01-02 23:06 ` Andrei Vagin @ 2019-01-02 23:31 ` Andrei Vagin 2019-01-03 0:33 ` David Howells ` (4 subsequent siblings) 5 siblings, 0 replies; 24+ messages in thread From: Andrei Vagin @ 2019-01-02 23:31 UTC (permalink / raw) To: David Howells; +Cc: Alexander Viro, linux-fsdevel, cgroups, Li Zefan On Wed, Jan 02, 2019 at 03:06:55PM -0800, Andrei Vagin wrote: > On Wed, Jan 02, 2019 at 10:26:20PM +0000, David Howells wrote: > > Andrei Vagin <avagin@gmail.com> wrote: > > > > > It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak") > > > commit was reverted by mistake. > > > > > > $ mkdir /tmp/cgroup > > > $ mkdir /tmp/cgroup2 > > > $ mount -t cgroup -o none,name=test test /tmp/cgroup > > > $ mount -t cgroup -o none,name=test test /tmp/cgroup2 > > > $ umount /tmp/cgroup > > > $ umount /tmp/cgroup2 > > > $ cat /proc/self/cgroup | grep test > > > 12:name=test:/ > > > > > > You can see the test cgroup was not freed. > > > > > > Cc: Li Zefan <lizefan@huawei.com> > > > Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context") > > > Signed-off-by: Andrei Vagin <avagin@gmail.com> > > > > The kernel (Al's for-next branch, that is) seems to work fine without this > > patch; > > It doesn't work for me. The mount system call stucks in cgroup1_get_tree: Here is a reproducer for this problem: [root@fc24 ~]# cat fs-vs-cg set -m d=$(mktemp -d /tmp/cg.XXXXXX) for i in `seq 2`; do mkdir $d/a$i mount -t cgroup -o none,name=xxxxy xxx $d/a$i done mkdir -p $d/a1/test for i in `seq 2`; do umount $d/a$i done d=$(mktemp -d /tmp/cg.XXXXXX) for i in `seq 2`; do mkdir $d/a$i mount -t cgroup -o none,name=xxxxy xxx $d/a$i done rmdir $d/a1/test for i in `seq 2`; do umount $d/a$i done You need to execute this script twice and it will stuck on a second attmept. I tested it on Al's vfs-next without my patch. You will see this warning in the kernel log: [ 47.043634] ------------[ cut here ]------------ [ 47.044522] percpu ref (css_release) <= 0 (0) after switching to atomic [ 47.044544] WARNING: CPU: 1 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x1dc/0x210 [ 47.047369] Modules linked in: [ 47.047911] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.20.0-rc1-00071-g1fab5fff0a7a #4 So my patch is probably correct and the problem is in another place. > > [root@fc24 ~]# uname -a > Linux fc24 4.20.0-rc1-00071-g1fab5fff0a7a #4 SMP Wed Jan 2 14:59:36 PST 2019 x86_64 x86_64 x86_64 GNU/Linux > > [avagin@laptop linux]$ git log | head -n 6 > commit 1fab5fff0a7ae1fa3b78383a78f7a56f03a3d673 > Merge: ea5751ccd665 fd6261f4322c 4addd2640fca a40612ef0ee1 f91528955d00 > Author: Al Viro <viro@zeniv.linux.org.uk> > Date: Fri Dec 28 02:05:08 2018 -0500 > > Merge branches 'work.mount', 'work.misc', 'misc.misc' and 'work.iov_iter' into for-next > > > $ ps axf > ... > 636 ? S 0:00 \_ sshd: root@pts/0 > 643 pts/0 Ss 0:00 \_ -bash > 710 pts/0 S 0:00 \_ python test/zdtm.py run -p 4 --keep-going --report report -T .*cgroup.* --ignore-taint > 1449 pts/0 S 0:00 | \_ flock zdtm_mount_cgroups.lock ./zdtm_umount_cgroups > 1450 pts/0 S 0:00 | \_ /bin/sh ./zdtm_umount_cgroups > 1456 pts/0 D 0:00 | \_ mount -t cgroup -o none,name=zdtmtst.defaultroot zdtm zdtm.EW1qB8 > > > [root@fc24 criu]# cat /proc/1456/stack > [<0>] msleep+0x38/0x40 > [<0>] cgroup1_get_tree+0x47b/0x76b > [<0>] vfs_get_tree+0x3d/0x100 > [<0>] do_mount+0x2d8/0xde0 > [<0>] ksys_mount+0xba/0xd0 > [<0>] __x64_sys_mount+0x21/0x30 > [<0>] do_syscall_64+0x5a/0x200 > [<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe > [<0>] 0xffffffffffffffff > > > this patch causes the kernel to go bang (trace below). > > > > David > > --- > > percpu ref (css_release) <= 0 (0) after switching to atomic > > I have updated to Al's vfs-next and now I see this error too. > > I tested by patch on 40effd960bec ("Merge branches 'work.mount', 'work.misc', 'misc.misc' and 'work.iov_iter' into for-next") > and it works fine there... > > > > WARNING: CPU: 1 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x90/0x1a0 > > Modules linked in: > > CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.20.0-rc1-fscache+ #1256 > > Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014 > > RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x90/0x1a0 > > Code: d8 48 85 c0 7f 26 80 3d 29 60 ff 00 00 75 1d 48 8b 53 d8 48 c7 c7 00 3c 16 82 c6 05 15 60 ff 00 01 48 8b 73 e8 e8 26 1f ac ff <0f> 0b 48 8d 43 d8 48 89 c7 49 89 c6 ff 53 f0 48 c7 43 f0 00 00 00 > > RSP: 0018:ffff8800c6c83ed8 EFLAGS: 00010286 > > RAX: 0000000000000000 RBX: ffff8800d3a6e048 RCX: ffff8800c6c83dc4 > > RDX: 0000000000000003 RSI: ffff8800c5c9cb08 RDI: ffff8800c5c9c340 > > RBP: ffff8800c6c83ef8 R08: 000000000000003b R09: 0000000000021900 > > R10: ffff8800c6c83b00 R11: 0000004ac8b230c2 R12: 000060ff39019cd0 > > R13: ffffffff825b0be8 R14: ffffffffffffffff R15: 0000000000000009 > > FS: 0000000000000000(0000) GS:ffff8800c6c80000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 00007f33b5ad1f54 CR3: 0000000002410001 CR4: 00000000001606e0 > > Call Trace: > > <IRQ> > > ? rcu_process_callbacks+0x469/0x6de > > ? percpu_ref_exit+0x26/0x26 > > rcu_process_callbacks+0x4d7/0x6de > > __do_softirq+0x1a5/0x38f > > irq_exit+0x63/0xd1 > > smp_apic_timer_interrupt+0x1cd/0x1e0 > > apic_timer_interrupt+0xf/0x20 > > </IRQ> > > RIP: 0010:cpuidle_enter_state+0x24e/0x2b1 > > Code: ff e8 7b be 8a ff 45 84 ed 74 17 9c 58 0f ba e0 09 73 08 0f 0b fa e8 63 8d 94 ff 31 ff e8 14 af 8f ff e8 cd 8b 94 ff fb 85 db <78> 47 48 8b 14 24 b8 ff ff ff 7f 48 b9 ff ff ff ff f3 01 00 00 48 > > RSP: 0018:ffff8800c5d23ea0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13 > > RAX: ffff8800c5c9c340 RBX: 0000000000000005 RCX: 000000000000001f > > RDX: 0000000000000046 RSI: 0000000000000001 RDI: ffff8800c5c9c340 > > RBP: 0000000000000005 R08: 0000000000000002 R09: 0000000000021900 > > R10: 071c71c71c71c71c R11: 0000004ac7f2c3ed R12: ffffe8ffffc89ed0 > > R13: 0000000000000000 R14: ffffffff8251d478 R15: 0000000000000000 > > do_idle+0x163/0x1ea > > cpu_startup_entry+0x1d/0x1f > > start_secondary+0x175/0x190 > > secondary_startup_64+0xa4/0xb0 > > irq event stamp: 99625 > > hardirqs last enabled at (99624): [<ffffffff810b18a5>] vprintk_emit+0xe6/0x24a > > hardirqs last disabled at (99625): [<ffffffff81001639>] trace_hardirqs_off_thunk+0x1a/0x1c > > softirqs last enabled at (99590): [<ffffffff8105eba9>] irq_enter+0x42/0x5d > > softirqs last disabled at (99591): [<ffffffff8105ec27>] irq_exit+0x63/0xd1 > > ---[ end trace f59fd95ebc091779 ]--- ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak 2019-01-02 23:06 ` Andrei Vagin 2019-01-02 23:31 ` Andrei Vagin @ 2019-01-03 0:33 ` David Howells 2019-01-03 13:41 ` David Howells ` (3 subsequent siblings) 5 siblings, 0 replies; 24+ messages in thread From: David Howells @ 2019-01-03 0:33 UTC (permalink / raw) To: Andrei Vagin; +Cc: dhowells, Alexander Viro, linux-fsdevel, cgroups, Li Zefan Andrei Vagin <avagin@gmail.com> wrote: > Here is a reproducer for this problem: > > [root@fc24 ~]# cat fs-vs-cg > set -m > d=$(mktemp -d /tmp/cg.XXXXXX) > for i in `seq 2`; do > mkdir $d/a$i > mount -t cgroup -o none,name=xxxxy xxx $d/a$i > done > mkdir -p $d/a1/test > for i in `seq 2`; do > umount $d/a$i > done > d=$(mktemp -d /tmp/cg.XXXXXX) > for i in `seq 2`; do > mkdir $d/a$i > mount -t cgroup -o none,name=xxxxy xxx $d/a$i > done > rmdir $d/a1/test > for i in `seq 2`; do > umount $d/a$i > done > > You need to execute this script twice and it will stuck on a second > attmept. This causes: percpu ref (css_release) <= 0 (0) after switching to atomic without your patch and: percpu ref (css_release) <= 0 (-2) after switching to atomic with your patch, both followed by: WARNING: CPU: 0 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x90/0x1a0 on Al's for-next branch. David ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak 2019-01-02 23:06 ` Andrei Vagin 2019-01-02 23:31 ` Andrei Vagin 2019-01-03 0:33 ` David Howells @ 2019-01-03 13:41 ` David Howells 2019-01-03 13:42 ` David Howells ` (2 subsequent siblings) 5 siblings, 0 replies; 24+ messages in thread From: David Howells @ 2019-01-03 13:41 UTC (permalink / raw) To: Andrei Vagin; +Cc: dhowells, Alexander Viro, linux-fsdevel, cgroups, Li Zefan I'm seeing something a bit weird, but I'm not sure why. If I expand Andrei's test script to: set -m d=$(mktemp -d /tmp/cg.XXXXXX) mkdir $d/a1 mount -t cgroup -o none,name=xxxxy xxx $d/a1 mkdir $d/a2 mount -t cgroup -o none,name=xxxxy xxx $d/a2 mkdir -p $d/a1/test umount $d/a1 umount $d/a2 d=$(mktemp -d /tmp/cg.XXXXXX) mkdir $d/a1 mount -t cgroup -o none,name=xxxxy xxx $d/a1 mkdir $d/a2 mount -t cgroup -o none,name=xxxxy xxx $d/a2 rmdir $d/a1/test umount $d/a1 umount $d/a2 then a cgroup gets leaked at the end. Just retaining the final: for i in `seq 2`; do umount $d/a$i done in place of the two final expanded umount commands fixes the issue. The mounts are getting released, but not the cgroups as far as I can tell. David ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak 2019-01-02 23:06 ` Andrei Vagin ` (2 preceding siblings ...) 2019-01-03 13:41 ` David Howells @ 2019-01-03 13:42 ` David Howells 2019-01-03 15:27 ` David Howells 2019-01-03 15:44 ` David Howells 5 siblings, 0 replies; 24+ messages in thread From: David Howells @ 2019-01-03 13:42 UTC (permalink / raw) Cc: dhowells, Andrei Vagin, Alexander Viro, linux-fsdevel, cgroups, Li Zefan David Howells <dhowells@redhat.com> wrote: > umount $d/a1 > umount $d/a2 Putting a small sleep between the two fixes the problem too. David ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak 2019-01-02 23:06 ` Andrei Vagin ` (3 preceding siblings ...) 2019-01-03 13:42 ` David Howells @ 2019-01-03 15:27 ` David Howells 2019-01-03 15:44 ` David Howells 5 siblings, 0 replies; 24+ messages in thread From: David Howells @ 2019-01-03 15:27 UTC (permalink / raw) To: Andrei Vagin; +Cc: dhowells, Alexander Viro, linux-fsdevel, cgroups, Li Zefan David Howells <dhowells@redhat.com> wrote: > umount $d/a1 > umount $d/a2 > > then a cgroup gets leaked at the end. Just retaining the final: > > for i in `seq 2`; do > umount $d/a$i > done > > in place of the two final expanded umount commands fixes the issue. The issue also goes away if I set PERCPU_REF_INIT_ATOMIC on the refcount object when initialising it:-/ For reference, the sequence of css_get/tryget/put calls is: *** 00000000e2655b5f RIN 1: cgroup1_get_tree+0x5de/0x73b *** 00000000e2655b5f GET 2: cgroup1_get_tree+0x6e1/0x73b *** 00000000e2655b5f PUT 1: cgroup_fs_context_free+0x12c/0x15f *** 00000000e2655b5f TRYO 2: cgroup1_get_tree+0x4aa/0x73b *** 00000000e2655b5f PUT 1: cgroup_do_get_tree+0x1c5/0x1fa *** 00000000e2655b5f GET 2: cgroup1_get_tree+0x6e1/0x73b *** 00000000e2655b5f PUT 1: cgroup_fs_context_free+0x12c/0x15f *** 00000000e2655b5f TRY 2: cgroup_kn_lock_live+0x145/0x18a *** 00000000e2655b5f GET 3: cgroup_mkdir+0x241/0x454 *** 00000000e2655b5f PUT 2: cgroup_mkdir+0xad/0x454 *** 00000000e2655b5f PUT 1: cgroup_kill_sb+0x134/0x161 *** 00000000e2655b5f TRYO 2: cgroup1_get_tree+0x4aa/0x73b *** 00000000e2655b5f GET 3: cgroup1_get_tree+0x6e1/0x73b *** 00000000e2655b5f PUT 2: cgroup_fs_context_free+0x12c/0x15f *** 00000000e2655b5f TRYO 3: cgroup1_get_tree+0x4aa/0x73b *** 00000000e2655b5f PUT 2: cgroup_do_get_tree+0x1c5/0x1fa *** 00000000e2655b5f GET 3: cgroup1_get_tree+0x6e1/0x73b *** 00000000e2655b5f PUT 2: cgroup_fs_context_free+0x12c/0x15f *** 00000000e2655b5f KIL 2: cgroup_kill_sb+0x146/0x161 *** 00000000e2655b5f PUT 0: css_free_rwork_fn+0x399/0x5ec As obtained with the attached patch. I'm not sure the refcount is correct after the second block (at the end of the second mount). Should it have one ref per active sb? David --- diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 5e1694fe035b..23d022eb669f 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -165,6 +165,8 @@ struct cgroup_subsys_state { * fields of the containing structure. */ struct cgroup_subsys_state *parent; + + bool debug; }; /* diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index bb0c7da50ed2..76ca2be4986d 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -295,6 +295,8 @@ void css_task_iter_end(struct css_task_iter *it); * Inline functions. */ +extern void css_refcount(struct cgroup_subsys_state *css, const char *op); + /** * css_get - obtain a reference on the specified css * @css: target css @@ -303,8 +305,10 @@ void css_task_iter_end(struct css_task_iter *it); */ static inline void css_get(struct cgroup_subsys_state *css) { - if (!(css->flags & CSS_NO_REF)) + if (!(css->flags & CSS_NO_REF)) { percpu_ref_get(&css->refcnt); + css_refcount(css, "GET"); + } } /** @@ -316,8 +320,10 @@ static inline void css_get(struct cgroup_subsys_state *css) */ static inline void css_get_many(struct cgroup_subsys_state *css, unsigned int n) { - if (!(css->flags & CSS_NO_REF)) + if (!(css->flags & CSS_NO_REF)) { percpu_ref_get_many(&css->refcnt, n); + css_refcount(css, "GETM"); + } } /** @@ -333,8 +339,11 @@ static inline void css_get_many(struct cgroup_subsys_state *css, unsigned int n) */ static inline bool css_tryget(struct cgroup_subsys_state *css) { - if (!(css->flags & CSS_NO_REF)) - return percpu_ref_tryget(&css->refcnt); + if (css->flags & CSS_NO_REF) + return true; + if (!percpu_ref_tryget(&css->refcnt)) + return false; + css_refcount(css, "TRY"); return true; } @@ -350,8 +359,11 @@ static inline bool css_tryget(struct cgroup_subsys_state *css) */ static inline bool css_tryget_online(struct cgroup_subsys_state *css) { - if (!(css->flags & CSS_NO_REF)) - return percpu_ref_tryget_live(&css->refcnt); + if (css->flags & CSS_NO_REF) + return true; + if (!percpu_ref_tryget_live(&css->refcnt)) + return false; + css_refcount(css, "TRYO"); return true; } @@ -383,8 +395,10 @@ static inline bool css_is_dying(struct cgroup_subsys_state *css) */ static inline void css_put(struct cgroup_subsys_state *css) { - if (!(css->flags & CSS_NO_REF)) + if (!(css->flags & CSS_NO_REF)) { percpu_ref_put(&css->refcnt); + css_refcount(css, "PUT"); + } } /** @@ -396,8 +410,10 @@ static inline void css_put(struct cgroup_subsys_state *css) */ static inline void css_put_many(struct cgroup_subsys_state *css, unsigned int n) { - if (!(css->flags & CSS_NO_REF)) + if (!(css->flags & CSS_NO_REF)) { percpu_ref_put_many(&css->refcnt, n); + css_refcount(css, "PUTM"); + } } static inline void cgroup_get(struct cgroup *cgrp) diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index de7d625ec077..d87a0b8c3441 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -1175,7 +1175,7 @@ int cgroup1_get_tree(struct fs_context *fc) ss->root == &cgrp_dfl_root) continue; - if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) { + if (!css_tryget_online(&ss->root->cgrp.self)) { mutex_unlock(&cgroup_mutex); goto err_restart; } @@ -1228,7 +1228,7 @@ int cgroup1_get_tree(struct fs_context *fc) */ pinned_sb = kernfs_pin_sb(root->kf_root, NULL); if (IS_ERR(pinned_sb) || - !percpu_ref_tryget_live(&root->cgrp.self.refcnt)) { + !css_tryget_online(&root->cgrp.self)) { mutex_unlock(&cgroup_mutex); if (!IS_ERR_OR_NULL(pinned_sb)) deactivate_super(pinned_sb); @@ -1284,6 +1284,7 @@ int cgroup1_get_tree(struct fs_context *fc) if (new_root) { mutex_lock(&cgroup_mutex); percpu_ref_reinit(&root->cgrp.self.refcnt); + css_refcount(&root->cgrp.self, "RIN"); mutex_unlock(&cgroup_mutex); } cgroup_get(&root->cgrp); diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 7dbe71b23941..299e3fb30731 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1904,8 +1904,11 @@ void init_cgroup_root(struct cgroup_fs_context *ctx) root->flags = ctx->flags; if (ctx->release_agent) strscpy(root->release_agent_path, ctx->release_agent, PATH_MAX); - if (ctx->name) + if (ctx->name) { strscpy(root->name, ctx->name, MAX_CGROUP_ROOT_NAMELEN); + if (strcmp(root->name, "xxxxy") == 0) + cgrp->self.debug = true; + } if (ctx->cpuset_clone_children) set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags); } @@ -1927,7 +1930,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags) root_cgrp->ancestor_ids[0] = ret; ret = percpu_ref_init(&root_cgrp->self.refcnt, css_release, - ref_flags, GFP_KERNEL); + ref_flags | PERCPU_REF_INIT_ATOMIC, GFP_KERNEL); if (ret) goto out; @@ -2166,8 +2169,10 @@ static void cgroup_kill_sb(struct super_block *sb) if (!list_empty(&root->cgrp.self.children) || root == &cgrp_dfl_root) cgroup_put(&root->cgrp); - else + else { + css_refcount(&root->cgrp.self, "KIL"); percpu_ref_kill(&root->cgrp.self.refcnt); + } kernfs_kill_sb(sb); } @@ -4891,7 +4896,7 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp, init_and_link_css(css, ss, cgrp); - err = percpu_ref_init(&css->refcnt, css_release, 0, GFP_KERNEL); + err = percpu_ref_init(&css->refcnt, css_release, PERCPU_REF_INIT_ATOMIC, GFP_KERNEL); if (err) goto err_free_css; @@ -4946,7 +4951,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent) if (!cgrp) return ERR_PTR(-ENOMEM); - ret = percpu_ref_init(&cgrp->self.refcnt, css_release, 0, GFP_KERNEL); + ret = percpu_ref_init(&cgrp->self.refcnt, css_release, PERCPU_REF_INIT_ATOMIC, GFP_KERNEL); if (ret) goto out_free_cgrp; @@ -5194,6 +5199,7 @@ static void kill_css(struct cgroup_subsys_state *css) * Use percpu_ref_kill_and_confirm() to get notifications as each * css is confirmed to be seen as killed on all CPUs. */ + css_refcount(css, "KLC"); percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn); } @@ -5278,6 +5284,7 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) cgroup1_check_for_release(parent); /* put the base reference */ + css_refcount(&cgrp->self, "KIL"); percpu_ref_kill(&cgrp->self.refcnt); return 0; @@ -6106,3 +6113,11 @@ static int __init cgroup_sysfs_init(void) } subsys_initcall(cgroup_sysfs_init); #endif /* CONFIG_SYSFS */ + +noinline void css_refcount(struct cgroup_subsys_state *css, const char *op) +{ + if (css->debug) + printk("*** %p %s %2ld: %pSR\n", + css, op, atomic_long_read(&css->refcnt.count), + __builtin_return_address(0)); +} ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak 2019-01-02 23:06 ` Andrei Vagin ` (4 preceding siblings ...) 2019-01-03 15:27 ` David Howells @ 2019-01-03 15:44 ` David Howells 5 siblings, 0 replies; 24+ messages in thread From: David Howells @ 2019-01-03 15:44 UTC (permalink / raw) Cc: dhowells, Andrei Vagin, Alexander Viro, linux-fsdevel, cgroups, Li Zefan David Howells <dhowells@redhat.com> wrote: > I'm not sure the refcount is correct after the second block (at the end of the > second mount). Should it have one ref per active sb? No, it shouldn't because there's only one sb. David ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2019-01-03 21:54 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-12-28 23:59 [PATCH] cgroup: fix top cgroup refcnt leak Andrei Vagin 2018-12-29 0:04 ` [PATCH vfs/for-next v2] " Andrei Vagin 2018-12-30 19:41 ` Andrei Vagin 2019-01-02 2:28 ` Al Viro 2019-01-02 18:14 ` [PATCH vfs/for-next v3] " Andrei Vagin 2019-01-02 19:37 ` [PATCH vfs/for-next v2] " Andrei Vagin 2019-01-02 19:37 ` [PATCH vfs/for-next v4] " Andrei Vagin 2019-01-02 20:02 ` Al Viro 2019-01-02 21:06 ` Andrei Vagin 2019-01-03 0:26 ` David Howells 2019-01-03 0:43 ` Andrei Vagin 2019-01-03 1:00 ` Andrei Vagin 2019-01-03 3:54 ` [PATCH vfs/for-next v6] " Andrei Vagin 2019-01-03 8:32 ` Al Viro 2019-01-03 17:34 ` Andrei Vagin 2019-01-03 21:54 ` David Howells 2019-01-02 22:26 ` [PATCH vfs/for-next v2] " David Howells 2019-01-02 23:06 ` Andrei Vagin 2019-01-02 23:31 ` Andrei Vagin 2019-01-03 0:33 ` David Howells 2019-01-03 13:41 ` David Howells 2019-01-03 13:42 ` David Howells 2019-01-03 15:27 ` David Howells 2019-01-03 15:44 ` David Howells
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.