linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* linux-next: cgroup_mount() falls asleep forever
@ 2014-09-24 10:31 Andrey Wagin
  2014-09-24 14:29 ` Andrey Wagin
  2014-09-24 16:13 ` Andrey Wagin
  0 siblings, 2 replies; 11+ messages in thread
From: Andrey Wagin @ 2014-09-24 10:31 UTC (permalink / raw)
  To: LKML, Tejun Heo, Li Zefan

Hi All,

I execute CRIU tests on linux-next. Today I found that one of tests
hangs up forever.

[root@linux-next-test linux-next]# git describe HEAD
next-20140922
[root@linux-next-test ~]# ps axf
...
  698 ?        Ss     0:05  \_ sshd: root@notty
  700 ?        Ss     0:00  |   \_ bash -x
jenkins-scripts/jenkins-ct.sh jenkins.sh
  706 ?        S      0:00  |       \_ bash -c ( mount --make-rprivate
/ && umount -l  /proc && mount -t proc proc /proc/ && bash -x
jenkins.sh)
  707 ?        S      0:00  |           \_ bash -c ( mount
--make-rprivate / && umount -l  /proc && mount -t proc proc /proc/ &&
bash -x jenkins.sh)
  711 ?        S      0:00  |               \_ bash -x jenkins.sh
 2981 ?        S      0:05  |                   \_ bash -x
test/zdtm.sh -C -x .*\(maps01\|maps04\)
 6717 ?        S      0:00  |                       \_ /bin/bash
zdtm/live/static/cgroup00.hook --clean
 6719 ?        D     11:13  |                           \_ mount -t
cgroup none cgclean.cWgK71 -o none,name=zdtmtst

[root@linux-next-test ~]# cat /proc/6719/stack
[<ffffffff8111fcb7>] msleep+0x37/0x50
[<ffffffff8114bde2>] cgroup_mount+0x712/0xa50
[<ffffffff8124d0e9>] mount_fs+0x39/0x1b0
[<ffffffff8126d76b>] vfs_kern_mount+0x6b/0x150
[<ffffffff8127075b>] do_mount+0x22b/0xc20
[<ffffffff81271492>] SyS_mount+0xa2/0x110
[<ffffffff817e3ba9>] system_call_fastpath+0x12/0x17
[<ffffffffffffffff>] 0xffffffffffffffff

Let me know which info do you need for investigation.

Steps to reproduce:

git clone https://github.com/xemul/criu.git
cd criu
make
bash test/zdtm.sh static/cgroup00

Thanks,
Andrew

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux-next: cgroup_mount() falls asleep forever
  2014-09-24 10:31 linux-next: cgroup_mount() falls asleep forever Andrey Wagin
@ 2014-09-24 14:29 ` Andrey Wagin
  2014-09-24 18:31   ` Cong Wang
  2014-09-24 18:52   ` Al Viro
  2014-09-24 16:13 ` Andrey Wagin
  1 sibling, 2 replies; 11+ messages in thread
From: Andrey Wagin @ 2014-09-24 14:29 UTC (permalink / raw)
  To: LKML, Tejun Heo, Li Zefan

2014-09-24 14:31 GMT+04:00 Andrey Wagin <avagin@gmail.com>:
> Hi All,

The problem is in a following commit:

commit 0c7bf3e8cab7900e17ce7f97104c39927d835469
Author: Zefan Li <lizefan@huawei.com>
Date:   Sat Sep 20 14:49:10 2014 +0800

    cgroup: remove redundant variable in cgroup_mount()

    Both pinned_sb and new_sb indicate if a new superblock is needed,
    so we can just remove new_sb.

    Note now we must check if kernfs_tryget_sb() returns NULL, because
    when it returns NULL, kernfs_mount() may still re-use an existing
    superblock, which is just allocated by another concurent mount.

    Suggested-by: Tejun Heo <tj@kernel.org>
    Signed-off-by: Zefan Li <lizefan@huawei.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

>
> I execute CRIU tests on linux-next. Today I found that one of tests
> hangs up forever.
>
> [root@linux-next-test linux-next]# git describe HEAD
> next-20140922
> [root@linux-next-test ~]# ps axf
> ...
>   698 ?        Ss     0:05  \_ sshd: root@notty
>   700 ?        Ss     0:00  |   \_ bash -x
> jenkins-scripts/jenkins-ct.sh jenkins.sh
>   706 ?        S      0:00  |       \_ bash -c ( mount --make-rprivate
> / && umount -l  /proc && mount -t proc proc /proc/ && bash -x
> jenkins.sh)
>   707 ?        S      0:00  |           \_ bash -c ( mount
> --make-rprivate / && umount -l  /proc && mount -t proc proc /proc/ &&
> bash -x jenkins.sh)
>   711 ?        S      0:00  |               \_ bash -x jenkins.sh
>  2981 ?        S      0:05  |                   \_ bash -x
> test/zdtm.sh -C -x .*\(maps01\|maps04\)
>  6717 ?        S      0:00  |                       \_ /bin/bash
> zdtm/live/static/cgroup00.hook --clean
>  6719 ?        D     11:13  |                           \_ mount -t
> cgroup none cgclean.cWgK71 -o none,name=zdtmtst
>
> [root@linux-next-test ~]# cat /proc/6719/stack
> [<ffffffff8111fcb7>] msleep+0x37/0x50
> [<ffffffff8114bde2>] cgroup_mount+0x712/0xa50
> [<ffffffff8124d0e9>] mount_fs+0x39/0x1b0
> [<ffffffff8126d76b>] vfs_kern_mount+0x6b/0x150
> [<ffffffff8127075b>] do_mount+0x22b/0xc20
> [<ffffffff81271492>] SyS_mount+0xa2/0x110
> [<ffffffff817e3ba9>] system_call_fastpath+0x12/0x17
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> Let me know which info do you need for investigation.
>
> Steps to reproduce:
>
> git clone https://github.com/xemul/criu.git
> cd criu
> make
> bash test/zdtm.sh static/cgroup00
>
> Thanks,
> Andrew

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux-next: cgroup_mount() falls asleep forever
  2014-09-24 10:31 linux-next: cgroup_mount() falls asleep forever Andrey Wagin
  2014-09-24 14:29 ` Andrey Wagin
@ 2014-09-24 16:13 ` Andrey Wagin
  1 sibling, 0 replies; 11+ messages in thread
From: Andrey Wagin @ 2014-09-24 16:13 UTC (permalink / raw)
  To: LKML, Tejun Heo, Li Zefan

2014-09-24 14:31 GMT+04:00 Andrey Wagin <avagin@gmail.com>:
> Hi All,
>
> I execute CRIU tests on linux-next. Today I found that one of tests
> hangs up forever.
>
> [root@linux-next-test linux-next]# git describe HEAD
> next-20140922
> [root@linux-next-test ~]# ps axf
> ...
>   698 ?        Ss     0:05  \_ sshd: root@notty
>   700 ?        Ss     0:00  |   \_ bash -x
> jenkins-scripts/jenkins-ct.sh jenkins.sh
>   706 ?        S      0:00  |       \_ bash -c ( mount --make-rprivate
> / && umount -l  /proc && mount -t proc proc /proc/ && bash -x
> jenkins.sh)
>   707 ?        S      0:00  |           \_ bash -c ( mount
> --make-rprivate / && umount -l  /proc && mount -t proc proc /proc/ &&
> bash -x jenkins.sh)
>   711 ?        S      0:00  |               \_ bash -x jenkins.sh
>  2981 ?        S      0:05  |                   \_ bash -x
> test/zdtm.sh -C -x .*\(maps01\|maps04\)
>  6717 ?        S      0:00  |                       \_ /bin/bash
> zdtm/live/static/cgroup00.hook --clean
>  6719 ?        D     11:13  |                           \_ mount -t
> cgroup none cgclean.cWgK71 -o none,name=zdtmtst
>
> [root@linux-next-test ~]# cat /proc/6719/stack
> [<ffffffff8111fcb7>] msleep+0x37/0x50
> [<ffffffff8114bde2>] cgroup_mount+0x712/0xa50
> [<ffffffff8124d0e9>] mount_fs+0x39/0x1b0
> [<ffffffff8126d76b>] vfs_kern_mount+0x6b/0x150
> [<ffffffff8127075b>] do_mount+0x22b/0xc20
> [<ffffffff81271492>] SyS_mount+0xa2/0x110
> [<ffffffff817e3ba9>] system_call_fastpath+0x12/0x17
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> Let me know which info do you need for investigation.
>
> Steps to reproduce:
>
> git clone https://github.com/xemul/criu.git
> cd criu
> make
> bash test/zdtm.sh static/cgroup00

Steps to reproduce without CRIU:

[root@avagin-fc19-cr ~]# mount -t cgroup -o none,name=zdtmtst xxx /mnt/test/
[root@avagin-fc19-cr ~]# mkdir /mnt/test/xxx
[root@avagin-fc19-cr ~]# umount /mnt/test/
[root@avagin-fc19-cr ~]# mount -t cgroup -o none,name=zdtmtst xxx /mnt/test/
[1]+  Stopped                 mount -t cgroup -o none,name=zdtmtst xxx
/mnt/test/
[root@avagin-fc19-cr ~]# bg
[root@avagin-fc19-cr ~]# cat /proc/633/stack
[<ffffffff810e9377>] msleep+0x37/0x50
[<ffffffff811118a2>] cgroup_mount+0x482/0x670
[<ffffffff811e3769>] mount_fs+0x39/0x1b0
[<ffffffff8120157b>] vfs_kern_mount+0x6b/0x150
[<ffffffff8120422b>] do_mount+0x22b/0xc20
[<ffffffff81204f62>] SyS_mount+0xa2/0x110
[<ffffffff816fcc69>] system_call_fastpath+0x12/0x17
[<ffffffffffffffff>] 0xffffffffffffffff

>
> Thanks,
> Andrew

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux-next: cgroup_mount() falls asleep forever
  2014-09-24 14:29 ` Andrey Wagin
@ 2014-09-24 18:31   ` Cong Wang
  2014-09-24 18:52   ` Al Viro
  1 sibling, 0 replies; 11+ messages in thread
From: Cong Wang @ 2014-09-24 18:31 UTC (permalink / raw)
  To: Andrey Wagin; +Cc: LKML, Tejun Heo, Li Zefan

On Wed, Sep 24, 2014 at 7:29 AM, Andrey Wagin <avagin@gmail.com> wrote:
> 2014-09-24 14:31 GMT+04:00 Andrey Wagin <avagin@gmail.com>:
>> Hi All,
>
> The problem is in a following commit:
>
> commit 0c7bf3e8cab7900e17ce7f97104c39927d835469
> Author: Zefan Li <lizefan@huawei.com>
> Date:   Sat Sep 20 14:49:10 2014 +0800
>
>     cgroup: remove redundant variable in cgroup_mount()
>
>     Both pinned_sb and new_sb indicate if a new superblock is needed,
>     so we can just remove new_sb.
>
>     Note now we must check if kernfs_tryget_sb() returns NULL, because
>     when it returns NULL, kernfs_mount() may still re-use an existing
>     superblock, which is just allocated by another concurent mount.
>

I guess the check for NULL is incorrect, the comment on kernfs_pin_sb()
says:

Returns NULL if there's no superblock associated to this kernfs_root,

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux-next: cgroup_mount() falls asleep forever
  2014-09-24 14:29 ` Andrey Wagin
  2014-09-24 18:31   ` Cong Wang
@ 2014-09-24 18:52   ` Al Viro
  2014-09-24 19:24     ` Tejun Heo
  2014-09-24 20:16     ` Al Viro
  1 sibling, 2 replies; 11+ messages in thread
From: Al Viro @ 2014-09-24 18:52 UTC (permalink / raw)
  To: Andrey Wagin; +Cc: LKML, Tejun Heo, Li Zefan

On Wed, Sep 24, 2014 at 06:29:27PM +0400, Andrey Wagin wrote:
> 2014-09-24 14:31 GMT+04:00 Andrey Wagin <avagin@gmail.com>:
> > Hi All,
> 
> The problem is in a following commit:
> 
> commit 0c7bf3e8cab7900e17ce7f97104c39927d835469
> Author: Zefan Li <lizefan@huawei.com>
> Date:   Sat Sep 20 14:49:10 2014 +0800
> 
>     cgroup: remove redundant variable in cgroup_mount()
> 
>     Both pinned_sb and new_sb indicate if a new superblock is needed,
>     so we can just remove new_sb.
> 
>     Note now we must check if kernfs_tryget_sb() returns NULL, because
>     when it returns NULL, kernfs_mount() may still re-use an existing
>     superblock, which is just allocated by another concurent mount.
> 
>     Suggested-by: Tejun Heo <tj@kernel.org>
>     Signed-off-by: Zefan Li <lizefan@huawei.com>
>     Signed-off-by: Tejun Heo <tj@kernel.org>

Lovely...  First of all, that thing is obviously racy - there's nothing
to prevent another mount happening between these two places.  Moreover,
kernfs_mount() calling conventions are really atrocious - pointer to
bool just to indicate that superblock is new?

Could somebody explain WTF is the whole construction trying to do?  Not
to mention anything else, what *does* this pinning a superblock protect
from?  Suppose we have a superblock for the same root with non-NULL ns
and _that_ gets killed.  We get hit by the same
	percpu_ref_kill(&root->cgrp.self.refcnt);
so what's the point of pinned_sb?  Might as well have just bumped the
refcount, superblock or no superblock.  And no, delaying that kernfs_kill_sb()
does you no good whatsoever - again, pinned_sb might have nothing to do with
the superblock we are after.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux-next: cgroup_mount() falls asleep forever
  2014-09-24 18:52   ` Al Viro
@ 2014-09-24 19:24     ` Tejun Heo
  2014-09-25  2:47       ` Al Viro
  2014-09-24 20:16     ` Al Viro
  1 sibling, 1 reply; 11+ messages in thread
From: Tejun Heo @ 2014-09-24 19:24 UTC (permalink / raw)
  To: Al Viro; +Cc: Andrey Wagin, LKML, Li Zefan

Hey, Al.

On Wed, Sep 24, 2014 at 07:52:14PM +0100, Al Viro wrote:
> On Wed, Sep 24, 2014 at 06:29:27PM +0400, Andrey Wagin wrote:
> > 2014-09-24 14:31 GMT+04:00 Andrey Wagin <avagin@gmail.com>:
> > > Hi All,
> > 
> > The problem is in a following commit:
> > 
> > commit 0c7bf3e8cab7900e17ce7f97104c39927d835469
> > Author: Zefan Li <lizefan@huawei.com>
> > Date:   Sat Sep 20 14:49:10 2014 +0800
> > 
> >     cgroup: remove redundant variable in cgroup_mount()
> > 
> >     Both pinned_sb and new_sb indicate if a new superblock is needed,
> >     so we can just remove new_sb.
> > 
> >     Note now we must check if kernfs_tryget_sb() returns NULL, because
> >     when it returns NULL, kernfs_mount() may still re-use an existing
> >     superblock, which is just allocated by another concurent mount.
> > 
> >     Suggested-by: Tejun Heo <tj@kernel.org>
> >     Signed-off-by: Zefan Li <lizefan@huawei.com>
> >     Signed-off-by: Tejun Heo <tj@kernel.org>

I'm gonna wait for Li's response for a couple days and then revert it
if it can't be fixed differently.

> Lovely...  First of all, that thing is obviously racy - there's nothing
> to prevent another mount happening between these two places.  Moreover,
> kernfs_mount() calling conventions are really atrocious - pointer to
> bool just to indicate that superblock is new?
> 
> Could somebody explain WTF is the whole construction trying to do?  Not
> to mention anything else, what *does* this pinning a superblock protect
> from?  Suppose we have a superblock for the same root with non-NULL ns
> and _that_ gets killed.  We get hit by the same
> 	percpu_ref_kill(&root->cgrp.self.refcnt);
> so what's the point of pinned_sb?  Might as well have just bumped the
> refcount, superblock or no superblock.  And no, delaying that kernfs_kill_sb()
> does you no good whatsoever - again, pinned_sb might have nothing to do with
> the superblock we are after.

Yeah, it's an ugly thing to work around vfs interface not very
conducive for filesystems which conditionally create or reuse
superblocks during mount.  There was a thread explaining what's going
on.  Looking up...

  http://thread.gmane.org/gmane.linux.kernel.containers/27623/focus=10635

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux-next: cgroup_mount() falls asleep forever
  2014-09-24 18:52   ` Al Viro
  2014-09-24 19:24     ` Tejun Heo
@ 2014-09-24 20:16     ` Al Viro
  1 sibling, 0 replies; 11+ messages in thread
From: Al Viro @ 2014-09-24 20:16 UTC (permalink / raw)
  To: Andrey Wagin; +Cc: LKML, Tejun Heo, Li Zefan

On Wed, Sep 24, 2014 at 07:52:14PM +0100, Al Viro wrote:

> Could somebody explain WTF is the whole construction trying to do?  Not
> to mention anything else, what *does* this pinning a superblock protect
> from?  Suppose we have a superblock for the same root with non-NULL ns
> and _that_ gets killed.  We get hit by the same
> 	percpu_ref_kill(&root->cgrp.self.refcnt);
> so what's the point of pinned_sb?  Might as well have just bumped the
> refcount, superblock or no superblock.  And no, delaying that kernfs_kill_sb()
> does you no good whatsoever - again, pinned_sb might have nothing to do with
> the superblock we are after.

Hrm...  Scratch the comments re "different superblock" (we are passing NULL
ns in that kernfs_mount() below), but...  then WTF is that thing trying to
do?  OK, you've got your active reference to a superblock from
kernfs_pin_sb().  You grab root->cgrp.self.refcnt.  Suppose it also worked.
Now what?  You drop cgroup_mutex and proceed to kernfs_mount().  Which
calls sget(), looking for exact same thing as your kernfs_pin_sb().  So it
finds the same superblock and grab it for you (with ->s_umount held).
At which point you drop root->cgrp.self.refcnt and drop the active reference
you've got from kernfs_pin_sb().  Pardon me, but what's the point of that
song and dance?  Who else can make that attempt at grabbing
root->cgrp.self.refcnt fail?

BTW, what happens if kernfs_fill_super() fails?  You get cgroup_kill_sb()
triggered by deactivate_locked_super(), which calls kernfs_kill_sb(), which
does kernfs_put().  Balancing the kernfs_get() we'd never got around to
in kernfs_fill_super()...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux-next: cgroup_mount() falls asleep forever
  2014-09-24 19:24     ` Tejun Heo
@ 2014-09-25  2:47       ` Al Viro
  2014-09-25  3:25         ` Tejun Heo
  0 siblings, 1 reply; 11+ messages in thread
From: Al Viro @ 2014-09-25  2:47 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Andrey Wagin, LKML, Li Zefan

On Wed, Sep 24, 2014 at 03:24:56PM -0400, Tejun Heo wrote:

> > Lovely...  First of all, that thing is obviously racy - there's nothing
> > to prevent another mount happening between these two places.  Moreover,
> > kernfs_mount() calling conventions are really atrocious - pointer to
> > bool just to indicate that superblock is new?
> > 
> > Could somebody explain WTF is the whole construction trying to do?  Not
> > to mention anything else, what *does* this pinning a superblock protect
> > from?  Suppose we have a superblock for the same root with non-NULL ns
> > and _that_ gets killed.  We get hit by the same
> > 	percpu_ref_kill(&root->cgrp.self.refcnt);
> > so what's the point of pinned_sb?  Might as well have just bumped the
> > refcount, superblock or no superblock.  And no, delaying that kernfs_kill_sb()
> > does you no good whatsoever - again, pinned_sb might have nothing to do with
> > the superblock we are after.
> 
> Yeah, it's an ugly thing to work around vfs interface not very
> conducive for filesystems which conditionally create or reuse
> superblocks during mount.  There was a thread explaining what's going
> on.  Looking up...
> 
>   http://thread.gmane.org/gmane.linux.kernel.containers/27623/focus=10635

Umm...  I still don't get it.  Could you describe the screnario in which
that percpu_ref_tryget_live() would be called and managed to fail?

It smells to me like most of the problems here are simply due to having too
many locks and not being able to decide where should they live relative to
->s_umount.  That cgroup_mutex thing feels like something way to coarse...
You have it grabbed/dropped in
	cgroup_destroy_root(), cgroup_remount(), cgroup_mount(),
	task_cgroup_path(), cgroup_attach_task_all(), cgroup_rename(),
	cgroup_rm_cftypes(), cgroup_add_cftypes(), cgroup_transfer_tasks(),
	cgroupstats_build(),
	css_release_work_fn() [async, queue_work()],
	css_killed_work_fn() [async, queue_work()],
	cgroup_init_subsys(), cgroup_init(), proc_cgroup_show(),
	proc_cgroupstats_show(), cgroup_release_agent(), __cgroup_procs_write(),
	cgroup_release_agent_write(), cgroup_subtree_control_write(),
	cgroup_mkdir(), cgroup_rmdir()
And that's a single system-wide mutex.  Plus there's a slew of workqueues
and really unpleasant abuse of restart_syscall() tossed in for fun - what
happens if some joker triggers that ->mount() _not_ from mount(2)?

Then there's a global rwsem nesting inside that sucker.  And there's another
mutex in fs/kernfs - also a global one.  Are the locking rules documented
anywhere?  Lifetime rules, as well...

Frankly, my first inclination here would be to try using sget() instead of
scanning the list of roots.  How painful would it be to split the allocation
and setup of roots, always allocate a new root and have sget() wait
for fs shutdown in progress and decide whether it wants to reuse the existing
one.  You can easily tell reuse existing vs. set up a new one from each other -
just look at the root associated with the superblock you've got and check
if it's the one you've allocated.  Freeing the damn thing if we'd reused
an existing one and doing setup otherwise.

I realize that it won't do in such form; your for_each_subsys() loop in there
really depends on holding cgroup_mutex all the way through.  But do we really
need it there?  Would just skipping the ones that doomed in rebind_subsystems()
suffice?

Just looking at the size of the damn thing is depressing, TBH - it's quite
a bit bigger than *anything* in fs/*.c.  And callers are all over the tree ;-/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux-next: cgroup_mount() falls asleep forever
  2014-09-25  2:47       ` Al Viro
@ 2014-09-25  3:25         ` Tejun Heo
  2014-09-25 10:23           ` Zefan Li
  0 siblings, 1 reply; 11+ messages in thread
From: Tejun Heo @ 2014-09-25  3:25 UTC (permalink / raw)
  To: Al Viro; +Cc: Andrey Wagin, LKML, Li Zefan

Hello, Al.

On Thu, Sep 25, 2014 at 03:47:19AM +0100, Al Viro wrote:
> > Yeah, it's an ugly thing to work around vfs interface not very
> > conducive for filesystems which conditionally create or reuse
> > superblocks during mount.  There was a thread explaining what's going
> > on.  Looking up...
> > 
> >   http://thread.gmane.org/gmane.linux.kernel.containers/27623/focus=10635
> 
> Umm...  I still don't get it.  Could you describe the screnario in which
> that percpu_ref_tryget_live() would be called and managed to fail?

That was for the initial fix and Li later added the pinning to fix
something else.  Let's wait for Li to chime in.  He knows this part
better.

> It smells to me like most of the problems here are simply due to having too
> many locks and not being able to decide where should they live relative to
> ->s_umount.  That cgroup_mutex thing feels like something way to coarse...
> You have it grabbed/dropped in

cgroup_mutex is the outer-most lock as far as cgroup is concerned and
not expected to nest under anything which is used by individual
controllers.  Most of what cgroup core does is low-freq managerial
things which don't benefit from finer grained locking and the mount
path is one of few surfaces where it interacts with outside in terms
of locking, so it's better to keep that path special and everything
else simpler.

> And that's a single system-wide mutex.  Plus there's a slew of workqueues
> and really unpleasant abuse of restart_syscall() tossed in for fun - what
> happens if some joker triggers that ->mount() _not_ from mount(2)?

For cgroup, mount is the userland-visible init interface.  It gotta be
called from userland.  It originally had internal retry loop but
syscall restart is simpler.  Reviving that loop isn't difficult if it
ever becomes necessary.

> Then there's a global rwsem nesting inside that sucker.  And there's another

The rwsem nests inside cgroup_mutex and exists to allow multiple
reader accesses to a particular data structure.

> mutex in fs/kernfs - also a global one.  Are the locking rules documented
> anywhere?  Lifetime rules, as well...

kernfs one shouldn't interact with anything outside kernfs.  Its
dependency is terminated within kernfs.

> Frankly, my first inclination here would be to try using sget() instead of
> scanning the list of roots.  How painful would it be to split the allocation
> and setup of roots, always allocate a new root and have sget() wait
> for fs shutdown in progress and decide whether it wants to reuse the existing
> one.  You can easily tell reuse existing vs. set up a new one from each other -
> just look at the root associated with the superblock you've got and check
> if it's the one you've allocated.  Freeing the damn thing if we'd reused
> an existing one and doing setup otherwise.
> 
> I realize that it won't do in such form; your for_each_subsys() loop in there
> really depends on holding cgroup_mutex all the way through.  But do we really
> need it there?  Would just skipping the ones that doomed in rebind_subsystems()
> suffice?

At this point, cgroup core locking is heavily focused on simplicity -
cgroup_mutex for the whole thing and css_set_rwsem for css_set reader
accesses.  It works out pretty well for the rest of the code but the
mount path does get tricky.  We can definitely relax / separate out
locking on subsys iteration for mount path but if possible I'd prefer
to pay isolated complexity there instead of spilling it to other
places.

Anyways, let's wait for Li.  At least nobody reported breakage before
the recent commit, so we can revert the offending commit for the short
term.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux-next: cgroup_mount() falls asleep forever
  2014-09-25  3:25         ` Tejun Heo
@ 2014-09-25 10:23           ` Zefan Li
  2014-09-25 15:14             ` Tejun Heo
  0 siblings, 1 reply; 11+ messages in thread
From: Zefan Li @ 2014-09-25 10:23 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Al Viro, Andrey Wagin, LKML

On 2014/9/25 11:25, Tejun Heo wrote:
> Hello, Al.
> 
> On Thu, Sep 25, 2014 at 03:47:19AM +0100, Al Viro wrote:
>>> Yeah, it's an ugly thing to work around vfs interface not very
>>> conducive for filesystems which conditionally create or reuse
>>> superblocks during mount.  There was a thread explaining what's going
>>> on.  Looking up...
>>>
>>>   http://thread.gmane.org/gmane.linux.kernel.containers/27623/focus=10635
>>
>> Umm...  I still don't get it.  Could you describe the screnario in which
>> that percpu_ref_tryget_live() would be called and managed to fail?
> 

See this:

https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/commit/kernel/cgroup.c?id=3a32bd72d77058d768dbb38183ad517f720dd1bc

Without the above commit, A scenario like this can happen:

1. Thread A has dropped sb refcnt to 0 but hasn't called percpu_ref_kill(). 
2. At this time Thread B calls cgroup_mount() and percpu_ref_tryget() succeeds.
3. After a while Thread C calls cgroup_mount(), but percpu_ref_tryget() keeps
returning failure, because the ref has been killed by percpu_ref_kill().

So I had to use kernfs_pin_sb() to prevent thread B from getting the percpu refcnt.

Any better idea to fix this?

> That was for the initial fix and Li later added the pinning to fix
> something else.  Let's wait for Li to chime in.  He knows this part
> better.
> 
>> It smells to me like most of the problems here are simply due to having too
>> many locks and not being able to decide where should they live relative to
>> ->s_umount.  That cgroup_mutex thing feels like something way to coarse...
>> You have it grabbed/dropped in
> 
> cgroup_mutex is the outer-most lock as far as cgroup is concerned and
> not expected to nest under anything which is used by individual
> controllers.  Most of what cgroup core does is low-freq managerial
> things which don't benefit from finer grained locking and the mount
> path is one of few surfaces where it interacts with outside in terms
> of locking, so it's better to keep that path special and everything
> else simpler.
> 
>> And that's a single system-wide mutex.  Plus there's a slew of workqueues
>> and really unpleasant abuse of restart_syscall() tossed in for fun - what
>> happens if some joker triggers that ->mount() _not_ from mount(2)?
> 
> For cgroup, mount is the userland-visible init interface.  It gotta be
> called from userland.  It originally had internal retry loop but
> syscall restart is simpler.  Reviving that loop isn't difficult if it
> ever becomes necessary.
> 
>> Then there's a global rwsem nesting inside that sucker.  And there's another
> 
> The rwsem nests inside cgroup_mutex and exists to allow multiple
> reader accesses to a particular data structure.
> 
>> mutex in fs/kernfs - also a global one.  Are the locking rules documented
>> anywhere?  Lifetime rules, as well...
> 
> kernfs one shouldn't interact with anything outside kernfs.  Its
> dependency is terminated within kernfs.
> 
>> Frankly, my first inclination here would be to try using sget() instead of
>> scanning the list of roots.  How painful would it be to split the allocation
>> and setup of roots, always allocate a new root and have sget() wait
>> for fs shutdown in progress and decide whether it wants to reuse the existing
>> one.  You can easily tell reuse existing vs. set up a new one from each other -
>> just look at the root associated with the superblock you've got and check
>> if it's the one you've allocated.  Freeing the damn thing if we'd reused
>> an existing one and doing setup otherwise.
>>
>> I realize that it won't do in such form; your for_each_subsys() loop in there
>> really depends on holding cgroup_mutex all the way through.  But do we really
>> need it there?  Would just skipping the ones that doomed in rebind_subsystems()
>> suffice?
> 
> At this point, cgroup core locking is heavily focused on simplicity -
> cgroup_mutex for the whole thing and css_set_rwsem for css_set reader
> accesses.  It works out pretty well for the rest of the code but the
> mount path does get tricky.  We can definitely relax / separate out
> locking on subsys iteration for mount path but if possible I'd prefer
> to pay isolated complexity there instead of spilling it to other
> places.
> 
> Anyways, let's wait for Li.  At least nobody reported breakage before
> the recent commit, so we can revert the offending commit for the short
> term.
> 

That patch is wrong. We have to use both pinned_sb and new_sb, so
please revert it. :(

I think we need to put some test cases in tools/testing/selftests/, to
prevent this fragile thing from breaking again.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux-next: cgroup_mount() falls asleep forever
  2014-09-25 10:23           ` Zefan Li
@ 2014-09-25 15:14             ` Tejun Heo
  0 siblings, 0 replies; 11+ messages in thread
From: Tejun Heo @ 2014-09-25 15:14 UTC (permalink / raw)
  To: Zefan Li; +Cc: Al Viro, Andrey Wagin, LKML

On Thu, Sep 25, 2014 at 06:23:39PM +0800, Zefan Li wrote:
> That patch is wrong. We have to use both pinned_sb and new_sb, so
> please revert it. :(
> 
> I think we need to put some test cases in tools/testing/selftests/, to
> prevent this fragile thing from breaking again.

Can you please send a patch to revert it?  There have been a couple
changes dependent on it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-09-25 15:14 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-24 10:31 linux-next: cgroup_mount() falls asleep forever Andrey Wagin
2014-09-24 14:29 ` Andrey Wagin
2014-09-24 18:31   ` Cong Wang
2014-09-24 18:52   ` Al Viro
2014-09-24 19:24     ` Tejun Heo
2014-09-25  2:47       ` Al Viro
2014-09-25  3:25         ` Tejun Heo
2014-09-25 10:23           ` Zefan Li
2014-09-25 15:14             ` Tejun Heo
2014-09-24 20:16     ` Al Viro
2014-09-24 16:13 ` Andrey Wagin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).